Generating low-level code from a higher-level language

Generating C
Motivation
Offshoring
Tagless-final embedding of (a subset of) C in OCaml
Mutable variables in tagless-final style
Realistic, modular, composable, algebraic optimizations with some correctness guarantees: DAXPY
An advanced example: generating high-performance BLAS code

Generating C

Heterogeneous metaprogramming systems use a higher-level host language to generate code in a lower-level object language. Their appeal is taking advantage of the module system, higher-order functions, data types, type system and verification tools of the host language to quicker produce high-performant lower-level code with some correctness guarantees.

We present two heterogeneous metaprogramming systems whose host language is OCaml and object language is C. The first relies on offshoring: treating a subset of (MetaOCaml-generated) OCaml as a different notation for (a subset of) C. The second embeds C in OCaml in tagless-final style. The systems have been used in several projects, including the generation of C supersets OpenCL and OpenMP.

Generating C with some correctness guarantees is far less trivial than it may appear, with pitfalls abound: e.g., local variables may only be introduced in statement context; mutable variables are not first-class. Maintenance has challenges of its own, e.g., extensibility. On many examples, we expound the pitfalls we have come across in our experience, and describe the design and implementation to address them.

Version

The current version is August 29, 2023

References

generating-c.pdf [530K]
The paper ``Generating C: Heterogeneous Metaprogramming -- System Description''
(greatly extended, journal version of the FLOPS 2022 paper) is published in
Science of Computer Programming, v.231, January 2024, 103015, Elsevier. doi:10.1016/j.scico.2023.103015

Motivation

``This make me think that the authors are under the impression that compilers can magically transform bad code into efficient high-quality code by clever enough optimizations. As a compiler writer, I consider this a complete myth that many students and non-compiler-developers hope is a truth. I agree that compilers should aim to improve the efficiency of the code given as input, but one cannot expect the compiler to recognize algorithms and swap them for ones with better asymptotic complexity or make other major code transformations that really should be the job of the programmer. [emphasis mine]''

Magnus Myreen. Quoted by permission.

The reason for existence of code generation is helping programmers make major code transformations -- by conveniently and concisely expressing the transformation steps and by relieving the programmers from worrying if the result even compiles.

Offshoring

Offshoring is a facility to translate an imperative, first-order subset of OCaml to C or other low-level language. It is based on an observation that a first-order subset of OCaml may be regarded as a non-standard notation for C. Indeed, consider the vector addition code

    let addv = fun n vout v1 v2 ->
       for i=0 to n-1 do 
           vout.(i) <- v1.(i) + v2.(i) done

It is rather easy to imagine the C code it corresponds to. If we could implement this correspondence, then by generating OCaml we may, in effect, generate C. Offshoring is the library that realizes such a correspondence. Thus,

If we can generate code in a host language with some correctness guarantees, and
If we can translate the generated host language code into C preserving the guarantees

then we attain C code generation with the correctness guarantees. The first premise is homogeneous metaprogramming: the facility provided by MetaOCaml. It indeed generates OCaml code that is assuredly well-formed, well-typed, and well-scoped. The second premise is the job of offshoring. In effect, it turns homogeneous metaprogramming into heterogeneous.

Offshoring is very different from an OCaml-to-C compiler. The latter has to handle the full OCaml. Offshoring needs to deal only with a small, imperative subset of OCaml, without closures, higher-order functions, algebraic data types, etc. Therefore, it is much simpler than a compiler to implement and maintain, and much easier to ensure correctness. Although the supported subset of OCaml is small, it has proved adequate for numerical, high-performance computing (including OpenCL and OpenMP programming) and embedded system programming.

Offshoring was first proposed by Eckhardt, Kaiabachev, Pasalic, Swadi and Taha (New Generation Computing, 2007). It is completely re-thought and re-implemented in BER MetaOCaml since version N107. The paper Generating C describes this system, also mentioning why the original offshoring became unavailable and what has been done to make the re-implementation to last. Here we elaborate the running example of offshoring, and point to further examples.

The running example is the vector addition code addv above. The input to offshoring is MetaOCaml generated code: so-called closed code value. In our example, the simplest way to obtain such a value is by enclosing the addv code in MetaOCaml brackets:

    let addv_staged = 
      .<fun n vout v1 v2 -> for i=0 to n-1 do vout.(i) <- v1.(i) + v2.(i) done>.

Code values can be printed out:

    val addv_staged : (int -> int array -> int array -> int array -> unit) code = .<
      fun n_1 vout_2 v1_3 v2_4 ->
              for i_5 = 0 to n_1 - 1 do
                Stdlib.Array.set vout_2 i_5
                  ((Stdlib.Array.get v1_3 i_5) + (Stdlib.Array.get v2_4 i_5))
              done>.

We see the original addv, but with the desugared array indexing operations and with renamed identifiers, to ensure uniqueness.

To easily support a variety of low-level languages, offshoring actually maps a subset of OCaml to a simple imperative `intermediate language' (IR). One may see the IR code for the running example by

    let _ = Offshoring.(offshore (module DefaultConv) addv_staged)

whose result is

    - : proc_t =
    ([("n_1", TInt); ("vout_2", TArray1 TInt); ("v1_3", TArray1 TInt); ("v2_4", TArray1 TInt)],
     TUnit,
     Block (<abstr>,
      For
       {id = "i_5"; ty = TInt;
        lwb = Const (Const_int 0);
        upb = LocalVar "n_1";
        step = Const (Const_int 1);
        body = Block (<abstr>,
          Exp
           (FunCall ("array1_set",
             [LocalVar "vout_2"; LocalVar "i_5";
              FunCall ("+",
               [FunCall ("array1_get",
                 [LocalVar "v1_3"; LocalVar "i_5"]);
                FunCall ("array1_get",
                 [LocalVar "v2_4"; LocalVar "i_5"])])])))}))

Such an IR expression may then be `pretty-printed', say, to C. The function offshore_to_c combines the IR translation with pretty-printing. For example,

    let _ = Offshoring.offshore_to_c ~name:"addv" addv_staged

produces (on the standard output; the optional ~out argument may specify a different output stream)

    void addv(int n_1,int * vout_2,int * v1_3,int * v2_4){
      for (int i_5 = 0; i_5 < n_1; i_5 += 1)
        (vout_2[i_5]) = (v1_3[i_5]) + (v2_4[i_5]);
    }

which is how one might have imagined the C code corresponding to the original OCaml addv.

Suppose the size of the arrays to add is statically known, and is small enough so that we wish to fully unroll the loop. To generate the unrolled code, we first have to generalize addv:

    let iota n = List.init n Fun.id
    
    let addvg n vout v1 v2 =
      iota n |> List.iter (fun i -> vout.(i) <- v1.(i) + v2.(i))

and test on sample arrays that addvg has the same behavior as addv.

Trying to offshore addvg naively

    let addvg_staged_full =
      .<fun n vout v1 v2 -> 
        iota n |> List.iter (fun i -> vout.(i) <- v1.(i) + v2.(i))>.

gives an error however: the code value (the bracketed expression) contains a first-class function and is hence outside the offshorable subset.

If the size n of the arrays is indeed known at the generation time, addvg may be staged differently:

    let addvg_staged n =
      .<fun vout v1 v2 -> 
        .~(iota n |> List.map (fun i -> .<vout.(i) <- v1.(i) + v2.(i)>.) |> seq)>.
    
    (* val addvg_staged : int -> (int array -> int array -> int array -> unit) code *)

where seq : unit code list -> unit code builds a code sequence. As the inferred type indicates, addvg_staged takes an integer and produces the code of a function of three array arguments. For example, let addvg_staged4 = addvg_staged 4 generates:

    val addvg_staged4 : (int array -> int array -> int array -> unit) code = .<
      fun vout_11 v1_12 v2_13 ->
            Stdlib.Array.set vout_11 0
              ((Stdlib.Array.get v1_12 0) + (Stdlib.Array.get v2_13 0));
            Stdlib.Array.set vout_11 1
              ((Stdlib.Array.get v1_12 1) + (Stdlib.Array.get v2_13 1));
            Stdlib.Array.set vout_11 2
              ((Stdlib.Array.get v1_12 2) + (Stdlib.Array.get v2_13 2));
            Stdlib.Array.set vout_11 3
              ((Stdlib.Array.get v1_12 3) + (Stdlib.Array.get v2_13 3))>.

We may again test that the generated code has the behavior of addv 4. Finally, we map it to C:

    let _ = Offshoring.offshore_to_c ~name:"addv4" addvg_staged4

obtaining

    void addv4(int * vout_11,int * v1_12,int * v2_13){
      (vout_11[0]) = (v1_12[0]) + (v2_13[0]);
      (vout_11[1]) = (v1_12[1]) + (v2_13[1]);
      (vout_11[2]) = (v1_12[2]) + (v2_13[2]);
      (vout_11[3]) = (v1_12[3]) + (v2_13[3]);
    }

Thus, addv is easy to stage for no unrolling, but addvg for full unrolling. One may wish to write vector addition in such a way so that no unrolling, full unrolling, partial unrolling, scalar promotion etc. can be done without constantly re-writing it. The example addv.ml demonstrates how to write the vector addition once, in a clear, textbook form -- and apply a variety of optimizations in a modular way, reusing or extending earlier optimizations. Eventually, after strip mining and scalar promotion we obtain C code with the OpenBLAS look-and-feel:

    void addv(int n_16,double * ar_17,double * ar_18,double * ar_19){
      const int t_21 = n_16 & -4;
      const int t_22 = (0 + 3) & -4;
      for (int i_33 = 0; i_33 < t_22; i_33 += 1)
        (ar_17[i_33]) = (ar_18[i_33]) + (ar_19[i_33]);
      for (int i_24 = t_22; i_24 < t_21; i_24 += 4){
        const double t_26 = (ar_18[i_24 + 0]) + (ar_19[i_24 + 0]);
        const double t_28 = (ar_18[i_24 + 1]) + (ar_19[i_24 + 1]);
        const double t_30 = (ar_18[i_24 + 2]) + (ar_19[i_24 + 2]);
        const double t_32 = (ar_18[i_24 + 3]) + (ar_19[i_24 + 3]);
        (ar_17[i_24 + 0]) = t_26;
        (ar_17[i_24 + 1]) = t_28;
        (ar_17[i_24 + 2]) = t_30;
        (ar_17[i_24 + 3]) = t_32;
      }
      for (int i_23 = t_21; i_23 < n_16; i_23 += 1)
        (ar_17[i_23]) = (ar_18[i_23]) + (ar_19[i_23]);
    }

References

offshore_simple.ml [9K]
The code for the running example, among others

An illustration of the extensibility of offshoring: generating C code that includes calls to library functions (C file stdio, in our case) using library-specific data types (FILE)
offshore_ext.ml [3K]
The code for the example
file_stub.ml [2K]
The stub for a simple subset of C FILE i/o
offshore_ext_gen.c [<1K]
Testing the generated C code

addv.ml [15K]
The complete development for the advanced example. It generates addv_offshored.c when made
addv_gen.c [2K]
Testing the generated C code

Generating C
The paper that presents the new offshoring, describing the attractiveness, and also pitfalls and challenges of treating a subset of OCaml as a notation for C

Tagless-final embedding of (a subset of) C in OCaml

Embedding a subset of C into OCaml as a DSL in tagless-final style is another way to generate low-level code with some correctness guarantees. At the very least, the generated code should be well-formed, well-typed and well-scoped -- in other words, it shall compile without errors or (serious) warnings. Unlike offshoring, which is implicitly heterogeneous metaprogramming, in the tagless-final embedding heterogeneous programming is explicit. The paper Generating C presents it in detail. Here we show the running example, and point out others.

The running example is vector addition: the same example used for Offshoring. To remind, it is written in OCaml as

    let addv = fun n vout v1 v2 ->
       for i=0 to n-1 do 
           vout.(i) <- v1.(i) + v2.(i) done

We now try to write it as an embedded DSL program; specifically, a program in a statement-oriented DSL. Many low-level languages draw a sharp distinction between an expression and a statement. Consider C: x < 10 is an expression, but while(x < 10) {x = foo();} is a statement. Only expressions are allowed as the terminating condition of while loop. An expression always yields a value; a statement often does not. Still, there are value-yielding statements: return.

By examining the vector addition code, we determine the DSL operations needed to write it:

    module type cde = sig
      type 'a exp                           (* Abstract type of an expression *)
    
      val int   : int -> int exp
      val ( + ) : int exp -> int exp -> int exp
      val ( - ) : int exp -> int exp -> int exp
    
      type 'a stm                           (* A statement *)
      val for_ : int exp -> int exp -> (int exp -> unit stm) -> unit stm
    
      type 'a arr                           (* An array (variable) *)
      val array_get : 'a arr -> int exp -> 'a exp
      val array_set : 'a arr -> int exp -> 'a exp -> unit stm
    
      type 'a proc_t                        (* The complete function *)
      val mk3arr : 
        (int exp -> int arr -> int arr -> int arr -> unit stm) -> 
        (int -> int array -> int array -> int array -> unit) proc_t
    end

It is a typed first-order statement-oriented DSL embedded in OCaml. In the tagless-final style, an embedded DSL is specified as an OCaml signature, which defines types for DSL categories, and combinators that correspond to DSL expression forms. (Speaking algebraically, the embedded language is represented as a (multi-sorted) algebra whose operations are the syntactic forms of the language.) The type 'a exp represents DSL expressions of type 'a; 'a stm statements, and 'a proc_t complete procedures. Expressions are built of integer literals and variables (at present, arguments) using addition, subtraction and array dereference operations. Statements are the for-loop and array element assignment. The operation mk3arr builds the procedure header (of procedures taking three array arguments) and provides arguments to use in the function body. It looks too particular. We can generalize later. The strength of the tagless-final approach is extensibility: therefore, we may start with a particular and add more general when and if needed.

The operations of the cde signature indeed let us express vector addition:

    module ADDV(C:cde) = struct
      open C
      let addv_dsl = mk3arr @@ fun n vout v1 v2 ->
        for_ (int 0) (n - int 1) @@ fun i ->
          array_set vout i @@ array_get v1 i + array_get v2 i
    end

It looks quite like the OCaml addv code above, after desugaring of array indexing operators. With custom indexing operators one can make addv_dsl even more resembling addv.

One obvious implementation of the signature is metacircular: 'a exp, 'a stm and 'a proc_t are all just 'a. That is, the DSL is (a subset of) OCaml itself. Such an implementation is useful for testing. We leave it as an exercise to the reader.

Another implementation is in terms of the MetaOCaml code:

    module MOCde : (cde with type 'a proc_t = 'a code) = struct
      type 'a exp = 'a code
    
      let int   = fun x -> .<x>.
      let ( + ) = fun x y -> .<.~x + .~y>.
      let ( - ) = fun x y -> .<.~x - .~y>.
    
      type 'a stm = 'a code
      let for_ lwb upb body =
          .<for i= .~lwb to .~upb do .~(body .<i>.) done>.
    
      type 'a arr = 'a array code
      let array_get a i   = .< (.~a).(.~i) >.
      let array_set a i v = .< (.~a).(.~i) <- .~v >.
    
      type 'a proc_t = 'a code                       (* The complete function *)
      let mk3arr body = 
          .<fun n vout v1 v2 -> .~(body .<n>. .<vout>. .<v1>. .<v2>.)>.
    end

It is in effect a compiler, from the DSL to OCaml. For example, let module M = ADDV(MOCde) in M.addv_dsl produces

    - : (int -> int array -> int array -> int array -> unit) code = .<
      fun n_1 vout_2 v1_3 v2_4 ->
              for i_5 = 0 to n_1 - 1 do
                Stdlib.Array.set vout_2 i_5
                  ((Stdlib.Array.get v1_3 i_5) + (Stdlib.Array.get v2_4 i_5))
              done>.

which is identical to the result addv_staged in the Offshoring example earlier.

But we are interested in C. Let's write the implementation of cde that generates C (repetitive definitions are elided; see the complete code for detail):

    module CCde  = struct
      type 'a exp = string
    
      let int = string_of_int
      let ( + ) = Printf.sprintf "(%s + %s)"
    
      type 'a stm = string
      let for_ lwb upb body =
        let i = gensym "i" in
        "for(int " ^ i ^ "=" ^ lwb ^ "; " ^ i ^ "<=" ^ upb ^ "; " ^ i ^ "++){\n" ^
        "  " ^ body i ^ "}"
    
      type 'a arr = string
      let array_get = Printf.sprintf "%s[%s]"
      ...
    end

Now,

    let _ = let module M = ADDV(CCde) in print_endline M.addv_dsl

outputs

    void addv(int n_1, int* vout_2, int* v1_3, int* v2_4){
    for(int i_5=0; i_5<=(n_1 - 1); i_5++){
      vout_2[i_5]=(v1_3[i_5] + v2_4[i_5]);}
    }

This is exactly the addv C code produced by Offshoring. We have generated both OCaml and C code from the same DSL expression (addv_dsl).

The C and OCaml code for addv indeed look very similar, which was the inspiration for offshoring. They both look similar to addv_dsl. The similarities should not be surprising: the addv expressions in different languages are different interpretations of the same DSL expression addv_dsl. Speaking algebraically, they are different homomorphic images of the same term. One may write many more similar implementations of the cde signature, to obtain code in Fortran, WASM, LLVM IR, etc.

There is much room for improvement. For example, printf and string concatenations are an eye sore. Not for the DSL user however: They are all hidden behind the abstraction layer. The end user only uses the combinators of the cde signature, as in addv_dsl. The combinators are typed and enforce the typing discipline on the generated language. The type of the for_ combinator, for example, ensures that the lower and upper bounds of the loop are integer expressions, and the loop body is a statement. The existence of the metacircular or MOCde implementations shows that the typing is compatible with OCaml typing, and is, hence, sound. To ascertain that only well-formed code is generated, we only have to examine the implementation CCde (and check, e.g., that statement generators produce code ending in a closing brace or a semicolon). After such check we may claim that any well-typed application of the combinators produces well-formed and well-typed code -- without any need to examine that code. (Our actual implementation uses C AST for C code generation, which ensures well-formedness. C AST does not ensure well-typedness however; but the tagless-final encapsulation does.) Finally, because all the details of generation are hidden in DSL combinators, the user-written code like addv_dsl remains legible.

References

tf_simple.ml [13K]
The complete code, corresponding to the simple examples of offshoring, offshore_simple.ml [9K]

Mutable variables in tagless-final style

Let's consider the first extension to the addv DSL developed earlier: mutable variables. They are not as simple as they appear (see the separate web page on this topic). The culprit, as usual, is aliasing. Mutable variables presented a big problem for offshoring, which was solved just recently. The signature below incorporates the learned lessons: the main is distinguishing mutable variables, represented by the dedicated type 'a mut. From the DSL point of view, this is the syntactic distinction: Mutable variables are not expressions. Therefore, mutable variables in our DSL (as in C, Pascal, etc.) are not first class: one cannot pass them as an argument or return as a result. One may only dereference of assign to them.

    module type cdemut = sig
      include cde
    
      val float  : float -> float exp           (* also add floats for the sake of example *)
      val ( +. ) : float exp -> float exp -> float exp
    
      type 'a mut                               (* mutable variables *)
      val (let*) : 'a exp -> ('a mut -> 'w stm) -> 'w stm
      val dref   : 'a mut -> 'a exp
      val (:=)   : 'a mut -> 'a exp -> unit stm
    
      val (@.)   : unit stm -> 'a stm -> 'a stm (* sequencing *)
    
      val ret    : 'a exp -> 'a stm             (* return *)
    
      val mk1arr : 
        (int exp -> float arr -> 'a stm) -> 
        (int -> float array -> 'a) proc_t
    end

The types make it clear that mutable variables may only be introduced in the statement context. Aliasing is impossible: an expression like

    let* x = 0 in let* y = x in ...

gives a type error pointing at the second x:

    Error: This expression has type int mut
    but an expression was expected of type 'a exp

We may only write

    let* x = 0 in let* y = dref x in ...

where it is clear that y is not an alias of x; it is a new mutable DSL variable initialized to the current value of x.

The extended cdemut signature lets us write the next example: vector summation.

    module VSUM(C:cdemut) = struct
      open C
      let vsum = 
        mk1arr @@ fun n v ->
        let* sum = float 0. in
        begin for_ (int 0) (n - int 1) @@ fun i ->
          sum := dref sum +. array_get v i end
        @.
        ret (dref sum)
    end

Since defining new variables in C requires writing their type, our compiler has to perform type inference, which is straightforward. (Again, some definitions are elided for clarity.)

    module CMutCde : (cdemut with type 'a proc_t = string) = struct
    
      type 'a typ = 
        | TInt   : int typ
        | TFloat : float typ
        | TUnit  : unit typ
    
      let string_of_typ : type a. a typ -> string = function
        | TInt   -> "int"
        | TFloat -> "double"
        | TUnit  -> "void"
    
      type 'a exp = 'a typ * 'a CCde.exp
    
      let int x   = (TInt,   string_of_int x)
    
      (* typing rule *)
      let trule2 : type a b c. a typ -> b typ -> c typ ->
        (a CCde.exp -> b CCde.exp -> c CCde.exp) ->
        (a exp -> b exp -> c exp) = fun ty1 ty2 tyr f ->
          fun (_,x) (_,y) -> (tyr, f x y)
    
      let ( + ) = trule2 TInt TInt TInt CCde.( + )
      let ( +.) = trule2 TFloat TFloat TFloat (Printf.sprintf "(%s + %s)")
    
      type 'a stm = 'a typ * 'a CCde.stm
    
      let for_ (TInt,lwb) (TInt,upb) body =
        (TUnit,
         CCde.for_ lwb upb (fun i -> let (TUnit,b) = body (TInt,i) in b))
    
      type 'a arr = 'a typ * 'a CCde.arr
      let array_get (ty, a) (TInt, i) = (ty, CCde.array_get a i)
    
      let ret (t,x) = (t, Printf.sprintf "return %s;" x)
      let (@.) (TUnit,s1) (t,s2) = (t,Printf.sprintf "%s\n%s" s1 s2)
    
      type 'a mut = 'a typ * string
      let (let*) : 'a exp -> ('a mut -> 'w stm) -> 'w stm = fun (t,x) body ->
        let v = gensym "x" in
        let (tb,b) = body (t,v) in
        (tb,Printf.sprintf "%s %s = %s;\n%s" (string_of_typ t) v x b)
    
      let dref (t,v) = (t,v)
      let (:=) (_,v) (_,x) = (TUnit, Printf.sprintf "%s = %s;" v x)
    end

Since CMutCde is an extended interpreter, it can interpret the earlier code like ADDV, with the same result. For vector summation,

    let _ = let module M = VSUM(CMutCde) in print_endline M.vsum

we obtain

    double sumv(int n_11, double* v_12){
    double x_13 = 0.;
    for(int i_14=0; i_14<=(n_11 - 1); i_14++){
      x_13 = (x_13 + v_12[i_14]);}
    return x_13;
    }

We can also implement the signature cdemut in terms of the Offshoring IR (see the paper), and hence take the full advantage of its pretty-printers, to C or other low-level languages.

References

tf_simple.ml [13K]
The complete code, corresponding to the simple examples of offshoring, offshore_simple.ml [9K]

Mutable Variables and Reference Types

Generating low-level code from a higher-level language

Generating C

Motivation

Offshoring

Tagless-final embedding of (a subset of) C in OCaml

Mutable variables in tagless-final style

Last updated January 1, 2024