Mutable Variables and Reference Types: L-values demystified and deprecated

 
In Fortran/Algol tradition, a mutable variable is a variable that can be directly mutated as such -- by assignment, increment, etc. In contrast, ML tradition uses variables of reference types, bound to `boxed values', or reference cells. These are ordinary, lambda-calculus variables that can be substituted for but not mutated. What is mutated is their value: the `box'. The two models seem equivalent and easily relatable. When we actually try to relate them (in the context of the implicitly heterogeneous metaprogramming: writing OCaml as if it were C), we discover many subtleties, which make the problem nearly intractable. One complication is that mutable variables are not first-class, and a compositional mapping of reference-type variables to mutable variables is simply impossible.

Upon systematic investigation we discover a way to formally relate the two traditions. Practically, we can write (or generate) first-order OCaml code using reference-type variables without restrictions, and straightforwardly map it to the efficient and idiomatic C and be certain of its meaning, in C. We may hence write in (a subset of) OCaml as if it were C, using to the full extent the modules, higher-order functions and other abstraction facilities of OCaml -- and obtaining all the performance guarantees of C code, benefiting from vectorization, etc. of modern C compilers.

In the Fortran/Algol tradition, the semantics of assignment such as x = x + 1 is non-compositional, and to understand it, L-values were introduced decades ago. L-values simplify assignment by taking all the complexity upon themselves. Our relation of two traditions showed L-values are not needed. What is complicated about the mutable-variable assignment is not its semantics but an unusual syntax: in C, it is a mix-fix operator * _ = _, further obfuscated by syntax sugar.


 

Two traditions of mutation

There are two traditions for dealing with mutable variables. In Fortran/Algol tradition (that continues to Lisp/Scheme, Pascal, C, C++, Java and their descendants) a mutable variable is a variable that can be directly mutated as such -- by assignment, increment and other such operations. In contrast, in ML tradition (typically ML-like languages, Haskell and some part of Scheme) mutable variables are represented by variables of reference types, bound to `boxed values', or reference cells. These are ordinary, `lambda-calculus' variables that can be substituted but not mutated. What is mutated is their value: the `box'.

The following table contrasts the two traditions on the simplest example (for concreteness, using C and OCaml).

Fortran/Algol
    int x = 1;
    x = x + 1;
ML
    let x : int ref = ref 1
    x := !x + 1
In Fortran/Algol tradition, the mutable variable has the type int: the type of the value that is retrieved and assigned. ML tradition defines an immutable, lambda-calculus--like variable, of a reference type: int ref. One also has to explicitly allocate the box: ref 1. Mutable variables can be accessed directly. In the ML tradition, they have to be explicitly de-referenced: see the ! operator. In ML tradition, dereference ! and assignment := are ordinary functions. In Algol tradition, assignment is nothing but. We will talk about it in more detail later.

 

Problem

Vast amounts of good, correct, useful code is written in either tradition. Either way of dealing with mutation seems to be fine. What's the problem?

There are two: practical and theoretical. Practically, we want to map code in the reference-type style to code with direct mutation: e.g., map imperative first-order OCaml code to C. The next section tells why it is useful. The theoretical problem is the meaning of mutation.

The theoretical problem is sure tough, but it was already solved, long time ago by Strachey. He introduced so-called L-values -- the notion that has become widely used, especially in C-like languages. Elegant as they are, L-values are not simple. One may say, L-values simplify the problem of assignment by shifting all the complexity upon themselves.

This work started from the need to solve the practical problem. Unexpectedly, the solution gave a new perspective on assignment and mutation, without any need for L-values.

The present page summarizes the presentations at the ML Family workshop and the IFIP WG2.1 meeting, extending them with better examples and clearer explanations. I'm grateful to the participants of these meetings for questions and comments, which indeed helped clarify the presentation.

References
Do Mutable Variables Have Reference Types?
ACM SIGPLAN ML Family workshop, September 15, 2022.

Mutable Variables and Reference Types - or, L-values Demystified
IFIP WG2.1 meeting, December 14, 2022

<http://arxiv.org/abs/2211.04107>
Preprint describing the formalization

Generating C
The found solution has been implemented, in MetaOCaml and in the tagless-final embedding of languages with mutable variables

 

Background: Generating C and Offshoring

Generating C (or other such low-level language) is inevitable if we want the convenience and guarantees of abstractions -- and we want the code that runs in a constrained environment (e.g., a low-powered robot); involves OpenMP, OpenCL (i.e., GPGPU) or AVX512 instructions; benefits from profitable but highly domain-specific optimizations typical in HPC. In fact, we have done all of the above, using the so-called offshoring technique first proposed in (Eckhardt, Taha et al., 2007) and re-thought and re-implemented in the current MetaOCaml.

The key idea of offshoring is the close correspondence between imperative first-order OCaml and C code -- so by generating OCaml we in effect generate C. Let's look at the two pieces of code below: vector addition in idiomatic OCaml and C. The differences in syntax cannot hide how strikingly similar the two pieces of code are.

    let addv = fun n vout v1 v2 -> 
      for i=0 to n-1 do
        vout.(i) <- v1.(i) + v2.(i) done
    void addv(int n, int* vout, int* v1, int* v2) {
     for(int i=0; i<=n-1; i++)
       vout[i] = v1[i] + v2[i];
    }
They are so similar that one may say that it is the same code, but in different surface syntaxes: different `dialects'. In other words, an imperative first-order subset of OCaml can be thought of as a different notation for C.

We can generate OCaml code using all conveniences and abstractions of OCaml as a host language, and with some correctness guarantees -- using MetaOCaml. If we can map the generated OCaml code to idiomatic C in a way that clearly preserves its meaning, we in effect accomplish generating C conveniently, and with correctness guarantees.

OCaml and C, however, follow different traditions when it comes to mutable variables. Hence comes the challenge of mapping code written in the reference-type style to direct-mutation, idiomatic and efficient C code. Alas, it is not simple. In fact, it is simply impossible.

References
Generating C
The paper that discusses in detail the need and the challenges of generating C

 

Mapping reference-type variables to mutable variables: simply impossible

Recall, the problem is to relate Algol and ML approaches to mutable variables. Specifically: to be able to write idiomatic imperative OCaml code, which makes sense by OCaml rules, and to straightforwardly map it to idiomatic C code, which makes sense by C rules -- the same sense. This mapping should be clear and easy to understand -- therefore, it should be compositional. We now demonstrate why such straightforward mapping is impossible.

Let's consider the following OCaml code

    !x + 1
where x is a variable of type int ref. This expression is clearly a part of some larger expression -- but a compositional translation should not care about it. That is what being compositional means, after all. The easiest mapping that comes to mind (at least, the one that described in Eckhardt, Taha et al, 2007 paper and was implemented, until very recently, in BER MetaOCaml) is:
SimpleMapping
OCaml variables of t ref types correspond to (or: are a model of) mutable C variables of type t. Therefore, !x in OCaml maps to just x in C.

SimpleMapping indeed works, in many cases, for example:

    let x = ref 0 in foo (!x + 1)
This OCaml code clearly corresponds to the following C code
    int x = 0; foo (x + 1);
The sub-expression !x + 1 indeed maps to x + 1 in C.

Consider, however, the same !x + 1 sub-expression in a different context:

    fun x -> foo (!x + 1)
The corresponding C code (that is, C code of the same meaning) should be
    void myfun(int * x) { foo (*x + 1); }
The very same !x + 1 now maps to *x + 1 in C.

Here is the second OCaml example:

    !x + foo x
where, again, x has the type int ref, and foo is obviously of type int ref -> int. SimpleMapping would give the C code
    x + foo(x)
which is quite wrong: it erases the difference between a pointer and what it points to. One may, and should, object: SimpleMapping is just undefined for a naked ref-type variable. That is not a good answer either since it makes our example code untranslatable. Peeking ahead, one might say that a naked ref-type OCaml variable x should map to &x in C. The whole example should hence be translated as:
    x + foo(&x);
This is a good translation. Consider, however, our expression in the following context:
    let rec foo x = !x + foo x
(a bit silly code, but should do for the example.) The corresponding C code should be
    int foo(int * x) { return *x + foo(x); }
with !x + foo x mapping to an unexpected expression.

As the final example, consider the OCaml expression

    !!x
where x has the type int ref ref. How should its C mapping look like? Should it be x, or *x, or **x?

The upshot: C has mutable variables and, separately, variables of pointer types. OCaml, on the other hand, uses ref-type variables to model both. The mapping from C to OCaml is, hence, non-injective. Therefore, the inverse (functional) mapping from ref-type OCaml variables to mutable C variables cannot exist.

One may attempt to build a closer model of mutable variables in OCaml, defined by the following signature:

    type 'a mut    (* abstract *)
    val newmut : 'a -> ('a mut -> 'w) -> 'w
    val demut  : 'a mut -> 'a 
    val setmut : 'a mut -> 'a -> unit
Although better, it is still not quite right since it permits the OCaml code like:
    newmut 1 (fun x -> foo x)
That is, this model lets us pass a variable of a mut type as an argument and return it as a result. In C, one cannot do that! (One can in C++). Mutable variables in C are not first-class. That is another problem with modeling them in OCaml, with OCaml variables, which are first-class.

 

State of the Art (until very recently)

Offshoring was proposed in 2007. It must have faced the problem of mapping OCaml variables of reference types. It did face, and did come up with the solution: SimpleMapping mentioned earlier. SimpleMapping does not generally work, as we have just seen. To make it work, the original offshoring outlawed all the cases where it might not work. It imposed strong restrictions on the code subject to offshoring:

These restrictions implicitly appeared in the Eckhardt, Taha. et al. 2007 paper -- although the paper itself never mentioned any restrictions. One infers them by carefully examining the typing rules in the Appendix of that paper. In effect, the restrictions completely outlaw aliasing -- which is the source of much complexity with mutation.

SimpleMapping with the accompanying restrictions has been the state of the art, and implemented in BER MetaOCaml, until very recently. To illustrate, the OCaml code

    let x = ref 0 in x := !x+1
is mapped to the following C code.
    int x = 0; x = x + 1;
However, the following, also simple imperative OCaml code
    let x = ref 0 in
    let y = x in
    y := 41; x := !x+1
violates the restrictions and cannot be offshored.

In my experience, the restrictions are more or less adequate for numeric code. They do force workarounds or small cheating from time to time (e.g., reference-type arguments -- result, or out arguments, in Pascal-speak -- are needed occasionally).

 

Semantics of Assignment

To gain further insight, let's look at the theoretical problem: the meaning of the simplest assignment below.
    x = x + 1;
We see x on both sides. Clearly, the two things named x do not mean the same (i.e., do not both mean an integer). Assignment, hence, is not an ordinary function: it is a special form, and cannot be handled compositionally. Whereas its right-hand--side is an expression (where x means the current integer value of the variable), the left-hand side must be something else. Strachey called it `L-value'.

L-values are not restricted to variable occurrences. The samples below exemplify the L-value rules in the C standard:

    x     = x + 1;
    a[i]  = a[i] + 1;
    s.f   = s.f + 1;
    sp->f = sp->f + 1;
    *p    = *p + 1;
    ( z > 0 ? a : b )[i] = a[i] + 1;
    // ( z > 0 ? x : y )   = x + 1;     // not L-value
    // ( z > 0 ? *p : *q ) = *p + 1;    // not L-value
    *( z > 0 ? p : q ) = ( z > 0 ? *p : *q ) + 1;
where we have assumed the declarations:
    int x = 1;
    int a[2] = {1,1}; int b[2] = {10,10}; int i = 1;
    struct sf {int f;} s = {.f = 1}; struct sf * sp = &s;
    int * p = &x;

Let's look again at the original assignment expression.

    x = x + 1;
The main insight is that the expression is a syntax sugar: an abbreviated form of the following full expression (which is valid C):
    *&x = *&x + 1;
It, in turn, is also syntax sugar: a mix-fix * _ = _ form of
    assign(&x,*&x + 1);
where assign is an ordinary C function (which could be implemented as a built-in):
    void assign(int *p, int v) { /* movl %esi, 0(%rdi) */ }
The two sides of assignment are hence ordinary expressions, but of different types. The mystery of assignment is solved: assignment can be given a compositional meaning after all. What is unusual is not semantics, but a mix-fix form of the assignment operator. Had it been the simple infix operator like *=, there would have been no mystery to start with.

The desugaring helps us understand that a mutable variable per se is not an expression. It is a syntactic category all by itself. That is why mutable variables are not first-class. Mutable variables can be turned into an expression by prefixing them with &, which gives a pointer-type expression. The dereferencing operator * is an ordinary function, of a pointer-type argument. Again, * that appears on the left-hand--side in the desugared assignment operation *p = e is not a separate operator. It is part of the mix-fix syntax * _ = _ of the assignment operation.

To re-iterate, the assignment operation *p = e is a mix-fix expression with * being part of the syntax. It can be re-written as an ordinary procedure invocation: assign(p,e), where both p and e are expressions, but of different types. In particular, p is a pointer-type expression. There is no longer any need for L-values. Let's look again an the L-value examples earlier. They all can be desugared into the * _ = _ mix-fix form, and then converted to the assign procedure invocation, if desired.

    *&x       = *&x + 1;
    *(a+i)    = a[i] + 1;
    *&s.f     = s.f + 1;
    *&(sp->f) = sp->f + 1;
    *p        = *p + 1;
    *(( z > 0 ? a : b )+i) = a[i] + 1;
    *( z > 0 ? p : q )     = ( z > 0 ? *p : *q ) + 1;
If one insists on L-values, they can now be defined much easily: an L-value is anything that starts with a star.
References
lvalues.c [2K]
Many forms of L-values in C (before and after desugaring)

 

Writing idiomatic OCaml as if it were idiomatic C

Again, our goal is to be able to write idiomatic imperative OCaml code, which makes sense by OCaml rules, and to straightforwardly map -- offshore -- it to idiomatic C code, which makes sense by C rules -- the same sense. We, hence, have to map reference-type variables of OCaml to mutable variables, used in idiomatic C. Alas, as we have seen, there is no hope for simple mapping, because the mapping from mutable variables of C to reference-type variables of OCaml is non-injective. OCaml's reference-type variables can be a model of either a mutable C variable, or a (constant) C variable of a pointer type. Any offshoring translation has to disambiguate somehow. The extant offshoring translation based on SimpleMapping disambiguates by putting pointer-type C variables and expressions out of scope.

We now present a different offshoring translation, which maps idiomatic imperative first-order OCaml to idiomatic C preserving meaning while imposing no restrictions whatsoever on reference/pointer types. References of references, pointers of pointers, mutable variables of pointer types are all allowed.

The translation does have to disambiguate OCaml reference-type variables representing C mutable variables from those representing pointer-type variables. To this end, the translation does mutability inference, whose starting point (axiom) is: the variable introduced by the let-expression of the following form

    let x : t ref = ref e in ...
corresponds to a mutable C variable of type t of the same name, initialized by e. All other OCaml variables correspond to constant C variables.

Here is the translation in full. It is parameterized by the set of free variables considered mutable.

For example, OCaml's

    let x = ref 0 x := !x+1
is translated to
    int x = 0; *&x = *&x+1;
which, if desired, can be prettified by dropping *& to
    int x = 0; x = x+1;

The extant offshoring cannot translate

    let x = ref 0
    let y = x in
    y := 41; x := !x+1
But the new translation can, producing (after prettifying)
    int x = 0;
    int * const y = &x;
    *y = 41; x = x+1;
Indeed, idiomatic, easy to understand OCaml is translated to idiomatic, easy to understand (by C standards) C of the same meaning.

Adding increment/decrement is easy: OCaml's incr is translated to ++* in C, in any context.

 

Solving the puzzle of the address-of operator

The address-of operator `&' of C can be applied to arbitrarily expressions, not just variables. Its semantics in that case is a puzzle: it is not fully strict (like `*' or `!' operators, which evaluate their arguments) and it is not fully non-strict (like sizeof) either. Our demystified L-values easily solve the puzzle.

The following C code demonstrates the unusual semantics of &, contrasting it with sizeof:

    #include <stdio.h>
    
    int sz(int ** p) {
      return sizeof p[1][2];
    }
    
    int * ad(int ** p) {
      return &(p[1][2]);
    }
    
    int main(void){
      int* p1 [] = {NULL,NULL,NULL};
    
      printf("sz p1: %d\n",sz(p1));     // prints 4
      printf("sz NULL: %d\n",sz(NULL)); // prints 4
    
      printf("ad p1: %p\n",ad(p1));     // prints 0x8
      printf("ad NULL: %p\n",ad(NULL)); // segmentation fault
    }
As expected, sizeof does not evaluate its argument; therefore p[1][2] is always defined, even if p is NULL. The operator & is different, and puzzling: it seems to evaluate its argument, but only to some extent.

The C standard deals with the puzzle by defining four different cases of & expressions: applying & to functions; to L-value expressions; to expressions of the form * exp; and expressions of the form exp[exp]. The latter two are special cases. We now give the semantics to & without needing special cases.

First, we introduce an `internal' address-of, which applies only to variables (including register-storage--class variables). Our analysis of assignment earlier actually used this internal address-of. Its semantics is the location of the variable: an address in memory or the register. One may think of the internal address-of as producing an operand for an assembly instruction (such as %rdi, (%rdi), 10(%rsp), etc.). The `surface' operator & of C is then a `macro' of sort:

  1. If applied to a variable (which can be a function name), it checks its storage class. If register, report an error. Otherwise, apply the internal address-of operator;
  2. In all other cases, & must be applied to an expression that starts with * (perhaps after desugaring). The & operator removes the *. Recall, on our analysis every L-value expression starts with * (or: an expression that starts with *, perhaps after desugaring, is defined to be L-value.)

Let us see how it works on our example. The expression p[1][2] is a sugared version of *(p[1]+2). Therefore,

    &p[1][2] ≡  &*(p[1]+2) ≡ p[1]+2
and the ad function above is equivalent (desugars) to
    int * ad(int ** p) {
      return p[1]+2;
    }
which explains the program behavior, with no puzzles.
References
<https://en.cppreference.com/w/c/language/operator_member_access>
The C standard on the & operator

 

Formalities

The new translation above was presented rather informally and hand-wavingly. But there are formalities and theorems.
References
reftypes.pdf [444K]
Do Mutable Variables Have Reference Types? Short paper with formalization, etc.
<http://arxiv.org/abs/2211.04107>

The paper introduces calculi ICaml (a model of OCaml with reference types) and CoreCE (a simple first-order language with mutable variables) and formally defines the translation of the former to the latter. It states the meaning preservation theorem.

refcalculi.ml [12K]
The code gives (executable) denotational semantics of ICaml and CoreCE (as tagless-final interpreters) and defines the translation in full detail, as an ICaml interpreter in terms of CoreCE. The mere fact the latter interpreter type-checks proves the translation type-preserving. The meaning preservation needs (an informal, at present) argument examining the mapping of the denotations of expression forms. (Quite a few mappings are syntactically the identity.)

 

Conclusions