previous   next   up   top

Tutorial: Systematic Generation of Optimal Code with MetaOCaml

 

A common application of generative programming is building high-performance computational kernels highly tuned to the problem at hand. A typical simple linear algebra kernel is specialized to data representation, loop unrolling factors, and the a priori knowledge (e.g., if a matrix is symmetric, almost tridiagonal, etc.). It is tedious and error prone to specialize by hand, writing numerous variations of the same algorithm.

The widely used generators such as ATLAS and SPIRAL reliably produce highly tuned specialized code, but they are difficult to extend. In ATLAS, which generates C with printf, even ensuring that parentheses are balanced is a challenge. According to ATLAS creator, debugging is nightmare.

A typed staged programming language such as MetaOCaml lets us state a general, obviously correct algorithm and add layers of specializations in a modular way. By ensuring that the generated code always compiles and letting us quickly run it, MetaOCaml makes writing generators less daunting and more productive. The tutorial answers two simple Shonan Challenges.

The tutorial on systematic generation of optimal numeric kernels using the staged programming language MetaOCaml was presented at the tutorial session of the conference of Commercial Users of Functional Programming (CUFP 13) on September 23, 2013 in Boston, USA. This page collects the notes for the tutorial along with the examples, problems and exercises.

Tutorial problems


 

Overview of the tutorial

The goal of the tutorial is to teach how to write typed code generators, how to make them modular, and how to introduce local domain-specific optimizations with MetaOCaml. By the end of the tutorial the participants implement a simple domain-specific language (DSL) for linear algebra, with layers of optimizations for the memory layout of matrices and vectors and their algebraic properties. They will generate a few simple and optimal BLAS kernels. Hopefully the participants will see that writing generators is not too complicated and that (staged) types are of great help. They will get the taste of the ``Abstraction without guilt''.

The participants are not expected to know MetaOCaml but should be familiar with a modern functional language, in particular OCaml, F#, SML, Haskell.

The tutorial is meant to be interactive. There will be many questions to the audience, and, hopefully from the audience. The tutorial is built around live coding, in interaction with the MetaOCaml and its type checker. The tutorial includes many exercises and larger, homework, projects, to work alone or in groups.

References
Lightweight Modular Staging: Program Generation and Embedded Compilers in Scala
< http://scala-lms.github.com >
Similar in spirit tutorials on optimizing numeric code with metaprogramming facilities of Scala

 

Why MetaOCaml

Writing code generators in a typed staged language like MetaOCaml benefits in several ways. First, the generated code will be well-formed, with all parentheses matching. Such a guarantee is a dear wish when writing C with printf (as done in ATLAS) or C++ with Matlab. The generated code is well-typed and so will compile without errors. There is no longer puzzling out a compilation error in the generated code, which is typically large, obfuscated and with unhelpful variable names. Mainly, code generation errors are reported in terms of the generator rather than the generated code. The tutorial will give many chances to see the importance of good error reporting.

MetaOCaml generators are hygienic, producing well-scoped code, with no unbound variables. Otherwise, hygiene violations are quite hard to detect in practice and may lead to the devious error of unintentionally bound variables. Most importantly, MetaOCaml is typed. Types, staged types in particular, really do help write the code. All throughout the tutorial we will be writing code in live interaction with the type checker -- accepting type errors not as a punishment but as a valuable hint. We shall see on many occasions that once we fix the type signature, the generator practically writes itself. The type checker will tell us where to put a staging annotation.

The staging annotations of MetaOCaml are like the ``assembler'' instructions of metaprogramming. We need higher-level abstractions. The final benefit of MetaOCaml is that it is part of OCaml, and hence can take the full advantage of OCaml's abstraction and combination facilities, from higher-order functions to modules.

 

Tutorial problems

The tutorial is based on the progression of problems, solved in interaction with the audience and MetaOCaml. Except for the introductory one, all are slightly simplified versions of real-life problems. In fact, the last two were suggested by high-performance computing (HPC) researchers, as challenges to program generation community (Shonan challenges). The common theme is building high-performance computational kernels highly tuned to the problem at hand. Hence most problems revolve around simple linear algebra -- a typical and most frequently executed part in high-performance computing.

 

Introduction to staging and MetaOCaml: Power

The tutorial starts with the introduction to staging and MetaOCaml on the power example -- the example used by A.P.Ershov in his 1977 paper that has launched partial evaluation. The example demonstrates all features of MetaOCaml: brackets and escapes to build code values, cross-stage persistence, printing code values, and compiling the generated code and executing it -- so-called `running' of code values.

Two insights are worth stressing. First, a code value may represent open code, with free variables. In particular, in .<fun x -> .~(r := .<x>. ...)>. the code value .<x>. is the variable ``x'' in the generated code. We may store such variables in reference cells and pass them as arguments and function results. Staging hence lets us manipulate (future-stage) variables symbolically. However, we cannot determine the real name of a future-stage variable let alone tamper with it. Also, once a code value is constructed, it cannot be taken apart and examined. This pure generativity of MetaOCaml helps maintain hygiene: open code can be manipulated but the lexical scoping is still preserved.

The second insight is two different ways of using MetaOCaml. First is run-time specialization: specializing a frequently used function to some data obtained at run-time, e.g., from user input -- see the file power_rts.ml. Here running the code is essential. The generated code can also be saved into a file. Since the generated code is ordinary OCaml, it can be compiled with ordinary OCaml compilers, even ocamlopt, and later linked into various ordinary OCaml applications. In this approach, the MetaOCaml facility to run code values is not essential: running is used for testing or for empirical code search.

Thus, MetaOCaml can be used not only for run-time specialization, but also for offline generation of specialized library code, e.g., of BLAS and Linpack libraries. MetaOCaml can implement sane ATLAS. The further examples in the tutorial will stress this point.

References
power.ml [6K]
Code and explanations

power_rts.ml [2K]
The example of run-time specialization

square.ml [<1K]
The externally defined `square' function

 

Compile an image-processing DSL

A common application of staging is building DSL compilers. First, one implements the DSL as an interpreter; its staging yields a compiler. A computationally intensive image-processing DSL will clearly benefit from compilation. This problem is meant for advanced users who are already familiar with MetaOCaml. They are encouraged to work on this problem during the MetaOCaml introduction.

Our gray-scale image processing DSL is fairly minimal, conventional, untyped -- with the following abstract syntax:

     type binop = Add | Sub | Mul | Lt | Gt | Eq
     type exp =
       | LitInt of int                       (* integer literal *)
       | Load of string                      (* load from a file *)
       | Display of exp                      (* display the image *)
       | It                                  (* the variable `it' *)
       | Binop of exp * binop * exp          (* binary operation *)
       | If of exp * exp * exp               (* if-then-else conditional *)
       | Iterate of exp * exp                (* Iterate img exp *)

The only unfamiliar operation iterate takes an image and an expression and evaluates the expression on each pixel of the image, binding it to the current pixel value. The result of evaluating the expression, which must be an integer, is stored in the current pixel. Written with a bit of syntactic sugar, the following DSL expression

     display (iterate (load "takaosan.pgm")
                (if_ (it >% int 128) (int 2 *% (it -% int 128)) (int 0)))
loads an image, thresholds and enhances its luminance, and displays the result. Clearly interpreting the DSL has the overhead of analyzing and dispatching on the expression's syntax (interpretive overhead) plus the overhead of run-time tags (which characterize a value as int, boolean or image).

The task is to write a compiler for the DSL. To save time, the tutorial provides the interpreter eval, along with the supporting library of reading and displaying images. We are to stage the interpreter, specializing eval to the given DSL program. Does your compiler remove both the interpretive and the tagging overheads? Has the staging also given us the DSL type checker?

References
imgdsl.mli [2K]
The definition of the DSL

img_sugar.ml [<1K]
Syntactic sugar

img_trans.ml [<1K]
Transcript of a sample session

img_interp.ml [3K]
The interpreter to stage. The code includes a few tests.

grayimg.mli [2K]
grayimg.ml [5K]
Low-level image processing library

takaosan.pgm [1302K]
A sample image to process: Takao-san view, April 2008

Christoph Armin Herrmann and Tobias Langhammer: Combining partial evaluation and staged interpretation in the implementation of domain-specific languages.
Science of Computer Programming, 2006, N1, v62, pp. 47-65
< http://dx.doi.org/10.1016/j.scico.2006.02.002 >
Our tutorial problem is a very simple version of the image-processing DSL compiler implemented in the above paper.

 

Generating optimal FIR filters

Filters are all around in radio, sound and image processing. With specialized DSP and now fast CPU, it became possible to filter signals digitally, in software. Digital filters transform digitized signals represented as sequences of samples, or uniformly sampled signal values. This tutorial deals with so-called finite impulse response (FIR) filters, of the following general form
     y[i] = Sum{b[k] * x[i-k] | k=0..m-1}
The sample of the output signal y[i] is a linear combination of m current and past input signal samples x[i], x[i-1],... x[i-m+1]. The values b[k] are filter coefficients. Since the number of coefficients (and often their values) are usually known, it makes sense to specialize the filter computation. Speed is important: filters often process long signals or work online.

The filtering example offers plenty of exercises and small projects: try different boundary conditions and smoothing windows; do filtering inplace; try IIR (recursive) filters; play with sound; implement a design methodology for particular kinds of filters (low-pass, two-tap, etc.); write a framework, graphical or otherwise, for filter design.

The filtering example teaches us loop unrolling and the importance of let-insertion. We see that great many effective optimizations depend on domain knowledge and so cannot be realistically implemented in a general-purpose optimizing compiler. We also see that MetaOCaml is powerful enough for the needed optimizations, but the techniques so far were ad hoc. We need a systematic design, building optimizations from separate components. We need DSL. The rest of the tutorial will build a DSL, for linear algebra.

References
filter.ml [9K]
The illustrated tutorial for specializing FIR filters

Richard W. Hamming: Digital Filters
1997, 1983, 1989. 284 pp.
A lucid, enjoyable introduction to digital filters

Julius O. Smith III: Introduction to Digital Filters with Audio Applications Stanford Center for Computer Research in Music and Acoustics (CCRMA)
< https://ccrma.stanford.edu/~jos/filters/filters.html >

 

Complex vector arithmetic

This problem has been posed by an HPC researcher from the University of Toukyou -- as part of Shonan Challenges, by HPC researchers to meta-programmers. The problem is realistic, to specialize code for a particular data representation. Some architectures favor a so-called vector-of-structures representation whereas GPUs and vector processors favor the structure-of-vectors. Porting a program from GPGPU to a supercomputer or vice versa requires changing data representation. The goal is to automate such changes, to implement the algorithm only once and then mechanically adjust it to different data layouts.

Specifically the problem is to implement point-wise multiplication of two complex vectors, making it possible to systematically specialize the code to different representations for a complex vector: as complex array; as two separate float array, for real and imaginary parts; as a single float array with the real part in the first half; as BigArray.Array2, etc. We should be able to automatically generate different versions of the code for all these different representations.

Again there are plenty of exercises: try other choices for the layout of complex vectors; try the polar representation of complex numbers; generate code for other complex vector operations, for example, computing a * x[j] + y[j] where a is a scalar. (The common BLAS name for the latter operation is caxpy, 'p' stands for plus.) All these extension should be added modularly, reusing the already written code and the already implemented DSL.

References
Reiji Suda: Shonan Challenge: Complex number representation. June 8, 2012
< https://github.com/StagedHPC/shonan-challenge/blob/master/problems/complexNumber/ComplexNumber.txt >

complex_short.ml [10K]
code without two last exercises

cmplx.mli [<1K]
defining data type cmplx

 

Final Project

Having learned the systematic approach to linear algebra specializations, we should apply it to the digital filter specializations. The project is to re-implement the generators for optimal FIR filters, relying on the linear algebra DSL. Make sure to perform algebraic simplifications, so to generate truly optimal code for the echo filter.
References
Generating optimal FIR filters

 

Conclusions

Building even complicated generators is simple if we take advantage of abstraction and types. OCaml's excellent abstraction facilities -- from higher-order functions to module system -- let us write and debug generators in small pieces, and compose optimizations from separate layers. Staged types are of particularly great help. They help ensure that compiling the generated code produces no errors. Problematic generators are reported with helpful error messages that refer to the generator (rather than generated) code. Furthermore, on many occasions we have seen that to stage code, we merely need to give it the desired signature, submit the code to the type checker and fix the reported errors. Often the fix is adding a staging annotation at the flagged location. The type checker actively helps us write the code.


Last updated November 5, 2013

This site's top page is http://okmij.org/ftp/

oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML