Joint work with Jacques Carette.
< http://www.cas.mcmaster.ca/~carette/metamonads/index.html >
Besides the paper, the page points to the generator code, and the generated Gaussian Elimination codes to handle real and integer matrices using various pivoting strategies.
Shifting the Stage: Staging with Delimited Control
Section 5.2 of the JFP paper uses the same example of generating specialized Gaussian Elimination codes to illustrate an alternative, notationally more convenient approach -- relying on delimited control rather than monads.
Can we generate optimal code in one pass, without further transformations such as common-subexpression-elimination, without looking into the code at all after it has been generated? The papers below answer affirmatively. The papers discuss in detail the generation of a straight-line C, Fortran, etc. code for the power-of-two FFT. The papers show that with the help of Abstract Interpretation we can exactly match the quality of code generated by the well-known system FFTW. The second paper demonstrates that our generated power-of-two FFT code has exactly the same number of floating-point operations as that in codelets of FFTW. The significant difference from FFTW is that we do not do any intensional code analysis -- the generated code is a black box and can not be processed nor manipulated any further. Moreover, the generated code cannot even be compared in equality. The reason for these restrictions is to maintain strong invariants about the generated code, e.g., that the generated code is automatically strongly typed and does not need to be typechecked. These invariants and the absence of ad hoc manipulation on the generated low-level code significantly increase our confidence in the result.
Unlike FFTW, we know precisely where savings are coming from, and which particular equivalences contribute to which improvements in the code.
The paper makes the case that ``there are benefits to focusing on writing better generators rather than on fixing the results of simple generators.''
The corresponding MetaOCaml source code
< http://www.metaocaml.org/examples/fft.ml >
Generating optimal FFT code and relating FFTW and Split-Radix
Matching FFTW in operation counts
This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be implemented in a type-safe, purely generative framework. Peak performance is achievable through the careful combination of a high-level, multi-stage evaluation language -- MetaOCaml -- with low-level code generation techniques. Nevertheless, our results also show that generative approaches for high-performance computing do not come without technical caveats and implementation barriers concerning productivity and reuse. We describe these difficulties and identify ways to hide or overcome them, from abstract syntaxes to heterogeneous generators of code generators, combining high-level and type-safe multi-stage programming with a back-end generator of imperative code.
Joint work with Albert Cohen, Se'bastien Donadio, Mari'a Jesu's Garzara'n, Christoph Armin Herrmann, and David A. Padua.
< http://www-rocq.inria.fr/~acohen/publications/CDGHKP06.ps.gz >
From `Albert Cohen - Publications'
A particular area of current interest shared by PL and HPC researchers is how to use domain-specific languages (DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL's signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.
Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit -- in other words, how the HPC wish list is ranked by importance.
This workshop aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. The workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop benefited PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may have also benefited, for example by learning new ways to understand or organize the code generators they are already writing and using. To promote this mutual understanding, the workshop emphasized presentations that elicited questions rather than merely provide answers.
The workshop is organized together with Chung-chieh Shan and Yukiyoshi Kameyama. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).
< http://www.nii.ac.jp/shonan/wp-content/uploads/2011/09/No.2012-4.pdf >
The final report
One of the most interesting parts of the report is the section
`Tangible Outcomes'. The foremost is the Shonan Challenge, a set of
representative HPC examples where staging could be of help, in
producing more maintainable code and letting domain experts perform
modifications at a higher-level of abstraction.
Shonan Challenge is a collection of crisp problems posed by HPC and domain experts, for which efficient implementations are known but were tedious to write and modify. The challenge is to generate a similar efficient implementation from the high-level specification of a problem, performing the same optimizations, but automatically. It should be easy to adjust optimizations and the specification, maintaining confidence in the generated code.
We describe our initial set of benchmarks and provide three solutions to two of the problems. We hope that the Shonan Challenge will clarify the state of the art and stimulate the theory and technology of staging just as the POPLmark challenge did for meta-theory mechanization. Since each Shonan Challenge problem is a kernel of a significant HPC application, each solution has an immediate practical application.
The paper is written with Baris Aktemur, Yukiyoshi Kameyama, and Chung-chieh Shan.
< http://groups.google.com/group/stagedhpc >
< https://github.com/StagedHPC/shonan-challenge >
The mailing list (Google Group) and the GitHub repository for the Shonan Challenge. The list is open: subscription requests are welcome.
MetaOCaml as a domain-specific optimizer: Shonan Challenge 1
B/F ratio -- of the
hand-optimized code. The generated code correctly handles the edge cases,
which were problematic for the hand-written code. The generated
code assuredly has no array bound errors, which also troubled the
hand-written code.
The stencil EDSL is the straightforward application of the previous,
seemingly purely academic, `useless' work on generating optimal
Gibonacci codes and the let-insertion with delimited control. The key
insight is regarding a stencil as a sliding-window cache, for real or virtual
arrays. The implementation demonstrated that scope extrusion is a
problem in practice after all, justifying the research into its static
prevention.
The main challenge in high-performance computing (HPC) is overcoming the
memory bottleneck, reducing the amount of time CPU (local node) spends
waiting for data. The speed disparity between the CPU and the memory subsystem
is often characterized by the so-called B/F ratio: the ratio of
the number of bytes moved from or to the main memory during a computation
to the number of executed floating-point operations. The challenge
of the HPC code optimization is to reduce the B/F ratio and
the amount of needed local memory.
Signal processing or finite-element algorithms can often be stated as element-wise computations on shifted arrays. The following is the running example in the challenge.
w = a * S[1] a
b = a - w + S[-1] w
Here a is a global input array, b is a global output array,
and w is a working array. The operation S[n]
shifts the argument vector by n (to the left, if n is
positive). All arithmetic operations on vectors (addition,
multiplication, subtraction) are elementwise. Global arrays
are shared or distributed throughout a supercomputer; reading or writing
them requires accessing off-the-chip memory or the inter-node
communication.The naive implementation, neglecting for now edge elements, can be written in OCaml as
for i=0 to (n-2) do
w.(i) <- a.(i) *. a.(i+1)
done;
for i=1 to (n-1) do
b.(i) <- a.(i) -. w.(i) +. w.(i-1)
done
Assuming w is a local array, the first loop reads 2*(n-1)
elements from the global array a and does (n-1) floating-point
additions. The second loop reads n-1 elements from a, writes n-1
elements to b and does 2*(n-1) floating-point computations.
With 8-byte floating-point values, the B/F ratio is 32/3. Current supercomputers can sustain B/F of about 1 to 2. Therefore,
the naive implementation will run the order of magnitude worse
the peak performance, because global memory cannot keep
up with the CPU. The slow-down will likely to be larger since for large n, the array w won't fit into the local memory (cache).
After the array computation finishes, the output array b becomes the
input array for the next `time step', and the computation continues.
The B/F ratio can be reduced by doing two time steps at once:
w1 = a * S[1] a
wm = a - w1 + S[-1] w1
w2 = wm * S[1] wm
b = wm - w2 + S[-1] w2
The naive implementation has B/F of 32/6 -- with twice as many
floating-point operations per global memory access. However,
the implementation requires three times more local memory.
We generate the code that reads and writes only 1 floating-point value
per loop iteration while executing 6 floating-point operations, for B/F = 16/6 -- without using any local arrays. The code
is expected to run nearly at the peak speed of the supercomputer.
The enclosed code develops the EDSL for stencil computation in three steps, starting with the naive implementation generating many duplicate array accesses. The array accesses are then cached in local variables. Finally, a stencil is introduced as a sliding-window cache. For illustration, here is the naive generator of the double-time-step computation:
let t21 max_bounded a b =
forlooph 2 min_bounded max_bounded (
let a = gref0 a in
let w1 = a *@ ashift 1 a in
let wm = a -@ w1 +@ ashift (-1) w1 in
let w2 = wm *@ ashift 1 wm in
b <-@ wm -@ w2 +@ ashift (-1) w2)
where *@, +@ etc. are element-wise operations on arrays, ashift
n is the array shift S[n] and forlooph 2 is a loop combinator
that unrolls the first and the last two iterations. The generator will
complain if the halo value is less than 2 because it cannot see
(correctly) that array indices are in-bounds. The EDSL code looks
quite like the mathematical notation. To generate the optimal code, we merely add stencil_memo combinators: let t25 max_bounded a b =
let p = new_prompt () and q = new_prompt () in
push_prompt q (fun () ->
forlooph 2 min_bounded max_bounded (with_body_prompt p (
let a = stencil_memo q p (gref0 a) in
let w1 = stencil_memo q p (a *@ ashift 1 a) in
let wm = stencil_memo q p (a -@ w1 +@ ashift (-1) w1) in
let w2 = stencil_memo q p (wm *@ ashift 1 wm) in
b <-@ wm -@ w2 +@ ashift (-1) w2)))
The combinators need to know the loop scope (the scope of loop-local
variables) and the outer scope (the scope of the variables to store data
cross-iteration). So-called prompts q and p denote those scopes.
The generated code, shown in the comments in stencil.ml, reads and
writes a global array once per iteration and does six floating-point
operations, with the B/F of 16/6.
The hand-optimized code has essentially the same loop body and the
same B/F. However, the first versions of the hand-written code left
the elements 0, 1, and n-1 of the output array
uninitialized. Furthermore, the code read past the end of the input
array. Getting edge cases right is really difficult. The generated
code has very long and boring pieces of code before and after the main
loop, to properly compute the edge elements. Writing such tedious code
by hand is excruciatingly boring. Code generation truly helps.
The stencil challenge proved to be a surprisingly good motivation for the research on staging. It required the code generation with effects, in particular, effects crossing binders. Implementing the challenge showed that scope extrusion does happen in practice. Although the fact of scope extrusion is easy to see when compiling the generated code, finding the culprit within the generator turns out difficult.
problems/tiling/main1.cpp committed to the Shonan-challenge GitHub repository on 2012-05-21
stencil_hand.ml [4K]
Reference code: specification and the hand-written optimized code
The code correctly handles the edges cases and has unit tests.
stencil.ml [28K]
The MetaOCaml code of the solution to the challenge
The file gradually develops the EDSL for stencil computations and has lots of generated sample code and numerical tests.
The code requires MetaOCaml and the delimcc library.
bounded.mli [3K]
bounded.ml [5K]
Partially known dynamic values for safe array indexing
A library for statically tracking bounds on unknown dynamic values.
oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!
Converted from HSXML by HSXML->HTML