previous   next   up   top

Program Generation for High-Performance Computing

 


 

Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code

[The Abstract of the paper]
We use multi-stage programming, monads and OCaml's advanced module system to demonstrate how to eliminate all abstraction overhead from generic programs while avoiding any inspection of the resulting code. We demonstrate this clearly with Gaussian Elimination as a representative family of symbolic and numeric algorithms. We parameterize our code to a great extent -- over domain, input and permutation matrix representations, determinant and rank tracking, pivoting policies, result types, etc. -- at no run-time cost. Because the resulting code is generated just right and not changed afterward, MetaOCaml guarantees that the generated code is well-typed. We further demonstrate that various abstraction parameters (aspects) can be made orthogonal and compositional, even in the presence of name-generation for temporaries, and ``interleaving'' of aspects. We also show how to encode some domain-specific knowledge so that ``clearly wrong'' compositions can be rejected at or before generation time, rather than during the compilation or running of the generated code.

Joint work with Jacques Carette.

Version
The current version is October 2008.
References
SCP-metamonads.pdf [311K]
Science of Computer Programming, v76, N5, May 2011, pp. 349-375
doi:10.1016/j.scico.2008.09.008

< http://www.cas.mcmaster.ca/~carette/metamonads/index.html >
Besides the paper, the page points to the generator code, and the generated Gaussian Elimination codes to handle real and integer matrices using various pivoting strategies.

Shifting the Stage: Staging with Delimited Control
Section 5.2 of the JFP paper uses the same example of generating specialized Gaussian Elimination codes to illustrate an alternative, notationally more convenient approach -- relying on delimited control rather than monads.

 

Generating optimal code with confidence

Can we generate optimal code in one pass, without further transformations such as common-subexpression-elimination, without looking into the code at all after it has been generated? The papers below answer affirmatively. The papers discuss in detail the generation of a straight-line C, Fortran, etc. code for the power-of-two FFT. The papers show that with the help of Abstract Interpretation we can exactly match the quality of code generated by the well-known system FFTW. The second paper demonstrates that our generated power-of-two FFT code has exactly the same number of floating-point operations as that in codelets of FFTW. The significant difference from FFTW is that we do not do any intensional code analysis -- the generated code is a black box and can not be processed nor manipulated any further. Moreover, the generated code cannot even be compared in equality. The reason for these restrictions is to maintain strong invariants about the generated code, e.g., that the generated code is automatically strongly typed and does not need to be typechecked. These invariants and the absence of ad hoc manipulation on the generated low-level code significantly increase our confidence in the result.

Unlike FFTW, we know precisely where savings are coming from, and which particular equivalences contribute to which improvements in the code.

The paper makes the case that ``there are benefits to focusing on writing better generators rather than on fixing the results of simple generators.''

References
A Methodology for Generating Verified Combinatorial Circuits
Joint work with Kedar N. Swadi and Walid Taha.
Proc. of EMSOFT'04, the Fourth ACM International Conference on Embedded Software, September 27-29, 2004, Pisa, Italy. ACM Press, pp. 249 - 258.
< http://www.cs.rice.edu/~taha/publications/conference/emsoft04.pdf >

The corresponding MetaOCaml source code
< http://www.metaocaml.org/examples/fft.ml >

Generating optimal FFT code and relating FFTW and Split-Radix
Matching FFTW in operation counts

 

In search of a program generator to implement generic transformations for high-performance computing

[The Abstract of the paper]
The quality of compiler-optimized code for high-performance applications is far behind what optimization and domain experts can achieve by hand. Although it may seem surprising at first glance, the performance gap has been widening over time, due to the tremendous complexity increase in microprocessor and memory architectures, and to the rising level of abstraction of popular programming languages and styles. This paper explores in-between solutions, neither fully automatic nor fully manual ways to adapt a computationally intensive application to the target architecture. By mimicking complex sequences of transformations useful to optimize real codes, we show that generative programming is a practical means to implement architecture-aware optimizations for high-performance applications.

This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be implemented in a type-safe, purely generative framework. Peak performance is achievable through the careful combination of a high-level, multi-stage evaluation language -- MetaOCaml -- with low-level code generation techniques. Nevertheless, our results also show that generative approaches for high-performance computing do not come without technical caveats and implementation barriers concerning productivity and reuse. We describe these difficulties and identify ways to hide or overcome them, from abstract syntaxes to heterogeneous generators of code generators, combining high-level and type-safe multi-stage programming with a back-end generator of imperative code.

Joint work with Albert Cohen, Se'bastien Donadio, Mari'a Jesu's Garzara'n, Christoph Armin Herrmann, and David A. Padua.

Version
The current version is September, 2006.
References
< http://dx.doi.org/10.1016/j.scico.2005.10.013 >
Science of Computer Programming, Volume 62, Issue 1, September 2006, pp. 25-46.

< http://www-rocq.inria.fr/~acohen/publications/CDGHKP06.ps.gz >
From `Albert Cohen - Publications'

 

Bridging the theory of staged programming languages and the practice of high-performance computing

This discussion-heavy workshop took place at Shonan Village Center, Kanagawa, Japan on May 19-22 2012. It aimed to bridge the theory of programming languages (PL) with the practice of high-performance computing (HPC). The topic of the discussion was code generation, or staging, a form of meta-programming. Both the PL and HPC communities have come to realize the importance of code generation: whereas PL theorists widely regard staging as the leading approach to making modular software and expressive languages run fast, HPC practitioners widely regard staging as the leading approach to making high-performance software reusable and maintainable. Thanks to this confluence of theory and practice, we have had the rare chance to bring together PL researchers and the potential consumers of their work.

A particular area of current interest shared by PL and HPC researchers is how to use domain-specific languages (DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL's signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.

Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit -- in other words, how the HPC wish list is ranked by importance.

This workshop aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. The workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop benefited PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may have also benefited, for example by learning new ways to understand or organize the code generators they are already writing and using. To promote this mutual understanding, the workshop emphasized presentations that elicited questions rather than merely provide answers.

The workshop is organized together with Chung-chieh Shan and Yukiyoshi Kameyama. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).

References
< http://www.nii.ac.jp/shonan/seminar019/ >
The web page of the workshop

< http://www.nii.ac.jp/shonan/wp-content/uploads/2011/09/No.2012-4.pdf >
The final report
One of the most interesting parts of the report is the section `Tangible Outcomes'. The foremost is the Shonan Challenge, a set of representative HPC examples where staging could be of help, in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction.

Shonan Challenge for Generative Programming

 

Shonan Challenge for Generative Programming

[The Abstract of the paper]
The appeal of generative programming is ``abstraction without guilt'' (coined by Ken Kennedy): eliminating the vexing trade-off between writing high-level code and highly-performant code. Generative programming also promises to formally capture the domain-specific knowledge and heuristics used by high-performance computing (HPC) experts. How far along are we in fulfilling these promises? To gauge our progress, a recent Shonan Meeting on ``bridging the theory of staged programming languages and the practice of high-performance computing'' proposed to use a set of benchmarks, dubbed ``Shonan Challenge''.

Shonan Challenge is a collection of crisp problems posed by HPC and domain experts, for which efficient implementations are known but were tedious to write and modify. The challenge is to generate a similar efficient implementation from the high-level specification of a problem, performing the same optimizations, but automatically. It should be easy to adjust optimizations and the specification, maintaining confidence in the generated code.

We describe our initial set of benchmarks and provide three solutions to two of the problems. We hope that the Shonan Challenge will clarify the state of the art and stimulate the theory and technology of staging just as the POPLmark challenge did for meta-theory mechanization. Since each Shonan Challenge problem is a kernel of a significant HPC application, each solution has an immediate practical application.

The paper is written with Baris Aktemur, Yukiyoshi Kameyama, and Chung-chieh Shan.

References
Shonan-challenge.pdf [230K]
Paper published in the Proceedings of the ACM SIGPLAN 2013 Workshop on Partial Evaluation and Program Manipulation, PEPM 2013 (Rome, Italy, January 21-22, 2013), pp. 147-154
doi:10.1145/2426890.2426917

< http://groups.google.com/group/stagedhpc >
< https://github.com/StagedHPC/shonan-challenge >
The mailing list (Google Group) and the GitHub repository for the Shonan Challenge. The list is open: subscription requests are welcome.

MetaOCaml as a domain-specific optimizer: Shonan Challenge 1

 

Generating optimal stencil code

Generating memory-bandwidth--optimal code for array computations -- typical of PDE solvers and signal processing kernels -- was one of the Shonan Challenge benchmarks. The challenge was submitted by Takayuki Muranushi, an astrophysicist whose work depends on solving PDEs on truly huge meshes. This article answers the challenge. It develops a domain-specific language (DSL) for stencil computations, embedded in (Meta)OCaml. The generated code attains the bandwidth optimality -- the B/F ratio -- of the hand-optimized code. The generated code correctly handles the edge cases, which were problematic for the hand-written code. The generated code assuredly has no array bound errors, which also troubled the hand-written code.

The stencil EDSL is the straightforward application of the previous, seemingly purely academic, `useless' work on generating optimal Gibonacci codes and the let-insertion with delimited control. The key insight is regarding a stencil as a sliding-window cache, for real or virtual arrays. The implementation demonstrated that scope extrusion is a problem in practice after all, justifying the research into its static prevention.

The main challenge in high-performance computing (HPC) is overcoming the memory bottleneck, reducing the amount of time CPU (local node) spends waiting for data. The speed disparity between the CPU and the memory subsystem is often characterized by the so-called B/F ratio: the ratio of the number of bytes moved from or to the main memory during a computation to the number of executed floating-point operations. The challenge of the HPC code optimization is to reduce the B/F ratio and the amount of needed local memory.

Signal processing or finite-element algorithms can often be stated as element-wise computations on shifted arrays. The following is the running example in the challenge.

     w = a * S[1] a
     b = a - w + S[-1] w
Here a is a global input array, b is a global output array, and w is a working array. The operation S[n] shifts the argument vector by n (to the left, if n is positive). All arithmetic operations on vectors (addition, multiplication, subtraction) are elementwise. Global arrays are shared or distributed throughout a supercomputer; reading or writing them requires accessing off-the-chip memory or the inter-node communication.

The naive implementation, neglecting for now edge elements, can be written in OCaml as

     for i=0 to (n-2) do
       w.(i) <- a.(i) *. a.(i+1)
     done;
     for i=1 to (n-1) do
       b.(i) <- a.(i) -. w.(i) +. w.(i-1)
     done
Assuming w is a local array, the first loop reads 2*(n-1) elements from the global array a and does (n-1) floating-point additions. The second loop reads n-1 elements from a, writes n-1 elements to b and does 2*(n-1) floating-point computations. With 8-byte floating-point values, the B/F ratio is 32/3. Current supercomputers can sustain B/F of about 1 to 2. Therefore, the naive implementation will run the order of magnitude worse the peak performance, because global memory cannot keep up with the CPU. The slow-down will likely to be larger since for large n, the array w won't fit into the local memory (cache).

After the array computation finishes, the output array b becomes the input array for the next `time step', and the computation continues. The B/F ratio can be reduced by doing two time steps at once:

     w1 = a * S[1] a
     wm = a - w1 + S[-1] w1
     w2 = wm * S[1] wm
     b  = wm - w2 + S[-1] w2
The naive implementation has B/F of 32/6 -- with twice as many floating-point operations per global memory access. However, the implementation requires three times more local memory.

We generate the code that reads and writes only 1 floating-point value per loop iteration while executing 6 floating-point operations, for B/F = 16/6 -- without using any local arrays. The code is expected to run nearly at the peak speed of the supercomputer.

The enclosed code develops the EDSL for stencil computation in three steps, starting with the naive implementation generating many duplicate array accesses. The array accesses are then cached in local variables. Finally, a stencil is introduced as a sliding-window cache. For illustration, here is the naive generator of the double-time-step computation:

     let t21 max_bounded a b =
       forlooph 2 min_bounded max_bounded (
         let a = gref0 a in
         let w1 = a *@ ashift 1 a in 
         let wm = a -@ w1 +@ ashift (-1) w1 in
         let w2 = wm *@ ashift 1 wm in
         b <-@ wm -@ w2 +@ ashift (-1) w2)
where *@, +@ etc. are element-wise operations on arrays, ashift n is the array shift S[n] and forlooph 2 is a loop combinator that unrolls the first and the last two iterations. The generator will complain if the halo value is less than 2 because it cannot see (correctly) that array indices are in-bounds. The EDSL code looks quite like the mathematical notation. To generate the optimal code, we merely add stencil_memo combinators:
     let t25 max_bounded a b =
       let p = new_prompt () and q = new_prompt () in
       push_prompt q (fun () ->
       forlooph 2 min_bounded max_bounded (with_body_prompt p (
         let a  = stencil_memo q p (gref0 a) in
         let w1 = stencil_memo q p (a *@ ashift 1 a) in 
         let wm = stencil_memo q p (a -@ w1 +@ ashift (-1) w1) in
         let w2 = stencil_memo q p (wm *@ ashift 1 wm) in
         b <-@ wm -@ w2 +@ ashift (-1) w2)))

The combinators need to know the loop scope (the scope of loop-local variables) and the outer scope (the scope of the variables to store data cross-iteration). So-called prompts q and p denote those scopes. The generated code, shown in the comments in stencil.ml, reads and writes a global array once per iteration and does six floating-point operations, with the B/F of 16/6.

The hand-optimized code has essentially the same loop body and the same B/F. However, the first versions of the hand-written code left the elements 0, 1, and n-1 of the output array uninitialized. Furthermore, the code read past the end of the input array. Getting edge cases right is really difficult. The generated code has very long and boring pieces of code before and after the main loop, to properly compute the edge elements. Writing such tedious code by hand is excruciatingly boring. Code generation truly helps.

The stencil challenge proved to be a surprisingly good motivation for the research on staging. It required the code generation with effects, in particular, effects crossing binders. Implementing the challenge showed that scope extrusion does happen in practice. Although the fact of scope extrusion is easy to see when compiling the generated code, finding the culprit within the generator turns out difficult.

Version
The current version is October, 2012.
References
stencil-challenge.cpp [6K]
Takayuki Muranushi: PDE challenge
A local copy of the file problems/tiling/main1.cpp committed to the Shonan-challenge GitHub repository on 2012-05-21
The file describes the challenge showing the naive and the hand-optimized code.

stencil_hand.ml [4K]
Reference code: specification and the hand-written optimized code
The code correctly handles the edges cases and has unit tests.

stencil.ml [28K]
The MetaOCaml code of the solution to the challenge
The file gradually develops the EDSL for stencil computations and has lots of generated sample code and numerical tests.
The code requires MetaOCaml and the delimcc library.

bounded.mli [3K]
bounded.ml [5K]
Partially known dynamic values for safe array indexing
A library for statically tracking bounds on unknown dynamic values.



Last updated November 5, 2013

This site's top page is http://okmij.org/ftp/

oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML