Program Generation for High-Performance Computing

 

 

In search of a program generator to implement generic transformations for high-performance computing

The quality of compiler-optimized code for high-performance applications is far behind what optimization and domain experts can achieve by hand. Although it may seem surprising at first glance, the performance gap has been widening over time, due to the tremendous complexity increase in microprocessor and memory architectures, and to the rising level of abstraction of popular programming languages and styles. This paper explores in-between solutions, neither fully automatic nor fully manual ways to adapt a computationally intensive application to the target architecture. By mimicking complex sequences of transformations useful to optimize real codes, we show that generative programming is a practical means to implement architecture-aware optimizations for high-performance applications.

This work explores the promises of generative programming languages and techniques for the high-performance computing expert. We show that complex, architecture-specific optimizations can be implemented in a type-safe, purely generative framework. Peak performance is achievable through the careful combination of a high-level, multi-stage evaluation language -- MetaOCaml -- with low-level code generation techniques. Nevertheless, our results also show that generative approaches for high-performance computing do not come without technical caveats and implementation barriers concerning productivity and reuse. We describe these difficulties and identify ways to hide or overcome them, from abstract syntaxes to heterogeneous generators of code generators, combining high-level and type-safe multi-stage programming with a back-end generator of imperative code.

Joint work with Albert Cohen, Se'bastien Donadio, Mari'a Jesu's Garzara'n, Christoph Armin Herrmann, and David A. Padua.

Version
The current version is September, 2006
References
SCP-search.pdf [511K]
Science of Computer Programming, Volume 62, Issue 1, September 2006, pp. 25-46 doi:10.1016/j.scico.2005.10.013

 

Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code

We use multi-stage programming, monads and OCaml's advanced module system to demonstrate how to eliminate all abstraction overhead from generic programs while avoiding any inspection of the resulting code. We demonstrate this clearly with Gaussian Elimination as a representative family of symbolic and numeric algorithms. We parameterize our code to a great extent -- over domain, input and permutation matrix representations, determinant and rank tracking, pivoting policies, result types, etc. -- at no run-time cost. Because the resulting code is generated just right and not changed afterward, MetaOCaml guarantees that the generated code is well-typed. We further demonstrate that various abstraction parameters (aspects) can be made orthogonal and compositional, even in the presence of name-generation for temporaries, and ``interleaving'' of aspects. We also show how to encode some domain-specific knowledge so that ``clearly wrong'' compositions can be rejected at or before generation time, rather than during the compilation or running of the generated code.

Joint work with Jacques Carette.

Version
The current version is October 2008
References
SCP-metamonads.pdf [311K]
The paper published in Science of Computer Programming, v76, N5, May 2011, pp. 349-375 doi:10.1016/j.scico.2008.09.008

<http://www.cas.mcmaster.ca/~carette/metamonads/index.html>
Besides the paper, the page points to the generator code, and the generated Gaussian Elimination codes to handle real and integer matrices using various pivoting strategies.

Shifting the Stage: Staging with Delimited Control
Section 5.2 of the JFP paper uses the same example of generating specialized Gaussian Elimination codes to illustrate an alternative, notationally more convenient approach: relying on delimited control rather than monads.

 

Generating optimal code with confidence

Can we generate optimal code in one pass, without further transformations such as common-subexpression-elimination, without looking into the code at all after it has been generated? The papers below answer affirmatively. The papers discuss in detail the generation of a straight-line C, Fortran, etc. code for the power-of-two FFT. The papers show that with the help of Abstract Interpretation we can exactly match the quality of code generated by the well-known system FFTW. The second paper demonstrates that our generated power-of-two FFT code has exactly the same number of floating-point operations as that in codelets of FFTW. The significant difference from FFTW is that we do not do any intensional code analysis -- the generated code is a black box and can not be processed nor manipulated any further. Moreover, the generated code cannot even be compared in equality. The reason for these restrictions is to maintain strong invariants about the generated code, e.g., that the generated code is automatically strongly typed and does not need to be typechecked. These invariants and the absence of ad hoc manipulation on the generated low-level code significantly increase our confidence in the result.

Unlike FFTW, we know precisely where savings are coming from, and which particular equivalences contribute to which improvements in the code.

The paper makes the case that ``there are benefits to focusing on writing better generators rather than on fixing the results of simple generators.''

References
A Methodology for Generating Verified Combinatorial Circuits
Joint work with Kedar N. Swadi and Walid Taha.
Proc. of EMSOFT'04, the Fourth ACM International Conference on Embedded Software, September 27-29, 2004, Pisa, Italy. ACM Press, pp. 249 - 258. doi:10.1145/1017753.1017794

Generating optimal FFT code and relating FFTW and Split-Radix
Matching FFTW in operation counts

 

Bridging the theory of staged programming languages and the practice of high-performance computing

This discussion-heavy workshop took place at Shonan Village Center, Kanagawa, Japan on May 19-22 2012. It aimed to bridge the theory of programming languages (PL) with the practice of high-performance computing (HPC). The topic of the discussion was code generation, or staging, a form of meta-programming. Both the PL and HPC communities have come to realize the importance of code generation: whereas PL theorists widely regard staging as the leading approach to making modular software and expressive languages run fast, HPC practitioners widely regard staging as the leading approach to making high-performance software reusable and maintainable. Thanks to this confluence of theory and practice, we have had the rare chance to bring together PL researchers and the potential consumers of their work.

A particular area of current interest shared by PL and HPC researchers is how to use domain-specific languages (DSL) to capture and automate patterns and techniques of code generation, transformation, and optimization that recur in an application domain. For example, HPC has created and benefited from expressive DSLs such as OpenMP directives, SPIRAL's signal processing language, and specialized languages for stencil computations and domain decomposition. Moreover, staging helps to build efficient and expressive DSLs because it assures that the generated code is correct in the form of precise static guarantees.

Alas, the communication between PL researchers working on staging and HPC practitioners could be better. On one hand, HPC practitioners often do not know what PL research offers. For example, current staging technology not only makes sure that the generated code compiles without problems, but further eliminates frequent problems such as matrix dimension mismatch. Staged programming can detect and prevent these problems early during development, thus relieving users from scouring reams of obscure generated code to debug compiler errors, or waiting for expensive trial computations to produce dubious results. On the other hand, PL researchers often do not know how much HPC practitioners who write code generators value this or that theoretical advance or pragmatic benefit -- in other words, how the HPC wish list is ranked by importance.

This workshop aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming. The workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). The workshop benefited PL and staging theorists by informing them what HPC practice actually needs. HPC practitioners may have also benefited, for example by learning new ways to understand or organize the code generators they are already writing and using. To promote this mutual understanding, the workshop emphasized presentations that elicited questions rather than merely provide answers.

The workshop is organized together with Chung-chieh Shan and Yukiyoshi Kameyama. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).

References
<https://shonan.nii.ac.jp/seminar/019/>
<https://shonan.nii.ac.jp/seminars/019/>
The web page of the workshop

shonan1-report.pdf [73K]
The final report
One of the most interesting parts of the report is the section `Tangible Outcomes'. The foremost is the Shonan Challenge, a set of representative HPC examples where staging could be of help, in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction.

Shonan Challenge for Generative Programming

 

Staging and High-Performance Computing: Theory and Practice

The second meeting on high-performance computing and staging took place at the Shonan Village Center, Kanagawa, Japan on May 27--30, 2014. It was the follow-up to the May 2012 meeting (Shonan Seminar 19) ``Bridging the theory of staged programming languages and the practice of high-performance computing''. At the first meeting, researchers in staged programming languages were learning of the problems in high-performance computing (HPC) from HPC experts. Since then, thanks in part to Shonan Challenges put forward at that meeting, the theory and tools of staged programming languages have progressed to the point where they could already be useful in HPC practice. The present follow-up meeting has gauged this readiness and set further milestones. It served as a forum for staged programming researchers to present their progress and for HPC practitioners to evaluate it, fostering the collaboration on real-life applications.

Generative programming, in particular, in the form of staging, is widely recognized in HPC as the leading approach to resolve the conflict between high-performance on one hand, and portability and maintainability on the other hand. In its general form, staging is an implementation technique for efficient domain-specific languages (DSL), letting HPC experts conveniently express their domain-specific knowledge and optimization heuristics. However, the results and tools of the current staging research are little used in the HPC community. Partly this is because HPC practitioners are not aware of the progress in staging; staging researchers are likewise unaware of what HPC practitioners really need. The first Shonan meeting brought together HPC practitioners and programming language (PL) researchers to break this awareness barrier. The meeting aimed to solicit and discuss real-world applications of assured code generation in HPC that would drive PL research in meta-programming.

The first Shonan meeting succeeded in its aim. It developed a set of benchmarks, representative HPC examples, where staging could be of help in producing more maintainable code and letting domain experts perform modifications at a higher-level of abstraction. This set was dubbed `Shonan Challenge'.

Shonan Challenge has greatly stimulated research and development of staging, resulting in extensible compilers based on staging (Rompf et al., POPL 2013) and the revival of MetaOCaml (ML 2013, FLOPS 2014). The answers to Shonan Challenges have been presented in the overview paper (Aktemur et el., PEPM 2013), in (Rompf et al., POPL 2013) and in the poster at APLAS 2012. Shonan Challenge problems (specifically, the Hidden-Markov Model benchmark) were discussed at the staging tutorials given in 2013 at the premier PL conferences PLDI, ECOOP, and ICFP/CUFP.

The present follow-up meeting was intended as the place to report the progress in staging back to the HPC practitioners who posed the challenges, to evaluate how well the developed tools meet the needs of the HPC practice already, and what is yet to be done.

As before, the workshop participants consisted of three groups of people: PL theorists, HPC researchers, and PL-HPC intermediaries (that is, people who are working with HPC professionals, translating insights from PL theory to HPC practice). To promote the mutual understanding, we have planned for the workshop to have lots of time for discussion. We emphasized tutorial, brainstorming and working-group sessions rather than mere conference-like presentations.

The workshop is organized together with Yukiyoshi Kameyama and Jeremy Siek. The Shonan seminar series is sponsored by Japan's National Institute of Informatics (NII).

References
<http://shonan.nii.ac.jp/seminar/056/>
The web page of the workshop

shonan2-report.pdf [126K]
The final report

Bridging the theory of staged programming languages and the practice of high-performance computing
The first Shonan meeting

 

Shonan Challenge for Generative Programming

The appeal of generative programming is ``abstraction without guilt'' (coined by Ken Kennedy): eliminating the vexing trade-off between writing high-level code and highly-performant code. Generative programming also promises to formally capture the domain-specific knowledge and heuristics used by high-performance computing (HPC) experts. How far along are we in fulfilling these promises? To gauge our progress, a recent Shonan Meeting on ``bridging the theory of staged programming languages and the practice of high-performance computing'' proposed to use a set of benchmarks, dubbed ``Shonan Challenge''.

Shonan Challenge is a collection of crisp problems posed by HPC and domain experts, for which efficient implementations are known but were tedious to write and modify. The challenge is to generate a similar efficient implementation from the high-level specification of a problem, performing the same optimizations, but automatically. It should be easy to adjust optimizations and the specification, maintaining confidence in the generated code.

We describe our initial set of benchmarks and provide three solutions to two of the problems. We hope that the Shonan Challenge will clarify the state of the art and stimulate the theory and technology of staging just as the POPLmark challenge did for meta-theory mechanization. Since each Shonan Challenge problem is a kernel of a significant HPC application, each solution has an immediate practical application.

The paper is written with Baris Aktemur, Yukiyoshi Kameyama, and Chung-chieh Shan.

References
Shonan-challenge.pdf [230K]
The paper published in the Proceedings of the ACM SIGPLAN 2013 Workshop on Partial Evaluation and Program Manipulation, PEPM 2013 (Rome, Italy, January 21-22, 2013), pp. 147-154 doi:10.1145/2426890.2426917

<http://groups.google.com/group/stagedhpc>
<https://github.com/StagedHPC/shonan-challenge>
The mailing list (Google Group) and the GitHub repository for the Shonan Challenge. The list is open: subscription requests are welcome.

The MetaOCaml Tutorial book describes solutions to two Shonan challenges

 

Do-it-yourself domain-specific optimizing compiler

This article is the record of a short presentation at the October 2006 meeting of the IFIP working group on program generation (WG 2.11). The presentation reported on a small, seemingly successful experiment in optimized DSL compilation. The goal was to apply domain-specific optimizations (such as loop unrolling, state-passing transformation, memoization) to a textbook algorithm written in OCaml. The tactics are written in MetaOCaml. Hence the code and the tactics are typed. The type system prevents obvious nonsense, statically ensuring that the optimization result is well-typed. The output is either the optimized OCaml or (with offshoring) C or Fortran. The key idea was combining MetaOCaml with camlp4. MetaOCaml is responsible for code generation and writing tactics. The camlp4 syntactic extension turns a block of OCaml code into a typed AST, which can then be easily (re-)interpreted by ordinary users through tactics.

By the time of the 2006 presentation, the subject has been already dead for a year. It is described here just for the record. One can recognize in that old code the beginnings of the tagless-final approach. In fact, the first tagless-final interpreter was written in the e-mail message ``Tagless staged interpreter in 5 definitions'' sent on November 3, 2006, after I came back from the WG 2.11 meeting. Strangely, I did not recognize the connection back then. One can also see in the WG 2.11 presentation the anticipation of Scala-Virtualized and the Lightweight Modular Staging (LMS).

The running example of the WG 2.11 presentation was the convolution of two sequences. Here is the textbook algorithm, expressed in what looks like OCaml.

    let dot1 = funM ENVID n a1 a2 ->
      let sum = ref 0.0 in
      for i = 0 to (n-1) do
        sum := !sum +. a1.(i) *. a2.(n-i-1)
      done;
      !sum

Replacing the odd-looking funM ENVID with just fun gives the truly OCaml code, which can be evaluated as such. As it stands though, dot1 generates code, in the form of the MetaOCaml quotation (brackets):

    val dot1 : ('a, int -> float array -> float array -> float) code =
      .<fun n_3 -> fun a1_4 -> fun a2_5 ->
         let sum_6 = (ref 0.0) in
         for i_7 = 0 to (n_3 - 1) do
          let i_8 = i_7 in
          (sum_6 := ((! sum_6) +. (a1_4.(i_8) *. a2_5.((n_3 - i_8) - 1))))
         done;
         (! sum_6)>.
The generated code reproduces what we started with, but with more parentheses and less readable identifiers.

The curious ENVID appearing in dot1 is quite boring:

    module ENVID = struct
     type 'v m   = 'v
     let retS x = x  let retL x = x  let ret x  = x
     let bind m f = f m   let run x = x
     let app f x = .<.~f .~x>.
     let ( + ) x y = .<Pervasives.( + ) .~x .~y>.
     let ( +.) x y = .<Pervasives.( +.) .~x .~y>.
     ...
     let fif c t e = .<if .~c then .~t else .~e>.
     let fseq e1 e2 = .<begin .~e1; .~e2 end>.
     let ffor f e1 e2 body = 
       if f then .<for i = .~e1 to .~e2 do .~(body .<i>.) done>.
            else .<for i = .~e1 downto .~e2 do .~(body .<i>.) done>.
     let ref e = .<ref .~e>.
     let fass e1 e2 = .<(.~e1) := (.~e2)>.
     let ( ! ) e = .<! (.~e) >.
     let aref e1 e2 = .<(.~e1).(.~e2)>.
     let ym fc = .<let rec ym f x = f (ym f) x in ym .~fc>.
    end
It defines code generating combinators, such as app for building a function application or ffor for generating a for-loop. ENVID is the starting point for optimizations, which replace some of the ENVID generators with optimizing ones.

We now demonstrate the partial loop unrolling optimization. First we introduce the combinator for full loop unrolling:

    let rec full_unroll lb ub body =
      if lb = ub then body lb
      else if lb < ub then .<begin .~(body lb); 
                              .~(full_unroll (lb+1) ub body) end>.
      else .< () >.
For example,
    .<let r = ref 0 in
      .~(full_unroll 1 3 (fun i ->
          .<r := !r + i>.))>.
produces the following code
    .<let r_1 = ref 0 in
      r_1 := ((! r_1) + 1); r_1 := ((! r_1) + 2); r_1 := ((! r_1) + 3)>.

Partial loop unrolling by the given factor is written in terms of the full unrolling:

    let partial_unroll factor e1 e2 body = 
       .<let lb = .~e1 and ub = .~e2 in
         let ntimes = (ub-lb+1)/factor in
         begin
           for ii = 0 to ntimes-1 do
             .~(full_unroll 0 (factor-1) 
                  (fun i -> body .< ii*factor+lb +  i >.))
           done;
           for i = ntimes*factor+lb to ub do .~(body .<i>.) done
         end>.

To apply the tactic we define ENVPU, which is like ENVID but with the ffor generating a partially-unrolled loop.

    module ENVPU(N : sig val unroll_factor : int end) = struct
     include ENVID
     let ffor true e1 e2 body = 
        partial_unroll N.unroll_factor e1 e2 body
    end
    module ENVPU2 = ENVPU(struct let unroll_factor = 2 end)
Using the just defined ENVPU2 instead of ENVID in dot1:
    let dotu = funM ENVPU2 n a1 a2 ->
      let sum = ref 0.0 in
      for i = 0 to (n-1) do
        sum := !sum +. a1.(i) *. a2.(n-i-1)
      done;
      !sum
produces the code with the for-loop unrolled twice:
    .<fun n_3 -> fun a1_4 -> fun a2_5 ->
       let sum_6 = (ref 0.0) in
       let lb_7 = 0
       and ub_8 = (n_3 - 1) in
       let ntimes_9 = (((ub_8 - lb_7) + 1) / 2) in
       for ii_12 = 0 to (ntimes_9 - 1) do
        let i_14 = (((ii_12 * 2) + lb_7) + 0) in
        (sum_6 := ((! sum_6) +. (a1_4.(i_14) *. a2_5.((n_3 - i_14) - 1))));
        let i_13 = (((ii_12 * 2) + lb_7) + 1) in
        (sum_6 := ((! sum_6) +. (a1_4.(i_13) *. a2_5.((n_3 - i_13) - 1))))
       done;
       for i_10 = ((ntimes_9 * 2) + lb_7) to ub_8 do
        let i_11 = i_10 in
        (sum_6 := ((! sum_6) +. (a1_4.(i_11) *. a2_5.((n_3 - i_11) - 1))))
       done;
       (! sum_6)>.
The code is messy, as typical of optimized code. It would have been tedious to write it by hand.

What made it possible to interpret the for-loop in dot1 as the optimized partially-unrolled loop is the syntactic extension funM. It did a source-to-source transformation of a block of the OCaml code to its typed AST in, modern terms, the tagless-final style. For example,

    funM ENVID n a  ->
      let sum = ref 0 in
      for i = 0 to n do
        sum := !sum + a.(i)
      done;
      !sum
becomes
    .<fun n a -> .~(ENV.run (
      ENV.bind (ENV.ref (ENV.retL 0)) (fun sum' ->
      .<let sum = .~sum' in .~(
      ENV.fseq
       (ENV.ffor true (ENV.retL 0) (ENV.retS .<n>.) (fun i' ->
         .<let i = .~i' in
           ENV.fass (ENV.retS .<sum>.)
                    (ENV.(+) (ENV.(!) (ENV.retS .<sum>.))
                             (ENV.aref (ENV.retS .<a>.) (ENV.retS .<i>.)))))
           (ENV.(!) (ENV.retS .<sum>.))))))
    >.
which is the code passed to the OCaml compiler. It is up to ENV then to interpret for-loops or let-statements. Thus the point of funM is to let the users define their own interpretations of OCaml code -- the interpretations performing code analyzes or incorporating domain-specific optimizations.

There is a lot of room for improvement; the most obvious is to turn funM into a functor -- or better yet, a first-class functor. That is, given the signature of a typed AST

    module type SYM = sig
     type 'a repr
     val int  : int  -> int repr
     val unit : unit -> unit repr
     val app  : ('a->'b) repr -> 'a repr -> 'b repr
     val lam  : ('a repr -> 'b repr) -> ('a->'b) repr
     val (+)  : int repr -> int repr -> int repr
     val (!)  : 'a ref repr -> 'a repr
     val ref  : 'a repr -> 'a ref repr
     val let_ : 'a repr -> ('a repr -> 'b repr) -> 'b repr
     val fseq : unit repr -> 'a repr -> 'a repr     (* sequencing *)
     val ffor : bool -> int repr -> int repr -> 
                (int repr -> unit repr) -> unit repr
     val aref : 'a array repr -> int repr -> 'a repr
     val fass : 'a ref repr -> 'a repr -> unit repr (* assignment *)
     val run  : 'a repr -> 'a code
    end
we re-write the following source code fragment
    funM ENVID n a  ->
      let sum = ref 0 in
      for i = 0 to n do
        sum := !sum + a.(i)
      done;
      !sum
as follows:
    fun (module S:SYM) -> let open S in run (
      lam (fun n -> lam (fun a ->
        let_ (ref (int 0)) (fun sum ->
          fseq 
            (ffor true (int 0) n (fun i ->
              fass sum ((! sum) + aref a i)))
            (! sum)))))
The result is truly in the tagless-final style; it also looks quite like ``Scala-virtualized''. Another obvious improvement is to re-write the syntactic extension pa_lift as a ppx source-to-source transformer.

At the 2006 meeting, a couple of people said to me after the presentation that they liked it. One person added he might like to collaborate on this topic. ``Alas, my student has just graduated,'' he said, ``and I currently don't have others to work on this.'' This was the end of the story.

Version
The current version is October, 2006
References
pa_lift.ml [6K]
The old camlp4 syntactic extension that interprets the OCaml code as a tagless-final DSL

convolution.ml [3K]
The code for the convolution optimization example described in in the article

gib.ml [9K]
The code for optimizing the Gibonacci example: tactics for the state- and continuation-passing transformations

tagless_final.ml [5K]
The modern reconstruction of the 2006 WG2.11 talk, in the tagless-final terms

 

Generating optimal stencil code

Generating memory-bandwidth--optimal code for array computations -- typical of PDE solvers and signal processing kernels -- was one of the Shonan Challenge benchmarks. The challenge was submitted by Takayuki Muranushi, an astrophysicist whose work depends on solving PDEs on truly huge meshes. This article answers the challenge. It develops a domain-specific language (DSL) for stencil computations, embedded in (Meta)OCaml. The generated code attains the bandwidth optimality -- the B/F ratio -- of the hand-optimized code. The generated code correctly handles the edge cases, which were problematic for the hand-written code. The generated code assuredly has no array bound errors, which also troubled the hand-written code.

The stencil EDSL is the straightforward application of the previous, seemingly purely academic, `useless' work on generating optimal Gibonacci codes and the let-insertion with delimited control. The key insight is regarding a stencil as a sliding-window cache, for real or virtual arrays. The implementation demonstrated that scope extrusion is a problem in practice after all, justifying the research into its static prevention.

The main challenge in high-performance computing (HPC) is overcoming the memory bottleneck, reducing the amount of time CPU (local node) spends waiting for data. The speed disparity between the CPU and the memory subsystem is often characterized by the so-called B/F ratio: the ratio of the number of bytes moved from or to the main memory during a computation to the number of executed floating-point operations. The challenge of the HPC code optimization is to reduce the B/F ratio and the amount of needed local memory.

Signal processing or finite-element algorithms can often be stated as element-wise computations on shifted arrays. The following is the running example in the challenge.

    w = a * S[1] a
    b = a - w + S[-1] w
Here a is a global input array, b is a global output array, and w is a working array. The operation S[n] shifts the argument vector by n (to the left, if n is positive). All arithmetic operations on vectors (addition, multiplication, subtraction) are elementwise. Global arrays are shared or distributed throughout a supercomputer; reading or writing them requires accessing off-the-chip memory or the inter-node communication.

The naive implementation, neglecting for now edge elements, can be written in OCaml as

    for i=0 to (n-2) do
      w.(i) <- a.(i) *. a.(i+1)
    done;
    for i=1 to (n-1) do
      b.(i) <- a.(i) -. w.(i) +. w.(i-1)
    done
Assuming w is a local array, the first loop reads 2*(n-1) elements from the global array a and does (n-1) floating-point additions. The second loop reads n-1 elements from a, writes n-1 elements to b and does 2*(n-1) floating-point computations. With 8-byte floating-point values, the B/F ratio is 32/3. Current supercomputers can sustain B/F of about 1 to 2. Therefore, the naive implementation will run the order of magnitude worse the peak performance, because global memory cannot keep up with the CPU. The slow-down will likely to be larger since for large n, the array w won't fit into the local memory (cache).

After the array computation finishes, the output array b becomes the input array for the next `time step', and the computation continues. The B/F ratio can be reduced by doing two time steps at once:

    w1 = a * S[1] a
    wm = a - w1 + S[-1] w1
    w2 = wm * S[1] wm
    b  = wm - w2 + S[-1] w2
The naive implementation has B/F of 32/6 -- with twice as many floating-point operations per global memory access. However, the implementation requires three times more local memory.

We generate the code that reads and writes only 1 floating-point value per loop iteration while executing 6 floating-point operations, for B/F = 16/6 -- without using any local arrays. The code is expected to run nearly at the peak speed of the supercomputer.

The enclosed code develops the EDSL for stencil computation in three steps, starting with the naive implementation generating many duplicate array accesses. The array accesses are then cached in local variables. Finally, a stencil is introduced as a sliding-window cache. For illustration, here is the naive generator of the double-time-step computation:

    let t21 max_bounded a b =
      forlooph 2 min_bounded max_bounded (
        let a = gref0 a in
        let w1 = a *@ ashift 1 a in 
        let wm = a -@ w1 +@ ashift (-1) w1 in
        let w2 = wm *@ ashift 1 wm in
        b <-@ wm -@ w2 +@ ashift (-1) w2)
where *@, +@ etc. are element-wise operations on arrays, ashift n is the array shift S[n] and forlooph 2 is a loop combinator that unrolls the first and the last two iterations. The generator will complain if the halo value is less than 2 because it cannot see (correctly) that array indices are in-bounds. The EDSL code looks quite like the mathematical notation.

To generate the optimal code, we merely add stencil_memo combinators:

    let t25 max_bounded a b =
      let p = new_prompt () and q = new_prompt () in
      push_prompt q (fun () ->
      forlooph 2 min_bounded max_bounded (with_body_prompt p (
        let a  = stencil_memo q p (gref0 a) in
        let w1 = stencil_memo q p (a *@ ashift 1 a) in 
        let wm = stencil_memo q p (a -@ w1 +@ ashift (-1) w1) in
        let w2 = stencil_memo q p (wm *@ ashift 1 wm) in
        b <-@ wm -@ w2 +@ ashift (-1) w2)))
The combinators need to know the loop scope (the scope of loop-local variables) and the outer scope (the scope of the variables to store data cross-iteration). So-called prompts q and p denote those scopes. The generated code, shown in the comments in stencil.ml, reads and writes a global array once per iteration and does six floating-point operations, with the B/F of 16/6.

The hand-optimized code has essentially the same loop body and the same B/F. However, the first versions of the hand-written code left the elements 0, 1, and n-1 of the output array uninitialized. Furthermore, the code read past the end of the input array. Getting edge cases right is really difficult. The generated code has very long and boring pieces of code before and after the main loop, to properly compute the edge elements. Writing such tedious code by hand is excruciatingly boring. Code generation truly helps.

The stencil challenge proved to be a surprisingly good motivation for the research on staging. It required the code generation with effects, in particular, effects crossing binders. Implementing the challenge showed that scope extrusion does happen in practice. Although the fact of scope extrusion is easy to see when compiling the generated code, finding the culprit within the generator turns out difficult.

Version
The current version is October, 2012
References
stencil-challenge.cpp [6K]
Takayuki Muranushi: PDE challenge
A local copy of the file problems/tiling/main1.cpp committed to the Shonan-challenge GitHub repository on 2012-05-21
The file describes the challenge showing the naive and the hand-optimized code.

stencil_hand.ml [4K]
Reference code: specification and the hand-written optimized code
The code correctly handles the edges cases and has unit tests.

stencil.ml [28K]
The MetaOCaml code of the solution to the challenge
The file gradually develops the EDSL for stencil computations and has lots of generated sample code and numerical tests.
"The code requires MetaOCaml and the delimcc library.

bounded.mli [3K]
bounded.ml [5K]
Partially known dynamic values for safe array indexing
A library for statically tracking bounds on unknown dynamic values.