previous  next  contents  top

Differentiating Parsers

 

We describe differentiation of parsers with respect to the input stream. We obtain incremental parsers, which yield (fragments) of the parsed data as more input becomes available. The differentiated parser is undoable: at any point, e.g., upon encountering a parsing error, we can discard the portion of input up to a previously noted checkpoint and resume parsing supplying an alternative input.


 

Differentiating off-the-shelf parsers using delimited control

We demonstrate the conversion of a regular parser to an incremental one in byte-code OCaml. The converted, incremental parser lets us parse from a stream that is only partially known. The parser may report what it can, asking for more input. When more input is supplied, the parsing resumes. The converted parser is not only incremental but also undoable and restartable. If, after ingesting a chunk of input the parser reports a problem, we can `go back' and supply a different piece of input.

The conversion procedure is automatic and largely independent of the parser implementation. The parser should either be written without visible side effects, or else we should have access to its global mutable state. The parser otherwise may be written with no incremental parsing in mind, available to us only in a compiled form. The conversion procedure relies on the inversion of control, accomplished with the help of delimited continuations provided by the delimcc library.

In this section, we illustrate the conversion on the example of Genlex.make_lexer from the OCaml standard library. We also use Stream.iter, Stream.from and other OCaml library functions -- again, as they are, in the compiled form in the standard library. We have made no modifications to any part of the OCaml system. We show the complete code and a sample incremental parsing session below.

We develop our example interactively, by submitting definitions and expressions to the OCaml interpreter (the top level) and observing its responses. In the transcript below, the responses are indented. The responses always include the inferred type for the expression or the definition. We assume the delimcc library has already been installed.

 

Preliminaries: a sample non-incremental parser

First, preliminaries:

     open Delimcc;;
     open Stream;;
     open Printf;;
We define a sample parser (a lexer, to be precise) and demonstrate its common, non-incremental operation. We apply the lexer to a sample stream, build from a user-supplied string. We print out the produced tokens with the help of the standard library function Stream.iter.
     let lexer = Genlex.make_lexer ["+";"-";"*";"/";"let";"="; "("; ")"];;
          val lexer : char Stream.t -> Genlex.token Stream.t = <fun>
     
      (* The getChar-like function for Stream.from, reading characters *)
      (* from a user-supplied string str                               *)
     let stream_fn1 str = fun pos ->
       (* printf "stream_fn1: asked for char #%d\n" pos; *)
       if pos < String.length str then Some str.[pos] else None;;
          val stream_fn1 : string -> int -> char option = <fun>
     
     let show_token_stream = Stream.iter (function 
       | Genlex.Kwd s    -> printf "Stream elem: Kwd %s\n" s
       | Genlex.Ident s  -> printf "Stream elem: Ident %s\n" s
       | Genlex.Int d    -> printf "Stream elem: int %d\n" d
       | Genlex.Float f  -> printf "Stream elem: float %g\n" f
       | Genlex.String s -> printf "Stream elem: \"%s\"\n" s
       | Genlex.Char c   -> printf "Stream elem: '%c'\n" c
     );;
          val show_token_stream : Genlex.token Stream.t -> unit = <fun>
An indented line after each definition shows the response of the OCaml top-level, with the inferred type for each function. We can try our lexer out:
     let test1 = show_token_stream (lexer (Stream.from (stream_fn1 
        "let x = 1 + 2 in (* comment *) \"xxx\" ^ x")));;
     
          Stream elem: Kwd let
          Stream elem: Ident x
          Stream elem: Kwd =
          Stream elem: int 1
          Stream elem: Kwd +
          Stream elem: int 2
          Stream elem: Ident in
          Stream elem: "xxx"
          Stream elem: Ident ^
          Stream elem: Ident x
          val test1 : unit = ()
 

Differentiation

We now invert the parser, in the following two lines. We define a new getChar `callback' that uses delimited control operator shift:

     type stream_req = ReqDone | ReqChar of int * (char option -> stream_req);;
     let stream_inv p = fun pos -> shift p (fun sk -> ReqChar (pos,sk));;
          val stream_inv : stream_req Delimcc.prompt -> int -> char option = <fun>
If we wish to ask the user for chunks, rather than for one character at a time, we need a simple buffering layer.
     let rec filler buffer buffer_pos = function ReqDone -> ReqDone
       | ReqChar (pos,k) as req ->
       let pos_rel = pos - buffer_pos in
       let _ = 
         (* we don't accumulate already emptied buffers. We could. *)
         assert (pos_rel >= 0) in
       if pos_rel < String.length buffer then
         (* desired char is already in buffer *)
         filler buffer buffer_pos (k (Some buffer.[pos_rel])) 
         else
         (* buffer underflow. Ask the user to fill the buffer *)
         req
     ;;
          val filler : string -> int -> stream_req -> stream_req = <fun>
     
     let cont str (ReqChar (pos,k) as req) = filler str pos req;;
          val cont : string -> stream_req -> stream_req = <fun>
     
     let finish (ReqChar (pos,k)) = filler "" pos (k None);;
          val finish : stream_req -> stream_req = <fun>
 

Interactive play with the derived incremental parser

We are all set. Please compare the iterator below with test1 above. The composition show_token_stream (lexer (Stream.from ... )) is exactly as before. We merely replaced the getChar-like function stream_fn with stream_inv.

     let test2c0 = 
       let p = new_prompt () in
       push_prompt p (fun () ->
         filler "" 0 (
         show_token_stream (lexer (Stream.from (stream_inv p))); ReqDone))
     ;;
          val test2c0 : stream_req = ReqChar (0, <fun>)
Again, an indented line underneath each statement is the response of the OCaml top-level. The response above says that the parser is suspended, awaiting the character number 0. We now feed the parser chunks of input. The chunks do not have to contain complete lexems. For example, a comment may start in one chunk and end in another.
     let test2c1 = cont "let x = 1 + 2 " test2c0;;
     
          Stream elem: Kwd let
          Stream elem: Ident x
          Stream elem: Kwd =
          Stream elem: int 1
          Stream elem: Kwd +
          Stream elem: int 2
          val test2c1 : stream_req = ReqChar (14, <fun>)
We see that the parser ate the chunk, produced some tokens, and waits for more input. We give it more:
     let test2c2 = cont "in (* com " test2c1;;
     
          Stream elem: Ident in
          val test2c2 : stream_req = ReqChar (24, <fun>)
The supplied chunk had a non-terminated comment, to be closed in the next chunk:
     let test2c3 = cont "ment *) " test2c2;;
          val test2c3 : stream_req = ReqChar (32, <fun>)
     
     let test2c4 = cont "\"xx" test2c3;;
          val test2c4 : stream_req = ReqChar (35, <fun>)
     
     let test2c5 = cont "x\" " test2c4;;
          Stream elem: "xxx"
          val test2c5 : stream_req = ReqChar (38, <fun>)
Finally the parser produced something, because it got the matching double quote. We feed the final chunk:
     let test2c6 = cont " ^ x" test2c5;;
          Stream elem: Ident ^
          val test2c6 : stream_req = ReqChar (42, <fun>)
     
     let test2c7 = finish test2c6;;
          Stream elem: Ident x
          val test2c7 : stream_req = ReqDone
and we are finished. 

Checkpointing and re-parsing with alternative input

Derived incremental parsers are restartable and undoable. We may restart our parser, although we have already terminated its stream and obtained the complete parse, test2c7. Since we possess the reference to an intermediate state, test2c6, we can restart parsing from that point supplying alternative input:

     let test2c71 = cont "yx * 10" test2c6;;
          Stream elem: Ident xyx
          Stream elem: Kwd *
          val test2c71 : stream_req = ReqChar (49, <fun>)
     
     let test2c8  = finish test2c71;;
          Stream elem: int 10
          val test2c8 : stream_req = ReqDone
and obtaining an `alternative' parse. We can resume parsing from any other noted checkpoint, e.g., test2c1. 

The intuition for the differentiation

Parsers, scanners and other input processors often let the user specify a getChar-like `callback'. A parser invokes the callback to read input data. By supplying a special callback, essentially shift f f, we obtain an `inverted parser', which may throw a special `exception' indicating that more input is needed. The exception contains a resumption function, which takes input data and resumes the parser, yielding parsing results or another resumption exception. If the parser has visible side-effects such as mutations, we should be able to access that mutable state, so that we could see the partial results and check-point them. The inversion turns a function that pulls its input to the function into which we push input. That inversion is differentiation, letting us observe the behavior of the function as we feed it one piece of input at a time.

The debugger analogy might be helpful. The operation shift f f is akin to a breakpoint, like INT 3 on x86. When we set this break point in a callback to parse we drop into the debugger whenever the callback is invoked, that is, whenever the parser needs a new character. Once in the debugger, we can examine our state and resume the parser giving it a character of our choice. The control delimiter push_prompt (see test2c0) is akin to the try block. It lets us catch the `debugging exception' -- install own own INT3 handler, so to speak -- and so script the debugger in our own program. Delimited continuations offer us a scriptable debugger. Unlike the free-wheeling debugger like gdb, the debugger of delimited continuations respects abstractions and is type sound, supporting static invariants of the type system.

As in calculus, it makes sense to differentiate only continuous functions, which produce output as input is consumed. If the result is produced only after the whole input is taken, differentiation is not helpful. Yacc-produced parsers let the users specify `semantic actions'. We can embed `breakpoints' in them too, making a parser produce results incrementally, as a stream of fired semantic actions and recognized tokens. This approach can, for example, invert a DOM XML parser into a streaming, pull-like SAX XML parser.

 

Differentiating pure parsers

Dan Doel has applied the above differentiation technique to Haskell parsers, built with parser combinators. He started with a pure parser over an ordinary string; he then turned to a parser written in monadic style and consuming a monadic list. He used the CC-delcont library of delimited control, which he packaged for Hackage. Dan Doel described his results in two messages posted on the Haskell-Cafe mailing list, [inv-parser-pure] and [inv-parser-monadic].

In his first message, Dan Doel differentiated a pure, string-based parser built with Parsec. The technique applies to any parsing function as it is, without any need to re-write it in a monadic style or add delimited continuation type constraints. The shortcoming are discussed extensively in his message and summarized below. It is remarkable however how much can be accomplished with pure parsers.

To make a parser resumable, Dan Doel wrapped it in a CC monadic function that uses delimited control to request a chunk of input -- as was explained in the OCaml example earlier. As the wrapper accumulates progressively longer pieces of input, it applies the parser to those pieces. If the parser could already yield a successful result given the input so far, that input is remembered as a checkpoint and the successful partial parse is sent to the user. In effect, the wrapper accumulates the history of user input and successive parses of partial input.

Like in calculus, we feed the function progressively longer inputs and observe differences in the output. Since for pure and total functions the act of computation is unobservable (if by observation we mean looking at at program's results), rather than accumulating partial outputs we can accumulate partial inputs. This approach is reminiscent of Peter Thiemann's WASH -- which too remembered all history of user input to implement undo and resumptions.

There are two main shortcomings. Although re-parsing may be unobservable from the point of view of printed results, it does take time. Running the parser on the whole input whenever the user submitted a new chunk can be quite time consuming. A bigger problem is detection of parse errors. We run the parser on partial input as if it were the complete input. If the parsing fails, we cannot generally tell if it is because the input is incomplete or because it is unparseable already. For example, suppose the user submitted a chunk containing an open parenthesis without the matching close parenthesis. Parsing this chunk as if it were the complete input would fail. Supplying more input would `fix' the problem. On the other hand, submitted input may contain a mismatched close parenthesis. More input will not not fix that parsing error.

Genuine differentiation would solve both problems. We could see results as they are being produced. We can accumulate partial outputs rather than partial inputs. We can see a parsing error from a partial input and so alert the user that the input is wrong already. ``Whether this is enough motivation to push monads further into one's data structures, or drop Haskell and go with implicitly monadic code everywhere in OCaml is left as an exercise to the reader. :), '' Dan Doel wrote at the end of his message.

 

Differentiating a monadic parser in Haskell

In his second message [inv-parser-monadic] Dan Doel has made the parser monadic so that delimited-control--based inversion could be applied as it has been in OCaml. The parser no longer takes an ordinary string as its input. Rather, the input is a data structure incorporating monadic values, so that effects could be performed as more elements are demanded:

     data MList' m a = MNil | MCons a (MList m a)
     type MList m a = m (MList' m a)
Such lazy effectful data structures ``with embedded effects'' are typical in functional logic programming. In fact, exactly the above MList was used in [lazy-nondet].

``So, success! Tokens are printed out as soon as the lexer is able to recognize them, properly interleaved with other IO side effects, and resuming from an intermediate parse does not cause duplication of output,'' he wrote.

Indeed, we differentiate a parser by embedding `breakpoints' in the input stream, in the getChar-like routine the parser uses to obtain next piece of input. The MList data structure is a clever way to embed, or conceal the getChar routines. Dan Doel has warned that when converting list functions to MList functions, we have to be very careful not to perform side effects twice. The same care had to be taken in the development of LogicT.

Finally, Dan Doel noted that writing data structures like lists in terms of fixpoints of functors lets us derive both a pure version and the embedded monadic version of the data type, just by changing the fixpoint functor. The benefits of coding recursive data types with explicit open recursion have been described in Tim Sheard and Emir Pasalic: Two-Level Types and Parameterized Modules JFP v14 (5):547-587, 2004.''

 

History and references

This web page is based on the messages

Incremental, undoable parsing in OCaml as the general parser inversion
posted on the Caml-list on Tue, 3 Jul 2007 04:38:36 -0700 (PDT).

Parser inversions [Was: CC-delcont-0.1...] posted on the Haskell-Cafe on Wed, 1 Aug 2007 19:46:45 -0700 (PDT).

[inv-parser-pure]
Dan Doel. Re: ANNOUNCE: CC-delcont-0.1; Delimited continuations for Haskell
posted on the Haskell-Cafe mailing list on Sun Jul 15 06:57:23 EDT 2007.

[inv-parser-monadic]
Dan Doel. ANNOUNCE: CC-delcont-0.1; Delimited continuations for Haskell
posted on the Haskell-Cafe mailing list on Mon Jul 16 21:59:03 EDT 2007.

[lazy-nondet]
Purely Functional Lazy Non-deterministic Programming

[delimcc]
The delimcc library: delimited continuations in OCaml



Last updated December 5, 2009

This site's top page is http://okmij.org/ftp/

oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML