previous   next   up   top

Incremental multi-level input processing and collection enumeration

 

Iteratee/Enumeratee is a way to process data -- file bytes, network datagrams or matrix elements -- incrementally, declaratively and with tight resource control. The approach has been inspired by fold, parser/printer combinators and circuit diagrams. There are several variants of the Iteratee-based I/O, available as independent libraries on Hackage or as part of bigger applications such as web servers. Here we give motivation, justification and explanation, pointing out earlier and experimental versions as well as related projects.


 

Iteratee IO: safe, practical, declarative input processing

We introduce input processing with left-fold enumerator -- Iteratee IO -- as a safe, declarative and practical alternative to Handle and Lazy IO for input processing. The approach applies to the handling of data taken from in-memory structures, databases, files, sockets, etc. Binary and random IO is supported. The input data may also come from (deeply) embedded, e.g., chunk-encoded or encrypted streams. The approach is naturally incremental and permits IO interleaving without any unsafe operations. One stream can feed several processors in parallel; one processor can incrementally read several streams. The approach is algebraic and declarative, giving us an incremental online writer-parser--combinator library. The left-fold enumerator as a general concept has been used in other functional languages.

Our running example is HTTP request processing in Haskell, specifically, reading lines (terminated by CR, LF or CRLF) from a file descriptor and then from the chunk-encoded body. The main example illustrates multiplexing across two file descriptors and the full IO interleaving. The same line parser is used to process data from the file descriptor stream and from the embedded chunk-encoded stream, which is incrementally decoded.

We focus on:

We achieve both high performance and encapsulation of input processing layers that can be freely composed. The technique has been validated in a production web-application server written in Haskell; database access library Takusen; TIFF and WAVE file readers.

References
Please search Hackage for `enumerator' or `iteratee'.

IterateeIO-talk.pdf [159K]
IterateeIO-talk-notes.pdf [194K]
Talks about Iteratee IO
The first version of the talk was presented at DEFUN08, ACM SIGPLAN 2008 Developer Tracks on Functional Programming. September 27, 2008. Victoria, Canada.
The present version of the talk (given at the research colloquium of the Center for Software Technology, Department of Information and Computing Sciences, Utrecht University on December 17, 2009) describes the December 2009 version of the Iteratee IO.
The slides have been updated for the November 2010 version of IterateeM.
The talk is aimed at practitioners not afraid of Haskell, in particular, server programmers. That is, programmers who program long-running distributed applications and are painfully aware of the issues of reading from sockets, latency, buffering, many layers of decoding, proper resource disposal and sustaining high load.

README.dr [4K]
Description of the Iteratee IO library and the code accompanying the talk.

IterDemo.hs [14K]
A set of simple examples demonstrating Iteratees: solving a wide set of problems by combining and recombining of a few primitives. The examples revolve around reading text data and counting lines and words, or searching for a specific word.

Lazy-vs-correct.txt [5K]
Lazy vs correct IO: A message on composability and performance of Iteratee IO, posted on the Haskell-Cafe mailing list on Thu, 18 Sep 2008 23:51:59 -0700 (PDT).
We show that although the Iteratee IO library is not optimized at all, it already outperforms Lazy IO. The final benchmark, counting words in all GHC documentation, demonstrates the differences between Lazy and Iteratee IO: Iteratee IO finishes the test whereas Lazy IO runs out of file descriptors.

 

The design of iteratees

The protagonists are Stream, Iteratee, Enumerator, and the offspring Enumeratee. Stream carries the data, Iteratee consumes them, Enumerator produces them. Enumeratee is both a consumer and a producer, incrementally decoding the outer stream and producing the nested stream of decoded data. Enumerators compose to give a bigger producer, effectively concatenating the sources. Iteratee compose serially and in parallel. Iteratee can also nest, so to process nested streams or several streams incrementally.
     
     data Stream el = EOF (Maybe ErrMsg) | Chunk [el]
Stream is a (continuing) sequence of elements bundled in Chunks. The EOF variant tells that no more data will be coming: the stream is exhausted, either due to EOF or some error. Chunk [a] gives the currently available part of the stream. In particular, Chunk [] signifies a stream with no immediately available data but which is still continuing. When a stream processor receives such empty chunk, it should ``suspend itself'' -- indicate its desire for more data.
     
     data Iteratee el m a = IE_done a
                          | IE_cont (Maybe ErrMsg)
                                    (Stream el -> m (Iteratee el m a, Stream el))
Iteratee is a generic stream processor -- what is being folded over a stream. An iteratee exists in one of the three states: The control/error message too includes the continuation, to handle the possible reply by the stream producer. Like in Common Lisp, all iteratee errors are hence potentially restartable.

We assume that all iteratees are `good' -- given bounded input, they do the bounded amount of work and take the bounded amount of resources. Given a terminated stream, an iteratee should move to the done state, returning the results computed so far.

It turns out, iteratee forms a monad and a monad transformer. After all, Stream is a gradually consumed state; to read another chunk, the iteratee has to make an `operating system call' to the stream producer. Iteratee is a State and Cont monad, and hence an Error monad. We can write iteratees in the familiar do notation. The shown type of iteratees is not the only one possible; the comments in the source code debate two other choices.

     
     type Enumerator el m a = Iteratee el m a -> m (Iteratee el m a)
Enumerator is an encapsulation of a data source, a stream producer -- what folds an iteratee over the stream. An enumerator takes an iteratee and applies it to the stream data as they are being produced, until the source is depleted or the iteratee said it had enough. After disposing of buffers and other source-draining resources, enumerator returns the final value of the iteratee. Enumerator thus is an iteratee transformer.

Iteratees and enumerators operate over a base monad m, allowing producers and consumers side-effecting actions to obtain and digest data (e.g., reading or writing files). That base monad is often IO. That is not the only choice: Iteratee el IO is a monad too; it too may act as the base monad. If we instantiate the type of Enumerator el m a by choosing m to be the iteratee monad, we obtain Enumeratee:

     
      type Enumeratee elo eli m a =
        Iteratee eli m a -> Iteratee elo m (Iteratee eli m a)
Enumeratee, as any other enumerator, produces a stream, of the elements of type eli. Since an enumeratee acts in the monad Iteratee el IO, it may grab chunks of elo elements and use them as the source of its own stream. Thus enumeratee is a generic stream decoder. Stream decoding, or nested parsing, is rather common: buffering, character encoding, compression, encryption, SSL are just a few examples. The TIFF library relies on enumeratees, to decode a stream of bytes from a TIFF file, to the stream of 2- or 4-byte signed integers and then, for example, to the stream of rational numbers (the image resolution).

Concerns indeed separate: enumerators know how to get to the next element; iteratees know what to do with the next element. The character of Iteratees is the encapsulation of input processing layers that can be freely composed.

The November 2010 version of the code adds guidelines for writing simple iteratees and enumerators. The library uses standard combinators such as monadic bind to build iteratee chains; there are no more $$ and >>==.

Version
The current version is November 2010.
References
Iteratee.hs [25K]
Pure iteratees and parsing combinators. Justification for design decisions. Parsing of a stream chunk-encoded in another stream.

IterateeM.hs [52K]
Monadic and general iteratees: the main library. Examples of chunk-decoding and IO multiplexing.

IterateeIO-talk-notes.pdf [194K]
The updated slides of the Iteratee talk, originally given at Utrecht

Wc.hs [12K]
Many examples of nested and parallel processing of input stream.

IterateeMCPS.hs [39K]
The CPS variant of the January 2010 version of IterateeM.hs

Parallel composition of iteratees: one source to several sinks

Parallel composition of streams: several sources to one sink
Using Iteratee as a base monad to process several streams, at different speeds and inter-dependently.

 

Composing stream producers

A salient feature of Iteratee IO is compositionality: building complex stream operations by combining simpler ones. Iteratees -- stream consumers -- are monads and can be built by binding primitive iteratees, often using the do-notation. Composing enumerators -- stream producers -- gives the producer for the concatenated stream. Stream transformers, enumeratees, are consumers and producers at the same time and so are often regarded as complex. A more general view on Iteratee IO gives a simpler picture: stream consumers are monads, and stream producers are monoids. Stream combinators become fewer in number but clearer in semantics -- demonstrated by the examples of three different ways of composing (lists of) stream producers below.

Iteratees are akin to getChar and getLine from the standard System.IO library, parsing the implicit stdin-like stream. Like getChar, iteratees transparently take care of buffering and error handling. The stream `handle' is always implicit, and hence cannot be prematurely closed. Iteratees are more general than System.IO readers since iteratees' stdin is not restricted to a character stream. The type of iteratee stream elements e is part of the Iteratee type. Compare the signatures of the corresponding functions that extract the current element of the stream:

     System.IO.getChar :: IO Char
     IterateeM.head    :: Monad m => Iteratee e m e
Iteratees are not restricted to the IO monad either: the iteratee head above works in any monad and is therefore pure. We give iteratees a short alias, whose name should become clear shortly:
     type R e m = Iteratee e m
Writing iteratee programs should be just as familiar as writing System.IO programs, only without the restrictions on the type of stream elements and the monad.

The stream consumed by getChar is produced by the the Haskell run-time system, which is part of the Haskell environment rather than of a Haskell program. Iteratee IO brings the producers inside, letting the program itself produce streams and connect producers with consumers. A stream producer takes a consumer (a parser of the stream of es into a value of some type a), feeds it stream data and returns the resulting consumer.

     type Enumerator e m a = R e m a -> m (R e m a)
A producer encapsulates stream generation; a consumer encapsulates stream parsing. A consumer should work with any producer that generates the stream of the right type; correspondingly, a producer should work uniformly with any consumer so long as their stream types match. We enforce the abstraction boundary by requiring the enumerator to be parametric in a, the type of the parsing result. A general stream producer hence has the following type.
     newtype L e mi mo = L{unL :: forall a. R e mi a -> mo (R e mi a)}
Enumerator is a particular case of L e mi mo when mi and mo are the same.

The general stream producer L e mi mo is a monoid:

     instance (Monad mi, Monad mo) => Monoid (L e mi mo) where
        mempty = L return
        mappend (L e1) (L e2) = L $ \i -> e1 i >>= e2
It becomes intuitive that composing (or, mappending) enumerators effectively concatenates their streams. The parameters mi and mo of L e mi mo are monads; moreover, mo should be the same or `bigger' than mi: the producer should do all the effects of the consumer, and perhaps a few more. The following type class establishes the partial order on monads: what it means to be `bigger':
     class (Monad m1, Monad m2) => m1 :>=: m2 where
        promote :: m2 a -> m1 a
with at least the following obvious instances.
     instance Monad m => m :>=: m where 
        promote = id
     instance (Monad m, MonadTrans t, Monad (t m)) => t m :>=: m where 
        promote = lift

A notable particular case of L e mi mo is the outer monad mo being R e' m. The type L e m (R e' m) describes a producer that gets its data by reading another stream, of e', transforming one or more of e' into one or more of e. The type L e m (R e' m) thus describes a stream transformer -- an enumeratee! Enumerators and enumeratees hence unify in the general stream producer.

A producer connects with a consumer by a regular functional application (after the L unwrapping). For convenience, we introduce a right-associative infix operator

     L f ||| x = f x

Since an enumeratee is a consumer of the outer, e' stream, it can be connected with the producer of that stream. The composition then acts as a inner stream producer. We obtain a combinator to compose enumeratees with enumerators or with themselves:

     compL :: (mo :>=: m) => L e m mo -> L e' m (R e m)  -> L e' m mo
     compL (L e12) (L e23) = L (\ie' -> e12 (e23 ie') >>= promote . run)

We now demonstrate various compositions on a simple running example of composing the list of take enumeratees:

     takes = [takeL 2, takeL 3, takeL 4]
(where takeL is IterateeM.take wrapped in L). Granted, the example is a bit contrived: normally we want to compose a variety of enumeratees rather than the single take. On the other hand, take n makes a neater example.

First, takeL, as any stream producer, is a monoid, and can be mappended:

     c1 :: Monad m => L e m (R e m)
     c1 = foldl1 mappend takes
We could have just as well used foldr1: mappend is associative. Mappend'ing producers `concatenates' their sources. Therefore, c1 == take (2+3+4). To demonstrate this equality, we run c1 on a sample input stream [1..15], with stream2list as the consumer that reads the whole stream and returns it as a list.
     c1r = runIdentity $ run =<< (enum1chunkL [1..15] `compL` c1) ||| stream2list
     -- [1,2,3,4,5,6,7,8,9]
We have composed c1 with the enumerator enum1chunkL to obtain the enumerator of the prefix stream, which we then hook up with the stream2list iteratee. The result, in the comments, shows that c1 indeed behaves like take 9. The code to run c1 can be written differently but equivalently as
     c1r' = runIdentity $ run =<< 
           enum1chunkL [1..15] ||| (runI =<< c1 ||| stream2list)
     -- [1,2,3,4,5,6,7,8,9]
Instead of composing c1 with the producer enum1chunkL on the right we composed it with the consumer stream2list on the left, which leads to another way of composing takes.

The second method of composing enumeratees is to apply each to a consumer and compose the results. If e_j is a producer and i is an iteratee (for the inner stream), (e_i ||| i) is an iteratee for the outer stream. A list of such iteratees can be composed in several ways, sequentially:

     c2s :: Monad m => R e m a -> R e m [a]
     c2s = \i -> sequence $ map (\e -> runI =<< e ||| i) takes
     
     c2sr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2s stream2list
     -- [[1,2],[3,4,5],[6,7,8,9]]
or in parallel:
     c2p :: Monad m => R e m a -> R e m [a]
     c2p = \i -> parallel $ map (\e -> runI =<< e ||| i) takes
      where
      parallel [i] = liftM (: []) i
      parallel (i:t) = do
        (iv,tv) <- enumPair i (parallel t)
        return (iv:tv)
     
     c2pr = runIdentity $ run =<< enum1chunkL [1..15] ||| c2p stream2list
     -- [[1,2],[1,2,3],[1,2,3,4]]

Finally, enumeratees compose `telescopically': enumeratees are stream transformers and can be arranged, by compL, into a pipeline -- very much like the Unix shell pipeline -- to transform stream elements further and further. We have seen compL in action already, in the term c1r, which composed an enumerator for the stream e' with a e'-to-e enumeratee giving the enumerator for the stream e.

     -- compare with c1: compL vs mappend
     c4  = foldl1 compL takes

(again, foldr1 works just as well). Piping take n_j into each other is not very interesting. It is easy to see that the result is equivalent to the single take (minimum [n_1,...n_j]). Indeed, c4 behaves like take (minimum [2,3,4]) == take 2:

     c4r = runIdentity $ run =<< enum1chunkL [1..15] `compL` c4' ||| stream2list
     -- [1,2]

We have shown a general view on Iteratee IO that unifies enumerators and enumeratees as general stream producers. Whereas iteratees, stream consumers, are monads and monad transformers, stream producers are monoids -- which gives numerous ways of composing them.

Version
The current version is January 2012.
References
ComposeAdv.hs [6K]
The commented code illustrating different compositions

Compose.hs [5K]
The older code using the Iteratee library without any embellishments. It was originally posted as Re: Composing a list of Enumeratees on the Haskell-Cafe mailing list on Fri, 7 Oct 2011 23:27:40 -0700 (PDT).

 

Random and binary IO: Reading TIFF

Iteratees presuppose sequential processing. A general-purpose input method must also support random IO: processing a seek-able input stream from an arbitrary position, jumping back and forth through the stream. We demonstrate random IO with iteratees, as well as reading non-textual files and converting raw bytes into multi-byte quantities such as integers, rationals, and TIFF dictionaries. Positioning of the input stream is evocative of delimited continuations.

We use random and binary IO to write a general-purpose TIFF library. The library emphasizes incremental processing, relying on iteratees and enumerators for on-demand reading of tag values. The library extensively uses nested streams, tacitly converting the stream of raw bytes from the file into streams of integers, rationals and other user-friendly items. The pixel matrix is presented as a contiguous stream, regardless of its segmentation into strips and physical arrangement.

We show a representative application of the library: reading a sample TIFF file, printing selected values from the TIFF dictionary, verifying the values of selected pixels and computing the histogram of pixel values. The pixel verification procedure stops reading the pixel matrix as soon as all specified pixel values are verified. The histogram accumulation does read the entire matrix, but incrementally. Neither pixel matrix processing procedure loads the whole matrix in memory. In fact, we never read and retain more than the IO-buffer-full of raw data.

Version 2.0 of the library demonstrates restartable exceptions to seek (that is, control the position) in a stream.

Version
The current version is November 2010.
References
RandomIO.hs [10K]
Random and Binary IO with Iteratees
The code implements the enumerator for a seekable stream and iteratees for reading multi-byte integral quantities in big- and little-endian formats.

Tiff.hs [19K]
TIFF library

TiffTest.hs [8K]
A sample application of the TIFF library

Binary IO and the TIFF library in Scheme

 

Parallel composition of iteratees: one source to several sinks

A compelling application of iteratees is incremental processing of a single stream by several consumers in parallel, in constant memory.

Feeding several parallel consumers off the same stream is quite common. Examples include Unix tee, tcpdump and similar wire-tapping commands; reading a stream of numbers and computing the average and the variance, which requires keeping the count of the numbers, of their sum and the sum of their squares. Our running example is even more familiar: Unix wc command, which reports the number of lines, words and characters in its input stream or files.

The input data may come from the network or a pipe, which we cannot rewind and re-read. We should handle large input, fast; storing received data in memory or files for re-processing is not an option either. We really have to process data incrementally and in parallel.

The combinator enumPair composes two iteratees in parallel, turning a pair of iteratees into an iteratee for a pair of their results:

     enumPair :: Monad m => Iteratee el m a -> Iteratee el m b ->
                 Iteratee el m (a,b)
     enumPair i1 i2 = enum2 i1 i2 >>= runI2
One should not confuse enumPair with the lifted pair: enumPair i1 i2 composes the iteratees i1 and i2 in parallel, feeding the same chunk of input to both of them. In contrast, liftM2 (,) i1 i2 composes the iteratees sequentially, feeding the stream only to i1, with i2 getting the left-over after i1 is finished.

The combinator enumPair is expressed in terms of enum2, a signal-splitter--like enumeratee with one source and two sinks.

     enum2 :: Monad m => Iteratee el m a -> Iteratee el m b ->
              Iteratee el m (Iteratee el m a, Iteratee el m b)
The enumeratee splits the stream without any buffering, passing the received chunk to both iteratees, before asking for the next chunk. The enumeratee continues for as long as there are stream data and at least one of the iteratees wants them.

Our running example is like wc, reporting the number of words and characters in a given file. It can be elegantly written with Lazy IO:

     main_words_char_lazy = do
        name:_ <- getArgs
        file   <- readFile name
        print $ (length $ words file, length file)
Here is the same code using iteratees:
     main_words_char_iter = do
        name:_  <- getArgs
        counter <- run =<< enum_file name 
          ((runI =<< enum_words stream_count) `enumPair` stream_count)
        print counter
The iteratee stream_count counts the elements in its stream; we apply it to the stream of characters produced by enum_file and to the stream of words produced by decoding the character stream.

The ugly side of Lazy IO turns up when we run the code. As the size of the input file increases, from 2.4MB, to 4.9MB, to 24.8MB, so does the peak memory use of the lazy IO code: from 43MB to 78MB and finally, 337MB. Maximal residency grows too: from 20MB to 37MB and 163MB. The program must be reading the whole file in memory. That is not acceptable: we should not need 1/3-Gigabyte of memory to count words and characters in a 25-Megabyte file.

In contrast, the iteratee-based code counts in constant memory: the peak memory use is 2MB and the maximal residency is 108KB -- regardless of the size of the input file.

I am grateful to John Lato for introducing parallel composition and describing its advantages.

Version
The current version is November 2010.
References
Wc.hs [12K]
The benchmark code and the results. The code also tests counting words in very many files, demonstrating that the iteratee-based code is also frugal about file descriptors, needing only one.

 

Parallel composition of streams: several sources to one sink

Iteratee is a generic stream processor, what is being folded over a stream -- one stream. The standard Haskell Handle-based IO seems more general then, letting us read from several streams at once. Not only we can zip two files together, we can read two files at different paces. Furthermore, how much we read from one file may depend on what we have read from the other. Can we do all that with iteratees, while maintaining their advantages of incremental processing, precise resource control and the prevention of resource leaks? It turns out that we can. More surprising is how trivial it is. Neither we have to change the Iteratee library, nor do we have to know how iteratees are implemented. Monadic parsing combinators of the Iteratee library suffice.

Recall that Iteratee el m is a monad and so can be used as a base monad of another iteratee. That is how we derived enumeratees. We now observe that Iteratee el is also a monad transformer. We can transform any monad into an Iteratee monad -- including the iteratee monad. Laying iteratee monad transformers upon each other gives an us access to several streams at once.

A simple example of returning the first elements from two streams should make it concrete. Neither stream should be read beyond the first element. The most naive code works:

     i1 :: Monad m => Iteratee el1 (Iteratee el2 m) (el1, el2)
     i1 = do
          e1 <- head 
          e2 <- lift head
          return (e1,e2)
All signatures hereafter can be and have been inferred; we write them for clarity. The signature indeed clarifies that the iteratee i1 gets the element of the type el1 from the inner stream, and the element of the type el2 from the base monad, which happens to be iteratee as well and so can read from (its own) stream.

Since i1 is an iteratee, we can pass it to an enumerator. The nesting of streams becomes apparent in the (inferred) signature: the inner stream contains Ints and the outer stream has elements of the type el2.

     t11 :: (Monad m) => Iteratee el2 m (Iteratee Int (Iteratee el2 m) (Int, el2))
     t11 = enum_pure_1chunk [1::Int,2,3,4] i1
We complete the inner enumeration by running it. The type of t12 shows the result to be an iteratee, of an outer stream. We may pass it to a suitable enumerator (see t13) and run it:
     t12 :: (Monad m) => Iteratee el2 m (Int, el2)
     t12 = run =<< enum_pure_1chunk [1::Int,2,3,4] i1
     
     t13 :: (Monad m) => m (Iteratee Char m (Int, Char))
     t13 = enum_pure_1chunk "5678" $ run =<< enum_pure_1chunk [1::Int,2,3,4] i1
     
     t14 :: (Monad m) => m (Int, Char)
     t14 = run =<< (enum_pure_1chunk "5678" $
                      run =<< enum_pure_1chunk [1::Int,2,3,4] i1)
Running the complete example, runIdentity t14, gives us the pair (1,'5'), the heads of the two streams. With different enumerators we can verify that neither stream has been read beyond the first element.

We may even process two streams interdependently, reading incrementally and by various amounts. The amount to read from one stream may depend on what we have read from the other. For example, we will read the n-th word from the file test2.txt and find the first line in the file test1.txt with that word. We return the line paired with the word.

     zwl :: (MonadIO m) => Int -> Iteratee String (Iteratee Line m) (Line, String)
     zwl n = do
             drop n     -- Drop the words from the word stream before we
             w <- head  -- get the desired n-th word
             liftIO $ putStrLn $ "Got the word: " ++ w
                        -- From the line stream, drop the lines that
                        -- do not contain the word w
             lift $ dropWhile (not . isInfixOf w)
                        -- Read the first line that does contain w
             l <- lift $ head
             return (l,w)
We pass the two-stream iteratee zwl first to the enumerator of the inner stream of words, tzwlI, and then to the enumerator of the outer stream of lines, tzwlO. We run the whole computation and print the result:
     tzwlI :: (MonadIO m) => Int -> Iteratee Line m (Line, String)
     tzwlI n = run =<< enum_file_gen "test2.txt" (runI =<< enum_words (zwl n))
     
     tzwlO :: (MonadIO m) => Int -> m (Line, String)
     tzwlO n = run =<< enum_file_gen "test1.txt" (runI =<< enum_lines (tzwlI n))
     
     tzwlO5 = tzwlO 5 >>= print
The traces of read operations show we read both files incrementally, with interleaving, and only as far as needed.

Iterating iteratee transformers is reminiscent of lightweight monadic regions, which were obtained by iterating ST-monad--like transformers. Many-stream iteratees have many similarities with the regions, including the region-like discipline of allocating and disposing of stream resources.

Version
The current version is November 2010.
References
IterateeN.hs [7K]
The shown and other examples of multiple-stream processing

Merge.hs [8K]
A two-headed enumeratee: Merging two sorted streams into one sorted stream
The problem well illustrates the reading of two streams at various paces. How much to read from a stream can be determined only dynamically, based on the data read from both streams. We design not merely an iteratee with two input streams; we design an enumeratee. Thus we truly merge two input streams into a one, to be processed by inner iteratees as any other stream. We also process the two input streams by chunks rather than element-by-element. If a stream can yield several pieces of data at once, we take advantage of that, improving the performance.

How to zip folds: A library of fold transformers (streams)
The question of parallel processing of several streams is quite like the question of expressing list zipWith via list fold. (Recall the origins of iteratees and enumerators in the left fold.) If the only way to obtain elements of a list is to fold over it, could we express the zip function? At first blush, it seems impossible: after all, zip requires simultaneous processing of two lists whereas list fold can at best give us only nested processing. Yet we can write zip using only folds to deconstruct lists, as the referenced article explains.

Lightweight monadic regions

 

Towards the best collection traversal interface

Most programming languages support collections, represented by an in-memory data structure, a file, a database, or a generating function. A programming language system gives us typically one of the two interfaces to systematically access elements of a collection. One traversal API is based on enumerators -- e.g., for-each , map , filter higher-order procedures -- of which the most general is fold . The second approach relies on streams, a.k.a. cursors, lazy lists. Generators such as the ones in Icon, Ruby and Python are a hybrid approach.

It is well-known that given a cursor interface to a collection, we can implement an enumerator. It is less appreciated that given an enumerator interface, we can always derive a cursor -- in an automatic way. We demonstrate that generic procedure for languages with and without first-class continuations.

Now that cursors and enumerators are inter-convertible, an implementor of a collection has a choice: which of the two interfaces to implement natively? We argue that he should offer the enumerator interface as the native one. The paper elaborates that enumerators are superior: in efficiency; in ease of programming; in more predictable resource utilization and avoidance of resource leaks. We present a design of the overall optimal collection traversal interface, which is based on a left-fold-like combinator with premature termination. The design has been implemented and tested in practice.

Version
The current version is 1.5, Nov 6, 2003.
References
LL3-collections-enumerators.txt [23K]
Towards the best collection API: an extended abstract
LL3-collections-talk.pdf [55K]
An extended abstract and a poster presentation at the Lightweight Languages 2003 (LL3) workshop. November 8, 2003, MIT, Cambridge MA.

Generic Zipper: the context of a traversal

Enumerating collections: searching for the best iterator

Relationship between iterations and continuations

 

Justifying layered streams for input processing

Handling many internet protocols and file formats -- describing structured, multiply encoded data -- emphasizes layered stream processing. For example, the lowest-level stream reads data in blocks or ethernet frames. An overlay stream provides abstractions of characters or bytes. Higher streams take care of Base64, UTF and other decoding, assembling data into words, data structures, or Unicode characters. It is often useful to layer a stream that reads at most N units (octets, bits, etc) from a lower-level stream. A layered stream may transparently accumulate the SHA-1 digest of the read data and check the signature afterwards. It is important to be able to push and peel off the layers at run time. The latter is indispensable when dealing with HTTP streams carrying multi-part messages.
Version
The current version is 1.1, 2002.
References
layered-io.txt [6K]
Advocating layered IO on the example of HTTP multi-part processing. A message Layered I/O posted on the Haskell-Cafe mailing list on Tue, 14 Sep 2004 23:55:25 -0700 (PDT).

io.txt [26K]
An early draft of an ambitious layered IO proposal, cast in the context of Scheme. More polished drafts exist, and even a prototype Scheme implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled.

 

Streams in Linear Algebra

The linear matrix library LinAlg is built upon matrix streams, which provide sequential view/access to a matrix or its parts. Many of Linear Algebra algorithms turn out to require only sequential access to a matrix or its rows/columns, which is simpler and faster than random access. All streams in LinAlg are light-weight: they use no heap storage and leave no garbage.

A stream-wise access to a collection is an important access method, which may even be supported by hardware. For example, Pentium III floating-point extension (Internet Streaming SIMD Extension) lets programmers designate arrays as streams and provides instructions to handle such data efficiently (Internet Streaming SIMD Extensions, Shreekant (Ticky) Thakkar and Tom Huff, Computer, Vol. 32, No. 12, December 1999, pp. 26-34). Streaming is a typical memory access model of DSPs: that's why DSP almost never incorporate a data cache (See "DSP Processors Hit the Mainstream," Jennifer Eyre and Jeff Bier, Computer, Vol. 31, No. 8, August 1998, pp. 51-59). A memory architecture designed in an article "Smarter Memory: Improving Bandwidth for Streamed References" (IEEE Computer, July 1998, pp.54-63) achieves low overall latencies because the CPU is told by a compiler that a stream operation is to follow. LinAlg offers this streaming access model to an application programmer.

Matrix streams may stride a matrix by an arbitrary amount. This lets us traverse a matrix along the diagonal, by columns, by rows, etc. Streams can be constructed of a Matrix itself, or from other matrix views (MatrixColumn, MatrixRow, MatrixDiag). In the latter case, the streams are confined only to specific portions of the matrix.

Many functions of LinAlg are written in terms of streams, for example, the computation of vector norms, the addition of a vector to the diagonal or the anti-diagonal of a matrix, Aitken-Lagrange interpolation. Singular value decomposition SVD demonstrates many applications of streams: e.g., multiplying a matrix by a rotation matrix avoids random access to matrix elements and the corresponding range checks and offset calculations. The stream code is also more lucid. One may create a stream that spans over a part of another stream. We use substreams, for example, to efficiently reflect the upper triangle of a square matrix onto the lower one, yielding a symmetric matrix. The SVD computation uses subranging extensively, e.g., for left Householder transformations.

LinAlg's streams may span an arbitrary rectangular block of a matrix, including the whole matrix, a single matrix element, a matrix row or a column, or a part thereof. Assigning a block of one matrix to a block of another takes only one line -- which, due to inlining, is just as efficient as the direct loop with pointer manipulation.

References
LinAlg: Basic Linear Algebra and Optimization classlib


Last updated February 4, 2012

This site's top page is http://okmij.org/ftp/

oleg-at-pobox.com or oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from HSXML by HSXML->HTML