Comparing SSAX and Expat XML parsers in features and performance

Introduction to the benchmark
Expat and SSAX
Summary of the performance benchmark
Why SSAX?
1. SSAX has a better API
Answers to comments
References

We present the results of a benchmark that compares the performance of C and Scheme XML parsers. The benchmark gives an insight into the performance of Scheme, compared to the best C code. We also discuss differences between SSAX and Expat in terms of features and the user interface.

Originally this page was a compilation of two articles posted on a newsgroup comp.lang.scheme On Nov 1 and 5, 2001, extended with answers to common comments. This page reports the latest benchmark results for the current, as of this writing, version of SSAX.

Introduction to the benchmark

The benchmark is "untagging" of an XML document realizing a full binary tree, such as

     <node><node><node><node><leaf>0</leaf><leaf>1</leaf></node>
                       <node><leaf>2</leaf><leaf>3</leaf></node></node>
                 <node><node><leaf>4</leaf><leaf>5</leaf></node>
                       <node><leaf>6</leaf><leaf>7</leaf></node></node></node>
           <node><node><node><leaf>8</leaf><leaf>9</leaf></node>
                       <node><leaf>0</leaf><leaf>1</leaf></node></node>
                 <node><node><leaf>2</leaf><leaf>3</leaf></node>
                       <node><leaf>4</leaf><leaf>5</leaf></node></node>
     </node></node>

The content of leaf nodes is a single-digit string. In loftier terms, we compute a string-value of the document root.

Expat and SSAX

To make the comparison meaningful, we need to discuss input models of Expat and SSAX.

Expat is the fastest XML parser [Expat-tutorial]. It is written in C by James Clark. It is a SAX parser. An application that uses Expat must open an XML file or stream and read its blocks into memory. The application should pass these blocks to Expat, indicating the block size and if it is the last block of the document. An application can potentially load the whole document into memory and pass this single block to Expat. The parser is specifically optimized for such a scenario, because Expat uses shared substrings as much as possible. The first benchmark application, string-value.c, reads the whole document into memory, passes it to Expat and asks the parser to compute the string value. It takes 0.05 sec user time for string-value.c to handle an XML document that represents a full binary tree of depth 15 (32768 leaf nodes, total size 884,723 bytes). This is indeed fast.

The Expat input mode assumes that a calling application must know when the document ends. Otherwise, an application cannot read the document from a stream, in whole or in blocks. When the input document is read from a regular file, it is trivial to find out when we are finished reading. The OS will tell us. It is not so trivial to determine the end of an XML document if it is wrapped in MIME or other envelopes. Furthermore, if we take a document from a (tcp) pipe, it may be impossible to tell offhand when to stop reading. Furthermore, if we unwittingly try to read one extra character, we become blocked and possibly deadlocked.

SSAX is also a SAX parser [SSAX]. It is written completely in Scheme. SSAX uses a different input model. Rather than relying on an application to feed data to it, SSAX reads characters from a given input port itself. SSAX reads ahead by no more than one character, and only when the parser is positive the character to read ahead must be available. SSAX does not need to be told when the document is ended. On the contrary, SSAX will tell you when it has finished parsing a root (or other) element. SSAX therefore is safe to be used on pipes, to process several documents in a file, or to handle selected parts of a document.

Therefore, to meaningfully compare SSAX with Expat, we need a different benchmark application: string-value-by-one.c . It first loads an XML document into a memory buffer. It then passes the content of that buffer one character at a time to Expat. It takes 0.34 user seconds for string-value-by-one.c to handle the same XML document as above (full binary tree of depth 15).

A SSAX benchmark string-value-ssax.scm is fully equivalent to string-value-by-one.c. This Scheme code also loads an XML document, opens the memory buffer as a string port and passes the port to SSAX. It takes 0.32 user seconds to handle the same document.

A benchmark string-value-ssax-ss.scm is also Scheme code: it is a version of string-value-ssax that relies on a custom function substring->symbol. The latter avoids copying a substring if the symbol has already been seen. The files substring-symbol.c, read-NCName-ss.scm and string-value-ssax-ss.scm in [SSAX-benchmarks] give more details.

The complete code for the benchmark is part of the SSAX project at SourceForge [SSAX-benchmarks].

Summary of the performance benchmark

	string-value	string-value -by-one	string-value -ssax	string-value -ssax-ss
bench1-file15.xml `884,723` bytes	`0.05s` `241`	`0.34s` `25`	`0.32s` `842`	`0.28s` `1503`
bench1-file16.xml `1,769,459` bytes	`0.09s` `483`	`0.68s` `49`	`0.65s` `2353`	`0.57s` `3744`
bench1-file17.xml `3,538,931` bytes	`0.18s` `963`	`1.38s` `97`	`1.31s` `5311`	`1.12s` `8391`
bench1-file18.xml `7,077,875` bytes	`0.36s` `1926`	`2.74s` `193`	`2.58s` `10283`	`2.24s` `14431`
bench1-file19.xml `14,155,763` bytes	`0.73s` `3850`	`5.52s` `385`	`5.52s` `16381`	`4.61s` `20846`
bench1-file20.xml `28,311,539` bytes	`1.43s` `7698`	`11.04s` `770`	`11.17s` `26513`	`9.64s` `32411`

All numbers in the above table are the medians of three consecutive runs after two warm-up runs. The first number is the user time, in seconds. The number underneath is the number of page reclaims, as reported by getrusage(). The number of page reclaims is related (albeit not directly) to the memory requirements of a process. The user time and the number of page reclaims reflect the resources used only by the parser. Resources used during the application start up and loading of the document in memory are excluded from the reported counts.

The timings are precise: they were reported by getrusage(), using a microsecond virtual clock. The system time is insignificant. For example, the system type for the benchmark string-value-ssax-ss on the file bench1-file20.xml is 0.22 seconds, which is about 2% of the total time. All benchmarks were run on a FreeBSD 4.6.2-RELEASE system, Pentium IV 2GHz CPU, 1GB RAM. The numbers above reflect activities that occur entirely in memory. There is no i/o whatsoever -- neither application i/o nor VM-related i/o. There were no page faults. The large page-reclaim count for Scheme programs is most likely due to initial heap allocations. After a threshold is reached, the heap growth significantly slows down: note a sublinear increase in page-reclaim counts for Scheme benchmarks when processing large XML files.

We were using Expat library version 1.95.5, installed as a FreeBSD package. Several other components of FreeBSD depend on that library. We compiled the C code with gcc 2.95.3, option -O2. We used a Bigloo 2.4b Scheme system, which was itself compiled with the flags -O3 -fomit-frame-pointer -mcpu=i686. The complete set of options and other compilation details is enumerated in Makefile [SSAX-benchmarks].

The most striking is the comparison of columns string-value-by-one and string-value-ssax-ss: A large, practically relevant Scheme program, an XML parser, runs measurably faster than an equivalent well-written C program! Some XML performance benchmarks [Parser-benchmark] show that even the fastest Java XML parser (XP) is slower than Expat by an order of magnitude; Perl and Python parsers that are based on Expat [sic!] are slower than Expat by a factor of 20-25 (for large files). Thus the SSAX parser seems quite competitive in performance. Many thanks to Manuel Serrano for his excellent optimizing compiler, Bigloo. I must also stress that SSAX is a pure functional parser. The whole parser and all of its handlers are completely referentially transparent.

Why SSAX?

Despite some similarities between SSAX and Expat (which came as a surprise to myself), SSAX is not a Scheme clone of Expat. SSAX is intended to be a toolkit for various markup-related tasks. One example -- parsing of (potentially invalid) HTML -- has been demonstrated recently [HTML-parsing]. It was so easy to write an HTML parser/lexer with SSAX. The major difference between SSAX and Expat however is that SSAX has a better interface.

SSAX has a better API

The application interface of SSAX has several advantages over Expat API. SSAX minimizes the amount of application-specific state that has to be shared among user-supplied event handlers. In particular, SSAX makes the maintenance of an element stack unnecessary, which eliminates several classes of common bugs. SSAX is written in a pure-functional subset of Scheme. Therefore, the event handlers are referentially transparent, which makes them easier for a programmer to write and to reason about. The more expressive, reliable and easier to use application interface for the event-driven XML parsing is the outcome of implementing the parsing engine as an enhanced tree fold combinator, which fully captures the control pattern of the depth-first tree traversal.

A tutorial article about Expat [Expat-tutorial] explains well how Expat is supposed to be used. A user application most certainly has to have

a good stack mechanism in order to keep track of current context... The things you're likely to want to keep on a stack are the currently opened element and it's attributes. You push this information onto the stack in the start handler and you pop it off in the end handler.

Compare that description with the following SSAX application, which converts an XML document to a simplified (for the sake of clarity) SXML:

     (define (simple-XML->SXML port)
       (reverse
        ((SSAX:make-parser
          NEW-LEVEL-SEED
          (lambda (elem-gi attributes namespaces expected-content seed)
          '())
     
          FINISH-ELEMENT
          (lambda (elem-gi attributes namespaces parent-seed seed)
            (cons (cons elem-gi (reverse seed)) parent-seed))
     
          CHAR-DATA-HANDLER
          (lambda (string1 string2 seed)
            (if (string-null? string2) (cons string1 seed)
                (cons* string2 string1 seed)))
     
         )
         port '())))

As you see, simple-XML->SXML does not need any stack to keep track of the current context. There is no need to maintain the stack across several callbacks and to check for underflows and other errors. The SSAX callbacks hardly do anything at all. The simpler the handlers are, the easier it is to write them and to reason about them.

Answers to comments

Choice of XML trees as input files for the benchmark

XML representation of trees has many deeply- nested elements.Processing such documents exercises the parser engine. In contrast, marked-up Shakespeare's plays (another popular choice for benchmark input files) are comprised mostly of character data. They do not exert the element parsing engine.

Is the benchmark fair to Expat?

The most striking result is that a Scheme application is only 1.4 times slower than a comparable well-written C application.
Yes, but only if you slow down the C application by a factor of 7 first.

Whatever overhead is imposed on string-value-by-one.c compared to string-value.c, it applies equally to string-value-ssax-ss. The benchmark does not subject Expat to more overhead than it is imposed on SSAX. Therefore, the comparisons of two parsers is fair and meaningful.

SSAX could have been written like Expat, to accept chunks of arbitrary size. SSAX could have read the whole document into memory and parse it from there. As the benchmark shows, parsing from memory and avoiding any string copies increases the performance by a factor of 7. Yet I did not feel that solution satisfactory. I do parse XML documents from pipes; oftentimes the documents are generated on the fly so I do not know what the final size of the document will be. In that situation, I can not load the document into memory without "pre-parsing" of the input stream first. Furthermore, SSAX can handle documents that are simply too big to load into memory.

We should remark that if the Expat-like input mode is required by a particular application, it can be achieved in SSAX. Furthermore, we can turn SSAX into an XML Pull parser with the help of a general iteration inversion procedure, as described in [SSAX-pull].

Is SSAX really referentially transparent?

It is emphasized in SSAX title comments and elsewhere that the input port is treated as a linear, read-once parameter. This technique is similar to that of Clean. I/O in Clean is referentially transparent.

Because the port is treated in SSAX as a read-once parameter,

     (let ((result (foo port))) ...)

is equivalent to

     (let-values (((result new-port) (foo* port))) ... )

and can be mechanically transformed into the latter if needed. Transformations of that kind cannot generally be performed on code that uses assignments indiscriminately (which is not referentially transparent, that is).

In short, SSAX is as referentially transparent as Clean is -- which has been proven to be.

What about performance of Expat if it were to wrapped in Scheme?

Just for the sake of argument, if one were to wrap Expat into Scheme primitive objects, and invoke Expat from an interpreter, how do you think it will perform?

If you invoke Expat from a Scheme interpreter, you should expect the the performance similar to that of Perl or Python wrappers around Expat [Parser-benchmark].

The wrappers start off well for small documents; but as the document size increases to around 3 MB, the wrapped Expat becomes 20-30 times slower than the original Expat. The reason is the cost of invoking of Perl/Python callbacks from the Expat code and the related cost of translating strings and other data across the language barrier.

References

[Expat-tutorial] Clark Cooper. Using Expat. September 01, 1999.
< http://www.xml.com/pub/a/1999/09/expat/index.html>

[SSAX] SSAX and SXML at SourceForge.
< http://ssax.sourceforge.net/>

[SSAX-benchmarks] SSAX and Expat benchmarks.
< http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/ssax/SSAX/benchmarks/>

[Parser-benchmark] Clark Cooper. Summary of XML Parser Performance Testing. May 05, 1999.
< http://www.xml.com/pub/a/Benchmark/exec.html>

[HTML-parsing] Permissive parsing of perhaps invalid HTML

[SSAX-pull] XML pull parsing and SSAX

[Expat-vs-SSAX thread] Expat vs. SSAX, C vs. Scheme (Bigloo)Two articles posted on a newsgroup comp.lang.scheme on Thu, 1 Nov 2001 13:41:22 -0800 and Mon, 5 Nov 2001 13:40:16 -0800.

Last updated May 9, 2003

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!

Converted from SXML by SXML->HTML