From www@deja.com Sun Nov 28 15:08:20 1999
Message-Id: <81sctg$dpf$1@nnrp1.deja.com>
From: oleg@pobox.com
Subject: A simple XML _lexer_ in Scheme
Date: Sun, 28 Nov 1999 23:12:50 GMT
Reply-To: oleg@pobox.com
Keywords: lexical analysis, parsing, XML, SAX, Scheme
Newsgroups: comp.lang.scheme,comp.text.xml
Summary: Announcing a framework to parse XML documents
X-Article-Creation-Date: Sun Nov 28 23:12:50 1999 GMT
Content-Length: 6656
Status: OR

This is to announce a small XML _lexer_ in Scheme. It breaks an XML
document into lexical "units": character data and markup. The package
implements a set of low- to medium-level parsers for various
productions defined in the XML Recommendation. This lexer is therefore
neither a Simple XML API (SAX) nor a Document Object Model
parser. Rather it is a framework, a set of "Lego blocks" one can use
to build a SAX or a DOM parser -- or a specialized lightweight parser
for a particular document type. Incidentally this article shows a
snippet from such a specialized parser, for Weather Observation Format
(OMF XML) documents.

The present XML lexer has a "sequential" feel of SAX yet a "functional
style" of DOM. Like a SAX parser, the lexer scans the document only
once and permits incremental processing. An application that handles
document elements in order can run as efficiently as
possible. _Unlike_ a SAX parser, the XML lexer does not require an
application register stateful callbacks and surrender control to the
parser. Rather, it is the application that drives the lexer -- calling
lexer's functions to get the current lexical or syntax element. These
functions do not maintain or mutate any state save the input
port. Therefore, the lexer permits parsing of XML in a pure functional
style, with the input port being a monad (or a linear, read-once
parameter).

The XML lexer is a collection of functions that can be partitioned
into several levels. Lower-level scanners deal with such XML
productions as a Name, a markup token, a CDATA section, a character
reference, Attributes*, an ExternalID. Higher-level functions
SSAX:read-root-element, SSAX:read-content-norm-ws,
SSAX:read-ANY-body-norm-ws parse more general productions describing
document's prolog, elements and their content. These functions expand
character references and normalize white spaces.

At the very bottom of the hierarchy are primitive token parsers --
next-token, next-token-of, skip-while, skip-until,
find-string-from-port?. They comprise an input parsing library. These
generic lexers can help parse any kind of input stream. The library
was built in a minimalist spirit. I tried to distill the most powerful
of the most frequently used lexical scanning functions, while keeping
their number as low as possible so I could remember them all. The XML
lexer seems to be the proof that the library is capable enough.

Parsing XML is slightly more complex than it appears. For example,
consider an 'Attribute' production:

[41] Attribute ::= Name Eq AttValue
[10] AttValue ::=  '"' ([^<&"] | Reference)* '"'
                 | "'" ([^<&'] | Reference)* "'"
[25] Eq ::= S? '=' S?

(where the numbers in brackets refer to productions in the XML
Recommendation,
http://www.w3.org/TR/1998/REC-xml-19980210.html). This grammar seems
very regular and trivial to follow. However, one has to keep in mind
"normalization rules", formulated in the XML recommendation in plain
English:

"Before the value of an attribute is passed to the application  or
checked for validity, the XML processor must normalize it as follows:
 - a character reference is processed by appending the referenced
   character to the attribute value
 - an entity reference is processed by recursively processing the
   replacement text of the entity
   [named entities amp lt gt quot apos are assumed pre-declared]
 - a whitespace character (#x20, #xD, #xA, #x9) is processed by
appending #x20 to the normalized value, except ... [elided]"

Suddenly reading of attribute values becomes a recursive endeavor,
requiring one to handle external or internal entities (which in turn
may refer to other external, internal or character entities).

The parsing library and the XML lexers do not use regular
expressions. This is done on purpose. For one thing, Olin Shiver's
excellent RegExp package hasn't become a final SRFI yet (let alone
universally supported). Another reason is efficiency. Granted, one can
seemingly easily parse XML documents with (especially Perl-style)
regular expressions. I have done that myself. The trouble is that
these expressions are more powerful than their name suggests; for
example, they might involve backtracking. Therefore one can't apply
Perl-style regexp _directly_ to a strictly sequential input stream. He
has to read the stream into a buffer first. Therefore one has to make
assumptions as to the max size of an element (to properly set a
low-water mark of the input buffer) -- or prepare to deal with
spill-overs. The present XML lexer and the input parsing library were
specifically designed to read the input port strictly sequentially,
without any backtracking. Developing the XML lexer gave me an immense
appreciation for XML and the XML committee, who have designed the
language so it's easy to parse with a simple one-char look-ahead. This
appreciation and insight have made all the trouble worthwhile.


References:

        http://pobox.com/~oleg/ftp/Scheme/SSAX.scm
The code has built-in regression tests, which aim to validate all the
major functions. I don't parse DTD yet as I the documents I handle are
valid by construction (if they are well-formed, that is, if a
generator finished successfully).

Parsing library
        http://pobox.com/~oleg/ftp/Scheme/parsing.html


I used the lexer to parse OMF documents and store the information back
into a database (thus reversing the work of a Metcast server):
        http://zowie.metnet.navy.mil/~spawar/JMV-TNG/XML/OMF.html

The following is a top-level function in that application, which
appears rather generic.

(call-with-input-file omf-file-name
  (lambda (port)
    (let ((root-element (SSAX:read-root-element 'OMF port)))
      (let loop ((counter 0)
                 (token (SSAX:read-content-norm-ws port)))
           (cond
            ((not (xml-token? token))
             (esc-file "Error: OMF-PARSE expected a START or END tag,"
                       "rather than '" token "'\n"))
            ((eq? 'START (xml-token-kind token))
             (cond
              ((assq (xml-token-head token)
                     OMF-PARSE:OMF-ITEM-handlers) =>
                     (lambda (handler-assoc)
                       ((cdr handler-assoc) token port)
                       (loop (++ counter)
                             (SSAX:read-content-norm-ws port))))
              (else
               (esc-file "Error: Unrecognized element: " token
                         "\nin the content of the root element"))))
            (else
             (SSAX:assert-token token 'END
                                (xml-token-head root-element))
             (cerr "\nProcessed elements: " counter nl)))))))