From www@deja.com Sun Nov 28 15:08:20 1999 Message-Id: <81sctg$dpf$1@nnrp1.deja.com> From: oleg@pobox.com Subject: A simple XML _lexer_ in Scheme Date: Sun, 28 Nov 1999 23:12:50 GMT Reply-To: oleg@pobox.com Keywords: lexical analysis, parsing, XML, SAX, Scheme Newsgroups: comp.lang.scheme,comp.text.xml Summary: Announcing a framework to parse XML documents X-Article-Creation-Date: Sun Nov 28 23:12:50 1999 GMT Content-Length: 6656 Status: OR This is to announce a small XML _lexer_ in Scheme. It breaks an XML document into lexical "units": character data and markup. The package implements a set of low- to medium-level parsers for various productions defined in the XML Recommendation. This lexer is therefore neither a Simple XML API (SAX) nor a Document Object Model parser. Rather it is a framework, a set of "Lego blocks" one can use to build a SAX or a DOM parser -- or a specialized lightweight parser for a particular document type. Incidentally this article shows a snippet from such a specialized parser, for Weather Observation Format (OMF XML) documents. The present XML lexer has a "sequential" feel of SAX yet a "functional style" of DOM. Like a SAX parser, the lexer scans the document only once and permits incremental processing. An application that handles document elements in order can run as efficiently as possible. _Unlike_ a SAX parser, the XML lexer does not require an application register stateful callbacks and surrender control to the parser. Rather, it is the application that drives the lexer -- calling lexer's functions to get the current lexical or syntax element. These functions do not maintain or mutate any state save the input port. Therefore, the lexer permits parsing of XML in a pure functional style, with the input port being a monad (or a linear, read-once parameter). The XML lexer is a collection of functions that can be partitioned into several levels. Lower-level scanners deal with such XML productions as a Name, a markup token, a CDATA section, a character reference, Attributes*, an ExternalID. Higher-level functions SSAX:read-root-element, SSAX:read-content-norm-ws, SSAX:read-ANY-body-norm-ws parse more general productions describing document's prolog, elements and their content. These functions expand character references and normalize white spaces. At the very bottom of the hierarchy are primitive token parsers -- next-token, next-token-of, skip-while, skip-until, find-string-from-port?. They comprise an input parsing library. These generic lexers can help parse any kind of input stream. The library was built in a minimalist spirit. I tried to distill the most powerful of the most frequently used lexical scanning functions, while keeping their number as low as possible so I could remember them all. The XML lexer seems to be the proof that the library is capable enough. Parsing XML is slightly more complex than it appears. For example, consider an 'Attribute' production: [41] Attribute ::= Name Eq AttValue [10] AttValue ::= '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" [25] Eq ::= S? '=' S? (where the numbers in brackets refer to productions in the XML Recommendation, http://www.w3.org/TR/1998/REC-xml-19980210.html). This grammar seems very regular and trivial to follow. However, one has to keep in mind "normalization rules", formulated in the XML recommendation in plain English: "Before the value of an attribute is passed to the application or checked for validity, the XML processor must normalize it as follows: - a character reference is processed by appending the referenced character to the attribute value - an entity reference is processed by recursively processing the replacement text of the entity [named entities amp lt gt quot apos are assumed pre-declared] - a whitespace character (#x20, #xD, #xA, #x9) is processed by appending #x20 to the normalized value, except ... [elided]" Suddenly reading of attribute values becomes a recursive endeavor, requiring one to handle external or internal entities (which in turn may refer to other external, internal or character entities). The parsing library and the XML lexers do not use regular expressions. This is done on purpose. For one thing, Olin Shiver's excellent RegExp package hasn't become a final SRFI yet (let alone universally supported). Another reason is efficiency. Granted, one can seemingly easily parse XML documents with (especially Perl-style) regular expressions. I have done that myself. The trouble is that these expressions are more powerful than their name suggests; for example, they might involve backtracking. Therefore one can't apply Perl-style regexp _directly_ to a strictly sequential input stream. He has to read the stream into a buffer first. Therefore one has to make assumptions as to the max size of an element (to properly set a low-water mark of the input buffer) -- or prepare to deal with spill-overs. The present XML lexer and the input parsing library were specifically designed to read the input port strictly sequentially, without any backtracking. Developing the XML lexer gave me an immense appreciation for XML and the XML committee, who have designed the language so it's easy to parse with a simple one-char look-ahead. This appreciation and insight have made all the trouble worthwhile. References: http://pobox.com/~oleg/ftp/Scheme/SSAX.scm The code has built-in regression tests, which aim to validate all the major functions. I don't parse DTD yet as I the documents I handle are valid by construction (if they are well-formed, that is, if a generator finished successfully). Parsing library http://pobox.com/~oleg/ftp/Scheme/parsing.html I used the lexer to parse OMF documents and store the information back into a database (thus reversing the work of a Metcast server): http://zowie.metnet.navy.mil/~spawar/JMV-TNG/XML/OMF.html The following is a top-level function in that application, which appears rather generic. (call-with-input-file omf-file-name (lambda (port) (let ((root-element (SSAX:read-root-element 'OMF port))) (let loop ((counter 0) (token (SSAX:read-content-norm-ws port))) (cond ((not (xml-token? token)) (esc-file "Error: OMF-PARSE expected a START or END tag," "rather than '" token "'\n")) ((eq? 'START (xml-token-kind token)) (cond ((assq (xml-token-head token) OMF-PARSE:OMF-ITEM-handlers) => (lambda (handler-assoc) ((cdr handler-assoc) token port) (loop (++ counter) (SSAX:read-content-norm-ws port)))) (else (esc-file "Error: Unrecognized element: " token "\nin the content of the root element")))) (else (SSAX:assert-token token 'END (xml-token-head root-element)) (cerr "\nProcessed elements: " counter nl)))))))