XML and Scheme

 

A micro-talk presentation at a Workshop on Scheme and Functional Programming 2000
Montreal, 17 September 2000

 

This talk will propose consistent or conformant Scheme implementations of W3C Recommendations: XML Infoset, XPath query language and a small subset of XSL Transformations.
 

XML semantics

The following is the view of XML semantics as set out by the W3 Consortium:

At the top of the hierarchy is an XML information set, Infoset: an abstract data set that describes information available in a well-formed XML document. Infoset's goal is to present in some form all relevant pieces of data and their abstract, container-slot relationships to each other. The implementation of this nest of containers as well as means of accessing items and properties are beyond the scope of the Infoset specification. XML document, with tags in familiar angular brackets is one of the concrete instances of an XML Infoset. Conversion is through parsing/unparsing.

The XML Path Language, XPath, makes the tree structure of XML Infoset explicit, and more practical. For example, XPath groups character information items into strings. Still the tree model of XPath is conceptual only, and does not mandate any particular implementation. XPath is a query language over an XPath tree. A Document Object Model (DOM) is another specialization of Infoset. It not only describes a tree view of a document but also makes the tree real to an application, via a set of interfaces to navigate the tree and query or update its nodes. XML Stylesheet Language Transformations, XSLT, build upon the XPath tree to describe transformations from branches and leaves of one tree onto another.

This talk will show a conformant instance of this mostly abstract hierarchy with Scheme as an implementation language. Scheme represents the XML Infoset and the XPath tree -- which are data structures. Scheme is also used to express XPath queries and tree transformations.
 

Sample XML and SXML

This is a micro-talk, and therefore, will flash a few simple examples. The reference section points to more detailed documents.

The following is a sample XML document to be used throughout this talk.

  <Forecasts TStamp='958082142'>

  <TAF TStamp='958066200' LatLon='36.583, -121.850' 
    BId='724915' SName='KMRY, MONTEREY PENINSULA'>

  <VALID TRange='958068000, 958154400'>111730Z 111818</VALID>

  <PERIOD TRange='958068000, 958078800'>
   <PREVAILING>31010KT P6SM FEW030</PREVAILING></PERIOD>

  <PERIOD TRange='958078800, 958104000' Title='FM2100'>
   <PREVAILING>29016KT P6SM FEW040</PREVAILING></PERIOD>

  <PERIOD TRange='958104000, 958154400' Title='FM0400'>
   <PREVAILING>29010KT P6SM SCT200</PREVAILING>
   <VAR Title='BECMG 0708' TRange='958114800, 958118400'>
      VRB05KT</VAR>

  </PERIOD></TAF> 
  </Forecasts> 

Here is its representation in SXML:

  (Forecasts (@ (TStamp  "958082142"))
   (TAF (@ (SName  "KMRY, MONTEREY PENINSULA")(BId  "724915")
           (LatLon  "36.583, -121.850")(TStamp  "958066200"))
    (VALID (@ (TRange  "958068000, 958154400"))
        "111730Z 111818")

    (PERIOD (@ (TRange  "958068000, 958078800")) 
       (PREVAILING  "31010KT P6SM FEW030"))

    (PERIOD (@ (Title  "FM2100")
               (TRange  "958078800, 958104000")) 
       (PREVAILING  "29016KT P6SM FEW040"))

    (PERIOD (@ (Title  "FM0400")
               (TRange  "958104000, 958154400")) 
       (PREVAILING  "29010KT P6SM SCT200")
       (VAR (@ (TRange  "958114800, 958118400") 
               (Title  "BECMG 0708")) 
          "VRB05KT"))))
This SXML data structure is produced by four lines of code -- an XML parser:
  (call-with-input-string doc-string 
    (lambda (port) 
      (SSAX:element->SXML  
        (SSAX:read-content-norm-ws port) port)))

The SXML form appears a bit more concise. For a document without external parsed entities and within a single namespace, SXML conforms to the XML Information Set specification.
 

XPath and SXPath

Let's consider querying of the above tree. The SXML data structure above (assumed to be bound to an identifier tree1 throughout this talk) contains three PERIOD elements. In the first query, we want to retrieve the body of the PREVAILING element from the second PERIOD.

Here is the query expressed in XPath, abbreviated notation:
    //PERIOD[2]/PREVAILING/text()

The following line shows the query in SXPath (evaluated by a sxpath interpreter)
    ((sxpath '(// (PERIOD 2) PREVAILING *text*)) tree1)

The block of code below is an SXML query in the full notation. It is an executable Scheme code.

  ((node-reduce
      (node-closure (node-typeof? 'PERIOD)) 
      (node-pos 2)
      (select-kids (node-typeof? 'PREVAILING)) 
      (select-kids (node-typeof? '*text*))
   ) tree1) 
The result of each of the queries is the same: a nodeset ("29016KT P6SM FEW040")
 

In the second example, we ask for a TRange attribute of those PERIODS that also have a Title attribute. Of the three PERIODS (see SXML above), only the last two have the Title attribute. Again, we compare XPath with SXPath:

  //PERIOD[@Title]/@TRange 
  ((sxpath '(// (PERIOD (@ Title)) @ TRange))  tree1)
In both cases, the result is the same nodeset:
((TRange "958078800, 958104000") (TRange "958104000, 958154400"))
 

Finally, we are asking for a Title attribute of such a PERIOD whose time range includes the time moment of 958104010 epoch seconds.

  ((sxpath `(// (PERIOD (@ TRange *text*
     ,(lambda (trange)
        (with-input-from-string trange 
          (lambda ()
            (let*
              ((tbegin (read)) (delim (read-char))
               (tend (read)))
             (or (< tbegin 958104010 tend) '()) ))) )
       )) @ Title)) tree1)
==>
   ((Title "FM0400"))
An SXPath expression may contain an arbitrary Scheme code as a filter. The let* block parses the TRange attribute and makes sure the range includes the given time moment. The corresponding expression in XPath would be more complex (requiring escape into JavaScript).

Thus, abbreviated SXPath expressions look identical to XPath expressions modulo parentheses and path separators.
 

Unparsing

This piece of code translates SXML back to its markup form. These few lines can handle every SXML tree: the code is the most generic unparser. See the References for the complete source code.

  (SRV:send-reply 
    (post-order tree1
     `((@ ((*default* ; local override for attrs
            . ,(lambda (attr-key value)
                 ((enattr attr-key) value))))
        . ,(lambda (trigger value)
             (list '@ value))) 
       (*default* . ,(lambda (tag elems)
                      (apply (entag tag) elems)))
       (*text* . ,(lambda (trigger str) str))
  )) )

 

SXSLT

Finally this slide shows several examples of SXML transformations: re-writing from one tree to another. In this particular example, we re-write the sample SXML tree into a flat tree that contains TRange attributes of every PERIOD element. The first fragment is in XML Stylesheet Language Transformations (XSLT):

  <xsl:template match="/"> 
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="PERIOD/@TRange">
    <xsl:value-of/>
  </xsl:template>

The next piece of code traverses the tree and applies handlers in post-order:

  (post-order tree `(
    (PERIOD
       ((TRange
          ((*text* . ,id))
          . ,id))
        . ,id) 
    (*text* . ,(lambda (trigger value) '())) 
    (*default* . ,id)
  ))
Most of the handlers are just identity transformers of a node or a branch. The handler (lambda (trigger value) '()) suppresses all the text content by default. However, this default is overridden in the context were a TRange node was seen during traversal of a PERIOD branch.

The last fragment accomplishes the same task by traversing the SXML tree in pre-order. In this case we don't need to specify the whole bunch of identity transformers. The engine does not descend into branches unless necessary, that is, unless a handler is specified for a node or a branch. In contrast, in the post-order example above, the entire tree is always traversed once. Note how SXSLT looks similar to XSLT.

  (apply-templates tree1 
    `((PERIOD @ TRange *text* 
        . ,(lambda (node) node))))
==>
     ("958068000, 958078800" "958078800, 958104000" 
      "958104000, 958154400" "958114800, 958118400") 

It all works. I have tested all the examples of this talk: I ran them and pasted the results from an emacs *scheme* window into the presentation. Details are available in the References section below.

Conclusions

We have shown an s-expression-style for XML, XPath and XSLT. SXML, SXPath and SXSLT however are not merely s-expressions: they can denote executable Scheme code and can be evaluated by an eval function as they are. Thus an XML document and operations on it can be expressed in Scheme -- and regarded either as data structures or as code.

 


References

[XML] World Wide Web Consortium. Extensible Markup Language (XML)
Version 1.0. W3C Recommendation 10 February 1998

[XML Infoset] World Wide Web Consortium. XML Information Set.
W3C Working Draft.

[XPath] World Wide Web Consortium. XML Path Language (XPath).
Version 1.0. W3C Recommendation 16 November 1999.

[XSLT] World Wide Web Consortium. XSL Transformations (XSLT).
Version 1.0. W3C Recommendation 16 November 1999.

[SXML] SXML is an instance of XML Infoset as S-expressions. SXML is an Abstract Syntax Tree of an XML document.
SXML.html

[SXPath]

[SXmanip] Evaluating SXML: XSL Transformations in Scheme


Last updated March 4, 2001

This site's top page is http://okmij.org/ftp/

oleg-at-okmij.org
Your comments, problem reports, questions are very welcome!