On parent pointers in SXML trees

* Abstract

In this article we discuss and compare five different methods of
determining the parent of an SXML node. The existence of these methods
is a crucial step in a constructive proof that SXML is a complete
model of the XML Information set, and SXPath is a complete
implementation of the XPath Recommendation. Four of the five
techniques of determining the parenthood require no mutations, and can
be implemented in a pure and strict functional language. Three of the
five techniques rely on annotations attached to SXML nodes. The
techniques differ in the nature of the annotations and in the time
complexity of attaching and using these annotations.  The complete
source code of four techniques is included.

* Definitions of terms

See
	http://pobox.com/~oleg/ftp/Scheme/xml.html
for the definitions of SXML, SXPath, and other terms used in this
article.


* Introduction

An XML processing application may need to locate a parent, an
ancestor, or a sibling of the current XML element. The XPath
Recommendation specifies "parent::", "ancestor::" and various sibling
axes for that purpose. Correspondingly, an item in an XPath data
structure, or in the XML Information set in general, has a "parent"
property. That property is an upward link from a child to its
parent. S-expressions then seem lacking in that respect: S-expressions
represent directed trees and trees cannot have upwards pointers. It
may appear therefore that S-expressions cannot properly model the XML
information set. One might conclude that SXML -- an abstract syntax
tree of an XML document in the form of S-expressions -- is deficient
and the SXML query language (SXPath) cannot fully implement the XPath
Recommendation.

In this article we consider three major approaches for locating the
parent of an SXML node. One approach is to search the SXML tree for a
parent, the second approach relies on SXML trees decorated with some
kind of parent "pointers". The third approach is based on a query
re-writing.  Each approach entails a trade-off among these parameters:
space and time complexities of adding annotations to SXML nodes; speed
of locating the parent; space leaks and undesired sharing; complexity
of serialization and de-serialization of the annotated SXML;
generality; functional purity.

The approach of annotating the SXML tree with parent "pointers" is
particularly tricky. Parent pointers, if taken literally, turn a tree
into a doubly connected graph. It is far from straightforward to
create cyclical structures in a pure functional way. Furthermore, the
double connectivity is a curse. We can no longer compare two SXML
fragments with 'equal?' or print them with 'display' or 'write'. SXML
datastructures with true parent pointers require extra care, to avoid
inadvertently following the back pointer and thus falling into an
infinite loop. The double connectivity has a more serious problem. If
we take a node from an SXML tree and move it into another tree, the
node still points to its original parent, and to its parent, and
eventually to the root of the whole tree. A serious space leak ensues:
If we reference only a leaf node, the entire datastructure is
reachable and cannot be garbage collected.

We demonstrate three implementations of the annotation-based
approach. One of them turns an SXML structure into a general cyclic
graph, another makes an SXML structure a DAG. The third implementation
adds the parent "pointers" to SXML tree and still keeps the data
structure a tree, with only a single path between any two nodes.

	  - Searching for a parent in a SXML tree
	  - Parent pointers as SXML node annotations
	  - A genuine _tree_ with thunked parent pointers
	  - Building an SXML tree with parent pseudo-pointers
	  - Rewriting an XPath query to avoid backward axes

Only one method requires mutation. All other techniques in this
article are purely functional and can be implemented in a strict
functional language. The dictionary-based implementation seems to
offer the best trade-off. The annotated SXML tree can be created "on
the fly," as we parse the XML document. The annotated SXML can be
easily serialized and reconstructed. The approach avoids space leaks
caused by undesired sharing. If the dictionary of nodes is implemented
using an appropriate data structure such as a hash table, locating the
parent of a node is efficient.


The first four techniques are implemented in the source code file
parent-pointers.scm that accompanies the article. The code has been
checked in the SSAX CVS repository as
	SSAX/examples/parent-pointers.scm
You can also view the code as
 http://cvs.sf.net/cgi-bin/viewcvs.cgi/ssax/SSAX/examples/parent-pointers.scm

The code verifies the four parenthood techniques by printing out
ancestorial chains for a sample XML document. For each element in the
sample document, we print the names of its parent element, its
grandparent element, etc. The function to print ancestorial chains,
'print-ancestors', is generic. It takes a source SXML tree and a
get-parent function. Each of the four techniques implemented in
parent-pointers.scm computes the SXML data structure in its own way
(with its own annotations, if any) and defines its own get-parent
function. For simplicity, the source code parent-pointers.scm elides
XML attributes and namespaces. The full source code of SSAX and
SXPathLib implements parenthood queries without any simplifications.


* Searching for a parent in a SXML tree

The absence of an explicit upward link from a child to a parent in
SXML appears to break the isomorphism between SXML and the XPath data
model. The XPath Recommendation however offers help. The "Data Model"
Section of the Recommendation (Section 5) says that each node in an
XPath tree is unique, and nodes never share children. SXML trees
constructed by the SSAX parser have this property. Since an XPath
expression may include an absolute location path, the evaluation
context of the expression must contain the root node. These two facts
show that the upward link from a child element to its parent is always
(conceptually) present in an SXML tree. To determine the parent of a
SXML node, we search the SXML tree, from the root node down, for a
node that contains the given node among its children:

	parent(x) = { y | y=child*(root), x=child(y) }

The first solution in parent-pointers.scm shows a simple
implementation of this idea. We search an SXML tree in the depth-first
order. We test our implementation by determining and printing
ancestorial chains for each element in a sample XML document. A
function 'node-parent' in SXPath.scm is a "production" implementation
of the same idea, which accounts for XML attributes and namespaces. A
function 'sxp:parent' in Kirill Lisovsky's SXPath-ext relies on a more
efficient breadth-first search for parents.

The present search-based approach has an advantage of requiring no
annotations on SXML nodes. We work with a single linked SXML tree (as
it was produced by ssax:xml->sxml) and incur no storage overhead for
parent pointers. The tree can be easily serialized and reconstructed,
using ordinary 'write' and 'read' Scheme functions. The present
approach trades time for space. If we start from the root, searching
the whole tree for a parent is clearly expensive. On the other hand,
if we know some ancestor of the current node (because we took a note
of the ancestor before the downward traversal), the search space can
be significantly reduced.


* Parent "pointers" as SXML node annotations

The second approach to determining the parenthood takes advantage of
an aux-list offered by the SXML Specification. An aux-list of a SXML
node is a list of ancillary properties -- annotations -- of a node. An
association "(parent parent-pt)" may be one of such annotations. This
approach clearly trades space for speed of locating the parent of a
node. We show three realizations of the annotation approach, which
differ in the meaning of the "pointer".


* Real parent pointers

The first implementation of the annotation approach (Solution 2 of
parent-pointers.scm) decorates SXML nodes with real parent
pointers. Each SXML node except the *TOP* includes an aux-list with an
association `(parent ,ptr-to-parent) where ptr-to-parent is a
reference to the parent of that node.

An SXML data structure with true parent pointers is no longer a
tree. Rather, it is a general, cyclic graph. Care must be taken when
displaying, comparing or processing such a data structure. In
particular, we should use display-circle or SRFI-38 but never a naive
display to output this SXML data structure.

We can construct the annotated SXML as we parse the corresponding XML
document. A function 'xml->true-parent-sxml' in parent-pointers.scm is
such a parser, which is a custom instantiation of SSAX. The time
overhead of adding annotations is therefore negligible. Locating a
parent is also very fast: we merely need to extract the reference to
the parent from the aux-list of a node.

To create real parent pointers as we construct SXML, we must do
destructive updates.  It is not possible to build cyclic data
structures in a strict language without mutations. Because of them,
the parsing code 'xml->true-parent-sxml' is notably trickier.

The advantage of this approach is the speed of locating the parent of
a node. It is a constant-time operation. Adding the annotations is
also easy and can be done at the time of parsing XML. The obvious
disadvantage is the space overhead for storing parent-pointer
annotations in aux-lists. A more serious drawback is double
connectivity of the annotated SXML data structure. We can no longer
compare (by equal?), print or naively traverse SXML fragments lest we
fall into the infinite loop. Furthermore, double connectivity leads to
a serious space leak, as we have discussed earlier.

Kirill Lisovsky's SXPath-ext offers a variation of this approach. His
annotated SXML datastructures wrap parent pointers in thunks. Because
procedural values in Scheme are opaque (and in general are not
examined by 'display' or 'equal?' functions), there is no longer a
danger of falling into the infinite loop when comparing, printing or
traversing such SXML. Thunks however require more space. Furthermore,
the true pointers within a thunk are not hidden from the garbage
collector. If any node of the SXML structure is reachable, the whole
structure is reachable and cannot be garbage collected.

When we write out the annotated SXML, we must obviously exclude the
parent pointer annotations from the serialization process. The
exclusion applies even if the pointers are wrapped in thunks: we
cannot portably serialize closures. The de-serialized SXML will
therefore lack any parent pointer annotations. These annotations have
to be added explicitly in a dedicated traversal pass.


* A genuine _tree_ with thunked parent pointers

The method in this section also relies on a SXML tree whose nodes are
annotated with parent pointers. To be more precise, each node except
the *TOP* should have an aux-list with an association 
`(parent ,thunk-ptr-to-parent)

In contrast to the previous approach, thunk-ptr-to-parent is not the
reference to the parent of the node. Rather, it is a nullary
procedure, a thunk, which, when applied, yields an SXML node that is
the parent of the current node.

There is another difference from the true parent pointers technique.
An SXML tree with thunked parent pointers can be built without any
mutations whatsoever, in a pure functional way. Furthermore, the SXML
datastructure with thunked parent annotations is _truly_ a tree.  We
must incur however a separate tree traversal pass -- which takes an
SXML tree without parent annotations (e.g., the output of the ordinary
ssax:xml->sxml) and returns an SXML tree with annotations.

There is a subtlety in the current approach. Indeed, it sounds
contradictory: how can we add backward pointers to a tree in a pure
functional way and still maintain the tree property (there is only
one path between any two nodes)? If we look carefully at the
annotating function 'add-thunk-parent-pt' in parent-pointer.scm
we note an odd feature: if 'node' is a proper SXML node other than
the *TOP*, then
    (memq node (node-children (get-parent-from-thunk-pt node)))
is actually #f. However
    (member* node (node-children (get-parent-from-thunk-pt node)))
is #t
where member* is the ordinary procedure member that uses a special
equal*? procedure to compare SXML nodes modulo their aux-lists.

The odd feature arises from the fact that the doubly connected graph
with forward and backward node pointers is "virtual". The graph is
computed on demand as a *fixpoint* of an expression (elem-add-parent
original-tree).  Because our parent pointers are thunks, we delay a
fixpoint iteration until it is demanded. We therefore avoid the
divergence of the fixpoint, notwithstanding the strict nature of our
host language. The fact that a node is not a child of its parent (if
we use eq? to compare nodes) -- but it is a child of its parent if we
use equal*? -- should not matter in a pure functional context. Indeed,
the notion of an extensional equality, eq?, does not exist in a pure
functional context, as Henry Baker pointed out. In fact, pure
functional languages like Haskell do not provide the eq? equality
predicate.

The present technique suffers the overhead of storing the thunked
parent pointers. Closures typically take quite a bit of memory. Adding
annotations needs a separate pass over the SXML tree. Determining the
parent of a node is also quite expensive as it requires performing one
fixpoint iteration. One advantage is that despite the thunked parent
pointers, our tree remains a tree at all times. Therefore, we can
safely print it out using 'display' and traverse it, without worrying
about falling into an infinite loop.

The present approach is perhaps more academic than practical. It does
however make us think what a parent pointer is or could be. We thus
come to the dictionary approach, which seems to give the best overall
trade-off.


* Building an SXML tree with parent pseudo-pointers

The present technique also locates the parent of a node with the help
of annotations attached to SXML tree nodes. Each node except the *TOP*
should have an aux-list with an association 
`(parent ,node-id-of-the-parent)

Here 'node-id-of-the-parent' is a pseudo-pointer to the parent of the
current node. A pseudo-pointer is an node-id, a symbol, which is a
key in a dictionary of node-ids. The *TOP* element of our annotated
SXML tree must have an aux-list with an association
     `(node-id-dict ,dictionary-of-node-ids)
where dictionary-of-node-ids is, in the present case, an associative
list of (node-id pointer). The performance of the get-parent procedure
improves if we use a more advanced data structure for the dictionary
of node ids, for example, a balanced tree or a hash table.

We can construct the annotated SXML as we parse the corresponding XML
document. A custom XML parser xml->dict-parent-sxml (an instantiation
of SSAX) in parent-pointer.scm does that. The time overhead of adding
annotations is therefore negligible. To determine the parent of a node,
we extract the id of the parent from the aux-list of the node, and
locate the parent itself through the dictionary-of-node-ids. If the
latter is implemented as a balanced tree or a hash table, the lookup
should be reasonably fast.

In contrast to the true-parent-pointer approach above, parent
pseudo-pointers annotations can be added to an SXML tree in a pure
functional way. Absolutely no mutations to SXML nodes are necessary.
There is also another important difference between the two approaches.
An SXML data structure with true parent pointers is not a tree and is
not a DAG. It contains directed cycles. The double connectivity means
that if any node is reachable, all of the data structure is
reachable. This fact presents a grave problem for the garbage
collector and leads to a significant space leak. In contrast, an SXML
data structure with parent pseudo-pointers is a DAG. Each node but
the *TOP* one is referenced from its parent and from the dictionary of
node-ids. However, there are no directed cycles. If we retain a leaf
of such a tree, the rest of the tree can be garbage collected.

Thus, in the present approach we incur the overhead of parent
pseudo-pointers, which amounts to a list of two symbols per node, plus
the size of the dictionary of nodes. The dictionary of nodes can be
used for other purposes, for example, for resolving XML IDs. In that
case, the dictionary is not entirely an overhead. Adding the
annotations is fast because the annotated tree can be constructed as
we parse an XML document. Locating the parent can be fast, given an
efficient implementation of the dictionary of nodes. As we have noted,
the dictionary can be used for other purposes. The annotated SXML data
structure is a DAG and does not contain directed cycles. Therefore,
the danger of a space leak is avoided. When serializing the annotated
SXML, we should exclude the dictionary of node-ids but can write out
the parent pseudo-pointers annotations as they are. We can use the
ordinary Scheme functions 'write' and 'read' for the body of the
annotated SXML. When we read the serialized SXML back, we merely need
to reconstruct the dictionary of node-ids, which requires one pass
through the tree. Overall, the present approach seems to give the best
overall trade-off.


* Rewriting an XPath query to avoid backward axes

The third approach to implementing parenthood queries in trees is
eliminating such queries. In some contexts, an SXPath expression with
an upward-looking axis such as parent:: or ancestor:: can be
re-written into an SXPath expression with only downward-looking
axes. For example, an XPath expression
	//p[../div]
can be replaced with
	//div/p
Indeed, the former expression selects all 'p' nodes in the document
that have 'div' parents. On the other hand, "//div/p" finds all 'div'
nodes and then extracts and returns their 'p' children. The end result
is the same.

In much more detail, this topic has been researched in a paper

   XPath: Looking Forward
   Dan Olteanu, Holger Meuss, Tim Furche, Fran[E7]ois Bry

   Research Report PMS-FB-2002-4, 2002
   The Electronic Library journal, volume 20, number 4, 275-287, 2002

http://www.pms.informatik.uni-muenchen.de/publikationen/PMS-FB/PMS-FB-2002-4.abs
I am grateful to Je'ro^me Sime'on for pointing out this paper.

The paper introduces two algorithms for transforming (subsets of)
XPath 1.0 expressions containing reverse axes into reverse-axis-free
equivalents.

The query re-writing technique has an advantage of no space overhead
for storing parent "pointer" annotations. This approach can process
SXML data structures that are truly trees. The trees can be easily
serialized and reconstructed, using ordinary 'write' and 'read' Scheme
functions. The approach has the excellent performance: an XPath query
with only forward-looking axes can be efficiently implemented, in a
stream-wise fashion. The disadvantage is that not every query can be
re-written in this way.