From ssax-sxml-admin@lists.sourceforge.net Tue Aug 23 08:50:42 2005
To: ssax-sxml@lists.sourceforge.net
In-Reply-To: <548334CFD2F7594FAB7B013CAA81128304C2203F@ex-mail-sea-02.ant.amazon.com>
Message-ID: <20050823084749.D7CA0AC5E@Adric.metnet.navy.mil>
Subject: [ssax-sxml] Stream-wise processing (differencing) of two SXML documents
List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum=ssax-sxml>
X-Original-Date: Tue, 23 Aug 2005 01:47:49 -0700 (PDT)
Date: Tue, 23 Aug 2005 01:47:49 -0700 (PDT)
Status: OR


Hello!

It is often easier to compare trees if we can handle them as streams,
so we can walk the trees in lock-step or at different paces. The good
example is the famous `same fringe' problem (deciding if two trees
have the same fringe, i.e., the sequence of leaves).

This message describes a not-so-trivial example of detecting the
differences in character content within the corresponding paragraphs
of two SXML documents. To simplify the matters, we assume that both
documents have the same number of paragraphs. When comparing the
paragraphs, we disregard the changes in the markup (e.g., added
emphasis) and the arrangement of character content in strings. For
example, the following two paragraphs are to be considered the same:

    (p "This" " " "is a " (strong "simple") " example!")
    (p "This is a simple " (br) (em "example") "!")

We also disregard enclosure of the paragraphs within section,
chapters, etc. As the result, we return SXML documents in which the
changed paragraphs are annotated (enclosed in the `changed'
element). Thus we highlight not only the traversal but also the
functional update of SXML documents. All in all, this message
emphasizes processing data in (overlaid) streams.

The complete code for this example is committed to the SSAX-SXML
CVS repository, as SSAX/examples/streams-diff.scm. It should be
available through ViewCVS shortly.

The simplest way to turn a tree or any other enumerable data structure
into a stream is through a zipper. See [1-2] for the references to
Huet's zipper. Our zipper is different however -- it and the
conversion procedure do not depend on the data structure. They only
depend on the _interface_ of the enumerator. Our zipper is not a
derivative (in the calculus sense) of a data structure; rather, it is
the derivative of an enumerator.

Besides walking the data structure in the ``natural order'' (that is,
in the order it is being enumerated), we can update the nodes and so
return a changed data structure. This is a pure functional update: the
original tree is intact so we can trivially ``roll-back'' our
update. If the enumerator is written to preserve sharing (see [3]) the
updated data structures shares as much of its content with the
original data structure as possible. We can also bookmark certain
points in the traversal and return to them repeatedly.


For the present message, we assume the following enumerator interface:
      ENUM :: TRAVERSAL-FN -> TREE
The enumerator takes the traversal function
     TRAVERSAL-FN :: NODE -> NODE
applies it to the data structure in question, and returns (perhaps
changed) data structure. The traversal function receives the current
node and returns a replacement node (which may be the same as the
input one).

Our zipper is merely a record with two elements, the current node and
the procedure to advance the zipper:

(define-record-type *zipper*
      (make-zipper z-curr-node z-k)
      zipper?
      (z-curr-node curr-node)
      (z-k next))

The procedure that turns the enumerator into zipper --
``differentiates'' the enumerator -- is a very simple one

; enum->stream :: ENUM -> ZIPPER or TREE
(define (enum->stream enum)
  (reset (enum (lambda (node) (shift f (make-zipper node f))))))

It relies on delimited continuation operators shift and reset
[4]. This should not be surprising as Huet's zipper is essentially a
delimited continuation represented as a data structure.


The pre-post-order SXML traversal operator is a good candidate
enumerator to turn into a stream. A better candidate actually is
pre-post-order defined in SSAX/examples/sxml-to-sxml.scm (because it
traverses in the left-to-right deterministic order).

Our first example therefore is to number sections of an SXML document,
sequentially. Given the document

(define doc1
  '(*TOP*
     (Section (@)
       (p "par 1")
       (Section (@ (class "subsection"))
	 (p "par 11")))
     (Section (@) (p "par 2"))))

we wish to obtain

(*TOP*
  (section
    (@ (counter 1))
    (p "par 1")
    (section
      (@ (counter 2) (class "subsection"))
      (p "par 11")))
  (section (@ (counter 3)) (p "par 2")))

that is, add an attribute `counter' to each section. 

Here's the code that accomplishes this

(define (number-sections doc)
  (let loop
    ((counter 1)
     (cursor
       (enum->stream
	 (lambda (traversal-fn)
	   (pre-post-order doc
	     `((*text* . ,(lambda (tag str) str)) ; basically, identity
		(*default* . ,(lambda x x))       ; ditto
		(Section
		  ((@ *preorder* 	; Attr node within a section
		     . ,(lambda (tag . elems) 
			  (cons tag (traversal-fn elems)))))
		  . ,(lambda x x))))))))          ; otherwise, id
    (if (zipper? cursor)
      (loop (+ 1 counter)
	((next cursor) `((counter  ,counter) . ,(curr-node cursor))))
      cursor				; not a zipper, it's the result
      )))

It appears that pre-post-order well-complements the streaming. In the
stylesheet, we define the nodes of our interest, and the zipper takes
care of the rest. In the above example, the stream is made of the
attribute lists of all Section nodes. 

With preorder, we can move not only sequentially, but also up, down,
left and right and can both add and remove nodes. See the messages
[2,3] for more detail.


Now, the main example. Suppose we have a document built of paragraphs,
elements `p'. Paragraphs may be enclosed in other elements, e.g.,
sections and chapters. The paragraphs have mixed content: they contain
text and additional markup, for example, `em', font annotations, etc.
Suppose we have an edited document: no paragraph was removed and no
new paragraph added. However, the chapters and sections may be
rearranged. The text within the paragraphs may be edited, and the
markup may be introduced or removed. We wish to find out paragraphs
whose character content was changed, and annotate them -- enclose in
the `changed' element. When comparing paragraphs, we want to disregard
the changes in the markup and the arrangement of character content in
strings.


Here's the example of an SXML document and the edited one:

(define docl
  '(*top*
     (Section 1
       (p "This" " " "is a " (strong "simple") " example!")
       (p "Second paragraph.")
       (p (@ (id "1")) "Third paragraph.")
       )))

(define docr
  '(*top*
     (Section 1
       (p "This is a simple " (br) (em "example") "!")
       (Section 11 (p "2nd" " paragraph."))
       (Section 12 (p "Third paragraph" "."))
       )))


Here are the two documents after annotation:

(*TOP*
  (section
    1
    (p "This" " " "is a " (strong "simple") " example!")
    (changed (p "Second paragraph."))
    (p (@ (id "1")) "Third paragraph.")))
(*TOP*
  (section
    1
    (p "This is a simple " (br) (em "example") "!")
    (section 11 (changed (p "2nd" " paragraph.")))
    (section 12 (p "Third paragraph" "."))))


The function diff-cc, defined in streams-diff.scm, converts the
documents into streams of streams of characters, which are quite
easily compared. Incidentally, no strings are copied or duplicated.


[1] http://pobox.com/~oleg/ftp/Scheme/zipper-in-scheme.txt
[2] http://www.haskell.org/pipermail/haskell/2005-April/015769.html
[3] http://www.haskell.org/pipermail/haskell/2005-May/015844.html

[4] See SSAX/lib/shift-reset.scm for the Danvy/Filinski's implementations
of shift/reset, which should work on any R5RS Scheme system.
See Martin Gasbichler and Michael Sperber,
    Final Shift for Call/cc: Direct Implementation of Shift and Reset
    ICFP'02:  ACM SIGPLAN Notices, 37, N9, pp. 271-282.
for an excellent explanation of the Danvy/Filinski code.