From ssax-sxml-admin@lists.sourceforge.net Tue Aug 23 08:50:42 2005 To: ssax-sxml@lists.sourceforge.net In-Reply-To: <548334CFD2F7594FAB7B013CAA81128304C2203F@ex-mail-sea-02.ant.amazon.com> Message-ID: <20050823084749.D7CA0AC5E@Adric.metnet.navy.mil> Subject: [ssax-sxml] Stream-wise processing (differencing) of two SXML documents List-Archive: X-Original-Date: Tue, 23 Aug 2005 01:47:49 -0700 (PDT) Date: Tue, 23 Aug 2005 01:47:49 -0700 (PDT) Status: OR Hello! It is often easier to compare trees if we can handle them as streams, so we can walk the trees in lock-step or at different paces. The good example is the famous `same fringe' problem (deciding if two trees have the same fringe, i.e., the sequence of leaves). This message describes a not-so-trivial example of detecting the differences in character content within the corresponding paragraphs of two SXML documents. To simplify the matters, we assume that both documents have the same number of paragraphs. When comparing the paragraphs, we disregard the changes in the markup (e.g., added emphasis) and the arrangement of character content in strings. For example, the following two paragraphs are to be considered the same: (p "This" " " "is a " (strong "simple") " example!") (p "This is a simple " (br) (em "example") "!") We also disregard enclosure of the paragraphs within section, chapters, etc. As the result, we return SXML documents in which the changed paragraphs are annotated (enclosed in the `changed' element). Thus we highlight not only the traversal but also the functional update of SXML documents. All in all, this message emphasizes processing data in (overlaid) streams. The complete code for this example is committed to the SSAX-SXML CVS repository, as SSAX/examples/streams-diff.scm. It should be available through ViewCVS shortly. The simplest way to turn a tree or any other enumerable data structure into a stream is through a zipper. See [1-2] for the references to Huet's zipper. Our zipper is different however -- it and the conversion procedure do not depend on the data structure. They only depend on the _interface_ of the enumerator. Our zipper is not a derivative (in the calculus sense) of a data structure; rather, it is the derivative of an enumerator. Besides walking the data structure in the ``natural order'' (that is, in the order it is being enumerated), we can update the nodes and so return a changed data structure. This is a pure functional update: the original tree is intact so we can trivially ``roll-back'' our update. If the enumerator is written to preserve sharing (see [3]) the updated data structures shares as much of its content with the original data structure as possible. We can also bookmark certain points in the traversal and return to them repeatedly. For the present message, we assume the following enumerator interface: ENUM :: TRAVERSAL-FN -> TREE The enumerator takes the traversal function TRAVERSAL-FN :: NODE -> NODE applies it to the data structure in question, and returns (perhaps changed) data structure. The traversal function receives the current node and returns a replacement node (which may be the same as the input one). Our zipper is merely a record with two elements, the current node and the procedure to advance the zipper: (define-record-type *zipper* (make-zipper z-curr-node z-k) zipper? (z-curr-node curr-node) (z-k next)) The procedure that turns the enumerator into zipper -- ``differentiates'' the enumerator -- is a very simple one ; enum->stream :: ENUM -> ZIPPER or TREE (define (enum->stream enum) (reset (enum (lambda (node) (shift f (make-zipper node f)))))) It relies on delimited continuation operators shift and reset [4]. This should not be surprising as Huet's zipper is essentially a delimited continuation represented as a data structure. The pre-post-order SXML traversal operator is a good candidate enumerator to turn into a stream. A better candidate actually is pre-post-order defined in SSAX/examples/sxml-to-sxml.scm (because it traverses in the left-to-right deterministic order). Our first example therefore is to number sections of an SXML document, sequentially. Given the document (define doc1 '(*TOP* (Section (@) (p "par 1") (Section (@ (class "subsection")) (p "par 11"))) (Section (@) (p "par 2")))) we wish to obtain (*TOP* (section (@ (counter 1)) (p "par 1") (section (@ (counter 2) (class "subsection")) (p "par 11"))) (section (@ (counter 3)) (p "par 2"))) that is, add an attribute `counter' to each section. Here's the code that accomplishes this (define (number-sections doc) (let loop ((counter 1) (cursor (enum->stream (lambda (traversal-fn) (pre-post-order doc `((*text* . ,(lambda (tag str) str)) ; basically, identity (*default* . ,(lambda x x)) ; ditto (Section ((@ *preorder* ; Attr node within a section . ,(lambda (tag . elems) (cons tag (traversal-fn elems))))) . ,(lambda x x)))))))) ; otherwise, id (if (zipper? cursor) (loop (+ 1 counter) ((next cursor) `((counter ,counter) . ,(curr-node cursor)))) cursor ; not a zipper, it's the result ))) It appears that pre-post-order well-complements the streaming. In the stylesheet, we define the nodes of our interest, and the zipper takes care of the rest. In the above example, the stream is made of the attribute lists of all Section nodes. With preorder, we can move not only sequentially, but also up, down, left and right and can both add and remove nodes. See the messages [2,3] for more detail. Now, the main example. Suppose we have a document built of paragraphs, elements `p'. Paragraphs may be enclosed in other elements, e.g., sections and chapters. The paragraphs have mixed content: they contain text and additional markup, for example, `em', font annotations, etc. Suppose we have an edited document: no paragraph was removed and no new paragraph added. However, the chapters and sections may be rearranged. The text within the paragraphs may be edited, and the markup may be introduced or removed. We wish to find out paragraphs whose character content was changed, and annotate them -- enclose in the `changed' element. When comparing paragraphs, we want to disregard the changes in the markup and the arrangement of character content in strings. Here's the example of an SXML document and the edited one: (define docl '(*top* (Section 1 (p "This" " " "is a " (strong "simple") " example!") (p "Second paragraph.") (p (@ (id "1")) "Third paragraph.") ))) (define docr '(*top* (Section 1 (p "This is a simple " (br) (em "example") "!") (Section 11 (p "2nd" " paragraph.")) (Section 12 (p "Third paragraph" ".")) ))) Here are the two documents after annotation: (*TOP* (section 1 (p "This" " " "is a " (strong "simple") " example!") (changed (p "Second paragraph.")) (p (@ (id "1")) "Third paragraph."))) (*TOP* (section 1 (p "This is a simple " (br) (em "example") "!") (section 11 (changed (p "2nd" " paragraph."))) (section 12 (p "Third paragraph" ".")))) The function diff-cc, defined in streams-diff.scm, converts the documents into streams of streams of characters, which are quite easily compared. Incidentally, no strings are copied or duplicated. [1] http://pobox.com/~oleg/ftp/Scheme/zipper-in-scheme.txt [2] http://www.haskell.org/pipermail/haskell/2005-April/015769.html [3] http://www.haskell.org/pipermail/haskell/2005-May/015844.html [4] See SSAX/lib/shift-reset.scm for the Danvy/Filinski's implementations of shift/reset, which should work on any R5RS Scheme system. See Martin Gasbichler and Michael Sperber, Final Shift for Call/cc: Direct Implementation of Shift and Reset ICFP'02: ACM SIGPLAN Notices, 37, N9, pp. 271-282. for an excellent explanation of the Danvy/Filinski code.