The following simple functions surprisingly often suffice when parsing an input stream. They either skip, or build and return tokens according to the inclusion or delimiter semantics. The list of characters to expect, include, or to break at may vary from one invocation of a function to another. This feature allows the functions to easily parse even context-sensitive languages.
EOF is generally frowned upon, and thrown up
upon if encountered. Still, sometimes the EOF "symbol"
can be expected: it may appear in a list of break-characters, coded as a
The input stream to parse is specified as a
PORT, which is usually the last (and optional) argument. It defaults to the
current input port if omitted.
This package relies on a function parser-error, which must be defined by a user of the package. The function has the following signature:
parser-error PORT MESSAGE SPECIALISING-MSG*
Many procedures of this package call the
report a parsing error. The first argument is a port, which typically
points to the offending character or its neighborhood. Most of the
Scheme systems let the user query a PORT for the current
position. MESSAGE is the description of the error. Other arguments
supply more details about the problem.
QUERY_STRING[in a separate document]
find-string-from-port? STR IN-PORT [MAX-NO-CHARS]
STRwithin the first
MAX-NO-CHARSchars of the input port
MAX-NO-CHARSmay be omitted: in that case, the search span would be limited only by the end of the input stream
STRwas not found
find-string-from-port?is a part of SLIB. Therefore, it is available on all platforms supported by SLIB.
The commented source code.
assert-curr-char CHAR-LIST STRING [PORT]
PORTand looks it up in the
CHAR-LISTof expected characters. If the read character was found among the expected, it is returned. Otherwise, the procedure writes a nasty message using
STRINGas a comment, and quits.
skip-until CHAR-LIST [PORT]
PORTuntil one of the break characters is encountered. This break character is returned. The break characters are specified as the
CHAR-LIST. This list may include
EOF, which is to be coded as a symbol
skip-until NUMBER [PORT]
NUMBERof characters from the
skip-while CHAR-LIST [PORT]
PORTto the first character that is not a member of the
CHAR-LIST-- or till the EOF, whichever occurs sooner. This character or the EOF object is returned. This character is left on the stream.
PORTand peeks at it. This function is useful when parsing LR(1)-type languages (one-char-read-ahead).
The commented source code.
A thorough verification code.
next-token PREFIX-CHAR-LIST BREAK-CHAR-LIST [COMMENT-STRING] [PORT]
PREFIX-CHAR-LIST), if any, and reads a sequence of characters up to (but not including) a break character, one of the
BREAK-CHAR-LIST. The string of characters thus read is returned. The break character is left on the input stream.
The list of break characters may include
EOF, which is to be coded as a symbol
EOF is fatal, generating an error message including a specified
COMMENT-STRING (if any).
As we cannot tell offhand the final size of the token we are reading, we make a guess, pre-allocate a string, and grow it by quanta if necessary. The quantum is always the length of the token buffer before it was extended the last time. This amounts to a Fibonacci-type extension, which has been shown optimal.
An internal procedure
determines the initial buffer allocation policy. By default, the
policy is to reuse a statically-allocated buffer. For Scheme systems
without preemptive threads or shared substrings, this policy proved
Another implementation of the same procedure is
next-token-list-based, which accumulates token characters
in a list rather than placing them into an extendible string buffer.
In reality, the list version works just as fast as the string buffer
version above, yet the list version allocates 50% more memory and has
to run garbage collection 50% as many times (Gambit-C 3.0).
; Example: a snippet from CGI:url-unquote (let ((read-primitive-token (lambda () (if (eof-object? (peek-char)) "" (next-token '() '(#\= #\+ #\& #\% *eof*) "URL-unquoting"))))) ) ...) ; Parsing of a comma-delimited list (let loop () (cons (next-token '(#\space) '(#\, *eof*) "" port) (if (eof-object? (read-char port)) '() (loop))))
next-token-of INC-CHARSET [PORT]
PORTthat belong to the list of characters
INC-CHARSET. The reading stops at the first character which is not a member of the set. This character is left on the stream. All the read characters are returned in a string.
next-token-of PRED [PORT]
PRED(a procedure of one argument) returns non-
#f. The reading stops at the first character for which
#f. That character is left on the stream. All the results of evaluating of
#fare returned in a string.
PRED is a procedure that takes one argument (a character or the
EOF object) and returns a character or
#f. The returned character does not have to be the same as the input argument to the
PRED. For example,
(next-token-of (lambda (c) (cond ((eof-object? c) #f) ((char-alphabetic? c) (char-downcase c)) (else #f))))will try to read an alphabetic token from the current input port, and return it in lower case.
next-token. It reads one line of text from the
PORT, and returns it as a string. A line is a (possibly empty) sequence of characters terminated by
LF(or even the end of file).
CRLFcombination) is removed from the input stream. The terminating character(s) is not a part of the return string either.
EOFis encountered before any character is read, the return value is EOF.
read-line. It is renamed to avoid conflicts with many Scheme implementations, which define a similar but not identical function.
read-string N [PORT]
Ncharacters from the
PORT, and returns them in a string. If
EOFis encountered before
Ncharacters are read, a shorter string will be returned.
Nis not positive, an empty string will be returned.
The commented source code.
A thorough verification code.
patin a file
(define (count-occurrences pat filename) (call-with-input-file filename (lambda (port) (let loop ((count 0)) (if (find-string-from-port? pat port) (loop (+ 1 count)) count)))))
or, more succinctly,
(define (count-occurrences-do pat filename) (call-with-input-file filename (lambda (port) (do ((count 0 (+ 1 count))) ((not (find-string-from-port? pat port)) count)))))
If a file
ab abra abracadabra abracadabra abracaDabracadabra abracadabraabracadabrathen
(count-occurrences "abracadabra" "/tmp/a.t") (count-occurrences-do "abracadabra" "/tmp/a.t")both return 5.
(string-length pat)characters in memory, no matter how big the input file is.
Given an input port that (possibly) contains an HTML or XML text, the function copies the text without the "tags" to the
current-output-port. To be more precise, the function omits from copying anything between
>, and the brackets themselves. Comments,
<!-- ... -->, are disregarded as well. Note, the comments may contain nested tags.
(define (dehtmlize port) (let loop () (let ((token (next-token '() '(#\< *eof*) "opening tag" port))) (display token) (and (eq? #\< (read-char port)) (or (and (eq? #\! (peek-char port)) ; check for possible comment (eq? #\- (peek-next-char port)) (eq? #\- (peek-next-char port)) (find-string-from-port? "-->" port) (loop)) (and (find-string-from-port? ">" port) (loop)))))))
The article referenced below embeds the dehtmlize function in a complete program that fetches a URL and prints out the corresponding HTML document without tags and comments.
The article also compares dehtmlize to the corresponding Perl code. The Scheme code has a notable advantage over the Perl code in its ability to dehtmlize the input stream "on the fly". The Perl code loads the entire text of the input document into memory. Any Perl code must do that as you cannot apply regular expression substitutions to a port as an input port generally does not allow "backtracking" by an arbitrary amount. The Scheme code works with a strictly sequential input port; it needs as much memory as the longest token -- which is far smaller than the whole document. The Perl code does four passes through the whole input (one to read it all into memory; two others do regular expression substitutions specified in dehtmlize; and yet another pass writes out the result). The Scheme code accomplishes all the work in only one pass. The size of the Scheme code is comparable with that of the Perl code.
A quick comparison of Scheme and Perl regarding
parsing of input [plain text file]
The article itself. It was originally posted as HTTP fetching on a newsgroup
comp.lang.schemeon 28 Feb 2001 03:07:49 +0100.
read-all, which takes an input port and returns the string of all characters read from the port till the end-of-file. The port can be opened on a file, terminal or a communication channel.
(define (read-all port) (let loop ((accum '())) (let ((c (read-char port))) (if (eof-object? c) (apply string (reverse accum)) (loop (cons c accum))))))This method is clumsy and slow: all the characters read from the port have to be scanned once more during the reversal of the list, and one more time when building a string. Furthermore, applying a function (in our case,
string) to a large argument list is inefficient. Some Scheme systems even limit the number of arguments that can be passed to a function.
(define (read-all port) (let ((quantum 8192)) (let loop ((chunks-rev '())) (let ((chunk (read-string quantum port))) (if (or (eof-object? chunk) (zero? (string-length chunk))) (string-concatenate-reverse chunks-rev) (loop (cons chunk chunks-rev)))))))where
read-stringis one of the "standard" functions implemented on many systems. It is also a part of the parsing library above. The function
string-concatenate-reverseis in SRFI-13. For completeness, here is the code of that function:
; See SRFI-13 for details (define (string-concatenate-reverse string-list) (let* ((final-size ; total the length of strings in string-list (let loop ((size 0) (lst string-list)) (if (pair? lst) (loop (+ size (string-length (car lst))) (cdr lst)) size))) (final-str (make-string final-size)) (dest-i (- final-size 1))) (let outer ((lst string-list)) (if (pair? lst) (let ((str (car lst))) (do ((i (- (string-length str) 1) (- i 1))) ((negative? i) (outer (cdr lst))) (string-set! final-str dest-i (string-ref str i)) (set! dest-i (- dest-i 1)))) final-str))))
(define (read-all port) (next-token '() '(*eof*) "reading string" port))This version of
read-allruns 4 times faster in elapsed time, 3 times faster in CPU time and allocates 3 times less memory, compared with the chunked read procedure above. The benchmark was run by a Gambit-C 3.0 interpreter, on a Pentium III Xeon 500 MHz computer with 128 MB of main memory and FreeBSD 4.0-RELEASE operating system.
(define (read-all port) (read-string (get-upper-bound-size-estimate) port))This procedure allocates no garbage and is twice as fast as the previous one (the one that used next-token).
port-copy, if it is available:
(define (read-all port) (call-with-output-string (lambda (o-port) (port-copy port o-port))))This version is about 20% slower than the
read-string-based implementation above. However this version does not require any estimate for the size of the port.
Re: Nicest and most efficient way to (read-all port)=>string posted on a newsgroup
comp.lang.scheme on Fri, 2 Nov 2001 11:31:14 -0800
Ports and the enhanced i/o
aa(AA(BB)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaawe should produce the list of strings
(AA(BB)CC) (BB) (DD) ((DD)EE) (FF(GG)) (GG) (HH)
A solution is given below. It has been tested on Gambit-C 3.0, Bigloo 2.4b and SCM 5d6. We assume that myenv-bigloo.scm or a similar platform-specific prelude and input-parse.scm have been loaded.
portand adds the list of found parenthesized strings to
accum. The latter is the list of paren-delim strings seen so far. Initially,
accumshould be an empty list.
(define (parse-nested-strs port accum) (if (eof-object? (skip-until '(#\( *eof*) port)) accum ; done reading, no more open paren encountered ; we have seen and read the opening paren (let ((accum (read-accum-within-paren port '("(") accum))) (parse-nested-strs port accum))))
read-accum-within-parenis a helper. We invoke it when we saw an opening parenthesis. The function reads
portuntil the corresponding closing parenthesis and returns the updated
accum. If there are nested paren-delimited strings, they will be added to
accumas well. At the top of the returned accum is the longest paren-delimited string. The argument
fragmentsis the list of seen text fragments, in the reverse order.
(define (read-accum-within-paren port fragments accum) (let* ((token (next-token '() '(#\( #\) ) "reading body" port)) (delim (read-char port)) ; what we've seen after the token (fragments (cons token fragments)) ; update the list of fragments ) (case delim ((#\) ) ; we've seen the matching closing paren; ; return what we've got (cons (apply string-append (reverse (cons ")" fragments))) accum) ) (else ; we've seen an opening paren: recurse (let ((accum (read-accum-within-paren port '("(") accum))) (read-accum-within-paren port (cons (car accum) fragments) accum))))))As a test, the following code:
(write (call-with-input-string "aa(AA(B(X())(Y)B)CC)aaaa((DD)EE)aaa(FF(GG))aaa(HH)aaa" (lambda (port) (parse-nested-strs port '()))))prints the following expected result:
("(HH)" "(FF(GG))" "(GG)" "((DD)EE)" "(DD)" "(AA(B(X())(Y)B)CC)" "(B(X())(Y)B)" "(Y)" "(X())" "()")
Re: Parsing of nested strings, help needed posted on a newsgroup
comp.lang.scheme on 1 Jul 2002 19:17:48 -0700
This problem, posed on comp.lang.scheme in August 2003, was to
read (floating point) numbers from a file and return them as a list.
The file may contain comments, which are marked by the character
#\# in the first position, and span through the end of
the line. Files of that structure are often used as configuration
files for numeric code. Given the sample file
/tmp/a.txt of the following content
# comment # another line of comment 1.23 1.223e12 1.e0 2.3e23 # comment again 2.344 -1.223e23 2.2334 +1.23e0 # #the expression
(call-with-input-file "/tmp/a.txt" read-numbers)should yield
(1.23 1.223e12 1.0 2.3e23 2.344 -1.223e23 2.2334 1.23)
read-numbers is given below. It is
portable and should work on any R5RS system. The code makes no
assumptions about the distribution of the numbers or comments. The
code scans the input
exactly once. There is no need to read
a string and scan it over again.
(define (read-numbers port) ; we're at the beginning of the line. Check the first char for ; #\# to see if it starts a comment (define (state-1 port accum) (let ((c (peek-char port))) (if (eof-object? c) (finish accum) (case c ((#\#) ; it was a comment -- skip to the end of the line (skip-until '(#\newline *eof*) port) (state-1 port accum)) ; recur (else (state-2 port accum)))))) ; it was probably a number ; we're sure we're not at the comment. Read the number. ; keep track if we're at the end of the line or not (define (state-2 port accum) (let ((c (skip-while '(#\space) port))) (if (or (eof-object? c) (char=? c #\newline)) (begin (read-char port) (state-1 port accum)) (let ((token (next-token-of (lambda (c) ; recognizer for valid tokens (cond ((eof-object? c) #f) ((string-index "0123456789eE+-." c) c) (else #f))) port))) (if (equal? token "") (error "bad number") (state-2 port (cons (string->number token) accum))))))) (define (finish accum) (reverse accum)) ; Get the ball rolling (state-1 port '()))
Re: Reading a file in Scheme posted on a newsgroup
comp.lang.scheme on Mon, 25 Aug 2003 12:10:50 -0700
The problem is to read numbers, strings and tokens from a
file of comma-separated-values. The input port should contain
comma-separated tokens, interspersed with the arbitrary amount of
white space. The tokens are: numbers (that is, anything parseable by
number->string), strings and anything else. The input
file may also contain `missing' tokens, that is, two consecutive
commas or the trailing comma, within the arbitrary whitespace. Such
missing tokens are considered `symbols' whose string representation is
the empty string. The string token in the input stream should be a
R5RS string datum: the string should be enclosed in double quotes and
may contain (escaped) backslashes, escaped quotes, and other escaped
characters as permitted by R5RS or the particular Scheme
nan-interpis either a procedure
string-or-symbol -> valor an associative list with string or symbolic keys. The function returns the list of read values: a numeric token is converted to a number; strings and anything else are passed to the
nan-interpfunction and its result is used in the list of return
nan-interpis an associative list, we do the lookup in the list then.
(write (call-with-input-string " 1.223, \"str\" , \"complex \\\\ string, indeed\\\"\" , 2.23 , nil" (lambda (port) (read-csv (lambda (x) x) port)))) ;==> (1.223 "str" "complex \\ string, indeed\"" 2.23 nil)
(display (call-with-input-string "-1,+2,-15e-15,, nan, 22.3334, 2.23 , , \"nil\"," (lambda (port) (read-csv `((,(string->symbol "") . 0.0) (nan . -999) ("nil" . -1.0)) port)))) ;==> (-1 2 -1.5e-14 0.0 -999 22.3334 2.23 0.0 -1.0 0.0)The latter example exhibits `skipped' tokens: two commas in a row and the trailing comma. The skipped tokens are represented by 0.0 in the result list, as specified in the associative list in the first argument to
comp.lang.schemeon Tue, 3 Feb 2004 16:40:14 -0800
This site's top page is http://okmij.org/ftp/
Converted from SXML by SXML->HTML