From oleg@pobox.com Tue Feb 27 18:23:30 2001 Date-Sent: Tue, 27 Feb 2001 18:23:28 -0800 (PST) From: oleg@pobox.com Message-Id: <200102280223.SAA28284@adric.cs.nps.navy.mil> Date: 28 Feb 2001 03:07:49 +0100 Newsgroups: comp.lang.scheme To: comp.lang.scheme@mailgate.org References: <3A9DA003@MailAndNews.com> Subject: HTTP fetching Reply-to: oleg@pobox.com Errors-to: oleg@pobox.com Status: OR Description: a quick comparison of Scheme and Perl regarding parsing of input: quick-and-dirty removal of HTML tags and comments: The problem: > I have the following short Perl script: > #!/usr/bin/perl -w > use LWP::UserAgent; > use HTTP::Request; > use strict; > sub get_url { > my $url = shift or die "Need URL!"; > my $ua = new LWP::UserAgent; > my $req = new HTTP::Request 'GET', $url; > $ua->timeout(30); > my $src = $ua->request($req); > return $src->content unless (!$src->is_success); > return 0; > } > > sub dehtmlize { > my $src = shift or die "Need HTML!"; > $src =~ s/<.*?>//g; > $src =~ s|||g; > print $src . "\n"; > } > > dehtmlize(get_url('http://www.scheme.org')); > This script removes HTML (imperfectly, but close enough) off of the URL you > pass to it. My question is how would I, using any Scheme environment you'd > like (as long as it's available for free), write the equivalent of this? Here's the code that does what you want. The Scheme code has a notable advantage over the Perl code in its ability to dehtmlize the input stream "on the fly". The Perl code you quoted loads the entire text of the document to process into memory (string $src->content). Any Perl code must do that as you can't apply regular expression substitutions to a port as an input port generally does not allow "backtracking" by an arbitrary amount. The Scheme code works with a strictly sequential input port; it needs as much memory as the longest token -- which is far smaller than the whole document. The Perl code does four passes through the whole input (one to read it all into memory; two others do regular expression substitutions specified in dehtmlize; and yet another pass writes out the result). The Scheme code accomplishes all the work in only one pass. Its size is comparable with that of the Perl code. The code also spells out the get-url method. Tested under Gambit 3.0. ; port contains possibly an HTML text ; This function removes everything within < and > ; (and the brackets themselves). ; The code removes comments as well (which may ; contain nested tags). ; This is the first approximation to converting HTML to plain text. ; For a more advanced example, use a SSAX parser, which actually parses ; the stream: SSAX.scm (define (dehtmlize port) (let loop () (let ((token (next-token '() '(#\< *eof*) "opening tag" port))) (display token) (and (eq? #\< (read-char port)) (or (and (eq? #\! (peek-char port)) ; check for possible comment (eq? #\- (peek-next-char port)) (eq? #\- (peek-next-char port)) (find-string-from-port? "-->" port) (loop)) (and (find-string-from-port? ">" port) (loop))))))) (include "myenv.scm") ; Given a URL, fetch it using the GET method. Return a port from ; which to read the reply. If http_proxy env variable is set, we will ; use that proxy ; Gambit 3.0 (define (get-url url) (define (do-fetch schema dummy host resource) (let* ((proxy (OS:getenv "http_proxy")) (target-host (or proxy host)) (http-port (##open-input-output-file (string-append "tcp://" target-host (if (string-index target-host #\:) "" ":80"))))) (for-each (lambda (str) (display str http-port)) `("GET " ,@(if proxy (list url) (list "/" resource)) " HTTP/1.0\r\n" "Host: " ,host "\r\n" "User-agent: Scheming-puppy/1.1\r\n" "\r\n" ; Empty line finishes the request )) (flush-output http-port) (cerr "\nrequest sent...\n") (let ((ret-code (begin (read http-port) (read http-port)))) (case ret-code ((200) ; now skip the headers (let loop () ; read-text-line handles CR, LF, and CRLF as ; line terminators (let ((header (read-text-line http-port))) (cerr "Header: " header nl) (cond ((eof-object? header) (error "unexpected EOF while reading the headers")) ((equal? "" header) ; read the empty line after the headers http-port) (else (loop)))))) (else (error "Bad return code: " ret-code)))) )) (apply do-fetch (string-split url '(#\/) 4))) (dehtmlize (get-url "http://www.schemers.org/")) The input parsing primitives are explained in http://pobox.com/~oleg/ftp/Scheme/parsing.html find-string-from-port? is also a part of SLIB. The only far-flung extension is dealing with "tcp://" file names. As a matter of fact a Scheme system takes such a string as a regular file name; the magic happens thanks to an "extended" version on open(2). http://pobox.com/~oleg/ftp/syscall-interpose.html It is possible to imbue any language system or application with such extended file opening powers. No recompilation is necessary (and in some cases, no relinking is required either, courtesy of LD_PRELOAD).