From oleg@pobox.com Tue Feb 27 18:23:30 2001
Date-Sent: Tue, 27 Feb 2001 18:23:28 -0800 (PST)
From: oleg@pobox.com
Message-Id: <200102280223.SAA28284@adric.cs.nps.navy.mil>
Date: 28 Feb 2001 03:07:49 +0100
Newsgroups: comp.lang.scheme
To: comp.lang.scheme@mailgate.org
References: <3A9DA003@MailAndNews.com>
Subject: HTTP fetching
Reply-to: oleg@pobox.com
Errors-to: oleg@pobox.com
Status: OR

Description: a quick comparison of Scheme and Perl regarding parsing of input:
quick-and-dirty removal of HTML tags and comments:

The problem:

> I have the following short Perl script:
> #!/usr/bin/perl -w

> use LWP::UserAgent;
> use HTTP::Request;
> use strict;
> sub get_url {
>     my $url = shift or die "Need URL!";
>     my $ua  = new LWP::UserAgent;
>     my $req = new HTTP::Request 'GET', $url;
>     $ua->timeout(30);
>     my $src = $ua->request($req);
>     return $src->content unless (!$src->is_success);
>     return 0;
> }
>
> sub dehtmlize {
>     my $src = shift or die "Need HTML!";
>     $src =~ s/<.*?>//g;
>     $src =~ s|<!--\s+.*?\s+-->||g;
>     print $src . "\n";
> }
>
> dehtmlize(get_url('http://www.scheme.org'));

> This script removes HTML (imperfectly, but close enough) off of the URL you 
> pass to it.  My question is how would I, using any Scheme environment you'd 
> like (as long as it's available for free), write the equivalent of this?

Here's the code that does what you want. The Scheme code has a notable
advantage over the Perl code in its ability to dehtmlize the input
stream "on the fly". The Perl code you quoted loads the entire text of
the document to process into memory (string $src->content). Any Perl
code must do that as you can't apply regular expression substitutions
to a port as an input port generally does not allow "backtracking" by
an arbitrary amount. The Scheme code works with a strictly sequential
input port; it needs as much memory as the longest token -- which is
far smaller than the whole document. The Perl code does four passes
through the whole input (one to read it all into memory; two others do
regular expression substitutions specified in dehtmlize; and yet
another pass writes out the result). The Scheme code accomplishes all
the work in only one pass. Its size is comparable with that of the
Perl code. The code also spells out the get-url method. Tested under
Gambit 3.0.


; port contains possibly an HTML text
; This function removes everything within < and >
; (and the brackets themselves).
; The code removes comments <!-- ... --> as well (which may
; contain nested tags).
; This is the first approximation to converting HTML to plain text.
; For a more advanced example, use a SSAX parser, which actually parses
; the stream: SSAX.scm

(define (dehtmlize port)
  (let loop ()
    (let ((token (next-token '() '(#\< *eof*) "opening tag" port)))
      (display token)
      (and (eq? #\< (read-char port))
	   (or (and (eq? #\! (peek-char port))	; check for possible comment
		    (eq? #\- (peek-next-char port))
		    (eq? #\- (peek-next-char port))
		    (find-string-from-port? "-->" port)
		    (loop))
	       (and
		(find-string-from-port? ">" port)
		(loop)))))))


(include "myenv.scm")

; Given a URL, fetch it using the GET method. Return a port from
; which to read the reply. If http_proxy env variable is set, we will
; use that proxy

; Gambit 3.0
(define (get-url url)
  (define (do-fetch schema dummy host resource)
    (let* ((proxy (OS:getenv "http_proxy"))
           (target-host (or proxy host))
           (http-port (##open-input-output-file
                       (string-append "tcp://" target-host
                         (if (string-index target-host #\:) "" ":80")))))
      (for-each (lambda (str) (display str http-port))
       `("GET "
         ,@(if proxy (list url) (list "/" resource))
         " HTTP/1.0\r\n"
         "Host: " ,host "\r\n"
         "User-agent: Scheming-puppy/1.1\r\n"
         "\r\n"         ; Empty line finishes the request
         ))
      (flush-output http-port)
      (cerr "\nrequest sent...\n")
      (let ((ret-code (begin (read http-port) (read http-port))))
	(case ret-code
	  ((200)		; now skip the headers
	   (let loop ()		; read-text-line handles CR, LF, and CRLF as
				; line terminators
	     (let ((header (read-text-line http-port)))
	       (cerr "Header: " header nl)
	       (cond
		((eof-object? header)
		 (error "unexpected EOF while reading the headers"))
		((equal? "" header) ; read the empty line after the headers
		 http-port)
		(else (loop))))))
	  (else (error "Bad return code: " ret-code))))
      ))

  (apply do-fetch (string-split url '(#\/) 4)))

(dehtmlize (get-url "http://www.schemers.org/"))


The input parsing primitives are explained in
        http://pobox.com/~oleg/ftp/Scheme/parsing.html
find-string-from-port? is also a part of SLIB.

The only far-flung extension is dealing with "tcp://" file names. As a
matter of fact a Scheme system takes such a string as a regular file
name; the magic happens thanks to an "extended" version on open(2).
        http://pobox.com/~oleg/ftp/syscall-interpose.html
It is possible to imbue any language system or application with such
extended file opening powers. No recompilation is necessary (and in
some cases, no relinking is required either, courtesy of LD_PRELOAD).