Service C++ functions and classes
	"advanced" i/o, (arithmetic) compression, and networking
	$Id: README,v 2.7 2005/06/26 23:04:50 oleg Exp oleg $

***** Platforms
***** Verification files
***** Highlights and idioms
	---- Extended file names
	---- Explicit Endian I/O of short/long integers
	---- Reading and writing of floating-point numbers
	---- Stream sharing
	---- Simple variable-length coding of short integers
	---- Arithmetic compression of a stream of integers
	---- TCP streams
	---- TCP transactor, a shell RPC-like tool
	---- Logging Service
	---- Convenience Functions
***** Revision history


***** Comments/questions/problem reports/etc
are all very welcome. Please send them to me at
oleg -at- pobox.com or oleg -at- okmij.org
http://pobox.com/~oleg/ftp/


***** Platforms
	I have personally compiled and tested this package on the
following platforms:
	i686/FreeBSD 4.9, gcc 3.2 and gcc 2.95.2
	i686/Linux 2.4.21, gcc 3.2.3
	I have received reports that the library compiles and tests
on Fedora Core 2 with GCC 3.3.3 and 3.4.0.

The previous (2.6) version also works on
	Sun Ultra-2/Solaris 2.6, gcc 2.95.2
	i686/FreeBSD 4.0, gcc 2.95.2
	i686/Linux 2.2.14, gcc 2.95.2
	WinNT, Visual C++ 6.0

	BEOS R4b4
	SunSparc20/Solaris 2.4, gcc 2.7.2, libg++ 2.7.1
	SunSparc20/Solaris 2.3, SunPro C++ compiler
	HP 9000/{750,770,712}, HP/UX 9.0.5, 9.0.7 and 10.0,
				gcc 2.7.2, libg++ 2.7.2
	PowerMac 7100/80, 8500/132,
	Metrowerk's CodeWarrior C++, v. 7 - 11
	Intel, Windows95, Borland C++ 4.5/5.0
	(the binaries then ran under Windows NT 4.0 beta)
I know that the package also works on DEC Alpha
and Concurrent Maxion 8000/RTU 6.2V25 (all with gcc 2.7.2 compiler)


***** Verification files: vmyenv, vendian_io, vendian_io_ext,
	vhistogram, varithm, vTCPstream, vTCPstream_server, vvoc

Don't forget to compile and run them, see comments in the Makefile for
details.  The verification code checks to see that all the functions
in this package have compiled and run well. The code also can serve as
an example how package's classes/functions can be used.

For each verification executable, the distribution includes a
corresponding *.lst file, containing the output produced by that
validation code on one particular platform (Sun Ultra-2/Solaris 2.6
to be precise).  You can use these files for reference or as a base
for regression tests.


***** Highlights and idioms

---- Extended file names

The package adds support for "extended" file names: file names that
contain a pipe symbol ('|') in a leading or trailing position,
or start with a "tcp://" or "ltcp://" prefixes. These "files" 
can be opened for reading, writing or even reading _and_ writing.

	EndianIn istream;
	istream.open("gunzip < /tmp/aa.gz |");
	EndianOut stream("| compress > /tmp/aa.Z");
	image.write_pgm("| xv -");
	FILE * fp = fopen("tcp://localhost:7","r");
        fstream fp("| cat | cat",ios::in|ios::out);

The "pipes" can be uni- or bi-directional. "Piped" filenames are
actually commands that are passed to a '/bin/sh', which is launched
in a subprocess. The process' stdin, stdout or both are plumbed to
a pipe or a bidirectional socket, which is returned to the user
as a "file descriptor". Code vendian_io.cc shows many examples
of using variously extended file names.

This extension is implemented on the lowest possible level, right
before the request to open a file goes to the OS. A function sys_open()
(in a source file sys_open.c) acts as a "patch": that is, if you
call sys_open() instead of open() to open a file, you get all
the open() functionality plus the extended file names.

Makefile contains some "black magic" that shows how to effectively
"substitute" the standard open(2) function with sys_open(), without
changing any of the system code. The substitution is completely safe
and does not require any extra privileges or permissions beyond what a
regular user already has. With this substitution inplace, *no matter*
how you open a file -- with open(), fopen(), fstream(), etc -- you can
submit extended file names and enjoy their functionality.


---- Explicit Endian I/O of short/long integers

  EndianOut stream("/tmp/aa");
  stream.set_littlendian();
  stream.write_long(1);

That means, 1 would be written as a long integer with the least
significant byte first, NO MATTER which computer (computer
architecture) the code is running on. Using an explicit endian
specification as above is the only way to ensure portability of
binary files containing arithmetic data.

Note it is perfectly appropriate to pass, say, -1 or any other signed
integer to write_short() even if write_short() was declared to take an
unsigned short. Any signed number can be transformed into the
corresponding unsigned number without any loss of precision or
range. You can use a typecast if your compiler wants it. The reverse
transformation, unsigned->signed is not generally possible (say, 32768
cannot be represented as a signed short). Still, if we know that we
wrote a signed integer, we are justified in demanding a signed number
back, e.g.,
	const short exponent = (signed short)read_short("reading exp");
The cast is really necessary here. The methods read_short()/write_short()
are intentionally made to take or return unsigned numbers. This is
to emphasize that these methods are to operate on 16-bit chunks. They move
16-bit quantities without assigning any particular meaning to them.
It is the user who provides all interpretation, by using typecasts.


---- Reading and writing of floating-point numbers

It is certainly possible to use EndianIO to read/write floating
point numbers in a portable way. Although EndianIn/EndianOut streams
currently support reading/writing of only integers, every FP number
can be split into exponent/mantissa parts, and reconstructed from
them, in a portable, platform-independent way. ANSI C/POSIX specify
functions frexp(), ldexp() and modf() for that purpose. See functions
write_double() and read_double() in file vendian_io.cc as an example.

These functions transfer floating-point numbers without any loss of
precision. Chances are however that a particular application does not
require the full precision. If you can afford to lose some of it,
you can write out the values in a more compact way. For example,
if single precision is enough for you, only the first 24 bits of
mantissa need to be written. BTW, if you can tolerate some loss,
the best strategy would be to scan the array of numbers to write,
determine the min and max values, subtract the min value from all the
elements of the array, and normalize the differences to be in range,
say, [0,255] or [0,65565]. You can use ArithmCodingIn/Out to read/write thus
normalized numbers (taking advantage of a (lossless) compression
built in these c++advio's streams).

Because efficient storing and communication of floating point numbers
is so application-specific, the write_double() and read_double() functions
are not made members of EndianIn/EndianOut classes.


---- Stream sharing

EndianIn/Out streams can share the same i/o buffer. This is useful
when one needs to read/write a "stratified" (layered) file consisting
of various variable-bit encoded data interspersed with headers.  For
example, a file may begin with a header (telling the total number of
data items, normalization factors) followed by some variable-bit
encoding of items, followed by another header, followed by an
arithmetic compressed stream of data, etc. Like a waffle pie a file
can be made of many layers: each of which being interpreted using
different streams, each of which collectively sharing the same file
and the same file pointer. The situation is similar to sharing an open
file and a file pointer among parent and child (forked) processes.

Note that a mere opening of a stream on a dup()-ed file handle, or
sync()-ing the stream doesn't cut it entirely. See endian_io.cc for
more discussion. This package implements stream sharing in a safe
and portable way: it works on a Mac and WinNT just as well as on
different flavors of UNIX.


---- Simple variable-length coding of short integers

The code is intended for writing a collection of short integers where
many of them are rather small in value; still, big values can crop up
at times, so we can't limit the size of the encoding to anything less
than 16 bits.  The code is a variation of a start-stop code described
in Appendix A, "Variable-length representations of the integers" of the
"Text Compression" book by T.Bell, J.Cleary and I.Witten,
p.290-295. The present code features support for both negative and
positive numbers and an optimization based on the fact that all
numbers are no larger than 2^15-1 in abs value, and an assumption that
most of them are smaller than 512 (in absolute value).


---- Arithmetic compression of a stream of integers

The present package provides a clean C++ implementation of Bell,
Cleary and Witten's arithmetic compression code, with a clear
separation between a model and the coder. ArithmCodingIn /
ArithmCodingOut act as i/o streams that encode signed short integers
you put() to, and decode them when you get() them. The
ArithmCodingIn/Out object needs a "plug-in" of a class
Input_Data_Model when the stream is created. The Input_Data_Model
object is responsible for providing the codec with the probabilities
(frequencies) a given data item is expected to appear with, and for
finding a symbol given its cumulative frequency. Input_Data_Model may
also modify itself to account for a new symbol. Thus, the ArithmCoding
class is a sort of the 'iostream' class that writes/reads data items
to/from the stream performing encoding/decoding. It relies upon the
Input_Data_Model for the probabilities needed to perform the
arithmetic coding.

The current version of the package provides two Input_Data_Model
plug-ins, both performing adaptive "modeling" of a stream of
integers. The first plug-in uses a simple 0-order adaptive prediction
(like the model given in the Witten's book). The other one takes a
histogram to sketch the initial distribution, and is a bit
sophisticated in updating the model. It is used in compressing a
wavelet decomposition of an image. The code below (taken literally
from varithm.cc) demonstrates how the coder classes are actually used.

The first example writes two different streams (of different patterns,
that's why it was better to encode them separately) into the same file

  EndianOut stream("/tmp/aa");
  stream.set_littlendian();
  const int sample_header = 12345;
  {
    AdaptiveModel model(-1,4);
    ArithmCodingOut ac(model);
    ac.open(stream);
    for(i=0; i<sizeof(pattern1)/sizeof(pattern1[0]); i++)
      ac.put(pattern1[i]);
  }
  {
    stream.write_long(sample_header);	// write a "header"
    AdaptiveModel model(-1,4);		// followed by the arithmetic coded
    ArithmCodingOut ac(model);		// stream
    ac.open(stream);
    for(i=0; i<sizeof(pattern2)/sizeof(pattern2[0]); i++)
      ac.put(pattern2[i]);
  }
  stream.close();

The reading is similar.

The second example uses a different model plug-in:

static void test_adh(void)
{
  message("\nCreating Histogram ...\n");
  Histogram histogram(-7,7);
  register int i;
  for(i=0; i<MyPattern_size; i++)
    histogram.put(MyPattern[i]);

  message("\nWriting data ...");
  AdaptiveHistModel model(histogram);
  ArithmCodingOut ac(model);
  ac.open("/tmp/aa");
  for(i=0; i<MyPattern_size; i++)
    ac.put(MyPattern[i]);
  ac.close();

  message("\nCoded file /tmp/aa has been created\n");

  AdaptiveHistModel i_model;
  ArithmCodingIn ac1(i_model);
  ac1.open("/tmp/aa");
  for(i=0; i<MyPattern_size; i++)
  {
    register int val_read = ac1.get();
    if( val_read != MyPattern[i] )
      _error("Read value %d of the %d-th integer is not what it is "
	     "supposed to be, %d",
	     val_read, i, MyPattern[i]);
  }
  ac1.get();
  assert( ac1.is_eof() );
}

---- TCP streams

	A TCPStream is a standard C++ stream to push data to and take
data from a TCP connection. A TCPStream assumes a half-duplex mode, so
to speak. This is the mode in which all transaction- or request-reply-
oriented TCP protocols -- HTTP, SMTP, POP, NNTP, RPC over TCP, to name
just very few -- operate. This mode implies that reads and writes
from/to a channel can share the same buffer. A "get" operation tacitly
flushes all the data deposited by prior "put" operations. Similarly
a "put" operation discards all the previously read but not yet
consumed data.

	The functionality of TCPStreams is identical to that of an
fstream: the only difference is that it may take really a while for an
i/o operation on a TCP channel to finish. It's unreasonable to make
the whole application wait. Thus an application is given a chance to
tell a TCP buffer being created that it does not want the network
i/o to block. Application does it by deriving and instantiating a
NetCallback object. If the currently registered NetCallback::async_io_hint()
returns true, a newly created TCP socket is turned into a (POSIX)
non-blocking mode; when the stream wants to flush/fill its buffer
and the operation can't be completed immediately, the stream calls
NetCallback::yield(). When this function returns, the stream gives
another try to the input/output operation.

        With the TCPStream, you can check your POP3 mail like this:
        TCPstream link;
        link.connect(SocketAddr("hostname",110));
        if( !link.is_open() || !link )
           _error("Failed to establish the connection");
        link << "USER " << user_name << endl;
        int resp_code; char buffer[100];
        link >> resp_code;
        if( ! link.get(buffer,sizeof(buffer)-1,'\r').good() )
         _error("error reading a response line from the link);
        if( resp_code >= 300 )
          _error("bummer");
	...etc...

See also vTCPstream.cc for more examples. This code has been used in
HTTP VFS,
	http://pobox.com/~oleg/ftp/HTTP-VFS.html
and in the TCP transactor tool below.

TCP streams are helpful on a server side as well. See vTCPstream_server.cc
for an example.


---- TCP transactor, a shell RPC-like tool

	tcp-trans is an application to perform a single transaction --
a request/reply exchange -- with a "server" on the other end of a TCP pipe.
tcp-trans is based on TCPStreams (see above), and shows an example of their
usage. This code establishes a connection to a server, sends a simple
request, listens to the reply and prints it out on its standard output.
This code can be used then to talk to any TCP server (HTTP daemon, or an
RPC-like service). tcp-trans is particularly useful as a scripting tool
(in sh or other scripts) to talk to TCP daemons. For example, 
	tcp-trans localhost:80 "GET / HTTP/1.0" ""
will fetch the root web page off the site.
	tcp-trans some.host:25 "expn <postmaster>" "quit"
reveals the real person behind the postmaster

See the title comments in the tcp-trans.cc code for more examples.


---- Logging Service

	A trivial service to help log various system activities onto
stderr, or some other log stream or file.  One can use it like
	Logger() << "Log this message" << "... and this too!" << endl;
Note that endl at the end is not necessary: ~Logger() destructor would
take care of it (provided anything was logged at all).
The Logger class is intended to be as light-weight as possible so that
all the logging operations can be inlined. Other examples:

  Logger clog;
  clog << "\nConnecting to " << connection_parms.q_host_to_connect() << ':'
      << connection_parms.q_port_to_connect() << endl;
  ...
  const int resp_code = read_response_status_line(link);
  clog << "\nreceived response code " << resp_code << endl;
  ...
  Logger() << "soft errors will be re-tried " << max_retries_count
           << " times ";


---- Convenience Functions

The package defines a few functions I found convenient to use, like
message(...) (which is equivalent to fprintf(stderr,....))  and
_error(...) ( the same as message(...), abort();). One doesn't need to
to #include <stdio.h> to use them.

Also included:
  xgetenv()            - getenv() with a fall-back clause
  get_file_size()      - also with a default clause
  does_start_with_ci() - an amazingly useful function in input parsing
see vmyenv.cc for examples of their usage.

The validation file vmyenv.cc also illustrates how to catch an abort
condition, without crashing the main process (macro
must_have_failed())


---- Portability Tips

Borland C++ 4.5 is sometimes unhappy with the order BitIn, BitOut (in
endian_io.h) and ArithmCodingIn, ArithmCodingOut (in arithm.h) classes
are derived.  Right now,
	class BitIn : BitIOBuffer, public EndianIn
upsets BC because "RTTI class BitIn being derived from non-RTTI class
BitIOBuffer". I have a hunch that the error like that could be avoided
by tinkering with C++ compiler options. On the other hand, merely
switching the order of inheritance,
        class BitIn : public EndianIn, BitIOBuffer
solves the problem. The same for BitOut, ArithmCodingIn, and
ArithmCodingOut.


***** Grand plans
	Consider a shared BitIO class that permits switching ArithmCoding
	streams freely, w/o the overhead of padding bits. See message
	by Erik Kruus, Jun 23, 2000.

***** Revision history

Version 2.7 - Jun 2005
	- Compiles with GCC 3.2-3.4
	- Added support for the "ltcp://" file name prefix, to open a 
	  listening socket and accept one connection. The code was
	  contributed by Bernhard Mogens Ege.  
	- Added an example simple-proxy.cc, which has actually been used
	  as a simple inetd-like server.

Version 2.6 - Nov 2000
	- Added passive open to TCPStream and the corresponding validation
	  test vTCPstream_server.cc.
	- Renamed library libserv.a into libcppadvio.a
	- Minute corrections (mainly to make the compiler happier)

Version 2.5 - Jan 2000
	- added tcp-trans.cc, a TCP transactor, a shell RPC-like tool.
	- a new section on reading and writing of floating-point numbers.
	- "renaming" of open() is tested on Solaris and FreeBSD systems
	- A user of a TCPStream can affect async i/o and error
	  call-backs by instantiating and registering a
	  CurrentNetCallback object.
	- sys_open() supports more "extended" file names, which denote
	  TCP connections and bidirectional pipes.
	- validation code (vendian_io.cc) was updated to
	  test the new functionality (esp. sys_open())
	- double pow(long x, long y) and double pow(double x, long y)
	- a few minor adjustments to please gcc 2.95/egcs 2.xx
	  on Linux, FreeBSD, Solaris and BeOS platforms

Version 2.4 - Mar 1998
	- a few minor adjustments to please gcc 2.8.1 and Visual C++ 5.0
	- added primitive Logging service
	- added TCP stream
	- extended i/o is done in a more universal way (by "renaming"
	of open(2), although no system function is changed)

Version 2.3 - Mar 1997
	- added xgetenv(), does_start_with_ci(), get_file_size()
	- created vmyenv.cc to validate myenv.h's functions
	- a few adjustments (mainly to endian_io.h and arithm.h)
	  to account for changes in implementation (and interfaces,
	  <sigh>) of the C++ iostream library, made in new versions
	  of libg++ (v. 2.7.2) and Metrowerk's CodeWarrior (v. 11)
	  This brings c++advio closer to the (ever evolving) C++ standard.
	- _Vocabulary_ (an embedded language, actually) is now
	  distributed with the c++advio, see voc.h for more detail.

Version 2.2.3 - Mar 1996
	- sys_open.cc now accepts an input pipe with more than one link
	  as a "file" name
	- endian_io.*: added EndianIOData:unshare() method to break
	  sharing of a streambuffer (if was any). This method is intended
	  for destructors only (makes the code more portable).
	- careful attention to comparisons between signed and unsigned
	  (mainly to get gcc 2.7.2 to shut up)
	- now everything compiles with gcc 2.7.2/libg++ 2.7.1 and
	  Metrowerks Codewarrior 8.
	- portability tweaks in myenv.h (declaring bool for platforms
	  that lack it)
	- arithm_modadh.*: more logical (and efficient) way of "pulling-to-
	  the-front" when updating adaptive model frequency counters
	  by more than 1. Also, the initial distribution is slightly
	  tweaked. The upshot is that the compression is a tiny bit
	  better (at least, the algorithm makes more sense).

Version 2.2.1 - Jun 1995
Fixed the last remaining incompatibility glitches. Now, exactly the
same code compiles on a Mac with CodeWarrior 6 and on Unix with gcc
2.6.3

Version 2.2 - May 1995
Added a variable-length (start/stop) coding of signed short integers.
Added dealing with simple histograms of an integer-valued
distribution.

Version 2.1 - Mar 1995
Introducing bool where appropriate (instead of int) and adding checks
to make sure an EndianIn/Out stream was opened successfully.

Version 2.0 - Feb 1995
Big change: splitting EndianIO into EndianIn and EndianOut and
removing all libg++-specific things; everything should be very
portable now.  Making sharing of the streambuffer portable.

Version 1.4 - Feb 1994
Updated for libg++ 2.5.3

Version 1.3 - Aug 1993
Introducing attachment of one stream to another, or sharing of a
streambuf among several streams. Took care of properly terminating an
arithm coding stream by writing a few phony bits at the end (so we
won't hit the EOF on reading). Thus it is possible now to concatenate
arithmetic coding streams.

Version 1.2 - Jun 1992
Updated to compile under gcc/g++ 2.2.1 and work with libg++ 2.0. The
first implementation of the arithmetic coding package

Version 1.1 - Nov 1991 - May 1992
Initial revision