Service C++ functions and classes "advanced" i/o, (arithmetic) compression, and networking $Id: README,v 2.7 2005/06/26 23:04:50 oleg Exp oleg $ ***** Platforms ***** Verification files ***** Highlights and idioms ---- Extended file names ---- Explicit Endian I/O of short/long integers ---- Reading and writing of floating-point numbers ---- Stream sharing ---- Simple variable-length coding of short integers ---- Arithmetic compression of a stream of integers ---- TCP streams ---- TCP transactor, a shell RPC-like tool ---- Logging Service ---- Convenience Functions ***** Revision history ***** Comments/questions/problem reports/etc are all very welcome. Please send them to me at oleg -at- pobox.com or oleg -at- okmij.org http://pobox.com/~oleg/ftp/ ***** Platforms I have personally compiled and tested this package on the following platforms: i686/FreeBSD 4.9, gcc 3.2 and gcc 2.95.2 i686/Linux 2.4.21, gcc 3.2.3 I have received reports that the library compiles and tests on Fedora Core 2 with GCC 3.3.3 and 3.4.0. The previous (2.6) version also works on Sun Ultra-2/Solaris 2.6, gcc 2.95.2 i686/FreeBSD 4.0, gcc 2.95.2 i686/Linux 2.2.14, gcc 2.95.2 WinNT, Visual C++ 6.0 BEOS R4b4 SunSparc20/Solaris 2.4, gcc 2.7.2, libg++ 2.7.1 SunSparc20/Solaris 2.3, SunPro C++ compiler HP 9000/{750,770,712}, HP/UX 9.0.5, 9.0.7 and 10.0, gcc 2.7.2, libg++ 2.7.2 PowerMac 7100/80, 8500/132, Metrowerk's CodeWarrior C++, v. 7 - 11 Intel, Windows95, Borland C++ 4.5/5.0 (the binaries then ran under Windows NT 4.0 beta) I know that the package also works on DEC Alpha and Concurrent Maxion 8000/RTU 6.2V25 (all with gcc 2.7.2 compiler) ***** Verification files: vmyenv, vendian_io, vendian_io_ext, vhistogram, varithm, vTCPstream, vTCPstream_server, vvoc Don't forget to compile and run them, see comments in the Makefile for details. The verification code checks to see that all the functions in this package have compiled and run well. The code also can serve as an example how package's classes/functions can be used. For each verification executable, the distribution includes a corresponding *.lst file, containing the output produced by that validation code on one particular platform (Sun Ultra-2/Solaris 2.6 to be precise). You can use these files for reference or as a base for regression tests. ***** Highlights and idioms ---- Extended file names The package adds support for "extended" file names: file names that contain a pipe symbol ('|') in a leading or trailing position, or start with a "tcp://" or "ltcp://" prefixes. These "files" can be opened for reading, writing or even reading _and_ writing. EndianIn istream; istream.open("gunzip < /tmp/aa.gz |"); EndianOut stream("| compress > /tmp/aa.Z"); image.write_pgm("| xv -"); FILE * fp = fopen("tcp://localhost:7","r"); fstream fp("| cat | cat",ios::in|ios::out); The "pipes" can be uni- or bi-directional. "Piped" filenames are actually commands that are passed to a '/bin/sh', which is launched in a subprocess. The process' stdin, stdout or both are plumbed to a pipe or a bidirectional socket, which is returned to the user as a "file descriptor". Code vendian_io.cc shows many examples of using variously extended file names. This extension is implemented on the lowest possible level, right before the request to open a file goes to the OS. A function sys_open() (in a source file sys_open.c) acts as a "patch": that is, if you call sys_open() instead of open() to open a file, you get all the open() functionality plus the extended file names. Makefile contains some "black magic" that shows how to effectively "substitute" the standard open(2) function with sys_open(), without changing any of the system code. The substitution is completely safe and does not require any extra privileges or permissions beyond what a regular user already has. With this substitution inplace, *no matter* how you open a file -- with open(), fopen(), fstream(), etc -- you can submit extended file names and enjoy their functionality. ---- Explicit Endian I/O of short/long integers EndianOut stream("/tmp/aa"); stream.set_littlendian(); stream.write_long(1); That means, 1 would be written as a long integer with the least significant byte first, NO MATTER which computer (computer architecture) the code is running on. Using an explicit endian specification as above is the only way to ensure portability of binary files containing arithmetic data. Note it is perfectly appropriate to pass, say, -1 or any other signed integer to write_short() even if write_short() was declared to take an unsigned short. Any signed number can be transformed into the corresponding unsigned number without any loss of precision or range. You can use a typecast if your compiler wants it. The reverse transformation, unsigned->signed is not generally possible (say, 32768 cannot be represented as a signed short). Still, if we know that we wrote a signed integer, we are justified in demanding a signed number back, e.g., const short exponent = (signed short)read_short("reading exp"); The cast is really necessary here. The methods read_short()/write_short() are intentionally made to take or return unsigned numbers. This is to emphasize that these methods are to operate on 16-bit chunks. They move 16-bit quantities without assigning any particular meaning to them. It is the user who provides all interpretation, by using typecasts. ---- Reading and writing of floating-point numbers It is certainly possible to use EndianIO to read/write floating point numbers in a portable way. Although EndianIn/EndianOut streams currently support reading/writing of only integers, every FP number can be split into exponent/mantissa parts, and reconstructed from them, in a portable, platform-independent way. ANSI C/POSIX specify functions frexp(), ldexp() and modf() for that purpose. See functions write_double() and read_double() in file vendian_io.cc as an example. These functions transfer floating-point numbers without any loss of precision. Chances are however that a particular application does not require the full precision. If you can afford to lose some of it, you can write out the values in a more compact way. For example, if single precision is enough for you, only the first 24 bits of mantissa need to be written. BTW, if you can tolerate some loss, the best strategy would be to scan the array of numbers to write, determine the min and max values, subtract the min value from all the elements of the array, and normalize the differences to be in range, say, [0,255] or [0,65565]. You can use ArithmCodingIn/Out to read/write thus normalized numbers (taking advantage of a (lossless) compression built in these c++advio's streams). Because efficient storing and communication of floating point numbers is so application-specific, the write_double() and read_double() functions are not made members of EndianIn/EndianOut classes. ---- Stream sharing EndianIn/Out streams can share the same i/o buffer. This is useful when one needs to read/write a "stratified" (layered) file consisting of various variable-bit encoded data interspersed with headers. For example, a file may begin with a header (telling the total number of data items, normalization factors) followed by some variable-bit encoding of items, followed by another header, followed by an arithmetic compressed stream of data, etc. Like a waffle pie a file can be made of many layers: each of which being interpreted using different streams, each of which collectively sharing the same file and the same file pointer. The situation is similar to sharing an open file and a file pointer among parent and child (forked) processes. Note that a mere opening of a stream on a dup()-ed file handle, or sync()-ing the stream doesn't cut it entirely. See endian_io.cc for more discussion. This package implements stream sharing in a safe and portable way: it works on a Mac and WinNT just as well as on different flavors of UNIX. ---- Simple variable-length coding of short integers The code is intended for writing a collection of short integers where many of them are rather small in value; still, big values can crop up at times, so we can't limit the size of the encoding to anything less than 16 bits. The code is a variation of a start-stop code described in Appendix A, "Variable-length representations of the integers" of the "Text Compression" book by T.Bell, J.Cleary and I.Witten, p.290-295. The present code features support for both negative and positive numbers and an optimization based on the fact that all numbers are no larger than 2^15-1 in abs value, and an assumption that most of them are smaller than 512 (in absolute value). ---- Arithmetic compression of a stream of integers The present package provides a clean C++ implementation of Bell, Cleary and Witten's arithmetic compression code, with a clear separation between a model and the coder. ArithmCodingIn / ArithmCodingOut act as i/o streams that encode signed short integers you put() to, and decode them when you get() them. The ArithmCodingIn/Out object needs a "plug-in" of a class Input_Data_Model when the stream is created. The Input_Data_Model object is responsible for providing the codec with the probabilities (frequencies) a given data item is expected to appear with, and for finding a symbol given its cumulative frequency. Input_Data_Model may also modify itself to account for a new symbol. Thus, the ArithmCoding class is a sort of the 'iostream' class that writes/reads data items to/from the stream performing encoding/decoding. It relies upon the Input_Data_Model for the probabilities needed to perform the arithmetic coding. The current version of the package provides two Input_Data_Model plug-ins, both performing adaptive "modeling" of a stream of integers. The first plug-in uses a simple 0-order adaptive prediction (like the model given in the Witten's book). The other one takes a histogram to sketch the initial distribution, and is a bit sophisticated in updating the model. It is used in compressing a wavelet decomposition of an image. The code below (taken literally from varithm.cc) demonstrates how the coder classes are actually used. The first example writes two different streams (of different patterns, that's why it was better to encode them separately) into the same file EndianOut stream("/tmp/aa"); stream.set_littlendian(); const int sample_header = 12345; { AdaptiveModel model(-1,4); ArithmCodingOut ac(model); ac.open(stream); for(i=0; i> resp_code; if( ! link.get(buffer,sizeof(buffer)-1,'\r').good() ) _error("error reading a response line from the link); if( resp_code >= 300 ) _error("bummer"); ...etc... See also vTCPstream.cc for more examples. This code has been used in HTTP VFS, http://pobox.com/~oleg/ftp/HTTP-VFS.html and in the TCP transactor tool below. TCP streams are helpful on a server side as well. See vTCPstream_server.cc for an example. ---- TCP transactor, a shell RPC-like tool tcp-trans is an application to perform a single transaction -- a request/reply exchange -- with a "server" on the other end of a TCP pipe. tcp-trans is based on TCPStreams (see above), and shows an example of their usage. This code establishes a connection to a server, sends a simple request, listens to the reply and prints it out on its standard output. This code can be used then to talk to any TCP server (HTTP daemon, or an RPC-like service). tcp-trans is particularly useful as a scripting tool (in sh or other scripts) to talk to TCP daemons. For example, tcp-trans localhost:80 "GET / HTTP/1.0" "" will fetch the root web page off the site. tcp-trans some.host:25 "expn " "quit" reveals the real person behind the postmaster See the title comments in the tcp-trans.cc code for more examples. ---- Logging Service A trivial service to help log various system activities onto stderr, or some other log stream or file. One can use it like Logger() << "Log this message" << "... and this too!" << endl; Note that endl at the end is not necessary: ~Logger() destructor would take care of it (provided anything was logged at all). The Logger class is intended to be as light-weight as possible so that all the logging operations can be inlined. Other examples: Logger clog; clog << "\nConnecting to " << connection_parms.q_host_to_connect() << ':' << connection_parms.q_port_to_connect() << endl; ... const int resp_code = read_response_status_line(link); clog << "\nreceived response code " << resp_code << endl; ... Logger() << "soft errors will be re-tried " << max_retries_count << " times "; ---- Convenience Functions The package defines a few functions I found convenient to use, like message(...) (which is equivalent to fprintf(stderr,....)) and _error(...) ( the same as message(...), abort();). One doesn't need to to #include to use them. Also included: xgetenv() - getenv() with a fall-back clause get_file_size() - also with a default clause does_start_with_ci() - an amazingly useful function in input parsing see vmyenv.cc for examples of their usage. The validation file vmyenv.cc also illustrates how to catch an abort condition, without crashing the main process (macro must_have_failed()) ---- Portability Tips Borland C++ 4.5 is sometimes unhappy with the order BitIn, BitOut (in endian_io.h) and ArithmCodingIn, ArithmCodingOut (in arithm.h) classes are derived. Right now, class BitIn : BitIOBuffer, public EndianIn upsets BC because "RTTI class BitIn being derived from non-RTTI class BitIOBuffer". I have a hunch that the error like that could be avoided by tinkering with C++ compiler options. On the other hand, merely switching the order of inheritance, class BitIn : public EndianIn, BitIOBuffer solves the problem. The same for BitOut, ArithmCodingIn, and ArithmCodingOut. ***** Grand plans Consider a shared BitIO class that permits switching ArithmCoding streams freely, w/o the overhead of padding bits. See message by Erik Kruus, Jun 23, 2000. ***** Revision history Version 2.7 - Jun 2005 - Compiles with GCC 3.2-3.4 - Added support for the "ltcp://" file name prefix, to open a listening socket and accept one connection. The code was contributed by Bernhard Mogens Ege. - Added an example simple-proxy.cc, which has actually been used as a simple inetd-like server. Version 2.6 - Nov 2000 - Added passive open to TCPStream and the corresponding validation test vTCPstream_server.cc. - Renamed library libserv.a into libcppadvio.a - Minute corrections (mainly to make the compiler happier) Version 2.5 - Jan 2000 - added tcp-trans.cc, a TCP transactor, a shell RPC-like tool. - a new section on reading and writing of floating-point numbers. - "renaming" of open() is tested on Solaris and FreeBSD systems - A user of a TCPStream can affect async i/o and error call-backs by instantiating and registering a CurrentNetCallback object. - sys_open() supports more "extended" file names, which denote TCP connections and bidirectional pipes. - validation code (vendian_io.cc) was updated to test the new functionality (esp. sys_open()) - double pow(long x, long y) and double pow(double x, long y) - a few minor adjustments to please gcc 2.95/egcs 2.xx on Linux, FreeBSD, Solaris and BeOS platforms Version 2.4 - Mar 1998 - a few minor adjustments to please gcc 2.8.1 and Visual C++ 5.0 - added primitive Logging service - added TCP stream - extended i/o is done in a more universal way (by "renaming" of open(2), although no system function is changed) Version 2.3 - Mar 1997 - added xgetenv(), does_start_with_ci(), get_file_size() - created vmyenv.cc to validate myenv.h's functions - a few adjustments (mainly to endian_io.h and arithm.h) to account for changes in implementation (and interfaces, ) of the C++ iostream library, made in new versions of libg++ (v. 2.7.2) and Metrowerk's CodeWarrior (v. 11) This brings c++advio closer to the (ever evolving) C++ standard. - _Vocabulary_ (an embedded language, actually) is now distributed with the c++advio, see voc.h for more detail. Version 2.2.3 - Mar 1996 - sys_open.cc now accepts an input pipe with more than one link as a "file" name - endian_io.*: added EndianIOData:unshare() method to break sharing of a streambuffer (if was any). This method is intended for destructors only (makes the code more portable). - careful attention to comparisons between signed and unsigned (mainly to get gcc 2.7.2 to shut up) - now everything compiles with gcc 2.7.2/libg++ 2.7.1 and Metrowerks Codewarrior 8. - portability tweaks in myenv.h (declaring bool for platforms that lack it) - arithm_modadh.*: more logical (and efficient) way of "pulling-to- the-front" when updating adaptive model frequency counters by more than 1. Also, the initial distribution is slightly tweaked. The upshot is that the compression is a tiny bit better (at least, the algorithm makes more sense). Version 2.2.1 - Jun 1995 Fixed the last remaining incompatibility glitches. Now, exactly the same code compiles on a Mac with CodeWarrior 6 and on Unix with gcc 2.6.3 Version 2.2 - May 1995 Added a variable-length (start/stop) coding of signed short integers. Added dealing with simple histograms of an integer-valued distribution. Version 2.1 - Mar 1995 Introducing bool where appropriate (instead of int) and adding checks to make sure an EndianIn/Out stream was opened successfully. Version 2.0 - Feb 1995 Big change: splitting EndianIO into EndianIn and EndianOut and removing all libg++-specific things; everything should be very portable now. Making sharing of the streambuffer portable. Version 1.4 - Feb 1994 Updated for libg++ 2.5.3 Version 1.3 - Aug 1993 Introducing attachment of one stream to another, or sharing of a streambuf among several streams. Took care of properly terminating an arithm coding stream by writing a few phony bits at the end (so we won't hit the EOF on reading). Thus it is possible now to concatenate arithmetic coding streams. Version 1.2 - Jun 1992 Updated to compile under gcc/g++ 2.2.1 and work with libg++ 2.0. The first implementation of the arithmetic coding package Version 1.1 - Nov 1991 - May 1992 Initial revision