;; -*-mode: Outline;  -*-

TOC
   General
   Awards

   Keynote. Neal Stephenson: Programmer as a writer. Do it right the first time
   A Logic File System
   Undo for Operators: Building an Undoable E-mail Store
   Role Classification of Hosts within Enterprise Networks Based on 
      Connection Patterns
   A Cooperative Internet Backup Scheme
   The Convergence of Ubiquity: The Future of Wireless Network Security
   Network Programming for the Rest of Us
   In-Place Rsync: File Synchronization for Mobile and Wireless Devices
   Application-specific Delta-encoding via Resemblance Detection
   Opportunistic Use of Content Addressable Storage for Distributed FileSystems
   StarFish: Highly-Available Block Storage
   Secure and Flexible Global File Sharing
   The CryptoGraphic Disk Driver
   The Design of the OpenBSD Cryptographic Framework
   A Binary Rewriting Defense against Stack-based Buffer Overflow Attacks
   Kernel Support for Faster Web Proxies
   Multiprocessor Support for Event-Driven Programs
   POSIX Access Control Lists on Linux
   Privman: A Library for Partitioning Applications
   The TrustedBSD MAC Framework: Extensible Kernel Access Control for FreeBSD50
   Using Read-Copy-Update Techniques for System V IPC in the Linux 2.5 Kernel
   An Implementation of User-level Restartable Atomic Sequences on the 
      NetBSD Operating System
   Providing a Linux API on the Scalable K42 Kernel
   BSD SuperBOF
   Plan9

* General
June 12-14, 2003, San Antonio, TX

Attendance: I haven't seen the exact numbers; subjectively there were
noticeably fewer people compared to past conferences. USENIX2001 in
Boston was attended by ca. 1500 people. This year there seem to be
only half as many. The exhibition was particularly poor -- there were
only about 14 exhibitors.

The conference is selective: in the general track, 24 out of submitted
104 papers were accepted. In the FREENIX track, 27 out of 70 (cf. last
year 53) submitted papers were accepted. In the Freenix track, each
paper had 10-15 reviewers.


* Awards

Rick Adams -- the founder of UUCP (the first wide area e-mail system),
the author of the SLIP protocol, and the founder of UUNET (the second
Internet Service provider in the U.S.) -- received the Lifetime USENIX
achievement (a.k.a. the "Flame") award.

The creators of CVS received the Software Tool User Group award. The
first version of CVS was a shell script! It was written at the VU in
Amsterdam, in 1986.


* Keynote: Programmer as a writer. Do it right the first time
The keynote address was given by Neal Stephenson, the author of
Cryptonomicon, Snow Crash, The Diamond Age, and the upcoming
Quicksilver

Neal Stephenson is a professional writer. He started his insightful
keynote with an observation that programmers and professional writers
have a lot in common. In particular, both groups build complex mental
structures, which they must serialize through a keyboard by pressing
one key at a time.

A good writer (and a good programmer) does not work by distilling good
ideas from a large pool of bad and good ones, but by producing few if
any bad ideas in the first place.  It is important to give ideas time
to mature [in the subconsciousness] so only good ideas percolate to
the conscious level.

The gist of his talk was an argument against a defined _methodology_
of writing (where 'methodology' is to be understood as in 'software
development methodology' or in the Capability Maturity Model CMM). A
good novel should be written mostly right the first time. A good novel
is written mainly subconsciously, with the great involvement of
emotion. Neal Stephenson is a great fan of Antonio Damasio. He quoted
extensively from Damasio's book "Descartes' error" [see however, a
critique of Damasio in
http://www.cogsci.soton.ac.uk/~harnad/Tp/bookrev.htm]

At the beginning of his talk, Neal Stephenson noted that it was 20
years ago, almost to-date, that he signed his first contract and
became a professional writer. He told an interesting and amusing story
of writing his first book. It was a hot summer in Iowa City. Neal
Stephenson had a regular job, and yet had a hunch that writing might
be for him. He had written a "query" -- a plot summary, the outline of
a book, biographies of characters, and a few sample chapters -- and
started to send them to editors, which he picked at random from trade
directories. Many rejection letters followed. Finally, one editor
wrote that he was intrigued by the outline and the sample chapter and
asked for the rest of the novel. After a brief exhilaration the
reality set in: there was no novel yet. He had to write it. With all
his vacation time and the 4th of July holiday there were 10 days, in
which to write a novel. He rented a modern typewriter, secluded
himself in his apartment and started to type. Soon a problem appeared:
the typewriter had a modern plastic ribbon. The plastic mellowed and
became sticky: it was July in Iowa City, and the apartment was
hot. The only way to prevent the ribbon from getting stuck is to keep
the ribbon moving. And the only way to keep the ribbon moving is to
keep pressing the keys. That discovery did wonders for his
productivity. He didn't have time to think: he had to keep pressing
the keys and write the first thing that came into his mind. He sent
thus created manuscript to the editor. The latter replied that his
publishing house can't print that -- but the work was interesting and
should be published. Eventually, Neal Stephenson got an agent, a
publisher, and his first published book, "The Big U".

And so Neal Stephenson has become professional writer, and started to
think about his next book. He wanted to follow some system of writing
a book, so at least he could tell himself that he's not a slacker but
a diligent worker. He picked a "distillation theory". The theory
states that to write a book you start with a big draft. You keep
re-writing -- distilling -- it until you remove bad parts and bring
the draft to the size of a novel. The process is similar to making
good whiskey out of large quantities of bad beer. Neal Stephenson
spent a year working on a huge draft. When he started to distill it,
he made a grim discovery. The good stuff in the draft was so
intertwined with large quantities of bad stuff that it was impossible
to separate them. He panicked and quickly wrote a new novel, which was
published with little editing -- partly because it was written of the
right size the first time. He found the process: do it right the first
time.  He also found that he tends to write good stuff in the morning
and bad stuff in the afternoon. So he adopted the following regimen:
he would write only in the morning. And to fill the rest of the day,
he would help a friend as a construction worker. Surprisingly, that
worked well. The construction work provided the "background" for
subconscious work, where ideas were born. Some of these ideas would
bubble to the conscious surface by the morning, and he would write
them down.

Neal Stephenson has discovered another remarkably helpful technique: a
fountain pen. The great advantage of the pen is that it is
_slow_. Ideas, in his view, come faster than we can "serialize" them
on the paper or at the keyboard. Therefore, the ideas are stored in an
"accumulation buffer" in the brain. When in the buffer, the ideas
interact and purify. If we empty the buffer too soon, we don't give
the ideas enough time to mature and we get a half-baked prose. There
is a virtue in slow typing or in hand-writing. Needless to say, Neil
is not a fan of a PowerPoint. He contends that writing on a blackboard
gives a teacher and a student so important time to think.

Neil Stephenson gave a few more examples for his thesis. For instance,
many stories of Charles Dickens were first published in a magazine,
over a course of several months. It is not often emphasized that
Dickens didn't have time to distill his novels and to write draft after
draft. He had to write in monthly installments, with barely a time for
one draft. He had to do it right the first time.


One member of the audience observed that in software development, the
distillation theory is called "the process". That theory states that
given a number of average or bad programmers, after many iterations,
good code can be produced. Just as everybody can be a writer if they
merely get patience and diligence to write enough drafts. The reality
of book writing and of software development shows the falsity of
these views. Not everybody can be a good writer. Bad writers always
write bad prose. Bad programmers always write bad code.

The message of the talk is also against debugging. It's all too
prevalent for programmers to quickly type in the code they don't quite
understand and aren't sure of correctness, and then debug it
away. Again, the experience (e.g., with ftp clients and Sun's tooltalk
daemon) shows that the code badly written the first time remains buggy
and filled with holes, no matter how much it is debugged.


* A Logic File System
Yoann Padioleau and Olivier Ridoux, IRISA / University of Rennes             
General Track, pp. 99-112 of the Proceedings
http://www.usenix.org/events/usenix03/tech/padioleau.html

This is the most interesting, insightful, and innovative talk at this
conference.

The goal of the talk is to combine the navigation and querying in a
file system. Roughly speaking, the talk attempted to bring filesystems
from the age of hierarchical and network databases into the age of
relational -- and, moreover, deductive databases -- and yet maintain
the familiar cd/ls/mkdir/mv interface, file paths and shell globs.

The current filesystems are built around the hierarchical model
(links, if present, are only uni-directional. Therefore filesystems do
not actually follow the network database model). Unfortunately, a
hierarchy is a tree rather than a lattice: a hierarchy decomposes the
system around one aspect disregarding the others. For example, the
typical layout of the filesystem emphasizes the scope: /, /usr,
/usr/local, /usr/local/share/ghostscript/7.05. Other aspects, e.g.,
library code, are smeared all over the place: /lib, /usr/lib,
/usr/local/lib and /usr/local/share/ghostscript/7.05/lib. Locating
files that are local to an installation is easy (/usr/local), locating
all the library files is hard. Many interesting queries are difficult
to execute or even to formulate. For example, to list all directories
that contain only empty subdirectories, to "cd" to a directory with
files foo.h or foo.c (disjunctive query), or to list all the files *.h
_or_ *.c in the given tree that are _not_ in (sub)directories named
Attic. To answer such queries, we have to write quite complex logical
formulas, made more complex by the fact that me must keep the
navigational structure in mind.

Many operating systems have tools that impose a relational layer on
the file system -- glimpse, locate, find -- and offer a keyword-based
search. However these tools often return a great number of results,
which are not grouped in any particular way. For example, "locate
sendmail.cf" may return a couple of file names. But "locate
!sendmail.cf" (where ! stands for negation), if it were allowed, would
list almost any file on the system. Clearly the user will find such a
result unmanageable. The user would much prefer if the output were
automatically grouped according to the most general and informative
categories: the most general keywords that _will_ refine the
search. Furthermore, the user should be able to navigate such a
virtual directory of results using the familiar 'cd' and 'ls'
commands, or reclassify a file with a 'mv' command. Finally, if the
user adds or deletes a file, the virtual file tree should be updated
automatically.

We would like to "cd /lib" and descend into the "directory" that
contains only library files, grouped into subdirectories "root",
"usr", "local". We would like to "cd /bin|/sbin" and get to the root
of the "filesystem tree" that contains only executable files (again,
grouped into "subdirectories"). We may even do "cd !src" to get the
view of the filesystem without any source files. The Logical File
System described in this talk (LISFS for short, not to be confused
with the Log File system) can accomplish all that -- now. The prototype
implementation is fully functional and reasonably fast. An
ls-benchmark shows hardly any performance overhead of far more
expressive filesystem queries.

The logical filesystem LISFS takes a relational view: files are
entities that are described by their properties (attributes). To
locate a file, we execute a query. LISFS lets a user compose a query
incrementally: "cd /query" executes the query and makes the
corresponding collection current. "cd query1" refines the original
query. Incidentally, this means that "cd a/b" and "cd b/a" are the
same: the path separator stands for a conjunction, and therefore, is
commutative. The closed-world assumption makes negation queries
possible (e.g., "cd !/lib"). LISFS goes farther than the naive
relational model. A user may create taxonomies of attribute
values. For example, the user may create a category "lib" with
sub-categories "shared-lib" and "static-lib" -- using the mkdir
command. LISFS will infer that anything classified as a shared library
is also a library. The inference rules can be more complex: we can
define a category "dual-lib" for project's libraries that can be
linked both statically and dynamically. When LISFS reifies the result
of a query as a filesystem directory, the most general categories that
partition the result become the names of subdirectories. LISFS thus
lets us navigate the inference tree. The complex inference requires a
logical engine -- and indeed the current version of LISFS includes a
Prolog interpreter, as a part of the filesystem code! Truly a file
path is a logical formula.

Like the semantic file system, LISFS has intrinsic attributes, whose
values are derived from the content of a file (e.g., file's size or
the resolution of a JPEG image). Unlike the semantic file system, a
user can define his own attributes and organize them into arbitrary
taxonomies.

The paper describes the algorithms, the implementation of directory
operations, and the obvious optimizations (e.g., caching).

LISFS is implemented truly at the low, file system level, as a
user-level NFS daemon. Therefore, none of the applications need to be
modified in any way to take advantage of LISFS. For example, the
wildcard expansion in a shell (glob) "just works".

LISFS is written with the help of a PerlFS toolkit. However, Yoann
told me that the implementation language was not truly Perl. Yoann
designed his own language with types and an ability to express
post-conditions and invariants. That source language was then
mechanically translated into Perl. Perhaps Yoann might wish to supply
more details?

The only filesystem that comes close to LISFS in terms of integration
of query and navigation was the file system of Apple Newton (so-called
'soup'). But NewtonScript never had any logical inference.

In the future research, Yoann plans to implement views of a file. We
could then edit selected sections from a TeX document as if they were
a separate file (Emacs' all-mode is a glimpse of that
functionality). It seems that LISFS can be meaningfully integrated
with Semantic Web.

Yoann was asked a question if a common person (without the knowledge
of logic, etc) would be able to meaningfully navigate LISFS. Yoann
gave an excellent answer: a common person can use Google. LISFS is not
much different.

The prototype and more information are available at
	www.irisa.fr/LIS


* Undo for Operators: Building an Undoable E-mail Store
Aaron B. Brown and David A. Patterson, University of California, Berkeley 
General Track, pp. 1-14 of the Proceedings
http://www.usenix.org/events/usenix03/tech/brown.html
A winner of the best paper award.

The statistics show that operator errors are the dominant cause of
Internet services outages. Backups and now the re-windable storage let
the administrator "time travel" -- roll back the system to some good
state, fix the problem, and then replay the log of changes. This form
of time travel (as time travel in general) may create paradoxes. For
example, a bug in an anti-virus e-mail scanning software caused the
deletion of all e-mail attachments. The administrator notices the
error, restores the system from a backup, patches the anti-virus
software, and replays the log of arrived messages. A user who had
received, read and deleted a message with an empty attachment
receives that old message once again, now with the full
attachment. The user will be confused -- the deleted message is
resurrected, with a different body.

The presenter noted that paradoxes in general cannot be avoided. He
proposed to mitigate them. In particular, we can rely on people's
ability to tolerate paradoxes, if they are explained. In this respect,
e-mail is a perfect application for this approach, because the users
are human. Therefore, the goal of the project is to detect paradoxes
and send an explanatory e-mail message.

The authors have built a prototype based on a rewindable storage (Net
appliance) and SMTP/IMAP proxies. The proxies monitor user
interactions with the e-mail system and maintain a log of events. An
event is sending a message, reading a particular message, deleting it,
etc. The record of an event must identify its action (the "verb") and
attributes. The latter must be in an environment-independent form: not
"deleting message number 42" but deleting message with a particular
message-id. Each action verb is associated with
compensation/explanation hooks. Play-forward routines check for
consistency between the replayed state and the recorded
event-stream. If inconsistency is detected (e.g., a message with a
particular id now has a different hash of the body), the compensation
hook associated with the verb is invoked.

The identification of events, and the development of
compensation/explanation routines are the responsibility of a
sysadm. Furthermore, as Eric Allman observed, SMTP is a far more
complex protocol, which cannot be reduced to a single "Deliver" event
(as it is in the prototype). I noted that explanatory messages sent
during the forward replay are themselves a paradox. The author agreed
and said that the explanatory messages must be identified as such, to
prevent the system from diverging. That means the system is not
modular: a compensating routine must be aware of the compensation done
by all the other routines.  Furthermore, the effect of "compensating"
for one paradox may trigger, in an indirect way, another paradox
(e.g., sending a message that bounces and causes mailer-daemon
messages) -- in a way that is difficult to identify. There doesn't
seem to be a good way to keep the paradoxes from snowballing. The
presenter said that they are considering running the forward-replaying
process in a VM under VMware, so they may tightly monitor and control
the process. I got the impression that the whole approach is not
thought through. The paper seems to be very ad hoc. I'm puzzled how it
was chosen to be one of the two best papers.


* Role Classification of Hosts within Enterprise Networks Based on Connection Patterns
Godfrey Tan, Massachusetts Institute of Technology; Massimiliano
Poletto, Mazu Networks; John Guttag and Frans Kaashoek,
Massachusetts Institute of Technology 
General Track, pp. 15-28 of the Proceedings
http://www.usenix.org/events/usenix03/tech/tan.html

The idea is grouping of hosts based on their communication patterns,
to be more precise, on the number of transferred bytes between two
hosts over the observation period.  The present paper is yet another
paper (out of many that followed at USENIX2003) whose motivation is
sorely lacking and whose approach is ad hoc at best. The present and
many other presented papers originated from a summer
internship. Perhaps that can explain their immaturity. The author
claimed that their host classification can be used for intrusion
detection. Given the fact that they need to accumulate quite a lot of
statistics, their intrusion detection would come a bit
late. Furthermore, an intruder who is not a fool takes trouble not to
be conspicuous. If he must probe other machines, he does that at a low
rate.

The paper presented a grouping algorithm and a group differencing
algorithm. The latter correlates two groups of hosts identified at two
different time moments. The group identification algorithm builds the
graph of hosts that communicate "regularly" and takes two-connected
components as initial groups. The latter are then merged according to
various heuristics. The "regular communication" requirement all but
precludes the usefulness of the system as an intrusion detection tool.


* A Cooperative Internet Backup Scheme
Mark Lillibridge, Sameh Elnikety, Andrew Birrell, Mike Burrows, and         
Michael Isard HP Systems Research Center                                 
General Track, pp. 29-42 of the Proceedings
http://www.usenix.org/events/usenix03/tech/lillibridge.html

The talk proposed a variation of an Internet backup at a far less
price: 25 cents per MB per month compared with the typical $7/Mb/month
of commercial Internet backup providers. The price is so low because
the users of the system store (parts of) backups of each other. The
proposed backup system has no centralized state, no trust assumptions
among the backup partners, no need for fixed identities of the
partners, no administrative overhead.

The backup software creates a back-up file, expands it with a
Read-Solomon erasure code, encrypts the blocks (it is important to
encrypt after the expansion!), signs each block, and distributes the
blocks among the partners. Currently, the list of partners is obtained
from a central server. In all other respects, the system is
peer-to-peer.

Most of the talk was a presentation of quite clever algorithms for
detection and punishment of free-loaders -- people who participate in
the system but don't store backup blocks of other people. To check if
a partner indeed stores the blocks we sent him, from time to time we
send the partner a nonce and a block number and ask for the hash of
that block. The system must be gracious enough to distinguish between
a freeloader and a computer that merely crashed. Therefore, the system
includes the notion of a "pre-payment". When a new user joins the
system, for three weeks he has to store backup blocks of its partners
"for free" (without any right to send the partners his own backup
blocks). After three weeks of good behavior (with partners verifying
that the blocks are stored indeed), the new user gains the backup
benefits. The presenter also explained a sophisticated algorithm to
detect a man-in-the middle attack.

The system is implemented. It provides the backup rate of 1MB/sec on a LAN
on Pentium Pro 20. 

Some of the algorithms are noteworthy, and can be applied to any
peer-to-peer service with untrusted partners.


* The Convergence of Ubiquity: The Future of Wireless Network Security
William A. Arbaugh, University of Maryland, College Park
Invited Talk

The talk has discussed several difficult to solve or yet to be solved
security problems in wireless (in particular, 802.11) networks:
  -- wireless security is different from the wired security:
     hop-by-hop vs. end-to-end
  -- problems with WEP and potential problems with WPA
  -- Denial-of-Service (DoS) attacks: they are easy, numerous, and 
     _not_ preventable

** wireless security is different from wired security: hop-by-hop vs. end-to-end
In a wired network, many computers are protected by a firewall. An
external attacker typically has no access to the LAN and the LAN
protocol. He cannot spoof ARP and monitor and interfere at the
transport (Ethernet, PPP) layer. In a wireless network, an external
attacker can do all that. In a wireless network, every attacker is an
insider.

In the 802.11 group of standards, frequency hopping or spreading of
spectrum are predictable, therefore, they offer no defense. With a
parabolic antenna, one can access a wireless network up to 2 miles
from the access point! 

End-to-end security is orthogonal to the transport layer (hop-by-hop)
security. End-to-end security (such as SSL) does not protect against
routing attacks, DoS, or social engineering. For example, when a
wireless user is connecting to his bank but instead is redirected to
attacker's site, the browser will pop up a warning, saying "a
certificate issued to www.your-bank.com cannot be verified. Do you
want proceed, cancel, or look at the certificate?"  It's too easy to
make a forged certificate look like the real one. A user not trained
in security will probably note no difference and accept the forged
certificate.

** problems with WEP and potential problems with WPA

The presenter said that WEP -- Wireless Equivalent of Privacy -- is a
very unlucky security standard. Many of the attacks against WEP could
only succeed in very specific circumstances (if bytes are arranged a
certain way, or if ciphers are composed in a certain
way). Unfortunately, WEP uses the ciphers precisely in that, most
vulnerable way. Some particular problems with WEP: it relies on a MAC
address (Media Access layer address: the Ethernet address of the
wireless card) as client's identifier. Alas, the MAC address is stored
in a flash memory and therefore is mutable. An attacker can change his
MAC address to be anything he wishes to. The CRC in WEP offers no
integrity protection: CRC and the XOR operations in encryption layer
are both linear. An attacker may flip bits in the packet without
breaking CRC. WEP does not protect against a replay of packets at all.

WPA is an stop-gap attempt to fix WEP. WPA uses two new algorithms --
TKIP and Michael -- which are not mature and not well-tested. In fact,
there was report that TKIP was broken.

** Denial-of-Service (DoS) is a real problem

Alas, all current and under consideration wireless standards seem to
ignore DoS. Furthermore, they make DoS too easy. One can easily turn
his microwave into a transmitter that jams all wireless network
within a wide range. Unfortunately, one can inflict damage against a
specific access point in a far easier way. For example, management
frames in 802.11 are not authenticated and can easily be spoofed. An
attacker can pretend to be an access point and occasionally transmit
frames "don't transmit for xx seconds" or "go to sleep" and
effectively shut down the network. Furthermore, there are tools that
do exactly that: Air-Jack and WLAN-Jack. Google search for these names
will give the source code and executables. The presenter has
demonstrated these tools. He said he uses the tools in the class he
teaches at the university, to prevent students from browsing the web
while in class.


* Network Programming for the Rest of Us
Glyph Lefkowitz, Twisted Matrix Labs; Itamar Shtull-Trauring, Zoteca        
FREENIX Track, pp. 77-90 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/lefkowitz.html

The presenter started his talk with a remark that manual transmission
cars indeed offer more control. However, an automatic transmission car
is much easier to use, and is good enough. During the talk, he several
times urged the programmers to use a higher-level networking
framework.

The talk presented one networking framework -- implemented in a
high-level language, Python. The presenter has made a good case for a
higher-level language.

Notable features of their framework:
	- cross-platform
	- low-level peculiarities are not hidden. The framework offers
	portable (Windows, UNIX, etc.) networking. A user can still set a
	particular socket option if he knows what he's doing. This is
	in contrast to the least-common-denominator approach of Java.
	- Event loop is made explicit. A user can integrate networking
	events with GUI and other framework's events. Actually, all
	asynchronous networking i/o is based on an event loop. The
	presenter said that multi-threading is easy to add to an event
	loop (message passing). The reverse is harder.
	- Event loop is integrated with select(), poll(), kqueue and
	WaitForMultipleObjects, depending on a platform.
	- Testing is integrated into the framework. Currently they
	have ca. 700 tests. Regression tests are run _automatically_ on
	each CVS check-in.
	- separate the protocol (HTTP, SMTP, FTP) and the transport 
	(TCP, UDP, SSL)
	- many high-level protocols supported: HTTP, NNTP, SSH, IRC,
	FTP, etc.
	- The code is LGPL and is available at www.twistedmatrix.com
	They have ca. 20 developers.
	
I asked the presenter if they investigated any other high-level
language, such as OCaml. He didn't seem to be familiar with OCaml. He
said that they inherited a prototype framework written in Python, and
have been enhancing it.


* In-Place Rsync: File Synchronization for Mobile and Wireless Devices
David Rasch and Randal Burns, Johns Hopkins University                      
FREENIX Track, pp. 91-100 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/rasch.html

The ordinary rsync protocol needs temporary space for the file being
synchronized. That can be a problem for a mobile and other memory
constrained device. The talk presented a modification of the rsync
protocol that operates in-place -- that is, it applies updates to an
existing old copy of the file in-place, bringing the copy
up-to-date. The rsync protocol sends the patches in COPY and ADD
operations. In the modified rsync protocol, the server topologically
sorts COPY operations and sends them in the order that avoids
conflicts. In rare circumstances (of circular dependencies), COPY is
converted into ADD. ADD operations now have to carry an offset in the
target file. Compression suffers a little bit, but the overall result
is encouraging.


* Application-specific Delta-encoding via Resemblance Detection
Fred Douglis and Arun Iyengar IBM T.J. Watson Research Center                
General Track, pp. 113-126 of the Proceedings
http://www.usenix.org/events/usenix03/tech/douglis.html

The topic of the talk is grouping files in the file system according
to their resemblance to increase the efficiency of delta-encoding.

Delta-encoding is a differential encoding against some base. An
example: rsync. Rsync takes an old version of a file as a base. What
if we don't have an earlier version of a file? What if the earlier
version was renamed or moved to a different place? The goal of the
talk was therefore finding a suitable base for a given file. To be
more precise, finding a collection of similar files so that each file
in the collection can be delta-encoded against each other. Possible
applications: Mail servers or web caches.

To determine if two files resemble each other, the standard technique
was used: extracting features as "shingles" (sliding window of the
size 20-30-bytes: truncated suffix tree), Rabin fingerprinting the
shingles, subsampling the space of features and measuring the
resemblance in that space. It turns out, 30 features are enough to
judge the similarity of files.

Experiments confirmed that resemblance implies good delta-encoding. To
be more precise, given two files that share 5 out of 30 features,
difference encoding was better than compression. VCDiff was used for
delta-encoding.

Using shingles rather than file blocks (as Pastiche does) let the
authors detect similarity in far more cases.  For example, if we take
a file, remove the first byte from the beginning and add that byte to
the very end, we end up with a file that differs from the original one
in every block. Fred Douglis's system will still find the two files
highly similar, and so will efficiently delta-encode them.


* Opportunistic Use of Content Addressable Storage for Distributed File Systems
Niraj Tolia,Carnegie Mellon University and Intel Research Pittsburgh;        
Michael Kozuch, Intel Research Pittsburgh; Mahadev Satyanarayanan,          
Carnegie Mellon University and Intel Research Pittsburgh; Brad Karp, Intel   
Research Pittsburgh; Thomas Bressoud, Intel Research Pittsburgh and          
Denison University; Adrian Perrig, Carnegie Mellon University
General Track, pp. 127-140 of the Proceedings
http://www.usenix.org/events/usenix03/tech/tolia.html

Content Addressable Storage means that the file name -- the
identifying tag of a piece of data -- is the cryptographic hash of
that file's contents.

The goal of the project is improving a _read_ performance in
distributed file systems such as AFS and Coda. Often a group of local
(wireless) clients are working with the same remote server. If a local
client needs a file, it can ask other clients in the neighborhood if
they have the file or some blocks of that file in their local
caches. When a client does binary install, upgrade or resumes an image
of a commonly used virtual machine, neighboring clients are very
likely to posses most of the needed files.

Implementation: split a file into separately addressable chunks. The
file server represents a file by its "recipe" -- a small document that
contains the description and the version of the file and lists the
names of all chunks (cf. i-node). Recall that the name of a chunk is
the hash of that chunk's content. To download a file, the client first
asks for the recipe and then checks the neighboring clients to see if
they have some of the needed chunks.


* StarFish: Highly-Available Block Storage
Eran Gabber, Jeff Fellin, Michael Flaster, Fengrui Gu, Bruce Hillyer,
Wee Teck Ng, Banu Oezden, and Elizabeth Shriver, Lucent Technologies
Bell Laboratories
FREENIX Track, pp. 151-164 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/gabber.html
A winner of the best paper award.

STARfish is a RAID-1-like system with geographically distributed
Internet accessible "disks", which supports on-the-fly replication and
dynamic, hot recovery from disk failures. Because the system looks
like a SCSI disk, it works with any OS. STARfish is not a distributed
file system (STARfish works on a block level), it does not attempt to
solve a multiple-writer problem (that is, only one writer is assumed),
and it is not iSCSI. STARfish is built on commodity cheap
hardware. All the software is open source (cf. EMC SRDF).

STARfish is implemented under FreeBSD 4.x. In fact, it has been
working in authors' Lab since Feb 28th, 2001.

Implementation: Host elements and Storage Elements. A Host Element
(HE) is a small box with SCSI connectors. To a computer, HE looks like
a SCSI disk. When HE receives a SCSI write command from the master, HE
assigns the command a sequence number and sends the command, over
TCP/IP, to storage elements (SE). Once HE receives enough
acknowledgments from SEs, HE acknowledges the command to the
master. When the master sends a read SCSI command, HE may either ask
all SEs for the needed blocks, or ask one SE (the one with the least
latency). Recovery: HE keeps a log of recent writes, for replaying on
SEs that come back from short failures. If an SE was down for an
extensive period of time, the SE asks the neighboring SEs for their
replay logs (which they keep). While SE recovers, it receives and
executes writes from HE. Thus SE recovery is "hot": when the recovery
is completed, SE is fully in the current state. HE failure is remedied
through replication. Again, STARfish is not a distributed backup
system: STARfish does not have a master store. All SEs in Starfish are
peers, and all of the SEs may be off-site.

Measurements indicate the overhead of 14% on a PostMark benchmark; HE
and SEs were connected via Internet with a simulated latency equal to
the latency of the Internet connection between NY and LA.  Online
recovery causes 4-22% degradation in performance.

Experiments showed that the best configuration has three SEs, two of
which are enough for the quorum (that is, two write ACKs are enough
for HE to send the ACK to the SCSI master). One SE not in the quorum
can have high latency (i.e., be far off-site).


* Secure and Flexible Global File Sharing
Stefan Miltchev, University of Pennsylvania; Vassilis Prevelakis,
Drexel University; Sotiris Ioannidis, University of Pennsylvania; John
Ioannidis, AT&T Labs - Research; Angelos D. Keromytis, Columbia
University; Jonathan M. Smith, University of Pennsylvania
FREENIX Track, pp. 165-178 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/miltchev.html

The goal of the DisCFS project is to provide distributed access to
remote files in a secure fashion. Clients are identified by their
certificates. The file server does not need accounts for every remote
client: decentralized access control.

The presenter mentioned that they have been using a KeyNote system for
managing authorization certificates. KeyNote's signed certificates may
contain complex conditions (based on the identity of the requester,
his domain, time of the day, etc). Furthermore, a person can delegate
a subset of his permissions to another person. This greatly simplifies
managing of certificates.

Implementation: a user-level NFS server, accessible over IPSec. For a
macro-benchmark -- kernel compile -- the processing of credentials is
amortized, and there is no difference in performance between DisCFS
and NFS (on which DisCFS is based). Future work: implement short-lived
refresher credentials, to be presented along with the regular
credentials, to support flexible revocation.


* The CryptoGraphic Disk Driver
Roland C. Dowdeswell, The NetBSD Project; John Ioannidis, AT&T Labs
Research
FREENIX Track, pp. 179-186 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/dowdeswell.html

The goal is to protect a personal computer, specifically, a laptop or
a removable disk, in case of theft or unauthorized access. The system
is assumed single-user. It's important to keep the overhead as low as
possible. 

Implementation: an encrypting pseudo-disk. The approach is similar to
FreeBSD's vnd, Linux loopback mounts or software RAID, only with a
encryption layer in-between.

An encrypted disk is unlocked with the help of a parameter file, which
keeps the keys and other relevant data. The parameter file should be
kept on a separate floppy or a USB dongle. In the latter case, there
is no need to type a password to inlock the disk (as long as the
dongle is present). The system supports N-factor authentication:
several parameter files must be present to unlock the disk.

Overhead: large compilations off the protected disk are affected by
1-5%.


* The Design of the OpenBSD Cryptographic Framework
Angelos D. Keromytis, Columbia University; Jason L. Wright and Theo de
Raadt, OpenBSD Project
General Track, pp. 181-196 of the Proceedings
http://www.usenix.org/events/usenix03/tech/keromytis.html

One (rare) interesting talk was about a cryptographic kernel subsystem
in OpenBSD. The subsystem has been used in OpenBSD over three
years. The subsystem provides an abstraction layer over various
cryptographic accelerators, crypto-cards, or pure software
ciphers. Clients -- user processes or kernel systems such as IPSec --
request cryptographic services in a uniform format, by queuing a
request. The presenter showed a slide with the format of the
request. One of the fields was named "continuation"! It was
exhilarating to see this mentioning of an advanced programming
language concept. It's preciously rare at USENIX. As the talk
progressed, it became clear that the service request is not exported
to user processes. Userland can interact with the crypto-system only
through /dev/crypo, by doing a _synchronous_ i/o.

I spoke with the presenter after the talk. I asked him to confirm that
this "continuation" (a call-back from the crypto-system) is not an
upcall. He confirmed that the call-back is available only to clients
within the kernel. He said that if a client is a kernel part of a
process, the process immediately goes to sleep after submitting the
request. The callback is a three-instruction function that executes
twakeup for the thread. IPSec is the only special case.  IPSec is
split into several phases. The first phase processes a raw packet, and
queues a request with the crypto-service to verify the signature or
decode the payload. After queuing the request, IPSec just "returns"
(to the interrupt handler), meaning that the packet is processed. The
callback extracts the encapsulated packet and queues it for processing
by the IP stack. The presenter confirmed that this non-trivial
callback is quite specific to IPSec and the TCP/IP stack. The callback
operates entirely within the kernel. It doesn't need to (re)acquire
locks because the OpenBSD kernel is single-threaded.


* A Binary Rewriting Defense against Stack-based Buffer Overflow Attacks
Manish Prasad and Tzi-cker Chiueh, Stony Brook University
General Track, pp. 211-224 of the Proceedings
http://www.usenix.org/events/usenix03/tech/prasad.html

The talk showed how to defend against buffer overflows by
instrumenting binary code. No access to the source code is needed. The
instrumentation involves prologue and epilogue of functions. When a
function is entered, we remember the return address in a special
database. The instrumented epilogue checks that the address we're
about to return to is present in the database. In order for
instrumentation to work, we have to reliably identify prologue and
epilogue of functions by _statically_ examining _binary_ code: we have
to disassemble the binary code. The authors showed that they can
disassemble x86 code with 99.9% accuracy. Still they miss a few
indirect functions -- which are entered not through an explicit call
instruction or via an indirect jump whose target cannot be determined
statically. Therefore, their technique can't handle hand-crafted
assembly, multi-threaded applications, self-modifying code.

It's amazing how much effort is expended in trying to patch code
written in unsafe languages. Perhaps it would be cheaper just to
forbid unsafe languages?


* Kernel Support for Faster Web Proxies
Marcel-Catalin Rosu and Daniela Rosu, IBM T.J. Watson Research Center
General Track, pp. 225-238 of the Proceedings
http://www.usenix.org/events/usenix03/tech/rosu.html

Challenge: a web proxy or a web server have to track a great number of
open TCP connections. Executing select()/poll() on a large fileset is
very inefficient. This problem has been described and extensively
investigated by Peter Druschel and his students at Rice U.

Marcel-Catalin Rosu gave another solution: select() in user-space. The
kernel exports the state of file descriptors as a read-only shared
memory segment. Application's source does not have to be changed: one
should merely relink the application with a select()
library. User-level select() reduces the overhead by 50-70%. With
user-level select(), a web proxy does not saturate CPU even at 240
req/sec.


* Multiprocessor Support for Event-Driven Programs
Nickolai Zeldovich, Stanford University; Alexander Yip, Frank Dabek,
and Robert T. Morris, Massachusetts Institute of Technology; David
Mazie`res, New York University; Frans Kaashoek, Massachusetts
Institute of Technology
General Track, pp. 239-252 of the Proceedings
http://www.usenix.org/events/usenix03/tech/zeldovich.html

The event loop of event-driven programs fetches events and executes
associated handlers (callbacks). The callbacks are executed
sequentially. This work tries to parallelize callbacks. It assumes a
coarse-grained parallelism -- most of the events pertain to
independent objects.

The solution: color the callbacks. This is the standard approach,
e.g., for multi-processor-aware memory allocators. Callbacks of
different colors are presumed independent and can be executed
concurrently. It is the _programmer_ who has to assign colors to
call-backs. One can color callbacks by the number of the file
descriptor (socket) on which they operate. However, if these callbacks
access a shared resource (cache), then the callbacks must be colored
the same.

The authors applied their solution to the SFS toolkit (see the talk by
David Mazie`res at USENIX 2001). Callbacks that execute encryption and
decryption operations are parallelizable. Still, the authors obtained
the speedup of only x1.5 on a 4 [sic!] CPU box. The presenter blamed
the Linux kernel.

I have questioned if typical callbacks are indeed
independent. Although HTTP requests may seem independent, the
corresponding handlers must write data into the common log and update
the common statistics.


* POSIX Access Control Lists on Linux
Andreas Gruenbacher, SuSE Labs
FREENIX Track, pp. 259-272 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/gruenbacher.html

The talk gave a brief overview of ACLs and the scope of their
implementation on various systems. Alas, there is no standard for ACL:
POSIX 1003.1e/1003.2c working group dissolved without producing a
standard. The latest, draft 17, is considered an unofficial ACL
document. Flavors of ACLs: Posix-like (draft 17 and its various
implementations), NFSv4, AFS, Windows/CIF.

The upshot: Interoperability between POSIX and NFS/Samba may be
acceptable. Still, users expecting a seamless integration between
different ACL systems "are likely to be disappointed".


* Privman: A Library for Partitioning Applications
Douglas Kilpatrick, Network Associates Laboratories
FREENIX Track, pp. 273-284 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/kilpatrick.html

Privman is a library for an OpenBSD-style privilege separation.�After
a privileged process initializes the privman library, it forks itself.
The child drops the privileges. The parent and the child processes
communicate via a pipe. Whenever a child needs to execute a privileged
operation (open a file it can't normally access or bind to a
privileged port), the child asks the parent to execute the operation
(and return the appropriate file descriptor, in the above two
examples). Before the parent process executes an operation on behalf
of a child, the parent process consults a policy (specified in a
flexible configuration file) to determine if that operation is
permitted.

A privileged operation setuid() presents a difficult problem. When a
child asks the parent to change the child's uid, the parent must kill
the current child process and fork another one with the new uid. The
parent must however preserve the state of the child process across
such a restart. In effect, we need to obtain a closure of a process to
run in a new dynamic (new user id) environment. Alas, Unix is
ill-suited for obtaining a closure of a process.

The author used privman to implement an FTP server. An FTP server does
need to execute setuid(). However, the state to preserve is relatively
small (only the identity of the connected client and the corresponding
/etc/passwd entry).

Micro-benchmarks show up to 20 times degradation in performance (for
opening of privileged files, for example). However, macro-benchmarks
of FTP performance show only 5% penalty, with no noticeable decrease
in the number of supported clients or throughput. It seems that
privileged operations are not that frequent.


* The TrustedBSD MAC Framework: Extensible Kernel Access Control for FreeBSD5.0
Robert Watson, Wayne Morrison, and Chris Vance, Network Associates
Laboratories; Brian Feldman, The FreeBSD Project
FREENIX Track, pp. 285-296 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/watson.html

The goal of the project is to build a framework for implementing
various policies and authorization requirements. All services of the
FreeBSD kernel -- VFS, sockets, IPC, etc. -- were modified to invoke
the framework for all authorization decisions. The framework in turn
invokes appropriate policy modules. The modules are written by users
and can be loaded dynamically (a module can prevent the users from
unloading it). The framework also supports attaching of policy-neutral
labels to objects, and a useful composition of policies (intersection of
rights). Other policy compositions are possible.

The project has implemented quite a few policies: MAC, LOMAC, Biba,
fixed-label hierarchical multi-level security, SEBSD, etc.  They have
finished porting of SE Linux's FLASK/TE (type enforcement) to SE
BSD. The authors spent quite a bit of time making sure that policy
decisions are properly serialized on SMP systems.

The benchmarks showed that the overhead of labeling slow objects (such
as files, whose labeling is implemented via extended attributes on
FFS2 under FreeBSD 5.x) is negligible.

* Using Read-Copy-Update Techniques for System V IPC in the Linux 2.5 Kernel
Andrea Arcangeli, SuSE Labs; SuSE Labs Mingming Cao, Paul E. McKenney,
and Dipankar Sarma, IBM Corporation
FREENIX Track, pp. 297-310 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/arcangeli.html

The topic is a _low-level_ concurrent access to (read-mostly) shared
data on a multi-processor. The standard method relies on atomic
instructions such as test-and-set and atomic increment. These atomic
instructions have high overhead because they require flushing of the
instruction pipeline, committing of all outstanding writes. For
example, atomic increment on Pentium 4 takes 135 cycles.

An alternative, for mostly-constant data (such as routing tables,
configuration data) is read-copy-update (RCU). The technique applies
when a data structure is accessed via an indirect pointer. For
example, slots of the routing table are accessed only through the
pointer to the routing table, which is stored in a global kernel
table. None of the kernel data structures reference routing slots
directly. This assumption implies that no pointer to a routing table
slot remains in CPU registers after a context switch. To update the
piece of shared data, we need to allocate another piece, copy the old
information, apply updates, and then store the pointer to the new copy
in the corresponding kernel table (update the global
pointer). However, we cannot deallocate the old copy until all CPUs in
the system performed a context switch. The latter condition guarantees
that no outstanding pointer to the old copy of data exists. To protect
against multiple writers, we need a spin lock. The advantage of the
RCU technique is that the readers need to assert _no_ lock. Readers
can access the data structure as if it were completely immutable and
is not subject to any race condition.

The authors took advantage of RCU in implementing System V
semaphores. Allocating and deallocating a semaphore is a relatively
rare event (compared with P- and V- operations on it). Previously, the
table of the semaphores and the current semaphore id were protected by
a global lock, which added a significant overhead. Now the access to
semid and the lookup of a semaphore control block is done without any
locking and any contention.

The RCU API and the efficient implementation of System V semaphores is
in Linux 2.5 kernel.


* An Implementation of User-level Restartable Atomic Sequences on the NetBSD Operating System
Gregory McGarry
FREENIX Track, pp. 311-322 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/mcgarry.html

The paper describes the mechanism for implementing atomic operations
on a _uniprocessor_ system. Atomic instructions such as test-and-set
or fetch-and-add enable higher-level mutual exclusion facilities. Some
CPUs offer atomic instructions, but some (such as ARM2 and MIPS R3000)
do not. Interlocking can be emulated in the userland code (Lamport
mutual exclusion algorithm) or by invoking the appropriate kernel
facility. Both methods have a high cost; a syscall is especially
expensive. Even if a particular platform offers atomic instructions,
they are quite expensive to execute. The paper describes an
alternative approach: restartable sequences. Restartable sequences
execute entirely in the userland. The approach is optimistic: in the
best case there is almost no overhead. The approach is general and
gives good performance. Restartable sequences turn out more efficient
even if a platform does offer atomic instructions. Therefore,
restartable sequences are now the preferred mechanism for atomic
operations in the NetBSD threads library on _uniprocessor_ systems.

A restartable sequence is a _simple_ sequence of instructions: this
sequence must not invoke any functions, must not execute any emulated
instructions or traps, and must be idempotent -- give the same effect
no matter how many times it executes. For example, reading a value and
writing a constant into a memory location (the software implementation
of test-and-set) is a simple sequence. An application notifies the
kernel of the beginning and the end of all restartable sequences
within the application. If the kernel, upon context switch, finds the
program counter within the range of a restartable sequence, the
program counter is reset to the beginning of the sequence before the
control is given back to the user thread. Resetting the program
counter to the beginning of a sequence is a roll-back. The idempotency
requirement of a restartable sequence guarantees that the roll-back is
sound. Because the kernel is non-preemptible on a uni-processor,
restarting a sequence guarantees the atomicity of the operation.


* Providing a Linux API on the Scalable K42 Kernel
Jonathan Appavoo University of Toronto; Marc Auslander, Dilma Da
Silva, David Edelsohn, Orran Krieger, Michal Ostrowski, Bryan
Rosenburg, Robert W. Wisniewski, and Jimi Xenidis, IBM T.J. Watson
Research Center
FREENIX Track, pp. 323-336 of the Proceedings
http://www.usenix.org/events/usenix03/tech/freenix03/appavoo.html

K42 is a new OpenSource operating system kernel being developed at
IBM: http://www.research.ibm.com/K42 . K42 is aimed at 64-bit CPUs and
is designed to scale from few to thousands of CPUs. K42 is similar to
an Exokernel; its IPC is similar to the L4 micro-kernel. It appears
the K42 takes advantage of the old SOM technology (part of OpenDoc) --
although the paper never mentioned SOM.

The paper describes the implementation of a Linux API and a Linux
application binary interface (ABI) on K42. Most Linux applications run
on K42 without recompilation (if an important application does not
run, the K42 team fixes the OS until the application runs). Linux
outperforms K42 with the Linux personality on a uni-processor, but K42
outperforms Linux from a 10-CPU computer on. K42 scales linearly, and
on a 24-CPU multiprocessor K42 is twice as fast as Linux (according to
the SDET benchmark).


* BSD SuperBOF

** NetBSD
There will be a grand release, 2.0, in the fourth quarter:
      - SMP kernel on x86 and AMD64 platforms
      - kqueues
      - compatibility layers for Mach and Darwin
      - crypto-disk (see the presentation above), UFS2, GCC 3.3

We can now cross-build NetBSD from Linux into any target
platform. There are 230 NetBSD developers, which constitute the
membership of the NetBSD foundation. Directors of the foundation --
which must be members, that is, active developers -- are elected for
2-year terms. The foundation coordinates donations, manages
intellectual property. For example, NetBSD and pkgsrc are registered
trademarks of the foundation. Directors perform mainly administrative
work and resolve disputes between developers.

** OpenBSD
OpenBSD has introduced several new facilities that make it very
difficult to execute buffer overflow attacks: randomizing the
placement of shared libraries in memory, W^X. The latter,
well-publicized facility, arranges that any executable piece of memory
is not writeable by default. That arrangement required extensive
modifications to the linker, dynamic loader, and even the GCC complier
(which used to generate trampolines on stack and thus needed the stack
to be executable). The OpenBSD packet filter pf has been significantly
enhanced. One particular new feature is ALTQ, which is a
quality-of-service and a throttle, triggered based on IP addresses and
other conditions.

** FreeBSD
FreeBSD had a very busy year: release of FreeBSD 5.0 and (just a week
of the conference) FreeBSD 5.1. One particular new feature in works is
mandatory access controls and other security policies. They will be
more pervasive in version 5.2 (see the talk on TrustedBSD above).

* Plan9

This year Plan9 had a show-and-tell BoF, which was attended by ca. 60
people. That was quite a surprise. Another surprise is that Plan9 is
not dead. Plan9 will truly become OpenSource within a month.  Lucent
(Bell Labs) are running some cell phone systems on Plan9, with a
real-time scheduler. The group of Dave Presotto at Bell Labs is
actively developing the system, moving towards
high-availability. Furthermore, there is a group at MIT that is
developing Plan9. There are other developers, in particular, at Los
Alamos Lab (in the Computer and Computational Sciences group). The
latter is using Plan9 to make booting of Linux faster. The latter
feature is important for their 1024-node cluster. They are also
looking at Plan9 in the context Grid computing. As Ron Minnich (LANL)
said, Plan9 is far superior to Globus. Ron Minnich said that DOE is
looking into Plan9. Plan9 has GCC 3.0, with the entire toolchain.

http://www.cs.bell-labs.com/plan9dist/