;; -*-mode: Outline; -*- TOC General Awards Keynote. Neal Stephenson: Programmer as a writer. Do it right the first time A Logic File System Undo for Operators: Building an Undoable E-mail Store Role Classification of Hosts within Enterprise Networks Based on Connection Patterns A Cooperative Internet Backup Scheme The Convergence of Ubiquity: The Future of Wireless Network Security Network Programming for the Rest of Us In-Place Rsync: File Synchronization for Mobile and Wireless Devices Application-specific Delta-encoding via Resemblance Detection Opportunistic Use of Content Addressable Storage for Distributed FileSystems StarFish: Highly-Available Block Storage Secure and Flexible Global File Sharing The CryptoGraphic Disk Driver The Design of the OpenBSD Cryptographic Framework A Binary Rewriting Defense against Stack-based Buffer Overflow Attacks Kernel Support for Faster Web Proxies Multiprocessor Support for Event-Driven Programs POSIX Access Control Lists on Linux Privman: A Library for Partitioning Applications The TrustedBSD MAC Framework: Extensible Kernel Access Control for FreeBSD50 Using Read-Copy-Update Techniques for System V IPC in the Linux 2.5 Kernel An Implementation of User-level Restartable Atomic Sequences on the NetBSD Operating System Providing a Linux API on the Scalable K42 Kernel BSD SuperBOF Plan9 * General June 12-14, 2003, San Antonio, TX Attendance: I haven't seen the exact numbers; subjectively there were noticeably fewer people compared to past conferences. USENIX2001 in Boston was attended by ca. 1500 people. This year there seem to be only half as many. The exhibition was particularly poor -- there were only about 14 exhibitors. The conference is selective: in the general track, 24 out of submitted 104 papers were accepted. In the FREENIX track, 27 out of 70 (cf. last year 53) submitted papers were accepted. In the Freenix track, each paper had 10-15 reviewers. * Awards Rick Adams -- the founder of UUCP (the first wide area e-mail system), the author of the SLIP protocol, and the founder of UUNET (the second Internet Service provider in the U.S.) -- received the Lifetime USENIX achievement (a.k.a. the "Flame") award. The creators of CVS received the Software Tool User Group award. The first version of CVS was a shell script! It was written at the VU in Amsterdam, in 1986. * Keynote: Programmer as a writer. Do it right the first time The keynote address was given by Neal Stephenson, the author of Cryptonomicon, Snow Crash, The Diamond Age, and the upcoming Quicksilver Neal Stephenson is a professional writer. He started his insightful keynote with an observation that programmers and professional writers have a lot in common. In particular, both groups build complex mental structures, which they must serialize through a keyboard by pressing one key at a time. A good writer (and a good programmer) does not work by distilling good ideas from a large pool of bad and good ones, but by producing few if any bad ideas in the first place. It is important to give ideas time to mature [in the subconsciousness] so only good ideas percolate to the conscious level. The gist of his talk was an argument against a defined _methodology_ of writing (where 'methodology' is to be understood as in 'software development methodology' or in the Capability Maturity Model CMM). A good novel should be written mostly right the first time. A good novel is written mainly subconsciously, with the great involvement of emotion. Neal Stephenson is a great fan of Antonio Damasio. He quoted extensively from Damasio's book "Descartes' error" [see however, a critique of Damasio in http://www.cogsci.soton.ac.uk/~harnad/Tp/bookrev.htm] At the beginning of his talk, Neal Stephenson noted that it was 20 years ago, almost to-date, that he signed his first contract and became a professional writer. He told an interesting and amusing story of writing his first book. It was a hot summer in Iowa City. Neal Stephenson had a regular job, and yet had a hunch that writing might be for him. He had written a "query" -- a plot summary, the outline of a book, biographies of characters, and a few sample chapters -- and started to send them to editors, which he picked at random from trade directories. Many rejection letters followed. Finally, one editor wrote that he was intrigued by the outline and the sample chapter and asked for the rest of the novel. After a brief exhilaration the reality set in: there was no novel yet. He had to write it. With all his vacation time and the 4th of July holiday there were 10 days, in which to write a novel. He rented a modern typewriter, secluded himself in his apartment and started to type. Soon a problem appeared: the typewriter had a modern plastic ribbon. The plastic mellowed and became sticky: it was July in Iowa City, and the apartment was hot. The only way to prevent the ribbon from getting stuck is to keep the ribbon moving. And the only way to keep the ribbon moving is to keep pressing the keys. That discovery did wonders for his productivity. He didn't have time to think: he had to keep pressing the keys and write the first thing that came into his mind. He sent thus created manuscript to the editor. The latter replied that his publishing house can't print that -- but the work was interesting and should be published. Eventually, Neal Stephenson got an agent, a publisher, and his first published book, "The Big U". And so Neal Stephenson has become professional writer, and started to think about his next book. He wanted to follow some system of writing a book, so at least he could tell himself that he's not a slacker but a diligent worker. He picked a "distillation theory". The theory states that to write a book you start with a big draft. You keep re-writing -- distilling -- it until you remove bad parts and bring the draft to the size of a novel. The process is similar to making good whiskey out of large quantities of bad beer. Neal Stephenson spent a year working on a huge draft. When he started to distill it, he made a grim discovery. The good stuff in the draft was so intertwined with large quantities of bad stuff that it was impossible to separate them. He panicked and quickly wrote a new novel, which was published with little editing -- partly because it was written of the right size the first time. He found the process: do it right the first time. He also found that he tends to write good stuff in the morning and bad stuff in the afternoon. So he adopted the following regimen: he would write only in the morning. And to fill the rest of the day, he would help a friend as a construction worker. Surprisingly, that worked well. The construction work provided the "background" for subconscious work, where ideas were born. Some of these ideas would bubble to the conscious surface by the morning, and he would write them down. Neal Stephenson has discovered another remarkably helpful technique: a fountain pen. The great advantage of the pen is that it is _slow_. Ideas, in his view, come faster than we can "serialize" them on the paper or at the keyboard. Therefore, the ideas are stored in an "accumulation buffer" in the brain. When in the buffer, the ideas interact and purify. If we empty the buffer too soon, we don't give the ideas enough time to mature and we get a half-baked prose. There is a virtue in slow typing or in hand-writing. Needless to say, Neil is not a fan of a PowerPoint. He contends that writing on a blackboard gives a teacher and a student so important time to think. Neil Stephenson gave a few more examples for his thesis. For instance, many stories of Charles Dickens were first published in a magazine, over a course of several months. It is not often emphasized that Dickens didn't have time to distill his novels and to write draft after draft. He had to write in monthly installments, with barely a time for one draft. He had to do it right the first time. One member of the audience observed that in software development, the distillation theory is called "the process". That theory states that given a number of average or bad programmers, after many iterations, good code can be produced. Just as everybody can be a writer if they merely get patience and diligence to write enough drafts. The reality of book writing and of software development shows the falsity of these views. Not everybody can be a good writer. Bad writers always write bad prose. Bad programmers always write bad code. The message of the talk is also against debugging. It's all too prevalent for programmers to quickly type in the code they don't quite understand and aren't sure of correctness, and then debug it away. Again, the experience (e.g., with ftp clients and Sun's tooltalk daemon) shows that the code badly written the first time remains buggy and filled with holes, no matter how much it is debugged. * A Logic File System Yoann Padioleau and Olivier Ridoux, IRISA / University of Rennes General Track, pp. 99-112 of the Proceedings http://www.usenix.org/events/usenix03/tech/padioleau.html This is the most interesting, insightful, and innovative talk at this conference. The goal of the talk is to combine the navigation and querying in a file system. Roughly speaking, the talk attempted to bring filesystems from the age of hierarchical and network databases into the age of relational -- and, moreover, deductive databases -- and yet maintain the familiar cd/ls/mkdir/mv interface, file paths and shell globs. The current filesystems are built around the hierarchical model (links, if present, are only uni-directional. Therefore filesystems do not actually follow the network database model). Unfortunately, a hierarchy is a tree rather than a lattice: a hierarchy decomposes the system around one aspect disregarding the others. For example, the typical layout of the filesystem emphasizes the scope: /, /usr, /usr/local, /usr/local/share/ghostscript/7.05. Other aspects, e.g., library code, are smeared all over the place: /lib, /usr/lib, /usr/local/lib and /usr/local/share/ghostscript/7.05/lib. Locating files that are local to an installation is easy (/usr/local), locating all the library files is hard. Many interesting queries are difficult to execute or even to formulate. For example, to list all directories that contain only empty subdirectories, to "cd" to a directory with files foo.h or foo.c (disjunctive query), or to list all the files *.h _or_ *.c in the given tree that are _not_ in (sub)directories named Attic. To answer such queries, we have to write quite complex logical formulas, made more complex by the fact that me must keep the navigational structure in mind. Many operating systems have tools that impose a relational layer on the file system -- glimpse, locate, find -- and offer a keyword-based search. However these tools often return a great number of results, which are not grouped in any particular way. For example, "locate sendmail.cf" may return a couple of file names. But "locate !sendmail.cf" (where ! stands for negation), if it were allowed, would list almost any file on the system. Clearly the user will find such a result unmanageable. The user would much prefer if the output were automatically grouped according to the most general and informative categories: the most general keywords that _will_ refine the search. Furthermore, the user should be able to navigate such a virtual directory of results using the familiar 'cd' and 'ls' commands, or reclassify a file with a 'mv' command. Finally, if the user adds or deletes a file, the virtual file tree should be updated automatically. We would like to "cd /lib" and descend into the "directory" that contains only library files, grouped into subdirectories "root", "usr", "local". We would like to "cd /bin|/sbin" and get to the root of the "filesystem tree" that contains only executable files (again, grouped into "subdirectories"). We may even do "cd !src" to get the view of the filesystem without any source files. The Logical File System described in this talk (LISFS for short, not to be confused with the Log File system) can accomplish all that -- now. The prototype implementation is fully functional and reasonably fast. An ls-benchmark shows hardly any performance overhead of far more expressive filesystem queries. The logical filesystem LISFS takes a relational view: files are entities that are described by their properties (attributes). To locate a file, we execute a query. LISFS lets a user compose a query incrementally: "cd /query" executes the query and makes the corresponding collection current. "cd query1" refines the original query. Incidentally, this means that "cd a/b" and "cd b/a" are the same: the path separator stands for a conjunction, and therefore, is commutative. The closed-world assumption makes negation queries possible (e.g., "cd !/lib"). LISFS goes farther than the naive relational model. A user may create taxonomies of attribute values. For example, the user may create a category "lib" with sub-categories "shared-lib" and "static-lib" -- using the mkdir command. LISFS will infer that anything classified as a shared library is also a library. The inference rules can be more complex: we can define a category "dual-lib" for project's libraries that can be linked both statically and dynamically. When LISFS reifies the result of a query as a filesystem directory, the most general categories that partition the result become the names of subdirectories. LISFS thus lets us navigate the inference tree. The complex inference requires a logical engine -- and indeed the current version of LISFS includes a Prolog interpreter, as a part of the filesystem code! Truly a file path is a logical formula. Like the semantic file system, LISFS has intrinsic attributes, whose values are derived from the content of a file (e.g., file's size or the resolution of a JPEG image). Unlike the semantic file system, a user can define his own attributes and organize them into arbitrary taxonomies. The paper describes the algorithms, the implementation of directory operations, and the obvious optimizations (e.g., caching). LISFS is implemented truly at the low, file system level, as a user-level NFS daemon. Therefore, none of the applications need to be modified in any way to take advantage of LISFS. For example, the wildcard expansion in a shell (glob) "just works". LISFS is written with the help of a PerlFS toolkit. However, Yoann told me that the implementation language was not truly Perl. Yoann designed his own language with types and an ability to express post-conditions and invariants. That source language was then mechanically translated into Perl. Perhaps Yoann might wish to supply more details? The only filesystem that comes close to LISFS in terms of integration of query and navigation was the file system of Apple Newton (so-called 'soup'). But NewtonScript never had any logical inference. In the future research, Yoann plans to implement views of a file. We could then edit selected sections from a TeX document as if they were a separate file (Emacs' all-mode is a glimpse of that functionality). It seems that LISFS can be meaningfully integrated with Semantic Web. Yoann was asked a question if a common person (without the knowledge of logic, etc) would be able to meaningfully navigate LISFS. Yoann gave an excellent answer: a common person can use Google. LISFS is not much different. The prototype and more information are available at www.irisa.fr/LIS * Undo for Operators: Building an Undoable E-mail Store Aaron B. Brown and David A. Patterson, University of California, Berkeley General Track, pp. 1-14 of the Proceedings http://www.usenix.org/events/usenix03/tech/brown.html A winner of the best paper award. The statistics show that operator errors are the dominant cause of Internet services outages. Backups and now the re-windable storage let the administrator "time travel" -- roll back the system to some good state, fix the problem, and then replay the log of changes. This form of time travel (as time travel in general) may create paradoxes. For example, a bug in an anti-virus e-mail scanning software caused the deletion of all e-mail attachments. The administrator notices the error, restores the system from a backup, patches the anti-virus software, and replays the log of arrived messages. A user who had received, read and deleted a message with an empty attachment receives that old message once again, now with the full attachment. The user will be confused -- the deleted message is resurrected, with a different body. The presenter noted that paradoxes in general cannot be avoided. He proposed to mitigate them. In particular, we can rely on people's ability to tolerate paradoxes, if they are explained. In this respect, e-mail is a perfect application for this approach, because the users are human. Therefore, the goal of the project is to detect paradoxes and send an explanatory e-mail message. The authors have built a prototype based on a rewindable storage (Net appliance) and SMTP/IMAP proxies. The proxies monitor user interactions with the e-mail system and maintain a log of events. An event is sending a message, reading a particular message, deleting it, etc. The record of an event must identify its action (the "verb") and attributes. The latter must be in an environment-independent form: not "deleting message number 42" but deleting message with a particular message-id. Each action verb is associated with compensation/explanation hooks. Play-forward routines check for consistency between the replayed state and the recorded event-stream. If inconsistency is detected (e.g., a message with a particular id now has a different hash of the body), the compensation hook associated with the verb is invoked. The identification of events, and the development of compensation/explanation routines are the responsibility of a sysadm. Furthermore, as Eric Allman observed, SMTP is a far more complex protocol, which cannot be reduced to a single "Deliver" event (as it is in the prototype). I noted that explanatory messages sent during the forward replay are themselves a paradox. The author agreed and said that the explanatory messages must be identified as such, to prevent the system from diverging. That means the system is not modular: a compensating routine must be aware of the compensation done by all the other routines. Furthermore, the effect of "compensating" for one paradox may trigger, in an indirect way, another paradox (e.g., sending a message that bounces and causes mailer-daemon messages) -- in a way that is difficult to identify. There doesn't seem to be a good way to keep the paradoxes from snowballing. The presenter said that they are considering running the forward-replaying process in a VM under VMware, so they may tightly monitor and control the process. I got the impression that the whole approach is not thought through. The paper seems to be very ad hoc. I'm puzzled how it was chosen to be one of the two best papers. * Role Classification of Hosts within Enterprise Networks Based on Connection Patterns Godfrey Tan, Massachusetts Institute of Technology; Massimiliano Poletto, Mazu Networks; John Guttag and Frans Kaashoek, Massachusetts Institute of Technology General Track, pp. 15-28 of the Proceedings http://www.usenix.org/events/usenix03/tech/tan.html The idea is grouping of hosts based on their communication patterns, to be more precise, on the number of transferred bytes between two hosts over the observation period. The present paper is yet another paper (out of many that followed at USENIX2003) whose motivation is sorely lacking and whose approach is ad hoc at best. The present and many other presented papers originated from a summer internship. Perhaps that can explain their immaturity. The author claimed that their host classification can be used for intrusion detection. Given the fact that they need to accumulate quite a lot of statistics, their intrusion detection would come a bit late. Furthermore, an intruder who is not a fool takes trouble not to be conspicuous. If he must probe other machines, he does that at a low rate. The paper presented a grouping algorithm and a group differencing algorithm. The latter correlates two groups of hosts identified at two different time moments. The group identification algorithm builds the graph of hosts that communicate "regularly" and takes two-connected components as initial groups. The latter are then merged according to various heuristics. The "regular communication" requirement all but precludes the usefulness of the system as an intrusion detection tool. * A Cooperative Internet Backup Scheme Mark Lillibridge, Sameh Elnikety, Andrew Birrell, Mike Burrows, and Michael Isard HP Systems Research Center General Track, pp. 29-42 of the Proceedings http://www.usenix.org/events/usenix03/tech/lillibridge.html The talk proposed a variation of an Internet backup at a far less price: 25 cents per MB per month compared with the typical $7/Mb/month of commercial Internet backup providers. The price is so low because the users of the system store (parts of) backups of each other. The proposed backup system has no centralized state, no trust assumptions among the backup partners, no need for fixed identities of the partners, no administrative overhead. The backup software creates a back-up file, expands it with a Read-Solomon erasure code, encrypts the blocks (it is important to encrypt after the expansion!), signs each block, and distributes the blocks among the partners. Currently, the list of partners is obtained from a central server. In all other respects, the system is peer-to-peer. Most of the talk was a presentation of quite clever algorithms for detection and punishment of free-loaders -- people who participate in the system but don't store backup blocks of other people. To check if a partner indeed stores the blocks we sent him, from time to time we send the partner a nonce and a block number and ask for the hash of that block. The system must be gracious enough to distinguish between a freeloader and a computer that merely crashed. Therefore, the system includes the notion of a "pre-payment". When a new user joins the system, for three weeks he has to store backup blocks of its partners "for free" (without any right to send the partners his own backup blocks). After three weeks of good behavior (with partners verifying that the blocks are stored indeed), the new user gains the backup benefits. The presenter also explained a sophisticated algorithm to detect a man-in-the middle attack. The system is implemented. It provides the backup rate of 1MB/sec on a LAN on Pentium Pro 20. Some of the algorithms are noteworthy, and can be applied to any peer-to-peer service with untrusted partners. * The Convergence of Ubiquity: The Future of Wireless Network Security William A. Arbaugh, University of Maryland, College Park Invited Talk The talk has discussed several difficult to solve or yet to be solved security problems in wireless (in particular, 802.11) networks: -- wireless security is different from the wired security: hop-by-hop vs. end-to-end -- problems with WEP and potential problems with WPA -- Denial-of-Service (DoS) attacks: they are easy, numerous, and _not_ preventable ** wireless security is different from wired security: hop-by-hop vs. end-to-end In a wired network, many computers are protected by a firewall. An external attacker typically has no access to the LAN and the LAN protocol. He cannot spoof ARP and monitor and interfere at the transport (Ethernet, PPP) layer. In a wireless network, an external attacker can do all that. In a wireless network, every attacker is an insider. In the 802.11 group of standards, frequency hopping or spreading of spectrum are predictable, therefore, they offer no defense. With a parabolic antenna, one can access a wireless network up to 2 miles from the access point! End-to-end security is orthogonal to the transport layer (hop-by-hop) security. End-to-end security (such as SSL) does not protect against routing attacks, DoS, or social engineering. For example, when a wireless user is connecting to his bank but instead is redirected to attacker's site, the browser will pop up a warning, saying "a certificate issued to www.your-bank.com cannot be verified. Do you want proceed, cancel, or look at the certificate?" It's too easy to make a forged certificate look like the real one. A user not trained in security will probably note no difference and accept the forged certificate. ** problems with WEP and potential problems with WPA The presenter said that WEP -- Wireless Equivalent of Privacy -- is a very unlucky security standard. Many of the attacks against WEP could only succeed in very specific circumstances (if bytes are arranged a certain way, or if ciphers are composed in a certain way). Unfortunately, WEP uses the ciphers precisely in that, most vulnerable way. Some particular problems with WEP: it relies on a MAC address (Media Access layer address: the Ethernet address of the wireless card) as client's identifier. Alas, the MAC address is stored in a flash memory and therefore is mutable. An attacker can change his MAC address to be anything he wishes to. The CRC in WEP offers no integrity protection: CRC and the XOR operations in encryption layer are both linear. An attacker may flip bits in the packet without breaking CRC. WEP does not protect against a replay of packets at all. WPA is an stop-gap attempt to fix WEP. WPA uses two new algorithms -- TKIP and Michael -- which are not mature and not well-tested. In fact, there was report that TKIP was broken. ** Denial-of-Service (DoS) is a real problem Alas, all current and under consideration wireless standards seem to ignore DoS. Furthermore, they make DoS too easy. One can easily turn his microwave into a transmitter that jams all wireless network within a wide range. Unfortunately, one can inflict damage against a specific access point in a far easier way. For example, management frames in 802.11 are not authenticated and can easily be spoofed. An attacker can pretend to be an access point and occasionally transmit frames "don't transmit for xx seconds" or "go to sleep" and effectively shut down the network. Furthermore, there are tools that do exactly that: Air-Jack and WLAN-Jack. Google search for these names will give the source code and executables. The presenter has demonstrated these tools. He said he uses the tools in the class he teaches at the university, to prevent students from browsing the web while in class. * Network Programming for the Rest of Us Glyph Lefkowitz, Twisted Matrix Labs; Itamar Shtull-Trauring, Zoteca FREENIX Track, pp. 77-90 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/lefkowitz.html The presenter started his talk with a remark that manual transmission cars indeed offer more control. However, an automatic transmission car is much easier to use, and is good enough. During the talk, he several times urged the programmers to use a higher-level networking framework. The talk presented one networking framework -- implemented in a high-level language, Python. The presenter has made a good case for a higher-level language. Notable features of their framework: - cross-platform - low-level peculiarities are not hidden. The framework offers portable (Windows, UNIX, etc.) networking. A user can still set a particular socket option if he knows what he's doing. This is in contrast to the least-common-denominator approach of Java. - Event loop is made explicit. A user can integrate networking events with GUI and other framework's events. Actually, all asynchronous networking i/o is based on an event loop. The presenter said that multi-threading is easy to add to an event loop (message passing). The reverse is harder. - Event loop is integrated with select(), poll(), kqueue and WaitForMultipleObjects, depending on a platform. - Testing is integrated into the framework. Currently they have ca. 700 tests. Regression tests are run _automatically_ on each CVS check-in. - separate the protocol (HTTP, SMTP, FTP) and the transport (TCP, UDP, SSL) - many high-level protocols supported: HTTP, NNTP, SSH, IRC, FTP, etc. - The code is LGPL and is available at www.twistedmatrix.com They have ca. 20 developers. I asked the presenter if they investigated any other high-level language, such as OCaml. He didn't seem to be familiar with OCaml. He said that they inherited a prototype framework written in Python, and have been enhancing it. * In-Place Rsync: File Synchronization for Mobile and Wireless Devices David Rasch and Randal Burns, Johns Hopkins University FREENIX Track, pp. 91-100 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/rasch.html The ordinary rsync protocol needs temporary space for the file being synchronized. That can be a problem for a mobile and other memory constrained device. The talk presented a modification of the rsync protocol that operates in-place -- that is, it applies updates to an existing old copy of the file in-place, bringing the copy up-to-date. The rsync protocol sends the patches in COPY and ADD operations. In the modified rsync protocol, the server topologically sorts COPY operations and sends them in the order that avoids conflicts. In rare circumstances (of circular dependencies), COPY is converted into ADD. ADD operations now have to carry an offset in the target file. Compression suffers a little bit, but the overall result is encouraging. * Application-specific Delta-encoding via Resemblance Detection Fred Douglis and Arun Iyengar IBM T.J. Watson Research Center General Track, pp. 113-126 of the Proceedings http://www.usenix.org/events/usenix03/tech/douglis.html The topic of the talk is grouping files in the file system according to their resemblance to increase the efficiency of delta-encoding. Delta-encoding is a differential encoding against some base. An example: rsync. Rsync takes an old version of a file as a base. What if we don't have an earlier version of a file? What if the earlier version was renamed or moved to a different place? The goal of the talk was therefore finding a suitable base for a given file. To be more precise, finding a collection of similar files so that each file in the collection can be delta-encoded against each other. Possible applications: Mail servers or web caches. To determine if two files resemble each other, the standard technique was used: extracting features as "shingles" (sliding window of the size 20-30-bytes: truncated suffix tree), Rabin fingerprinting the shingles, subsampling the space of features and measuring the resemblance in that space. It turns out, 30 features are enough to judge the similarity of files. Experiments confirmed that resemblance implies good delta-encoding. To be more precise, given two files that share 5 out of 30 features, difference encoding was better than compression. VCDiff was used for delta-encoding. Using shingles rather than file blocks (as Pastiche does) let the authors detect similarity in far more cases. For example, if we take a file, remove the first byte from the beginning and add that byte to the very end, we end up with a file that differs from the original one in every block. Fred Douglis's system will still find the two files highly similar, and so will efficiently delta-encode them. * Opportunistic Use of Content Addressable Storage for Distributed File Systems Niraj Tolia,Carnegie Mellon University and Intel Research Pittsburgh; Michael Kozuch, Intel Research Pittsburgh; Mahadev Satyanarayanan, Carnegie Mellon University and Intel Research Pittsburgh; Brad Karp, Intel Research Pittsburgh; Thomas Bressoud, Intel Research Pittsburgh and Denison University; Adrian Perrig, Carnegie Mellon University General Track, pp. 127-140 of the Proceedings http://www.usenix.org/events/usenix03/tech/tolia.html Content Addressable Storage means that the file name -- the identifying tag of a piece of data -- is the cryptographic hash of that file's contents. The goal of the project is improving a _read_ performance in distributed file systems such as AFS and Coda. Often a group of local (wireless) clients are working with the same remote server. If a local client needs a file, it can ask other clients in the neighborhood if they have the file or some blocks of that file in their local caches. When a client does binary install, upgrade or resumes an image of a commonly used virtual machine, neighboring clients are very likely to posses most of the needed files. Implementation: split a file into separately addressable chunks. The file server represents a file by its "recipe" -- a small document that contains the description and the version of the file and lists the names of all chunks (cf. i-node). Recall that the name of a chunk is the hash of that chunk's content. To download a file, the client first asks for the recipe and then checks the neighboring clients to see if they have some of the needed chunks. * StarFish: Highly-Available Block Storage Eran Gabber, Jeff Fellin, Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu Oezden, and Elizabeth Shriver, Lucent Technologies Bell Laboratories FREENIX Track, pp. 151-164 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/gabber.html A winner of the best paper award. STARfish is a RAID-1-like system with geographically distributed Internet accessible "disks", which supports on-the-fly replication and dynamic, hot recovery from disk failures. Because the system looks like a SCSI disk, it works with any OS. STARfish is not a distributed file system (STARfish works on a block level), it does not attempt to solve a multiple-writer problem (that is, only one writer is assumed), and it is not iSCSI. STARfish is built on commodity cheap hardware. All the software is open source (cf. EMC SRDF). STARfish is implemented under FreeBSD 4.x. In fact, it has been working in authors' Lab since Feb 28th, 2001. Implementation: Host elements and Storage Elements. A Host Element (HE) is a small box with SCSI connectors. To a computer, HE looks like a SCSI disk. When HE receives a SCSI write command from the master, HE assigns the command a sequence number and sends the command, over TCP/IP, to storage elements (SE). Once HE receives enough acknowledgments from SEs, HE acknowledges the command to the master. When the master sends a read SCSI command, HE may either ask all SEs for the needed blocks, or ask one SE (the one with the least latency). Recovery: HE keeps a log of recent writes, for replaying on SEs that come back from short failures. If an SE was down for an extensive period of time, the SE asks the neighboring SEs for their replay logs (which they keep). While SE recovers, it receives and executes writes from HE. Thus SE recovery is "hot": when the recovery is completed, SE is fully in the current state. HE failure is remedied through replication. Again, STARfish is not a distributed backup system: STARfish does not have a master store. All SEs in Starfish are peers, and all of the SEs may be off-site. Measurements indicate the overhead of 14% on a PostMark benchmark; HE and SEs were connected via Internet with a simulated latency equal to the latency of the Internet connection between NY and LA. Online recovery causes 4-22% degradation in performance. Experiments showed that the best configuration has three SEs, two of which are enough for the quorum (that is, two write ACKs are enough for HE to send the ACK to the SCSI master). One SE not in the quorum can have high latency (i.e., be far off-site). * Secure and Flexible Global File Sharing Stefan Miltchev, University of Pennsylvania; Vassilis Prevelakis, Drexel University; Sotiris Ioannidis, University of Pennsylvania; John Ioannidis, AT&T Labs - Research; Angelos D. Keromytis, Columbia University; Jonathan M. Smith, University of Pennsylvania FREENIX Track, pp. 165-178 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/miltchev.html The goal of the DisCFS project is to provide distributed access to remote files in a secure fashion. Clients are identified by their certificates. The file server does not need accounts for every remote client: decentralized access control. The presenter mentioned that they have been using a KeyNote system for managing authorization certificates. KeyNote's signed certificates may contain complex conditions (based on the identity of the requester, his domain, time of the day, etc). Furthermore, a person can delegate a subset of his permissions to another person. This greatly simplifies managing of certificates. Implementation: a user-level NFS server, accessible over IPSec. For a macro-benchmark -- kernel compile -- the processing of credentials is amortized, and there is no difference in performance between DisCFS and NFS (on which DisCFS is based). Future work: implement short-lived refresher credentials, to be presented along with the regular credentials, to support flexible revocation. * The CryptoGraphic Disk Driver Roland C. Dowdeswell, The NetBSD Project; John Ioannidis, AT&T Labs Research FREENIX Track, pp. 179-186 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/dowdeswell.html The goal is to protect a personal computer, specifically, a laptop or a removable disk, in case of theft or unauthorized access. The system is assumed single-user. It's important to keep the overhead as low as possible. Implementation: an encrypting pseudo-disk. The approach is similar to FreeBSD's vnd, Linux loopback mounts or software RAID, only with a encryption layer in-between. An encrypted disk is unlocked with the help of a parameter file, which keeps the keys and other relevant data. The parameter file should be kept on a separate floppy or a USB dongle. In the latter case, there is no need to type a password to inlock the disk (as long as the dongle is present). The system supports N-factor authentication: several parameter files must be present to unlock the disk. Overhead: large compilations off the protected disk are affected by 1-5%. * The Design of the OpenBSD Cryptographic Framework Angelos D. Keromytis, Columbia University; Jason L. Wright and Theo de Raadt, OpenBSD Project General Track, pp. 181-196 of the Proceedings http://www.usenix.org/events/usenix03/tech/keromytis.html One (rare) interesting talk was about a cryptographic kernel subsystem in OpenBSD. The subsystem has been used in OpenBSD over three years. The subsystem provides an abstraction layer over various cryptographic accelerators, crypto-cards, or pure software ciphers. Clients -- user processes or kernel systems such as IPSec -- request cryptographic services in a uniform format, by queuing a request. The presenter showed a slide with the format of the request. One of the fields was named "continuation"! It was exhilarating to see this mentioning of an advanced programming language concept. It's preciously rare at USENIX. As the talk progressed, it became clear that the service request is not exported to user processes. Userland can interact with the crypto-system only through /dev/crypo, by doing a _synchronous_ i/o. I spoke with the presenter after the talk. I asked him to confirm that this "continuation" (a call-back from the crypto-system) is not an upcall. He confirmed that the call-back is available only to clients within the kernel. He said that if a client is a kernel part of a process, the process immediately goes to sleep after submitting the request. The callback is a three-instruction function that executes twakeup for the thread. IPSec is the only special case. IPSec is split into several phases. The first phase processes a raw packet, and queues a request with the crypto-service to verify the signature or decode the payload. After queuing the request, IPSec just "returns" (to the interrupt handler), meaning that the packet is processed. The callback extracts the encapsulated packet and queues it for processing by the IP stack. The presenter confirmed that this non-trivial callback is quite specific to IPSec and the TCP/IP stack. The callback operates entirely within the kernel. It doesn't need to (re)acquire locks because the OpenBSD kernel is single-threaded. * A Binary Rewriting Defense against Stack-based Buffer Overflow Attacks Manish Prasad and Tzi-cker Chiueh, Stony Brook University General Track, pp. 211-224 of the Proceedings http://www.usenix.org/events/usenix03/tech/prasad.html The talk showed how to defend against buffer overflows by instrumenting binary code. No access to the source code is needed. The instrumentation involves prologue and epilogue of functions. When a function is entered, we remember the return address in a special database. The instrumented epilogue checks that the address we're about to return to is present in the database. In order for instrumentation to work, we have to reliably identify prologue and epilogue of functions by _statically_ examining _binary_ code: we have to disassemble the binary code. The authors showed that they can disassemble x86 code with 99.9% accuracy. Still they miss a few indirect functions -- which are entered not through an explicit call instruction or via an indirect jump whose target cannot be determined statically. Therefore, their technique can't handle hand-crafted assembly, multi-threaded applications, self-modifying code. It's amazing how much effort is expended in trying to patch code written in unsafe languages. Perhaps it would be cheaper just to forbid unsafe languages? * Kernel Support for Faster Web Proxies Marcel-Catalin Rosu and Daniela Rosu, IBM T.J. Watson Research Center General Track, pp. 225-238 of the Proceedings http://www.usenix.org/events/usenix03/tech/rosu.html Challenge: a web proxy or a web server have to track a great number of open TCP connections. Executing select()/poll() on a large fileset is very inefficient. This problem has been described and extensively investigated by Peter Druschel and his students at Rice U. Marcel-Catalin Rosu gave another solution: select() in user-space. The kernel exports the state of file descriptors as a read-only shared memory segment. Application's source does not have to be changed: one should merely relink the application with a select() library. User-level select() reduces the overhead by 50-70%. With user-level select(), a web proxy does not saturate CPU even at 240 req/sec. * Multiprocessor Support for Event-Driven Programs Nickolai Zeldovich, Stanford University; Alexander Yip, Frank Dabek, and Robert T. Morris, Massachusetts Institute of Technology; David Mazie`res, New York University; Frans Kaashoek, Massachusetts Institute of Technology General Track, pp. 239-252 of the Proceedings http://www.usenix.org/events/usenix03/tech/zeldovich.html The event loop of event-driven programs fetches events and executes associated handlers (callbacks). The callbacks are executed sequentially. This work tries to parallelize callbacks. It assumes a coarse-grained parallelism -- most of the events pertain to independent objects. The solution: color the callbacks. This is the standard approach, e.g., for multi-processor-aware memory allocators. Callbacks of different colors are presumed independent and can be executed concurrently. It is the _programmer_ who has to assign colors to call-backs. One can color callbacks by the number of the file descriptor (socket) on which they operate. However, if these callbacks access a shared resource (cache), then the callbacks must be colored the same. The authors applied their solution to the SFS toolkit (see the talk by David Mazie`res at USENIX 2001). Callbacks that execute encryption and decryption operations are parallelizable. Still, the authors obtained the speedup of only x1.5 on a 4 [sic!] CPU box. The presenter blamed the Linux kernel. I have questioned if typical callbacks are indeed independent. Although HTTP requests may seem independent, the corresponding handlers must write data into the common log and update the common statistics. * POSIX Access Control Lists on Linux Andreas Gruenbacher, SuSE Labs FREENIX Track, pp. 259-272 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/gruenbacher.html The talk gave a brief overview of ACLs and the scope of their implementation on various systems. Alas, there is no standard for ACL: POSIX 1003.1e/1003.2c working group dissolved without producing a standard. The latest, draft 17, is considered an unofficial ACL document. Flavors of ACLs: Posix-like (draft 17 and its various implementations), NFSv4, AFS, Windows/CIF. The upshot: Interoperability between POSIX and NFS/Samba may be acceptable. Still, users expecting a seamless integration between different ACL systems "are likely to be disappointed". * Privman: A Library for Partitioning Applications Douglas Kilpatrick, Network Associates Laboratories FREENIX Track, pp. 273-284 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/kilpatrick.html Privman is a library for an OpenBSD-style privilege separation. After a privileged process initializes the privman library, it forks itself. The child drops the privileges. The parent and the child processes communicate via a pipe. Whenever a child needs to execute a privileged operation (open a file it can't normally access or bind to a privileged port), the child asks the parent to execute the operation (and return the appropriate file descriptor, in the above two examples). Before the parent process executes an operation on behalf of a child, the parent process consults a policy (specified in a flexible configuration file) to determine if that operation is permitted. A privileged operation setuid() presents a difficult problem. When a child asks the parent to change the child's uid, the parent must kill the current child process and fork another one with the new uid. The parent must however preserve the state of the child process across such a restart. In effect, we need to obtain a closure of a process to run in a new dynamic (new user id) environment. Alas, Unix is ill-suited for obtaining a closure of a process. The author used privman to implement an FTP server. An FTP server does need to execute setuid(). However, the state to preserve is relatively small (only the identity of the connected client and the corresponding /etc/passwd entry). Micro-benchmarks show up to 20 times degradation in performance (for opening of privileged files, for example). However, macro-benchmarks of FTP performance show only 5% penalty, with no noticeable decrease in the number of supported clients or throughput. It seems that privileged operations are not that frequent. * The TrustedBSD MAC Framework: Extensible Kernel Access Control for FreeBSD5.0 Robert Watson, Wayne Morrison, and Chris Vance, Network Associates Laboratories; Brian Feldman, The FreeBSD Project FREENIX Track, pp. 285-296 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/watson.html The goal of the project is to build a framework for implementing various policies and authorization requirements. All services of the FreeBSD kernel -- VFS, sockets, IPC, etc. -- were modified to invoke the framework for all authorization decisions. The framework in turn invokes appropriate policy modules. The modules are written by users and can be loaded dynamically (a module can prevent the users from unloading it). The framework also supports attaching of policy-neutral labels to objects, and a useful composition of policies (intersection of rights). Other policy compositions are possible. The project has implemented quite a few policies: MAC, LOMAC, Biba, fixed-label hierarchical multi-level security, SEBSD, etc. They have finished porting of SE Linux's FLASK/TE (type enforcement) to SE BSD. The authors spent quite a bit of time making sure that policy decisions are properly serialized on SMP systems. The benchmarks showed that the overhead of labeling slow objects (such as files, whose labeling is implemented via extended attributes on FFS2 under FreeBSD 5.x) is negligible. * Using Read-Copy-Update Techniques for System V IPC in the Linux 2.5 Kernel Andrea Arcangeli, SuSE Labs; SuSE Labs Mingming Cao, Paul E. McKenney, and Dipankar Sarma, IBM Corporation FREENIX Track, pp. 297-310 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/arcangeli.html The topic is a _low-level_ concurrent access to (read-mostly) shared data on a multi-processor. The standard method relies on atomic instructions such as test-and-set and atomic increment. These atomic instructions have high overhead because they require flushing of the instruction pipeline, committing of all outstanding writes. For example, atomic increment on Pentium 4 takes 135 cycles. An alternative, for mostly-constant data (such as routing tables, configuration data) is read-copy-update (RCU). The technique applies when a data structure is accessed via an indirect pointer. For example, slots of the routing table are accessed only through the pointer to the routing table, which is stored in a global kernel table. None of the kernel data structures reference routing slots directly. This assumption implies that no pointer to a routing table slot remains in CPU registers after a context switch. To update the piece of shared data, we need to allocate another piece, copy the old information, apply updates, and then store the pointer to the new copy in the corresponding kernel table (update the global pointer). However, we cannot deallocate the old copy until all CPUs in the system performed a context switch. The latter condition guarantees that no outstanding pointer to the old copy of data exists. To protect against multiple writers, we need a spin lock. The advantage of the RCU technique is that the readers need to assert _no_ lock. Readers can access the data structure as if it were completely immutable and is not subject to any race condition. The authors took advantage of RCU in implementing System V semaphores. Allocating and deallocating a semaphore is a relatively rare event (compared with P- and V- operations on it). Previously, the table of the semaphores and the current semaphore id were protected by a global lock, which added a significant overhead. Now the access to semid and the lookup of a semaphore control block is done without any locking and any contention. The RCU API and the efficient implementation of System V semaphores is in Linux 2.5 kernel. * An Implementation of User-level Restartable Atomic Sequences on the NetBSD Operating System Gregory McGarry FREENIX Track, pp. 311-322 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/mcgarry.html The paper describes the mechanism for implementing atomic operations on a _uniprocessor_ system. Atomic instructions such as test-and-set or fetch-and-add enable higher-level mutual exclusion facilities. Some CPUs offer atomic instructions, but some (such as ARM2 and MIPS R3000) do not. Interlocking can be emulated in the userland code (Lamport mutual exclusion algorithm) or by invoking the appropriate kernel facility. Both methods have a high cost; a syscall is especially expensive. Even if a particular platform offers atomic instructions, they are quite expensive to execute. The paper describes an alternative approach: restartable sequences. Restartable sequences execute entirely in the userland. The approach is optimistic: in the best case there is almost no overhead. The approach is general and gives good performance. Restartable sequences turn out more efficient even if a platform does offer atomic instructions. Therefore, restartable sequences are now the preferred mechanism for atomic operations in the NetBSD threads library on _uniprocessor_ systems. A restartable sequence is a _simple_ sequence of instructions: this sequence must not invoke any functions, must not execute any emulated instructions or traps, and must be idempotent -- give the same effect no matter how many times it executes. For example, reading a value and writing a constant into a memory location (the software implementation of test-and-set) is a simple sequence. An application notifies the kernel of the beginning and the end of all restartable sequences within the application. If the kernel, upon context switch, finds the program counter within the range of a restartable sequence, the program counter is reset to the beginning of the sequence before the control is given back to the user thread. Resetting the program counter to the beginning of a sequence is a roll-back. The idempotency requirement of a restartable sequence guarantees that the roll-back is sound. Because the kernel is non-preemptible on a uni-processor, restarting a sequence guarantees the atomicity of the operation. * Providing a Linux API on the Scalable K42 Kernel Jonathan Appavoo University of Toronto; Marc Auslander, Dilma Da Silva, David Edelsohn, Orran Krieger, Michal Ostrowski, Bryan Rosenburg, Robert W. Wisniewski, and Jimi Xenidis, IBM T.J. Watson Research Center FREENIX Track, pp. 323-336 of the Proceedings http://www.usenix.org/events/usenix03/tech/freenix03/appavoo.html K42 is a new OpenSource operating system kernel being developed at IBM: http://www.research.ibm.com/K42 . K42 is aimed at 64-bit CPUs and is designed to scale from few to thousands of CPUs. K42 is similar to an Exokernel; its IPC is similar to the L4 micro-kernel. It appears the K42 takes advantage of the old SOM technology (part of OpenDoc) -- although the paper never mentioned SOM. The paper describes the implementation of a Linux API and a Linux application binary interface (ABI) on K42. Most Linux applications run on K42 without recompilation (if an important application does not run, the K42 team fixes the OS until the application runs). Linux outperforms K42 with the Linux personality on a uni-processor, but K42 outperforms Linux from a 10-CPU computer on. K42 scales linearly, and on a 24-CPU multiprocessor K42 is twice as fast as Linux (according to the SDET benchmark). * BSD SuperBOF ** NetBSD There will be a grand release, 2.0, in the fourth quarter: - SMP kernel on x86 and AMD64 platforms - kqueues - compatibility layers for Mach and Darwin - crypto-disk (see the presentation above), UFS2, GCC 3.3 We can now cross-build NetBSD from Linux into any target platform. There are 230 NetBSD developers, which constitute the membership of the NetBSD foundation. Directors of the foundation -- which must be members, that is, active developers -- are elected for 2-year terms. The foundation coordinates donations, manages intellectual property. For example, NetBSD and pkgsrc are registered trademarks of the foundation. Directors perform mainly administrative work and resolve disputes between developers. ** OpenBSD OpenBSD has introduced several new facilities that make it very difficult to execute buffer overflow attacks: randomizing the placement of shared libraries in memory, W^X. The latter, well-publicized facility, arranges that any executable piece of memory is not writeable by default. That arrangement required extensive modifications to the linker, dynamic loader, and even the GCC complier (which used to generate trampolines on stack and thus needed the stack to be executable). The OpenBSD packet filter pf has been significantly enhanced. One particular new feature is ALTQ, which is a quality-of-service and a throttle, triggered based on IP addresses and other conditions. ** FreeBSD FreeBSD had a very busy year: release of FreeBSD 5.0 and (just a week of the conference) FreeBSD 5.1. One particular new feature in works is mandatory access controls and other security policies. They will be more pervasive in version 5.2 (see the talk on TrustedBSD above). * Plan9 This year Plan9 had a show-and-tell BoF, which was attended by ca. 60 people. That was quite a surprise. Another surprise is that Plan9 is not dead. Plan9 will truly become OpenSource within a month. Lucent (Bell Labs) are running some cell phone systems on Plan9, with a real-time scheduler. The group of Dave Presotto at Bell Labs is actively developing the system, moving towards high-availability. Furthermore, there is a group at MIT that is developing Plan9. There are other developers, in particular, at Los Alamos Lab (in the Computer and Computational Sciences group). The latter is using Plan9 to make booting of Linux faster. The latter feature is important for their 1024-node cluster. They are also looking at Plan9 in the context Grid computing. As Ron Minnich (LANL) said, Plan9 is far superior to Globus. Ron Minnich said that DOE is looking into Plan9. Plan9 has GCC 3.0, with the entire toolchain. http://www.cs.bell-labs.com/plan9dist/