A few years ago, file sharing utilities were the first to explore the power of peer-to-peer computing. Protocols such as Napster, Gnutella, Freenet, and KaZaa provided users with an easy way to share popular files, documents, and other data. First generation protocols had their limitations. Freenet and Gnutella did not provide guaranteed results. Napster had limited scalability due to its use of centralized servers, and Gnutella's scalability was limited by the use of broadcast query messages. All of these protocols relied on the popularity (and resulting large number of replicas) of documents to make them easily accessible via search.
The second generation of peer-to-peer systems, including Tapestry [1,2], Chord , content-addressable networks (CAN) , and Pastry , provide reliable decentralized object location and routing (DOLR) functionality while only keeping local state logarithmic to the network size. These systems scale well in design, and under non-failure conditions, guarantee that queries find the desired object if it exists in the network.
To exploit this deterministic file location property, we've designed and implemented Interweave, a fully decentralized P2P file location utility on top of Tapestry. Interweave allows users to advertise local files by publishing IDs through the Tapestry object location layer, where IDs are generated from specified keywords. Users can do multi-field searches on multiple keywords or filenames, and prune their results further with constraints on file size and last modified date. Most importantly, search results are precise, and a document can easily be found, even if there is only one copy residing in a global network. Finally, Interweave leverages Tapestry's network locality to limit its results by the network distance between the client and each replica.
The Interweave design includes no points of centralization and no inherent scalability limitations, paving the way for its deployment on a large/global user base. The current implementation is feature complete, and interoperates with the current Tapestry Java implementation.
Current work revolves around improving the usability of the Interweave client, and performance optimizations to reduce the number of results returned (and the associated bandwidth usage) of popular requests.
Instant messaging (IM) is a new and useful way to communicate between networked users. Existing solutions such as AOL IM, MSN Messenger, ICQ, and Yahoo Messenger are generally centralized protocols with limited reliability and scalability properties.
In Shuttle, we take the traditional IM design, and implement it on a purely decentralized peer-to-peer infrastructure (Tapestry [1,2]). In the resulting protocol, users find their friends using a fully distributed lookup service rather than a centralized server. Communication is purely between peers. Contact lists are encrypted and distributed into the network storage layer of Tapestry, making them accessible from anywhere on the network.
Shuttle can also leverage the fault-tolerant routing aspects of Tapestry to route around congested and failed links to deliver messages more reliably. Finally, the completely server-less design of Shuttle means the system architecture does not limit scaling of the system to any sized network.
Shuttle is currently implemented on the current Java implementation of Tapestry. Plans are underway to deploy it as a long running service on an "always-up" Tapestry network anchored by a set of local Berkeley machines.
In today's chaotic Internet, data and services are mobile and replicated widely for availability, scalability, durability, and locality. Components within this global infrastructure interact in novel, rich, and complex ways, greatly stressing traditional approaches to location and routing. This article explores an alternative to traditional approaches, Tapestry [1, 2], which provides a peer-to-peer overlay location and routing infrastructure offering efficient, scalable, location-independent routing of arbitrary size messages directly to the closest copy of an object or a service using only localized resources. Tapestry supports a generic decentralized object location and routing (DOLR) API using a self-repairing, soft-state based fault-tolerant routing layer.
Using the Tapestry architecture, we have built two Tapestry implementations: a C-based simulator that can simulate up to 10,000 nodes; and a Java-based implementation that supports an efficient virtual node mode, where multiple virtual Tapestry nodes execute on a single physical machine. We have deployed the Java implementation on the large-scale PlanetLab  network testbed, which consists of approximately 100 machines at 42 academic and corporate institutions on three continents (North America, Europe, and Australia).
Using virtualization, we are performing detailed evaluation of the performance of Tapestry, by comparing the efficiency of overlay operations on Tapestry to idealized performance metrics, and analyzing Tapestry's large-scale behavior under highly dynamic, high-load scenarios, where client and server nodes may enter or leave the network in very large groups, and the rate of change in network membership is high. Finally, we are evaluating the usability and generality of Tapestry's extensible up-call interface.
Existing P2P overlay networks [1-5] utilize scalable routing tables to map unique identifiers to network locations as a decentralized object location and routing (DOLR) layer, and to locate nearby copies of object replicas as distributed hashtables (DHT). They allow network applications such as distributed file systems and distributed web caches to efficiently locate and manage object replicas across a wide-area network.
While these systems excel at locating objects and object replicas, they rely on known globally unique identifiers (GUID) for each object, commonly generated by applying a secure hash function over the data. This is problematic for text-based objects, where replicas are often similar but not identical. Replicas then can have different IDs, complicating their management.
To address this problem, we propose an approximate text addressing (ATA) layer, which uses a combination of block text fingerprints and P2P object location to find matches between highly similar data objects. To validate the ATA architecture, we design a decentralized spam-filtering application that leverages ATA to accurately identify junk email messages despite formatting differences and evasion efforts by marketing firms.
For evaluation, we use both probability models to predict the accuracy of our fingerprint vector scheme, and confirm those predictions by running experiments on real email traces. We measure both the accuracy rate, and also the false positive rate, and determine the size of fingerprint vector necessary to achieve an acceptable level of accuracy. We also simulate the spam filter application, and measure the success rate of matching a "marked" spam message as a function of both the percentage of nodes marking the message, and also the number of hops a request travels before termination. This allows us to tune the tradeoff between accuracy and network bandwidth for any given "marking rate" of a spam message.
Future work includes the implementation of the decentralized spam filter, and its deployment on the Tapestry network. We also plan to repeat our experiments on larger email datasets, while further validating the ATA design with the design and analysis of additional applications.
Figure 1: Fingerprint vector: a fingerprint vector is generated from the set of checksums of all substrings of length L, post-processed with sort, selection, and reverse operations.
I am currently investigating how introspection can be used in conjunction with code and data migration techniques to improve an Internet service's ability to adapt to changing network and resource conditions. Many of today's Internet services, such as e-mail, perform poorly when the connection to the server is high latency or low bandwidth. We have also shown that tomorrow's Internet services, based on peer-to-peer technologies, exhibit this same sensitivity to changes in network conditions. This sensitivity is usually due to inefficient use of the network connection.
Code and data migration techniques can be used to decrease this sensitivity by placing frequently used data objects nearer to the client, and by sending code fragments to run on the server to avoid round trip times. We are investigating how introspection can be used to build models of user access, so the infrastructure can predict expected user performance when code and data migration is used. Based on these models, the infrastructure can then decide when to employ the various code and data migration techniques to optimize a given metric.