Peer-to-Peer Web Indexing and Search

Boon Thau Loo, Jinyang Li, Frans Kaashoek1, David Karger2, and Robert Morris 3
(Professors Joseph M. Hellerstein and Ion Stoica)

This project examines the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table (DHT). A simple model of capacity and workload suggests that the Internet does not have enough capacity to make naive use of either of these techniques attractive for Web search. We then present a number of existing and novel optimizations, such as caching and precomputation, bloom filters, gap compression and adaptive set intersections [1] with clustering. We estimate their effect on performance, and conclude that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. To achieve the last order of magnitude, we propose using Fagin's algorithm (FA) [2] in conjunction with a ranking function to generate incremental results. This may allow savings in communication if users terminate their queries early.

We intend to build a DHT-based search engine, and compare its performance characteristics with Gnutella and KaZaA.

[1]
E. D. Demaine, A. López-Ortiz, and J. I. Munro, "Adaptive Set Intersections, Unions, and Differences," Proc. ACM-SIAM Symp. Discrete Algorithms, San Francisco, CA, January 2000.
[2]
R. Fagin and A. L. M. Naor, "Optimal Aggregation Algorithms for Middleware," Symp. Principles of Database Systems, Santa Barbara, CA, May 2001.
1Professor, MIT
2Professor, MIT
3Professor, MIT

Send mail to the author : (boonloo@eecs.berkeley.edu)


Edit this abstract