Approximate Text Addressing and Spam Filtering

Feng Zhou, Li Zhuang, Ben Y. Zhao, and Ling Huang
(Professors Anthony D. Joseph and John D. Kubiatowicz)
(NSF) ANI-9985250, (NSF) ITR CCR-0085899, and (UC MICRO) 00-049

Existing P2P overlay networks [1-5] utilize scalable routing tables to map unique identifiers to network locations as a decentralized object location and routing (DOLR) layer, and to locate nearby copies of object replicas as distributed hashtables (DHT). They allow network applications such as distributed file systems and distributed web caches to efficiently locate and manage object replicas across a wide-area network.

While these systems excel at locating objects and object replicas, they rely on known globally unique identifiers (GUID) for each object, commonly generated by applying a secure hash function over the data. This is problematic for text-based objects, where replicas are often similar but not identical. Replicas then can have different IDs, complicating their management.

To address this problem, we propose an approximate text addressing (ATA) layer, which uses a combination of block text fingerprints and P2P object location to find matches between highly similar data objects. To validate the ATA architecture, we design a decentralized spam-filtering application that leverages ATA to accurately identify junk email messages despite formatting differences and evasion efforts by marketing firms.

For evaluation, we use both probability models to predict the accuracy of our fingerprint vector scheme, and confirm those predictions by running experiments on real email traces. We measure both the accuracy rate, and also the false positive rate, and determine the size of fingerprint vector necessary to achieve an acceptable level of accuracy. We also simulate the spam filter application, and measure the success rate of matching a "marked" spam message as a function of both the percentage of nodes marking the message, and also the number of hops a request travels before termination. This allows us to tune the tradeoff between accuracy and network bandwidth for any given "marking rate" of a spam message.

Future work includes the implementation of the decentralized spam filter, and its deployment on the Tapestry network. We also plan to repeat our experiments on larger email datasets, while further validating the ATA design with the design and analysis of additional applications.


Figure 1: Fingerprint vector: a fingerprint vector is generated from the set of checksums of all substrings of length L, post-processed with sort, selection, and reverse operations.

[1]
B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph, Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing, UC Berkeley Computer Science Division Report No. UCB/CSD 01/1141, April 2001.
[2]
K. Hildrum, J. D. Kubiatowicz, S. Rao, and B. Y. Zhao, "Distributed Object Location in a Dynamic Network," Proc. ACM Symp. Parallel Algorithms and Architectures, Winnepeg, Canada, August 2002.
[3]
A. Rowstron and P. Druschel, "Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems," Proc. IFIP/ACM Middleware, Heidelberg, Germany, November 2001.
[4]
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," Proc. SIGCOMM, San Diego, CA, August 2001.
[5]
S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker, "A Scalable Content-Addressable Network," Proc. SIGCOMM, San Diego, CA, August 2001.

More information (http://www.cs.berkeley.edu/~ravenben/tapestry) or

Send mail to the author : (ravenben@eecs.berkeley.edu)


Edit this abstract