Recent work in the networking and OS research communities have developed (and continue to improve) highly scalable peer-to-peer systems called distributed hash tables (DHTs). These systems can provide the foundation for designing a massively distributed query engine. Such a query engine can be utilized not only to query over a wide-area network (tapping into distributed resources), but also query data that naturally resides in the network without the need to transport the data to a central collection point. We believe such as system will be useful in many applications, particularly for network monitoring.
DHTs are able to map a key (a sequence of bits) to a node in the network in the face of simultaneous nodes entering, leaving, and failing in the network. Above this mapping, the common hash table function, put() and get() can be implemented. We extend this interface slightly to allow for query processing.
We use existing hash-based distributed/parallel relational join algorithms (such as the symmetric hash join) while relaxing traditional consistency semantics (ACID) to allow the system to scale to over ten-thousand machines in simulation, and (so-far) dozens in an actual implementation over the Millennium Cluster.
Our primary research focus is on supporting continuous queries, which appear to be natural to the class of applications we are interested in supporting. In addition to relational joins, aggregation is also needed for many of the applications. Key properties for any operator include low communication, little or no synchronization, and are non-blocking. These open the door to questions of dynamic query optimization, approximate answers (such as online aggregation), and multi-query sharing.
1International Computer Science Institute