Flux, A Fault-Tolerant, Load-Balancing Operator for Continuous Query Systems

Mehul A. Shah
(Professor Joseph M. Hellerstein)

Continuous query (CQ) systems have been proposed for a variety of critical, high-throughput, long-running tasks. Examples include network monitoring tasks such as intrusion or denial-of-service detection, phone call record processing, web-log, or click-stream processing. These applications can be implemented as continuously processing dataflows, a generalization of database query plans. To provide high-throughput and low-latencies, CQ dataflows can be scaled by parallelizing them across a network of workstations, also known as a cluster or a shared-nothing platform.

Due to the long-running nature of CQ dataflows, two types of problems that arise in a cluster must be addressed for this method of scaling dataflows to be viable. First, as workload and runtime conditions evolve, detrimental imbalances in resource availability can arise across the cluster. CQ dataflows must continue to run efficiently in the face of these imbalances. Second, the machines on which CQ dataflows are running are bound to unexpectedly fail. Without using any redundancy, a single failure can seriously hamper or defeat the purpose of the CQ dataflow altogether. The CQ system must prevent data loss and provide high-availability for critical, long-running applications.

Our solution is to embed the logic for load-balancing and fault-tolerance into the communication abstraction, Flux, that is used to glue together a dataflow's constituent operators. In the absence of faults, Flux manages the state and input data partitioning of CQ operators to rebalance and keep them running efficiently. It also coordinates redundant processing of CQ operators and automatic fail-over to provide high-availability. The Flux mechanism is tunable so that an application designer can trade reliability for performance in portions of a CQ dataflow that do not require high-reliability. By judiciously embedding Flux in a CQ dataflow, an application designer can provide controlled degradation for the dataflow in the face of resource imbalances and machine faults that arise unexpectedly.

We are currently implementing Flux in the TelegraphCQ code base and developing robust CQ applications using Flux to demonstrate its utility.

More information (http://www.cs.berkeley.edu/~mashah) or

Send mail to the author : (mashah@eecs.berkeley.edu)

Edit this abstract