DRAFTS: Distributed Real-Time Applications Fault Tolerant Scheduling

Claudio Pinello
(Professor Alberto L. Sangiovanni-Vincentelli)
Gigascale Silicon Research Center

Some applications are so critical that they need to be resilient to faults in the computing architecture. Typically this is achieved by redundantly scheduling the application on the architecture. Starting from a network of process, some or all of the processes and the data they exchange are replicated. Additional processes may be needed for voting on the results of different replicas of a same process to establish a common result: the consensus problem [1]. We must then devise an assignment and schedule of the augmented network onto the distributed architecture.

It seems profitable to relieve the designer from the burden of devising a fault-tolerant distributed schedule, and opt for an approach based on synthesis.

In order to use resources efficiently, we want to allow the flexible use of passive replicas (replicas of a process that run only when the main replica undergoes a fault). Preliminary results have shown the usefulness of this technique in achieving higher schedulability by "reclaiming" resources from non-running replicas [2]. A further venue of improvement may arise in the context of gracefully degrading applications, where replicas are not an exact copy of the original process. Rather, there may be simpler versions with reduced functionality and/or accuracy and likely less resource requirements [3]. This exposes an opportunity to achieve higher schedulability by requiring strong fault resilience only of the light-weight versions.

Moreover, we want to allow general architectures, removing the strict boundaries of the modules and busses found in the TTA. This also enables more general fault models for the communication subsystem. The resulting architecture is a full-fledged distributed multiprocessor system, where each node can be used per se and not as a mere duplicate of another one. All the parallelism in the hardware can then be exploited to speed up the execution of parallel processes of the application [4,5] without affecting the degree of fault tolerance. We note that most of the results cited above have been derived under very restrictive assumption on the fault model. We believe some of their founding principles can be rephrased in a more general framework. The expected outcome of this research is a systematization of a set of design techniques that could allow for an easy exploration of design alternatives arising from different fault models.

M. Barborak, M. Malek, and A. Dahbura, "The Consensus Problem in Fault-Tolerant Computing," ACM Computing Surveys, Vol. 25, No. 2, 1993.
K. Ahn, J. Kim, and S. Hong, "Fault-Tolerant Real-Time Scheduling Using Passive Replicas," Proc. Pacific Rim Int. Symp. Fault-Tolerant Systems, Taipei, Taiwan, December 1997.
M. Caccamo, G. Buttazzo, and L. Sha, "Capacity Sharing for Overrun Control," Proc. IEEE Real-Time Systems Symp., Orlando, FL, November 2000.
C. Dima, A. Girault, C. Lavarenne, and Y. Sorel, "Off-line Real-Time Fault-Tolerant Scheduling," Proc. Euromicro Workshop on Parallel and Distributed Processing, Mantova, Italy, February 2001.
J. Aguilar and M. Hernandez, "Fault Tolerance Protocols for Parallel Programs Based on Tasks Replication," Proc. MASCOTS, San Francisco, CA, August-September 2000.

More information (http://www-cad.eecs.berkeley.edu/~pinello/distributedsystems) or

Send mail to the author : (pinello@eecs.berkeley.edu)

Edit this abstract