Reactive systems are those that maintain an ongoing interaction with their environment at a speed dictated by the latter. Examples of such systems include web servers, network routers, sensor nodes, and autonomous robots. While we increasingly rely on the correct operation of reactive systems, it is becoming ever harder to deploy bug-free systems.
In this paper, we propose a formal framework for automatically recovering
a class of reactive systems from run-time failures. This class of systems
comprises those whose executions can be divided into rounds such that each
round performs a new unit of work. We show how the system recovery
problem can be modeled as an instance of an online learning problem.
On the theoretical side, we give a strategy that is near-optimal,
and state and prove bounds on its performance. On the practical side, we
demonstrate the effectiveness of
our approach through the case study of a buggy network monitor.
Our results indicate that online learning provides a useful basis for
constructing autonomic reactive systems.
Paper available in PDF format.