Fault-Tolerant Protocols for Distributed Real-Time Systems

Graduate Students: Sreekanth Brahmamdam, Ashish Mehra, Jennifer Rexford

Professor Farnam Jahanian

Sponsors: Rackham School of Graduate Studies, University of Michigan

Office of Naval Research

With ever-increasing reliance on digital computers in embedded real-time systems for diverse applications such as avionics, manufacturing and process control, air-traffic control, and patient life-support monitoring, the need for dependable systems that deliver services in a timely manner has become crucial. Embedded real-time systems often interact with the external environment and operate under strict timing and fault-tolerance requirements. The focus of this research project is the management of redundancy with predictable timing behavior for distributed real-time systems. The primary objectives are (a) to investigate replication protocols for building predictable fault-tolerant systems; and (b) to develop a prototype testbed that can be used to demonstrate the feasibility of various replication schemes.

This effort takes two complementary approaches to the problem of replication in distributed real-time systems. The first approach supports fault-tolerance by providing various multicast services for real-time process groups. The primary goal of this approach is the integration of real-time scheduling theory with various multicast protocols to support predictable management of replicated data. The second approach deals with alternative replication protocols that exploit weaker consistency among the local states of replicated servers in a real-time system. These schemes may trade off atomic or causal consistency among replicas for less expensive replication protocols. This approach is based on a new concept, called window consistency, which provides a framework for supporting controlled inconsistency between replicas in a real-time application. The window-consistent replication service exploits the ability of many real-time applications to tolerate controlled inconsistency of the data to provide timely delivery of service. In this model, a distributed system can recover from a failure even though all servers may not have maintained identical copies of the replicated data.