Time-bounded Restoration of Real-time Communication Services


Table of Contents:

  1. Principal Investigator.
  2. Productivity Measures.
  3. Summary of Objectives and Approach.
  4. Detailed Summary of Technical Progress.
  5. Transitions and DOD Interactions.
  6. Software and Hardware Prototypes.
  7. List of Publications.
  8. Invited and Contributed Presentations.
  9. Honors, Prizes or Awards Received.
  10. Project Personnel Promotions.
  11. Project Staff.
  12. URLs.
  13. Keywords.
  14. Business Office.
  15. Expenditures.
  16. Students.
  17. Book Plans.
  18. Sabbatical Plans.
  19. Related Research.
  20. History.


Principal Investigator.


Productivity Measures.


Summary of Objectives and Approach.

  1. Objective: Develop dependable real-time communication protocols that provide firm QoS guarantees on end-to-end message delay, delay jitter, and bandwidth, and restore the QoS-guaranteed communication service disrupted by (persistent) network failures within a pre-specified time bound in multi-hop point-to-point networks.
  2. Approach:
    Real-time channel service paradigm: Deterministic per-connection QoS guarantees are achieved by resource reservation, admission control, traffic enforcement, and CPU/link scheduling.
    Primary-backup channel scheme: Backup channels are set up a priori for each primary channel. The resource overhead for setting up backup channels is minimized by a resource sharing and overbooking technique.
    Efficient failure detection: Two behavior-based mechanisms ensure fast and perfect detection of real-time channel failures.
    Fast/Robust failure handling: Failure reporting, channel switching, and resource reconfiguration after failure recovery are performed in a timely manner without affecting regular traffic.


Detailed Summary of Technical Progress.

  1. We have designed and implemented a real-time channel protocol which can provide a deterministic guarantee on such performance-QoS parameters as end-to-end message delay, delay jitter, and bandwidth, in packet-switched multi-hop networks. The QoS guarantee is maintained on a per-connection basis, and the well-behaving connections are protected from unexpected overload conditions. To achieve a high network-wide throughput, our protocol features a new "process-per-channel" protocol model that associates a channel handler with each established channel. Also, exported is a rich, well-defined API (Application Program Interface) through which applications can specify and negotiate QoS parameters.

  2. While the real-time channel protocol supports performance-QoS guarantees, we need to reduce or bound the disruption time of performance QoS-guaranteed communication services upon network failure. To quickly restore a real-time connection from the failure, we set up backup channels in addition to a primary channel for each connection. A backup channel remains as a cold-standby and does not carry any data until it is activated, so that it does not consume bandwidth under a normal condition. However, a backup channel is not free, as it requires the same amount of resources as its primary channel to be reserved, for immediate activation upon failure of the primary. we have developed a resource sharing technique to minimize the resources reserved for the fault-tolerance purpose while the dependability is not compromised. Two important dependability QoS parameters are guaranteed: recovery delay bound, and probability that the application does not suffer longer service disruption than that recovery delay bound.

  3. The first step in handling a failure is its detection. Behavior-based failure-detection schemes without hardware support suffice for datagram network applications, because they do not mandate fast failure recovery and reliable message delivery can be acheived through the acknowlegement/retransmission method. However, effective failure detection with high coverage and low latency is a key in providing fast/time-bounded failure recovery of real-time channels. We have developed two behavior-based (low overhead) failure detection schemes, and have evaluated their performance by extensive fault-injection experiments on a laborotary testbed. To this end, we used an integrated fault-injection enviroment, called DOCTOR, which provides a complete set of tools for automated fault-injection experiments in a distributed environment.

  4. Reporting detected failures to the end nodes of the affected channels and switching failed primary channels to their healthy backups should be done quickly, and, at the same time, these operations must be robust so that the operation of healthy connections can be insulated from the recovery process for failed connections. For time-bounded and robust transmission of time-critical control messages (e.g., failure report messages), we use speial-purpose real-time channels. After a failed primary channel is replaced by one of its healthy backups, the resources for the primary channel need to be released and a new backup channel should be established to maintain the connection's dependability QoS. Research on efficient QoS control is currently underway.


Transitions and DOD Interactions.

  1. Real-time channel software to Honeywell Advanced Technology Center for various tracking applications.

  2. DOCTOR (fault-injection tool set) to Lockheed Martin in Denver.

  3. On-going discussions on the possibility of porting the fault-injection tool and fault-tolerant communication software to the testbeds at NAWC, NSWC, SEI, and Allied Signal.


Software and Hardware Prototypes.

  1. Prototype Name: Real-Time Channel Protocol

  2. Prototype Name: Failure Detection Protocol

  3. Prototype Name: Backup Channel Protocol


List of Publications.

  1. Real-time channel service:

    A. Mehra, A. Indiresan, and Kang G. Shin, "Structuring Communication Software for Quality-of-Service Guarantees," Proc. of IEEE Real-Time Systems Symposium, December 1996.
  2. Fast failure recovery:

    S. Han and K. G. Shin, "Fast Restoration of Real-Time Communication Service from Component Failures in Multi-hop Networks," Proc. of ACM SIGCOMM Symposium, September 1997.
  3. Efficient failure detection:

    S. Han and K. G. Shin, "Experimental Evaluation of Failure-Detection Schemes in Real-time Commun ication Networks," Proc. of IEEE International Symposium on Fault-Tolerant Computing, June 1997.


Invited and Contributed Presentations.

  1. Real-time channel service:

    "Structuring Communication Software for Quality-of-Service Guarantees," at IEEE Real-Time Systems Symposium, December 1996.
  2. Fast failure recovery:

    "Fast Restoration of Real-Time Communication Service from Component Failures in Multi-hop Networks," at ACM SIGCOMM, September 1997.
  3. Efficient failure detection:

    "Experimental Evaluation of Failure-Detection Schemes in Real-time Commun ication Networks," at IEEE International Symposium on Fault-Tolerant Computing, June 1997.


Honors, Prizes or Awards Received.

Project Staff.


URLs.

  1. Annual Report FY97
  2. QUAD FY97 (power-point format)
  3. Vugraph_SIGCOMM97 (postscript)


Keywords.

  1. Real-time communication
  2. Failure recovery
  3. Multi-hop network


Business Office.


Expenditures.

  1. FY97: 64%


Current and Former Students.

  1. Name: Dr Ashish Mehra
  2. Name: Dr Atri Indiresan
  3. Name: Mr Harold Rosenberg
  4. Name: Mr Seungjae Han
  5. Name: Mr Charles Meissner


Book Plans.


Sabbatical Plans.


Related Research.

  1. HARTS project
  2. ARMADA project
  3. TENET project at Berkley
  4. IETF
  5. ATM Forum


History.