Deterministic Fault Injection of Distributed Systems
Ensuring that a system meets its prescribed specification is a growing
challenge that confronts software developers and system engineers.
Meeting this challenge is particularly important for distributed
systems with strict dependability and timeliness constraints. This
paper presents a technique, called script-driven probing and fault
injection, for the evaluation and validation of dependable
protocols. The proposed approach can be used to demonstrate three
aspects of a target protocol: i) detection of design or implementation
errors, ii) identification of violations of protocol specifications,
and iii) insight into design decisions made by the implementors. To
demonstrate the capabilities of this technique, the paper briefly
describes a probing and fault injection tool, called the PFI
tool, and several experiments on two protocols: the Transmission
Control Protocol (TCP) and the Group Membership Protocol (GMP). The
tool can be used to delay, drop, reorder, duplicate, and modify
messages. It can also introduce new messages into the system to probe
participants. In the case of TCP, we used the PFI tool to
duplicate the experiments reported by Comer and Lin in 1994 on several
TCP implementations without access to the vendors' TCP source code in
a very short time. We also ran several new experiments that are
difficult to perform using past approaches based on packet monitoring
and filtering. In the case of GMP, we used the tool to test the
fault-tolerance capabilities of an implementation under various
failure models including daemon/link crash, send/receive omissions,
and timing failures. Furthermore, by selective reordering of messages
and spontaneous transmission of new messages, we were able to guide a
distributed computation into hard to reach global states without
instrumenting the protocol implementation.
Back to
Publications list.
sdawson@engin.umich.edu