One of the novel features of our architecture is the identification of a modular collection of middleware services on OSF Mach-RT that can be used for developing ERA. The objective is to take full advantage of the base operating system and the real-time extensions including features such as preemptive microkernel, real-time scheduling framework, clock and timer support, and bounded IPC. These building blocks include services for managing computation and communication resources, for providing access to shared and/or replicated data in a networked environment, and for managing system dependability. The figure illustrates the various software layers in the suite of reusable middleware services. Each layer exports a well-defined interface to other services or to applications that are built on top of it. One or more protocols may be supported in each service layer. Related services are grouped together to illustrate functional dependencies among them.
Real-time multicast communication services
These services provide a collection of protocols with various delivery guarantees for
sending messages to a group of destinations, including real-time
(datagram) multicast, real-time atomic multicast, and atomic FIFO
multicast. These services allow the exploitation of the
underlying real-time channel protocol and hardware support
in a system without exposing the implementation to
the higher layer services.
Clock synchronization service
Clock synchronization provides access to synchronized clocks and logical timestamps. Clock
synchronization service provides a bound on the deviation between
processor clocks in the presence of hardware clock drifts and
failures. By timestamping events in a distributed computations,
`real-time' ordering and separation
of events can be compared. The logical timestamping service can be
used to assign timestamps that capture the causal ordering of events
in a distributed computation.
Group membership and topology services
These services maintain the operation status of nodes and communication links within
a system in the presence of processor failures/joins and communication
failures. These functions are provided by three related services:
heartbeat service, membership service, and topology
management. The heartbeat service is used by each node to determine
the operational status of other nodes. This service does not provide
any global consistency or agreement among processors. This is a
low-level service in the sense that it can use system-specific
mechanisms to detect node and link failures.
The topology management is a system-specific service that maintains
information about the connectivity of a system. This service does
not provide any global consistency or agreement. Changes in a
topology are propagated throughout the system in a lazy fashion.
The processor membership service uses the information provided by the
heartbeat and topology management service to ensure that all
functional nodes are reliably notified and have a consistent view of the
operational status of the processors in the presence of processor
failures/joins and communication failures. The processor membership
service is used by other services to trigger initialization and
recovery procedures.
Real-time data replication and caching services
These services provide the mechanisms for accessing and updating
replicated/distributed application
objects. The challenge is to develop replication schemes that provide
the required redundancy without sacrificing the timing predictability
of the system. An active replication scheme, based on the notion of
periodic process groups, and a novel passive replication scheme,
based on the notion of window consistency, will be provided as
part of the middleware services. The caching service supports
efficient access to frequently-accessed or slow-changing shared data.
The real-time replication service ensures that an update to a
data object is propagated to cached copies within bounded time.
The implementation of the replication and caching services
will exploit the bounded delivery
guarantees provided by the underlying real-time channel protocol.
Failure detection and recovery management
These are closely related to replication services.
While the replication service provides the mechanism for maintaining
redundant copies of certain objects, an application-specific
recovery procedure may have to be invoked
when a failure is detected.
The mechanism for failure detection and policies for recovery management
are supported by this service.
Distributed synchronization services
These services provide fault-tolerant and
scalable synchronization protocols for serializing access to
shared/exclusive resources. These services include a distributed lock
manager and a facility for barrier synchronization. The distributed
synchronization services can recover from various types of failures
including lock holder/coordinator crash and communication failure.
Resource tracking and migration services
provide the mechanisms for monitoring resources (or servers) in
a distributed system and for
properly supporting resource migration include invocation and
reply forwarding, mechanisms for ensuring co-location of related
resources and garbage collection.
Resource migration may be necessary to deal with failures or
to support load balancing/sharing in the presence of potential
system bottlenecks.
Back to the project homepage.