Detecting Member Failure

Members can fail for several reasons (this list is not exhaustive):

Process termination.
Process suspension (for example, UNIX Control-Z).
Software errors.
Hardware failure.
Network disconnect.

Rendezvous fault tolerance software does not distinguish between these failures. In each case, the failed member cannot fulfill its mission (or can fulfill it only locally), and another member must take its place.

Rendezvous fault tolerance software detects failure of an active member in two ways—heartbeat tracking and independent confirmation.

Heartbeat Tracking

The inactive members of a group listen for heartbeat messages from all of the active members. A steady stream of heartbeat messages is an important indicator of process health. While the heartbeat continues, no action is needed. If the heartbeat messages from an active member cease to arrive, then Rendezvous fault tolerance software considers that member to be lost, and instructs the ranking inactive member to activate, replacing the lost member.

Independent Confirmation

Rendezvous fault tolerance software also detects a set of events that indicate the loss of an active member. Consider these example events:

An active member withdraws from the fault tolerance group.
An active member process terminates, or disconnects from its Rendezvous daemon.
A Rendezvous daemon process terminates; an active member that relies on that daemon is unable to function.
A network hardware failure separates the network into two or more disconnected parts.

When Rendezvous fault tolerance software detects such events, it restores the active goal by directing member processes to activate.