Copyright © Cloud Software Group, Inc. All Rights Reserved
Copyright © Cloud Software Group, Inc. All Rights Reserved


Chapter 13 Fault Tolerance Concepts : Detecting Member Failure

Detecting Member Failure
Members can fail for several reasons (this list is not exhaustive):
 
Rendezvous fault tolerance software does not distinguish between these failures. In each case, the failed member cannot fulfill its mission (or can fulfill it only locally), and another member must take its place.
Rendezvous fault tolerance software detects failure of an active member in two ways—heartbeat tracking and independent confirmation.
Heartbeat Tracking
The inactive members of a group listen for heartbeat messages from all of the active members. A steady stream of heartbeat messages is an important indicator of process health. While the heartbeat continues, no action is needed. If the heartbeat messages from an active member cease to arrive, then Rendezvous fault tolerance software considers that member to be lost, and instructs the ranking inactive member to activate, replacing the lost member.
Independent Confirmation
Rendezvous fault tolerance software also detects a set of events that indicate the loss of an active member. Consider these example events:
When Rendezvous fault tolerance software detects such events, it restores the active goal by directing member processes to activate.

Copyright © Cloud Software Group, Inc. All Rights Reserved
Copyright © Cloud Software Group, Inc. All Rights Reserved