Fault Tolerance

In nearly every enterprise, mission-critical programs must continue to function properly despite sudden difficulties such as process termination, hardware failure and network disconnect. Fault tolerance in a network environment is characterized by rapid recovery from such failures.

Some fault-tolerant distributed programs keep service interruptions to a minimum by using redundant processes that cooperate across the network. Rendezvous fault tolerance software facilitates the development of distributed programs that use redundant processes for fault tolerance.

Rendezvous fault tolerance software helps your program achieve fault tolerance by coordinating a group of redundant processes. Some processes actively fulfill the tasks of the program, while other processes wait in readiness. When one of the active processes fails, another process rapidly assumes active duty.

Rendezvous fault tolerance software supports any number of cooperating processes connected by a local or wide-area network. Rendezvous fault tolerance software monitors the health of cooperating processes, determines when a key process is no longer in service, and instructs another process to take its place.

Rendezvous fault tolerance software is fast, compact, and adds little overhead to programs.

You can use Rendezvous fault tolerance software to design fault-tolerant behavior into programs from the start, or to retrofit existing programs for fault tolerance.

Programs can passively monitor the number of active members in a fault-tolerance group (whether or not the monitoring program is itself fault-tolerant).

Fault Tolerance versus Distributed Queues

Fault tolerance usually requires that every member of a fault tolerance group receive each message. In contrast, each message to a distributed queue group is received by exactly one worker in the group. These mutually exclusive semantics cannot co-exist in the same distributed application program. That is, a program cannot simultaneously be a member of a fault tolerance group and a member of a distributed queue.