Simple Fault Tolerance with Groups

Application programs can use the group facility to coordinate fault-tolerant operation.

For example, consider an application program that can operate in either of two roles, according to its application logic:

  • A1 is the active role: the program subscribes to messages, processes each message, and sends another message in response.
  • A2 is the standby role: the program subscribes to messages, but neither processes them nor sends responses.
    Simple Fault Tolerance, Role Behavior
    Ordinal Role Description
    1 A1 Actively process messages
    2 or greater A2 Standby

When a process instance of the program starts, it joins group_A, and receives its ordinal. The first process to start (P1) receives ordinal 1, so according to its application logic, it enters role A1: subscribing, receiving, processing, and sending messages. The second process to join the group (P2) receives ordinal 2, so it enters the A2 standby role. (If any additional processes join the group, they would receive ordinals 3, 4, 5, in sequence. They would also enter the A2 standby role.) In the following Timeline table, time t1 describes this state.

Simple Fault Tolerance, Timeline
Time Process P1 Process P2 Process P3 Process P4 Process P5
t1 Ord=1

Role=A1

Ord=2

Role=A2

Ord=3

Role=A2

Ord=4

Role=A2

Ord=5

Role=A2

t2 Ord=-1

Role=A2 (or exit)

Ord=1

Role=A1

Ord=2

Role=A2

Ord=3

Role=A2

Ord=4

Role=A2

t3 Ord=5

Role=A2

Ord=1

Role=A1

Ord=2

Role=A2

Ord=3

Role=A2

Ord=4

Role=A2

If P1 exits, or becomes disconnected from the group service, as at time t2, then all group members would receive new ordinals within the group, usually decrementing their existing ordinal. In particular, process P2 would receive ordinal 1, enter role A1, and begin processing messages and sending responses.

Meanwhile, if process P1 is still running while disconnected from the group service, then the group facility assigns it ordinal -1 and attempts to reconnect to the group service. The program can either exit, or enter the standby role A2, according to its program logic.

If P1 restarts, or reconnects to the group service, as at time t3, then it would receive the lowest unassigned ordinal, and operate in the corresponding role. Notice that P1 does not resume with ordinal 1: instead, P2 retains ordinal 1.