Simple Fault Tolerance with Groups
Application programs can use the group facility to coordinate fault-tolerant operation.
For example, consider an application program that can operate in either of two roles, according to its application logic:
- A1 is the active role: the program subscribes to messages, processes each message, and sends another message in response.
- A2 is the standby role: the program subscribes to messages, but neither processes them nor sends responses.
When a process instance of the program starts, it joins group_A, and receives its ordinal. The first process to start (P1) receives ordinal 1, so according to its application logic, it enters role A1: subscribing, receiving, processing, and sending messages. The second process to join the group (P2) receives ordinal 2, so it enters the A2 standby role. (If any additional processes join the group, they would receive ordinals 3, 4, 5, in sequence. They would also enter the A2 standby role.) In the following Timeline table, time t1 describes this state.
Time | Process P1 | Process P2 | Process P3 | Process P4 | Process P5 |
---|---|---|---|---|---|
t1 | Ord=1
Role=A1 |
Ord=2
Role=A2 |
Ord=3
Role=A2 |
Ord=4
Role=A2 |
Ord=5
Role=A2 |
t2 | Ord=-1
Role=A2 (or exit) |
Ord=1
Role=A1 |
Ord=2
Role=A2 |
Ord=3
Role=A2 |
Ord=4
Role=A2 |
t3 | Ord=5
Role=A2 |
Ord=1
Role=A1 |
Ord=2
Role=A2 |
Ord=3
Role=A2 |
Ord=4
Role=A2 |
If P1 exits, or becomes disconnected from the group server, as at time t2, then all group members would receive new ordinals within the group, usually decrementing their existing ordinal. In particular, process P2 would receive ordinal 1, enter role A1, and begin processing messages and sending responses.
Meanwhile, if process P1 is still running while disconnected from the group server, then the group facility assigns it ordinal -1 and attempts to reconnect to the group server. The program can either exit, or enter the standby role A2, according to its program logic.
If P1 restarts, or reconnects to the group server, as at time t3, then it would receive the lowest unassigned ordinal, and operate in the corresponding role. Notice that P1 does not resume with ordinal 1: instead, P2 retains ordinal 1.