Step 4: Choose the Intervals
Fault Tolerance Interval Parameters summarizes the four interval parameters that regulate the behavior of Rendezvous fault tolerance software. It is important that you choose appropriate values for these interval parameters.
Choosing the intervals requires a balance among several considerations:
• | The need for uninterrupted service. |
Ideally, critical applications must run with only minimal interruptions in service. Realistically, it takes time to discover a service interruption. You can reduce this time to the minimum that your network can support, but at the cost of network capacity and computer time.
• | Network transmission time. |
It takes time for heartbeat messages to traverse the network, and that time varies with distance and network load. This fact limits the minimum achievable heartbeat interval, which in turn limits the minimum achievable activation interval.
• | Finite network capacity. |
The network that carries heartbeat messages also carries application data. Smaller heartbeat intervals imply more frequent heartbeats. Avoid cluttering the network with too-frequent heartbeat messages.
Parameter |
Description |
|
Each active member broadcasts a sequence of heartbeat messages to inform the other group members that it is still active. The heartbeat interval determines the time between heartbeat messages. Parameter to the member creation call. |
|
Inactive members track heartbeat messages from each active member. When the time since the last heartbeat from an active member reaches this activation interval, Rendezvous fault tolerance software instructs the ranking inactive member to activate. Parameter to the member creation call. |
|
Some programs require advance notice to prepare before activation. When the time since the last heartbeat from an active member equals this preparation interval, Rendezvous fault tolerance software issues a hint to the ranking inactive member, so it can prepare to activate. Parameter to the member creation call. |
|
Monitor functions passively track heartbeat messages from active members of a fault tolerance group. When the time since the last heartbeat from an active member reaches this lost interval, Rendezvous fault tolerance software considers that member lost, and calls the monitor callback, passing it the current number of active members. Supply to the start monitor creation call. |
First: Determine the Activation Interval
The activation interval influences the longest service interruption in two situations:
• | When a new member joins a fault tolerance group, the initialization phase requires one activation interval before it can become active. |
• | In most failure situations the maximum service interruption is identical to the activation interval (assuming an inactive member exists). |
In each case, you must determine the amount of time that can elapse before interrupted service becomes a problem. Use an activation interval equal to that time.
Lower Bound
Use an activation interval no less than 3 seconds, though Rendezvous fault tolerance software accepts lower values. However, if your application is distributed across a WAN, use an activation interval no less than 10 seconds.
Second: Determine the Heartbeat Interval
Lower Bounds
Use a heartbeat interval no lower than 1 second, though Rendezvous fault tolerance software accepts lower values.
However, wide-area links transmit heartbeats more slowly (and at greater cost) than local networks. If your application is distributed across a WAN, use a heartbeat interval no less than 2 seconds.
Relationship between Activation and Heartbeat Interval
The heartbeat interval must be strictly less than the activation interval.
Our experience indicates that in most situations, the optimal heartbeat interval is slightly less than one third of the activation interval. For example, an activation interval of 10 seconds implies a heartbeat interval of 3 seconds.
However, messages traversing wide-area links show greater variability in arrival time (compared with local networks). If your application is distributed across a WAN, use a heartbeat interval that is less than one fifth of the activation interval. For example, an activation interval of 30 seconds implies a heartbeat interval of 6 seconds or less.
Conserving Network Capacity
It is important to conserve network capacity (bandwidth). Each heartbeat is a message. Each active member sends one message at every heartbeat interval. If the heartbeat interval is too small, then your program may overload the network with heartbeat messages.
Once you have established the activation and heartbeat intervals for your application, apply this reality check. Calculate the number of heartbeat messages that all the active members of your program will send. Does this figure still leave network capacity for other programs? If not, increase the heartbeat and activation intervals accordingly.
For example, if the heartbeat interval is 0.1 seconds, and an application requires one active member, then the network must carry 10 messages per second to sustain the heartbeat signal. If the application requires 50 active members, then the network must carry 500 messages per second to sustain the heartbeat signals.
Third: Determine the Preparation Interval
The last step is to determine whether the program requires time to prepare before it can activate, and if so, the length of time it needs.
If the program needs no preparation time, then supply zero as the preparation interval.
If the program does need preparation time to complete set-up tasks, estimate the length of time needed. Subtract that time from the activation interval to obtain the preparation interval.
Relationship between Preparation and Activation Interval
If non-zero, the preparation interval must be strictly greater than the heartbeat interval, and strictly less than the activation interval. Choose a preparation interval that is greater than twice the heartbeat interval.
For programs that require preparation time, use a preparation interval no less than 75% of the activation interval. For example, an activation interval of 10 seconds implies a preparation interval of 7.5–9.5 seconds. Smaller preparation intervals may increase the rate of false-positive prepare-to-activate hints.
For Monitors: Determine the Lost Interval
When monitoring a fault tolerance group, the lost interval argument must equal the activation interval of the group.