Step 3: Plan Program Behavior

Copyright © Cloud Software Group, Inc. All Rights Reserved

Chapter 15 Developing Fault-Tolerant Programs : Step 3: Plan Program Behavior

Step 3: Plan Program Behavior

Consider these issues early in the design phase of your program.

•

Parallel Data State.

•

Continuity—Track Active Backlog.

•

Activation.

•

Preparing to Activate.

•

Deactivation.

•

Serve It Once.

•

Send it Once.

Parallel Data State

An inactive member must be ready to activate in the same data state as the formerly active member it replaces. In some situations data state is irrelevant. In other situations it is straightforward to duplicate the data state either by copying and reading a state file, or by completing a brief computation. However, in some situations the data state is complex, or the result of cumulative operations, so the best way to maintain readiness is to compute a parallel data state while inactive.

Example: Current Value Cache

The rvcache utility stores the most recent message for each subject name. Whenever a program queries for a cached subject, rvcache sends the program the current data corresponding to that subject. (For a more information, see Current Value Cache in TIBCO Rendezvous Administration.)

Two or more rvcache processes can cooperate for fault-tolerant operation, with only one active process. All member processes (whether active or inactive) passively collect and store the same data—but only the active process responds by sending the current data when a program sends a query. Every inactive member always has all the cached data it needs to begin active duty; the data state of each inactive member is parallel to that of the active member.

Notice that in this application the inactive members are far from idle; they collect and store data just like the active member.

Furthermore, when starting a new rvcache process, the administrator can copy the store file from another fault-tolerant member, in order to initialize its database to contain the same data as existing member processes.

Continuity—Track Active Backlog

Some applications depend on a continuous stream of data. They must receive all the data—even if they receive it late. For the programs that produce that data, it is essential to maintain continuity of the outbound data stream.

Although Rendezvous fault tolerance software quickly restores service, a finite service interruption always exists between the failure of an active member and the activation of an inactive member. When continuity is essential, it is the responsibility of the inactive members to maintain continuity across the service interruption.

Inactive members maintain continuity by tracking the backlog from the active member. That is, the inactive member retains enough information to reproduce the expected output of the active member during the longest service interruption. When it activates, it produces that backlog output before processing any new data. Although the backlog output is delayed, no holes appear in the output stream.

Example: Data Distribution

Many enterprises require access to prodigious amounts of data, which must flow to decision makers in a timely fashion, without interruption. Many organizations use data distribution software that receives data from a serial port, processes it, and broadcasts it across a network to numerous computer workstations.

To ensure continuous service, data distribution software can operate in fault tolerance pairs, with one active member and one inactive member. Each member receives the same data, and each member processes the data, but only the active member broadcasts the data.

Once the active member has broadcast a data item, it can discard that data item.

However, the inactive member must hold the data until it receives the corresponding broadcast item from the active member. To see why, consider the service interruption between the time that the active member fails and the time that the inactive member activates. During the service interruption data continues to arrive, but neither member is broadcasting that data. When the inactive member activates, it must broadcast that backlog data—filling the gap in the data stream. To support this behavior, the inactive member can discard a data item only after confirming that the active member has broadcast it.

Notice that in this application the inactive member does work that the active member does not do; in addition to processing the same set of data items, the inactive member must also retain data, and discard it only at the proper time.

Activation

Consider the actions that your program does to switch from inactive to active.

In some programs the state change in the callback function is as straightforward as toggling a flag; functions throughout program code can branch on the flag to determine inactive or active behavior. Other programs must open data files, open communication lines, allocate resources, begin listening to Rendezvous subjects, or set timers to trigger computations.

Remember, each step delays activation. Whenever resources permit, minimize the steps that wait until activation time; taking these steps at start time results in quicker activation.

If the program must maintain continuity after a service interruption, see Continuity—Track Active Backlog.

Arrange for any needed transition steps in the program’s fault tolerance callback function.

Preparing to Activate

Consider whether any of the activation steps are time-consuming. For example, delays are common when opening an ISDN line or opening a database connection.

If such steps might cause unacceptable delays when the program activates, consider separating those preparations from the actual activation sequence. Instead, do them when the fault tolerance callback function receives a prepare-to-activate hint.

Consider setting a duration limit for preparations. For example, if the program allocates a large block of storage when preparing to activate, set a timer to expire after two or three activation intervals. If the timer expires before an actual instruction to activate, then deallocate the storage. If the call to activate arrives first, cancel the timer.

Deactivation

Consider the actions that the program does to deactivate. Usually these actions reverse the activation steps, but in some applications it might be more expedient to retain resources (anticipating the need to reactivate).

Arrange for any needed transition steps in the program’s fault tolerance callback function.

Serve It Once

For request server applications, ensure that each request receives service from only one active member. Duplicate service wastes server resources, and could result in incorrect behavior.

Consider whether distributed queues might be a better fit for such applications. See Distributed Queue.

Send it Once

For broadcast producer applications, ensure that members of a fault tolerance group cooperate to send each data item only once. If several processes are can be active simultaneously, they must not send duplicate data.