Distributed transaction failure handling

Distributed transaction failure handling
Prev	Chapter 6. Distributed computing	Next

Any communication failures to remote nodes detected during a global transaction before a commit sequence is started cause an exception that an application can handle (see the TIBCO ActiveSpaces® Transactions Java Developer's Guide). This allows the application to explicitly decide whether to commit or rollback the current transaction. If the exception is not caught, the transaction will be automatically rolled back.

Undetected communication failures to remote nodes do not impact the commit of the transaction. This failure scenario is shown in Figure 6.5, “Undetected communication failure”. In this case, Node 2 failed and was restarted after all locks were taken on Node 2, but before the commit sequence was started by the transaction initiator - Node 1. Once the commit sequence starts it continues to completion. The request to commit is ignored on Node 2 because the transaction state was lost when Node 2 restarted.

Figure 6.5. Undetected communication failure

Transaction initiator node failures are handled transparently using a transaction outcome voting algorithm. There are two cases that must be handled:

Transaction initiator fails before commit sequence starts.
Transaction initiator fails during the commit sequence.

When a node that is participating in a distributed transaction detects the failure of a transaction initiator, it queries all other nodes for the outcome of the transaction. If the transaction was committed on any other participating nodes, the transaction is committed on the node that detected the node failure. If the transaction was aborted on any other participating nodes, the transaction is aborted on the node that detected the failure. If the transaction is still in progress on the other participating nodes, the transaction is aborted on the node that detected the failure.

Transaction outcome voting before the commit sequence is shown in Figure 6.6, “Transaction initiator fails prior to initiating commit sequence”. In Figure 6.6, “Transaction initiator fails prior to initiating commit sequence” the initiating node, Node 1, fails before initiating the commit sequence. When Node 2 detects the failure it performs the transaction outcome voting algorithm by querying other nodes in the cluster to see if they are participating in this transaction. Since there are no other nodes in this cluster, the Transaction Status request is a noop and the transaction is immediately aborted on Node 2, releasing all locks held by the distributed transaction.

Figure 6.6. Transaction initiator fails prior to initiating commit sequence

Transaction outcome voting during a commit sequence is shown in Figure 6.7, “Transaction initiator fails during commit sequence”. In Figure 6.7, “Transaction initiator fails during commit sequence” the initiating node, Node 1, fails during the commit sequence after committing the transaction on Node 2, but before it is committed on Node 3. When Node 3 detects the failure it performs the transaction outcome voting algorithm by querying Node 2 for the resolution of the global transaction. Since the transaction was committed on Node 2 it is committed on Node 3.

Figure 6.7. Transaction initiator fails during commit sequence

To support transaction outcome voting each node maintains a history of all committed and aborted transactions for each remote node participating in a global transaction. The number of historical transactions to maintain is configurable and should be based on the time for the longest running distributed transaction. For example, if 1000 transactions per second are being processed from a remote node, and the longest transaction on average is ten times longer than the mean, the transaction history buffer should be configured for 10,000 transactions.

For each transaction from each remote node, the following is captured:

global transaction identifier
node login time-stamp
transaction resolution

The size of each transaction history record is 24 bytes.

Prev	Up	Next
Network error handling	Home	Chapter 7. High availability