Detecting failed nodes

ActiveSpaces® Transactions supports keep-alive messages between all nodes in a cluster. Keep-alive requests are used to actively determine whether a remote node is still reachable. Keep alive messages are sent to remote nodes using the configurable keepAliveSendIntervalSeconds time interval.

Figure 6.3, “Keep-alive protocol” shows how a node is detected as being down. Every time a keep-alive request is sent to a remote node, a timer is started with a duration of nonResponseTimeoutSeconds. This timer is reset when a keep-alive response is received from the remote node. If a keep-alive response is not received within the nonResponseTimeoutSeconds interval, a keep-alive request is sent on the next network interface configured for the node (if any). If there are no other network interfaces configured for the node, or the nonResponseTimeoutSeconds has expired on all configured interfaces, all connections to the remote node are dropped, and the remote node is marked Down.

Connection failures to remote nodes are also detected by the keep-alive protocol. When a connection failure is detected, as opposed to a keep-alive response not being received, the connection is reattempted to the remote node before trying the next configured network interface for the remote node (if any). This connection reattempt is done to transparently handle transient network connectivity failures without reporting a false node down event.

It is important to understand that the total time before a remote node is marked Down is the number of configured interfaces times the nonResponseTimeoutSeconds configuration value in the case of keep-alive responses not being received. In the case of connection failures, the total time could be twice the nonResponseTimeoutSeconds times the number of configured interfaces, if both connection attempts to the remote node (the initial one and the retry) hang attempting to connect with the remote node.

For example, in the case of keep-live responses not being received, if there are two network interfaces configured, and the nonResponseTimeoutSeconds value is four seconds, it will be eight seconds before the node is marked Down. In the case of connection establishment failures, where each connection attempt hangs, the total time would be sixteen seconds before the node is marked Down.

Keep-alive protocol

Figure 6.3. Keep-alive protocol