ActiveSpaces® Transactions
supports keep-alive messages between all nodes
in a cluster. Keep-alive requests are used to actively determine whether a
remote node is still reachable. Keep alive messages are sent to remote
nodes using the configurable keepAliveSendIntervalSeconds
time
interval.
Figure 6.3, “Keep-alive protocol” shows how a node is
detected as being down. Every time a keep-alive request is sent to a
remote node, a timer is started with a duration of nonResponseTimeoutSeconds
. This timer is reset
when a keep-alive response is received from the remote node. If a
keep-alive response is not received within the
nonResponseTimeoutSeconds
interval, a keep-alive
request is sent on the next network interface configured for the node (if
any). If there are no other network interfaces configured for the node, or
the nonResponseTimeoutSeconds
has expired on all
configured interfaces, all connections to the remote node are dropped, and
the remote node is marked Down
.
Connection failures to remote nodes are also detected by the keep-alive protocol. When a connection failure is detected, as opposed to a keep-alive response not being received, the connection is reattempted to the remote node before trying the next configured network interface for the remote node (if any). This connection reattempt is done to transparently handle transient network connectivity failures without reporting a false node down event.
It is important to understand that the total time before a
remote node is marked Down
is the number of configured
interfaces times the nonResponseTimeoutSeconds
configuration value in the case of keep-alive responses not being
received. In the case of connection failures, the total time could be
twice the nonResponseTimeoutSeconds
times the number of
configured interfaces, if both connection attempts to the remote node (the
initial one and the retry) hang attempting to connect with the remote
node.
For example, in the case of keep-live responses not being received,
if there are two network interfaces configured, and the
nonResponseTimeoutSeconds
value is four seconds, it
will be eight seconds before the node is marked Down
.
In the case of connection establishment failures, where each connection
attempt hangs, the total time would be sixteen seconds before the node is
marked Down
.