Node Failure

Loss of network connectivity is considered a node failure; that is, the miss of a heartbeat and failure to establish a TCP connection.

This can occur due to a hardware failure (the node is down), a network connection failure, or a network partitioning or a software error escalation. Every process running on the appliance is monitored and a repeated software error condition can trigger an escalation that restarts the appliance, hence triggering a node failure in the context of the failover.

Ethernet Disconnection

Unplugging the ethernet cable from a primary appliance in an HA pair triggers a failover.

HA pair (cluster) memberships fail, and eventually the primary appliance enters disabled mode. Plugging in the ethernet cable stops the failure state and allows the appliance to return to the running state, but it rejoins the cluster as the secondary node. What was the secondary appliance now becomes the primary appliance.

If you wish to have the two appliances take their old roles, another failover must be initiated. It is vital that the two appliances are allowed to synchronize before failover is triggered again; otherwise additional log loss occurs. Note that there is always be a small amount of log loss in any failover while the VIP is migrated from the old primary to the new primary appliance.

Run the following command as the toor user on the new primary appliance (after the first failover) which, after the second failover is triggered via this command, becomes the secondary appliance:

$ mtask -s engine_cluster_membership restart