Failover

The orchestrators can be deployed in a cluster domain. One of the members in the cluster can be cluster manager (CM) and the others act as workers (W). The cluster manager manages the workers in the clusters. Also the cluster manager monitors the members in a cluster, so that it can also process the pending events in failed nodes.

After a node goes down, the cluster manager assigns the load of failed nodes to be backed up by one of the running nodes. This node assignment considers the relative load of all the nodes currently activated in the cluster. The cluster manager fetches information related to how many plan items are being executed by different nodes in the cluster at that time. Based on this information, the cluster manager picks up the least load node to be assigned as a back up for the failed node. The cluster manager instructs this node to start handling the failed node, after which the backup node starts the backup process.

You can place a limit on the number of failed nodes from the cluster that should be backed up. This is controlled by a property with the name com.tibco.fom.orch.cluster.backUpThreshold. It defines the percentage of nodes in the cluster whose failure is handled by the cluster manager.

Another property by the name com.tibco.fom.orch.cluster.backUpTimeout is used to terminate the backup processing in rare cases, where the backup node has not completed the backup in an appropriate time span. This property is in milliseconds and defaults to one hour.

  1. The cluster manager uses the heartbeat generated by the workers to add members to the cluster and also to detect failures.

  2. At given point in time, only one member acts as the cluster manager; the other members act as workers. When the node starts up, a unique sequence number is generated and assigned to each member. This number is also included in the heartbeat published by the members. Eventually all members in the cluster are aware of the sequence number assigned to the other members in the cluster, based on the heartbeat. Each member can identify the lowest sequence numbers and find the corresponding node mapped to it. The member with the lowest sequence number starts as the cluster manager and other nodes start as workers. When the cluster manager member fails, the member next in sequence starts acting as the cluster manager.

  3. The unique sequence number is generated using Sequence (SEQ_CLUSTER_SEQ_NUMBER) from OMS database.

  4. The cluster domain name and members of the cluster are configured in the OMS database. The tables' definitions are as follows:

    table DOMAIN
    Column Description
    DOMAINID The name for the cluster domain.
    DESCRIPTION The short description of the domain.
    BACKINGSTORE The backing store (JMS). The JMS used for updating the status of the node.
    HEARTBEATINTERVAL The interval between Heart Beat in milliseconds.
    MANAGERACTIVATIONINTERVAL The interval between cycles in milliseconds to check if the node can be the cluster manager and start executing the cluster manager tasks.
    FTTHRESHOLDINTERVAL The cluster manager that assumes the node is failed if it does not receive the heartbeat from failed node after this specified interval in milliseconds.
    HANDLEFAILEDNODEEVENTS The allowed values are 1 or 0. The default is 1. If set to 1, Orchestrator transfers the events from failed nodes and start processing it. If 0, events from failed nodes are not transferred.
    table DOMAINMEMBERS
    Column Description
    MEMBERID The name for the cluster member.
    DESCRIPTION A short description of the member.
    DOMAINID The domain that the member belongs to.
    CLUSTERID The cluster ID is uniquely generated by the node on runtime.
    ISCLUSTERMANAGER The flag is true if the member is cluster manager or it is false if this worker.
    SEQNUMBER The sequence number generated by the member.
    LASTUPDATETIMESTAMP The last updated Timestamp.
    STATUS INIT or STARTED. INIT means the node is initializing. STARTED means the node has started.
    HEARTBEATTIMESTAMP The heartbeat time stamp.
    BACKUP_MEMBERID Holds the value of the member, which is backing up the failed node.
    TRANSACTIONID This field is used by node finder when it allocates the member ID to an instance.

    TRANUPDATESTAMP

    This field is updated to indicate when the given member ID was allocated.
    IS_STATIC

    Indicates if the member ID is available for static allocation or dynamic allocation. A value of 1 indicates static and a value of 0 indicates dynamic.

  5. The orchestrator restores the data during failure from the checkpoints. Th orchestrator supports database check pointing.

  6. On cluster failure, the orchestrator's listeners and throttling are disabled. The node joins the cluster again with a new sequence number.