Mashery Local Failover Strategy Recommendations

Many of TIBCO customers rely on TIBCO Mashery Local to manage and distribute their revenue-generating and business-critical API traffic. The nature of the usage warrants that Mashery Local is always on without any downtime. Proper planning for redundancy and failover is recommended when high availability is expected of a mission-critical system.

The current Mashery Local architecture relies on four entities:
  1. Mashery Cloud - This is where you make all your service configuration changes, through the Mashery Control Center API dashboard.
  2. Mashery On-Prem Manager (MoM) - This is your Mashery Local's gateway to the Mashery Cloud. You must have received a secured key and secret, which provides each of your clusters its unique identity. You should always have a separate MoM key for each cluster, even if they are connecting to the same Mashery area.
  3. A Mashery Master node synching with Cloud for API keys, OAuth Tokens, Service Configuration, User details, etc.. The time taken for synchronization of your configuration and token data is a function of the amount of data. Each customer implementation is unique and each network topology is different, so there isn't any formula that can correctly project the amount of time taken.
  4. Slaves within the cluster that replicate the API Key, OAuth Token and Service Configuration data from Master. This happens within the cluster and in your environment, so the replication speed is slightly faster than Cloud Sync, but yet depends on various other environmental and data components.

TIBCO Recommendations to Achieve High Availability for TIBCO Mashery Local Deployment

Failover and redundancy can be achieved at many levels and should be considered while building your high availability strategy. Customers should also maintain updated runbooks so that their environment specific nuances are captured for their internal teams for faster deployment and recovery of systems. Failover systems should be tested and monitored in periodic intervals to ensure that they are in sync with production. Failure to do so will result in loss of traffic at the time of need. The following are some recommendations for redundancy – redundancy within a cluster and cross datacenter redundancy.

Redundancy within a Cluster:

Each Cluster has two type of Nodes – a Master Node that syncs the cluster with Cloud and many Slave Nodes that replicate from the Master. Though both type of Nodes are capable of serving traffic, TIBCO's recommendation is to keep the Master out of rotation in high traffic, high OAuth type implementations.

Keep one Slave extra than what is needed for optimum capacity to achieve within cluster redundancy.

If Master runs into problem due to disk, VM, or Network issues, then you can easily promote the spare Slaves to a Master and point the rest of the Slaves to the new Master. This can be achieved in minutes and will have a very low impact on the traffic. Except for newly synched up OAuth tokens, Slaves should be able to successfully service traffic during the promotion and pointing to the new Master. Fix the old Master Node and you can bring it back as a Slave into the cluster after re-imaging the VM.

If a Slave runs into problems, then take that Slave out of rotation from load balancer level. That way, you will not experience any traffic loss. Fix the issue and bring the slave back into rotation.

Cross Datacenter Redundancy:

If there is an issue with the datacenter, or if the whole cluster is having problems, having an Active-Active or Active-Passive cluster strategy is very beneficial in this scenario:
  • Active-Active Strategy: Both Clusters with their unique Mashery On Prem Manager (MoM) key would connect to the same Mashery area and continue to sync. Nodes in both clusters can be used to serve traffic but both would have enough capacity (Disk Space, Caching configuration, etc.) to serve total traffic from both clusters combined and act like a failover if needed.
  • Active-Passive Strategy: Both Clusters with their unique Mashery On Prem Manager (MoM) key would connect to the same Mashery area and will sync. Nodes from only one cluster would serve traffic. If needed traffic can be routed to the other non-traffic serving cluster without any blip in service. Both clusters should be identical in their configuration and capacity (Disk Space, Caching configuration, etc.) to serve total traffic.

Please note that TIBCO Mashery's license policy in cluster-based, so please discuss this with your Sales or Mashery Customer Success team. Having cluster redundancy is absolutely essential to avoid any traffic loss. Master sync and Slave replication takes time when done from scratch, and without cluster failover, you will encounter traffic loss.