Failover and Failback of Distributed Cache Data
The object manager handles failover of the cache data on a failed cache agent and it handles failback when the agent recovers.
When a node hosting a cache agent fails the object manager redistributes objects among the remaining cache agents, using backup copies, if the remaining number of cache agents are sufficient to provide the number of backups, and if they have sufficient memory to handle the additional load. However, because this is a memory-based system, if one cache agent fails, and then another cache agent fails before the data can be redistributed, data may be lost. To avoid this issue, use a backing store.
If redistribution is successful, the complete cache of all objects, plus the specified number of backups, is restored. When the failed node starts again, the object management layer again redistributes cache data.
Specifically, when a cache agent JVM fails, the cache agent that maintains the backup of the failed JVM’s cache data objects takes over primary responsibility for that data. If two backup copies are specified, then the cache agent responsible for the second backup copy is promoted to primary backup. Additional backup copies are made according to the configuration requirements. When a new cache agent comes up, data is again redistributed across the cluster to make use of this new cache agent.
Because they store data in memory, cache-based systems are reliable only to the extent that enough cache agents with sufficient memory are available to hold the objects. If one cache agent fails, objects are redistributed to the remaining cache agents, if they have enough memory. You can safely say that if backup count is one, then one cache agent can fail without risk of data loss. In the case of a total system failure, however, the cache is lost.