Cache OM and Fault Tolerance
Fault tolerance of the engine process refers only to inference agents.
In all cases it is assumed that dedicated cache agents are also running.
If you use multi-engine (multi-agent) features, fault tolerance is implicit. When all agents in an agent group (an agent group consists of instances of the same agent class) are active, if any active agent fails, remaining agents in the group automatically handle the workload.
In all cases, in the event of total system failure, use of a backing store ensures recovery of data written to the backing store.
# PUs
# Agents |
With Fault Tolerance Configuration | No Fault Tolerance Configuration |
---|---|---|
1 PU
1 Agent |
(N/A) | (N/A) |
1 PU
n Agents |
(N/A) Each agent in the same PU is a different agent, not part of the same agent group. | (N/A) |
n PUs
1 Agent |
Fault tolerance is at the agent level. If one or more agents in a group fails, the load is distributed among remaining agents in that group. All agents can be active or some can be standbys. Configuration uses a
MaxActive property and a
Priority property.
Cluster data is shared between agents across all PUs, using the cache cluster. If the number of cache object backups is one, one cache agent (at a time) can fail with no data loss. With two backups, two servers can fail, and so on. Caches exist in memory only, so recovery is not available in the case of a total system failure. In the event of total system failure, use of a backing store ensures recovery of data written to the backing store. |
N/A. Fault tolerance is implicit. |
n PUs
n Agents |
Same as n PUs 1 agent. Each of the agents in one PU is fault tolerant with the agents in the same agent group, which are deployed in other PUs. | Multi-agent mode: N/A. Fault tolerance is implicit. |