Preventing Database Outages when a Cluster Ceases with Processing
A new feature was added to allow the Inference engine to continue to run during a database down period, when all database updates are buffered through the ActiveSpaces caching mechanism. Although the length of the database down period is not specified, it is limited by the available memory for ActiveSpaces datagrid (or by the buffer sizes in case of Coherence).
When database connections are restored, all buffered transactions will play back and data loss will be avoided.
- Required Configuration Settings
-
For this feature to work, the following configuration settings are required. These settings ensure that all the data that the Inference engine needs is already cached, and anything that is not found in the cache will not be in the database either. Regardless whether or not database is available, if a 'read/get' does not find the searched entity in cache, the database is never queried.
- Shared-All persistence with Write-Behind (Cache Aside=false)
- Persistence Mode is set to ASYNC
- Unlimited cache (both entity and object-table, even if object-table is disabled useobjecttable=false)
- Preload and Recover all caches (both entity and object-table, even if object-table is disabled useobjecttable=false)
- ObjectCacheFullyLoaded flag is set to
true:
<property name="be.engine.cluster.isObjectCacheFullyLoaded" value="true"/>
- Suggested Settings
- In addition to the required configuration settings, the following CDD settings are recommended:
- Cache Agent Quorum = 2 (or more)
- Number of Backup Copies = 1 (replication of 1 or more)
- Disable connection checking in Inference agents (prevents exceptions in inference logs
<property name="be.backingstore.connection.retry.count" value="0"/>
and with cache-aside database connections are not needed except during startup). - Cleanup object-table entries marked for deletion during startup (manually or automatically by setting
<property name="be.engine.cluster.cleanup" value="true"/>
- During testing set log-configuration roles as the following (this will help with debugging)
<roles>*:info runtime.service:info kernel.core:debug backingstore:all jdbcstore:all jdbcstore.impl:all sql.text:all sql.vars:all</roles>
Limitations
Scheduler (DB Poller) is database dependent and will not continue executing while database is down.
Testing Recommendations
- During testing, test disconnects by 'shutting down database service and machine', 'disconnecting database server from network' and by other means possible.
- Test for both very short outages (30 seconds), very long outages (hours) in-between outages, and repeated outages.
- Start tests first with a project where entities are created only, then later test when entities are deleted only and finally test the case where entities are created, modified and deleted as usual.
- If an entity is first created with ActiveSpaces and then deleted during a database outage period, they cancel each other out. As a result, there will be no related database transcation when the connection is restored.
- If an entity is modified multiple times with ActiveSpaces during a database outage period, only the last update will be kept. As a result, there will be only be a single database transaction when the connection is restored.
- When a database is disconnect with ActiveSpaces, there will be exceptions in the cache engine logs. These exceptions should almost immediately suspend the 'Persister' involved. The space will enter the 'Persister State = offline' state. This can also be accomplished by issuing the as-admin command "suspend persistence 'dist-unlimited-bs-Test--be_gen_Concepts_***'".
- When the database connections are re-established with ActiveSpaces, it will resume with all 'Persisters'. The space will enter the 'Persister State=replaying/online' state. This can also be accomplished by issuing the as-admin command "resume persistence 'dist-unlimited-bs-Test--be_gen_Concepts_***'".
- The ActiveSpaces behavior can also be tested in 'isolation' without disconnecting the database. You can instead suspend and resume commands for all the relevant spaces in the cluster.
- During the database outage, you may notice that the engine throughput will increase beyond normal. This is because all the database transactions are deferred (essentially, the system runs only in Cache mode with no persistence). When the connection is re-established throughput will decrease during replay period.
- With ActiveSpaces, monitor the space "ToPersist" count by using the as-admin tool. This shows the in-flight updates in cache which are not yet persisted.