Pre-Production Checklist
While developing or testing a new ActiveSpaces application, TIBCO recommends to evaluate each item in the following checklist to confirm expected behavior prior to moving the application to production. Some of these items may include simply understanding how to perform the specific activity while others may involve evaluating how specific failure scenarios may then lead to timeouts or other errors being returned to the application.
-
Rolling Upgrade - At some point, an upgrade to a newer version of FTL and ActiveSpaces is needed. Either the grid can be stopped and all grid processes upgraded at the same time or the grid can be upgraded one process at a time in a rolling fashion as described in the steps in the documentation.
-
Monitoring - The messaging monitoring stack (InfluxDB/Grafana) provides dashboards and stats collection for the different grid processes. This should be deployed and connected to the FTL server used by the grid. The FTL server pushes grid stats it collects through the tibmongateway to InfluxDB.
-
Single Node Failure - A test should be done to stop the primary node in a copyset while live ops (gets/puts/deletes) are happening. The expected behavior would be that another node (an alive secondary node) takes over for the primary node that stopped. Client ops may be delayed or may experience a timeout exception during this transition period, which must be handled in the application as appropriate.
-
Proxy Failure - A test should be done to stop a proxy while live ops (gets/puts/deletes) are happening. The expected behavior would be that the client application re-binds to a different proxy that should also be running. Client ops may be delayed or may experience a timeout exception during this transition period, which must be handled in the application as appropriate.
-
Client Application Error Handling - Timeouts and other types of errors can occur in a distributed system and the client application should have error handling in place to address these scenarios (either by retrying, returning the error, etc).
-
Secondary Node Sync/Catchup - At times a node may need to be stopped for an extended amount of time (due to host maintenance, etc) and that node is considered a dead secondary in the copyset as the primary node continues processing live operations. When the secondary node is restarted, it needs to complete a sync/catchup process where it is sent any data that was missed while it was stopped. This can involve more read or write activity (disk, network, CPU) than is typical when just processing live operations (it also happens in parallel with ongoing live ops) so this scenario should be exercised on representative hardware prior to production.
-
Redistribution To New Copysets - As a grid grows in size, it is often necessary to create a new copyset, which will then be given some of the existing data in the grid. An administrative command is used to begin the redistribution of data to the new copyset. This can involve more read and write activity (disk, network, CPU) than is typical when just processing live operations (it also happens in parallel with ongoing live ops). You must handle this scenario on representative hardware prior to production.
-
Live Backup/Restore or DR/Mirroring - Exercise features like live backup and restore or DR/mirroring on representative hardware prior to production.
-
Log/Status Collection - Collecting tibdg status, tibdg proxy status <proxy_name>, tibdg node status <node_name>, all log files for grid process and the FTL server, and the LOG file from the node data directory is often required to diagnose unexpected behavior. Automating this collection and exercising it prior to production is recommended.