Planned Failback to the Primary Site
At a high level, to failback to the primary site you simulate a failover to site 1. To ensure that no messages are lost, you may issue the “suspend” command to site 2 at the appropriate time (see the steps below). The “suspend” command causes the persistence services at site 2 to stop accepting messages from both clients and routes. This allows site 2 to finish replicating all data to site 1 before site 1 is activated. Then, once DNS is remapped, site 1 picks up where site 2 left off, accepting pending messages from clients and routes.
-
Clear all data directories at site 1.
-
Start FTL servers at site 1, this time with "
drfor" in the YAML file.Reference files:
-
samples/yaml/satellite-dr/tibftlserver_primary_failback.yaml
-
-
Use the REST API to make site 2 recognize site 1 as a disaster recovery standby for FTL configuration (command "
enable_dr"). Optionally set "drto" in the site 2 YAML files. -
Update the realm configuration to re-enable disaster recovery replication for the affected persistence cluster. Verify disaster recovery replication of any pending messages (in the user interface).
Reference files:
-
samples/yaml/satellite-dr/satellite-dr-sample-dr-failback.json
-
-
When ready for a planned failback, suspend messaging at site 2. Use the REST API (command "suspend"). For details, see “
POST cluster” -
Wait for the standby persistence services at site 1 to report their status as suspended (in the user interface).
-
Shut down site 2.
-
Use the REST API to activate the FTL configuration at site 1 (command "
activate_dr"). Optionally remove "drfor" from the site 1 YAML files. -
Clients and satellites need to reconnect to site 1. Remap the DNS or restart them.
Reference Files:
-
samples/yaml/satellite-dr/tibftlserver_sat_primary.yaml
-
samples/yaml/satellite-dr/tibftlserver_sat_dr.yaml
-
-
Update the realm configuration to activate messaging at site 1. This requires two changes to the persistence cluster: disable disaster recovery replication, and make site 1's server set the active server set. (Disaster recovery replication is re-enabled later once site 2 is brought back.)
Reference files:
-
samples/yaml/satellite-dr/satellite-dr-sample-primary-activate.json
-
-
Clear all data directories at site 2.
-
Start FTL servers at site 2. Make sure "
drfor", and not "drto", is specified in the YAML file. -
Use the REST API to make site 1 recognize site 2 as a disaster recovery standby for FTL configuration (command "
enable_dr"). Optionally set "drto" in the site 1 YAML files. -
Update the realm configuration to re-enable disaster recovery replication for the affected persistence cluster. Verify disaster recovery replication of any pending messages (in the user interface).
Reference files:
-
samples/yaml/satellite-dr/satellite-dr-sample.json
-
-
At this point, the system is ready for another failover.