Planned Failback to the Primary Site

At a high level, to failback to the primary site you simulate a failover to site 1. To ensure that no messages are lost, you may issue the “suspend” command to site 2 at the appropriate time (see the steps below). The “suspend” command causes the persistence services at site 2 to stop accepting messages from both clients and routes. This allows site 2 to finish replicating all data to site 1 before site 1 is activated. Then, once DNS is remapped, site 1 picks up where site 2 left off, accepting pending messages from clients and routes.

  1. Clear all data directories at site 1.

  2. Start FTL servers at site 1, this time with "drfor" in the YAML file.

    Reference files:

    • samples/yaml/satellite-dr/tibftlserver_primary_failback.yaml
  3. Use the REST API to make site 2 recognize site 1 as a disaster recovery standby for FTL configuration (command "enable_dr"). Optionally set "drto" in the site 2 YAML files.

  4. Update the realm configuration to re-enable disaster recovery replication for the affected persistence cluster. Verify disaster recovery replication of any pending messages (in the user interface).

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample-dr-failback.json
  5. When ready for a planned failback, suspend messaging at site 2. Use the REST API (command "suspend"). For details, see “POST cluster

  6. Wait for the standby persistence services at site 1 to report their status as suspended (in the user interface).

  7. Shut down site 2.

  8. Use the REST API to activate the FTL configuration at site 1 (command "activate_dr"). Optionally remove "drfor" from the site 1 YAML files.

  9. Clients and satellites need to reconnect to site 1. Remap the DNS or restart them.

    Reference Files:

    • samples/yaml/satellite-dr/tibftlserver_sat_primary.yaml
    • samples/yaml/satellite-dr/tibftlserver_sat_dr.yaml
  10. Update the realm configuration to activate messaging at site 1. This requires two changes to the persistence cluster: disable disaster recovery replication, and make site 1's server set the active server set. (Disaster recovery replication is re-enabled later once site 2 is brought back.)

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample-primary-activate.json
  11. Clear all data directories at site 2.

  12. Start FTL servers at site 2. Make sure "drfor", and not "drto", is specified in the YAML file.

  13. Use the REST API to make site 1 recognize site 2 as a disaster recovery standby for FTL configuration (command "enable_dr"). Optionally set "drto" in the site 1 YAML files.

  14. Update the realm configuration to re-enable disaster recovery replication for the affected persistence cluster. Verify disaster recovery replication of any pending messages (in the user interface).

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample.json
  15. At this point, the system is ready for another failover.