Planned Fail-back to the Active Satellite Site

At a high level, to fail-back to the active satellite site you need to simulate a failover to site 3. To ensure that no messages are lost, you may issue the suspend command to site 4 at the appropriate time (see the steps below). The suspend command causes the persistence services at site 4 to stop accepting messages from both clients and routes. This allows site 4 to finish replicating all data to site 3 before site 3 is activated. Then, once DNS is remapped, site 3 picks up where site 4 left off, accepting pending messages from clients and routes.

FTL configuration must still be managed at site 1 (or site 2, if site 2 happens to be active at the time).

Note: This procedure is similar to fail-back to site 1, with the exception that the FTL configuration does not need to fail over, since it is managed by site 1.
  1. Clear all data directories at site 3.

  2. Start FTL servers at site 3. No change to the YAML file is needed.

    Reference files:

    • samples/yaml/satellite-dr/tibftlserver_sat_primary.yaml
  3. Update the realm configuration to re-enable DR replication for the affected persistence cluster. Verify DR replication of any pending messages (in the user interface).

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample-satdr-failback.json
  4. When ready for a planned failback, suspend messaging at site 4. Use the REST API (command suspend). For details, see POST cluster.

  5. Wait for the standby persistence services at site 3 to report their status as suspended (in the user interface).

  6. Shut down site 4.

  7. Clients and servers need to reconnect to site 3. Remap the DNS or restart them.

  8. Update the realm configuration to activate messaging at site 3. This requires two changes to the persistence cluster: disable disaster recovery replication, and make site 3's server set the active server set. (Disaster recovery replication is re-enabled later once site 4 is brought back.)

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample-satpri-activate.json
  9. Clear all data directories at site 4.

  10. Start FTL servers at site 4. No change to the YAML file is needed.

    Reference files:

    • samples/yaml/satellite-dr/tibftlserver_sat_dr.yaml
  11. Update the realm configuration to re-enable disaster recovery replication for the affected persistence cluster. Verify disaster recovery replication of any pending messages (example in the user interface).

    Reference files:

    • samples/yaml/satellite-dr/satellite-dr-sample.json
  12. At this point, the system is ready for another failover.