Enabling Disaster Recovery for Routed Persistence Clusters

The sample YAML and JSON files at samples/yaml/satellite-dr demonstrate how to enable disaster recovery for routed persistence clusters. These samples create a full-mesh forwarding zone with two persistence clusters (c1 and c2) and one store (s1).

The FTL servers are distributed across four sites:

  • Site 1: The primary FTL site, consisting of ftlserver1-3. This site controls FTL configuration. It also runs persistence services pserver1-3, which make up the active set of cluster c1.

  • Site 2: The disaster recovery standby for the primary FTL site, consisting of ftlserver4-6. This site serves as a backup for the FTL configuration. It also runs persistence services pserver4-6, which make up the standby set of cluster c1.

  • Site 3: An active satellite site, consisting of ftlserver7-9. This site offers messaging only. It runs persistence services pserver7-9, which make up the active set of cluster c2.

  • Site 4: A standby satellite site for site 3, consisting of ftlserver10-12. This site runs persistence services pserver10-12, which make up the standby set of cluster c2.

    Follow these general guidelines before you begin:

    1. Ensure that all FTL servers and clients are at version 7.0 or later.

    2. Configure two server sets for each persistence cluster (excluding the default cluster). One server set is active and the other is standby.

    3. Each site (that is, each set of FTL core servers) hosts one persistence server set. So, each persistence cluster is split across two sites, with one site active and one site on standby.

      If desired, a site can host active or standby server sets from more than one persistence cluster. Do not mix active and standby server sets at a given site.

    4. If disaster recovery replication of messages is desired, enable disaster recovery replication for the persistence cluster. Otherwise the standby set is simply emptied until activated.

    5. All persistence stores that require disaster recovery replication must also be replicated (ensure that the 'Replicated' checkbox is selected)

    6. If using auto transports, ensure that each server set has "externally reachable addresses" configured. Servers at other sites use these addresses when connecting to the server set.

      Often, these addresses are the same as the FTL core server addresses where the server set is running. Or, the externally reachable address might be the address of a load balancer.

    7. Site 1 and site 2 are the only sites that manage FTL configuration. Site 1 needs "drfor" in its YAML file, and site 2 needs "drto".

    8. All other sites are satellites that provide messaging only. There is no need for "drfor"/"drto" in the satellite YAML files, but "satelliteof" must be present.

      Use the following sample files as a reference:

      • samples/yaml/satellite-dr/tibftlserver_primary.yaml
      • samples/yaml/satellite-dr/tibftlserver_dr.yaml
      • samples/yaml/satellite-dr/tibftlserver_sat_primary.yaml
      • samples/yaml/satellite-dr/tibftlserver_sat_dr.yaml
      • samples/yaml/satellite-dr/satellite-dr-sample.json

      The procedure that follows demonstrates activating site 2, failing back to site 1, activating site 4, and failing back to site 3.