Disaster Recovery

When using FTL stores, the disaster recovery capabilities of TIBCO FTL are extended to EMS. Setting up a Disaster Recovery (DR) site of operations can minimize EMS server downtime in the event that the primary site of operations becomes disabled.

FTL’s disaster recovery implementation works as follows. Whenever the FTL server cluster receives data that needs to be replicated among its constituent FTL servers, it immediately also forwards it to an identical FTL server cluster running in a remote DR site. This means that if the FTL server cluster at the primary site becomes unavailable, the cluster at the DR site, and thereby the EMS servers at the DR site, can pick up right where things left off. The only information that would have been lost is in-flight data that the FTL server cluster was in the process of replicating when the primary site went down. Data replication to the DR site is asynchronous and does not add latency to operations at the primary site.

Setting up the Disaster Recovery Site

FTL server clusters at both the primary site and DR site first need to be configured to support disaster recovery and then deployed.

Procedure 

1. Copy the YAML configuration of the FTL server cluster at the primary site to the DR site. If security is configured at the primary site, also copy over the trustfile, keystore, keystore_password_file, password_file, and users.txt file. Rename the FTL servers in the YAML configuration file such that there are no repeated FTL server names between the primary and DR sites. Alter any URLs in the core.servers list and -listens parameters that overlap with URLs in the primary site. Also modify any invalid paths present in the configuration.

Once the YAML has been modified for the DR site, add the drfor parameter. This parameter must be supplied with a pipe-separated list of URLs of the FTL servers in the primary site’s cluster. Each URL must be of the form <FTL server name>@<host>:<port>. If security is configured for the FTL servers at the primary site, the user and password parameters will also need to be added to the YAML configuration.

servers: <name of DR FTL server #1> # ... - realm drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port> user: admin password: file:<path to password_file> <name of DR FTL server #2> # ... - realm drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port> user: admin password: file:<path to password_file> <name of DR FTL server #3> # ... - realm drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port> user: admin password: file:<path to password_file>
2. Start up the FTL server cluster at the DR site.
3. If the FTL server cluster at the primary site is already running, first shut it down using the FTL admin tool.
tibftladmin --ftlserver <URL of any FTL server in the cluster> --
shutdown_cluster
4. Add the drto parameter to the YAML configuration of the FTL server cluster at the primary site. This parameter must be supplied with a pipe-separated list of URLs of all the FTL servers belonging to the cluster at the DR site. Each URL must be of the form <FTL server name>@<host>:<port>. If security is configured for the FTL servers at the primary site, the user and password parameters will also need to be added to the YAML configuration.

servers: <name of FTL server #1> # ... - realm: drto: <name of DR FTL server #1>@<host>:<port>|<name of DR FTL server #2>@<host>:<port>|<name of DR FTL server #3>@<host>:<port> user: admin password: file:<path to password_file> <name of FTL server #2> # ... - realm: drto: <name of DR FTL server #1>@<host>:<port>|<name of DR FTL server #2>@<host>:<port>|<name of DR FTL server #3>@<host>:<port> user: admin password: file:<path to password_file> <name of FTL server #3> # ... - realm: drto: <name of DR FTL server #1>@<host>:<port>|<name of DR FTL server #2>@<host>:<port>|<name of DR FTL server #3>@<host>:<port> user: admin password: file:<path to password_file>
5. Start (or restart) the FTL server cluster at the primary site. The primary site’s FTL server cluster will connect to the DR site’s FTL server cluster at this point.

Recovering After Disaster

In the event that the FTL server cluster – and thereby the EMS servers – at the primary site becomes unavailable, the FTL servers at the DR site will need to be notified that their site of operations has become the new primary.

This can be done by connecting the EMS admin tool to any one of the EMS servers and issuing the activate_dr_site command.

echo activate_dr_site > admin.script
tibemsadmin -server <EMS server URL> -script admin.script

Issuing this command will cause one of the EMS servers that is not configured as
-standby_only
to transition into active state.

Re-establishing a Disaster Recovery Site

Once the issues at the original primary site have been fixed, it can then be used as the new DR site for the current primary site. To do this the following steps will need to be performed.

Procedure 

1. At the new DR site, start by deleting the general data directories and FTL store-specific data directories of the previous FTL server cluster to remove any residual artifacts.
2. Copy the YAML configuration from the FTL server cluster at the new primary site to the new DR site. If security is configured at the new primary site, also copy over the trustfile, keystore, keystore_password_file, password_file, and users.txt file. Rename the FTL servers in the YAML configuration file so that they have the exact same names that were used by the FTL servers in the original primary site. Alter any URLs in the core.servers list and -listens parameters that overlap with URLs in the new primary site. Also modify any invalid paths present in the configuration.

Once the above changes have been made, replace the value of the drfor parameter with a pipe-separated list of URLs of the FTL servers in the new primary site cluster. Each URL must be of the form <FTL server name>@<host>:<port>. If security is configured for the FTL servers at the new primary site, the user and password parameters will also need to be added to the YAML configuration.
servers:
  <name of DR FTL server #1>
  # ...
  - realm
      drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port>
      user: admin
      password: file:<path to password_file>
  <name of DR FTL server #2>
  # ...
  - realm
      drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port>
      user: admin
      password: file:<path to password_file>
  <name of DR FTL server #3>
  # ...
  - realm
      drfor: <name of FTL server #1>@<host>:<port>|<name of FTL server #2>@<host>:<port>|<name of FTL server #3>@<host>:<port>
      user: admin
      password: file:<path to password_file>
3. Start up the FTL server cluster at the new DR site.
4. The FTL server cluster at the new primary site will need to be informed that it needs to start replicating to a new DR site. This can be done by connecting the EMS admin tool to the active EMS server and issuing the setup_dr_site command. This command must be provided with a pipe-separated list of URLs of the FTL servers in the new DR site. Each URL must be of the form <FTL server name>@<host>:<port>.
echo setup_dr_site <name of DR FTL server #1>@<host>:<port>|<name of DR FTL server #2>@<host>:<port>|<name of DR FTL server #3>@<host>:<port> 
> admin.script tibemsadmin -server <EMS server URL> -script admin.script