Disaster Recovery for Self-Hosted TIBCO Control Plane

This document outlines the steps required to set up a disaster recovery solution for an existing self-hosted TIBCO Control Plane cluster using Velero and AWS services. The process includes creating backups of cloud resources, Kubernetes runtime resources, and restoring them when necessary.

Considerations

  • This document refers to the AWS backup or restore service but similar services of other cloud vendors can be used as applicable.

  • Velero can also be used with other cloud providers, as it supports multiple providers for backup and recovery. For more information, see Velero documentation.

  • Initial testing has been successful after conducting the necessary tests, and the existing data planes were able to connect to the restored Control Plane.

Setup used for this disaster recovery test

Parameters used for this test:

  • Number of Clusters: 1 (1 Control Plane)

  • Number of Subscriptions: 3 Active subscriptions

  • Number of Data Planes: 2

  • Backup method: Manual backup for RDS and EFS (instead of scheduled backups)

  • Capabilities Deployed in data plane: TIBCO BusinessWorks Container Edition, TIBCO Flogo, and TIBCO Enterprise Message Service.

This information is provided as a reference, and results and processes may differ from use case to use case.

Backup

Back up the resources of the Control Plane namespace using Velero to perform disaster recovery. Additionally, you can use the AWS Backup service to ensure that cloud resources are protected and can be restored when needed.

Step 1: Create a S3 Bucket

Create an Amazon S3 bucket for Velero backups using either the AWS Management Console or the AWS Command-Line Interface (CLI). Replace <BUCKETNAME> and <REGION> with your chosen values.

# Replace <BUCKETNAME> and <REGION> with your own values below.
BUCKET=<BUCKETNAME>
REGION=<REGION>
aws s3 mb s3://$BUCKET --region $REGION

Step 2: Create an IAM Policy

Create a policy that grants Velero the necessary permissions to access AWS resources.

cat > velero_policy.json <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeVolumes",
                "ec2:DescribeSnapshots",
                "ec2:CreateTags",
                "ec2:CreateVolume",
                "ec2:CreateSnapshot",
                "ec2:DeleteSnapshot"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::${BUCKET}/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::${BUCKET}"
            ]
        }
    ]
}
EOF

Create the IAM policy in AWS:

aws iam create-policy \
    --policy-name VeleroAccessPolicy \
    --policy-document file://velero_policy.json

Step 3: Create a Service Account for Velero

Set up a service account for Velero in your EKS cluster.

CLUSTER_NAME=eks-cluster
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

eksctl create iamserviceaccount \
    --cluster=$CLUSTER_NAME \
    --name=velero-server \
    --namespace=velero \
    --role-name=eks-velero-backup \
    --role-only \
    --attach-policy-arn=arn:aws:iam::$ACCOUNT:policy/VeleroAccessPolicy \
    --approve

Step 4: Install Velero in EKS Cluster

Add the VMware Tanzu Helm repository and create a values.yaml file for the Velero installation.

# Add Helm repository
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts

# Create values.yaml file
cat > values.yaml <<EOF
configuration:
  backupStorageLocation:
  - bucket: $BUCKET
    provider: aws
  volumeSnapshotLocation:
  - config:
      region: $REGION
    provider: aws
initContainers:
- name: velero-plugin-for-aws
  image: velero/velero-plugin-for-aws:v1.7.1
  volumeMounts:
  - mountPath: /target
    name: plugins
credentials:
  useSecret: false
serviceAccount:
  server:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::${ACCOUNT}:role/eks-velero-backup"
EOF

# Install Velero
helm install velero vmware-tanzu/velero \
    --create-namespace \
    --namespace velero \
    -f values.yaml

Step 5: Backup Cloud Resources

Utilize the AWS Backup service to take on-demand backups of Amazon RDS (Relational Database Service) and Amazon EFS (Elastic File System). You can also create a backup plan for automatic backups.

Step 6: Backup Kubernetes Runtime Resources

Using Velero, take two sets of backups: one for the Custom Resource Definitions (CRDs) required by the router and another for all resources in the cp1-ns namespace. Velero also supports automated backups, allowing for scheduled backup operations to ensure that your Kubernetes resources are consistently protected.

# Backup cluster-level resources needed by the router
velero backup create cp1-ns-crds --include-cluster-resources --include-resources validatingwebhookconfigurations,mutatingwebhookconfigurations,customresourcedefinitions

# Backup all resources in cp1-ns namespace
velero backup create cp1-ns --include-namespaces cp1-ns

# Check backup status
velero get backups

# Verify backup details
velero backup describe <backup-name> --details

Restore

This section outlines the steps for restoration of cloud resources and Kubernetes runtime resources using Velero.

Cluster Setup

Followed the workshop to create the cluster and install third-party charts in the EKS cluster, including:

  • cert-manager

  • external-dns

  • metrics-server

  • AWS Load Balancer Controller

  • dp-config-aws chart for ingress creation

Step 1: Restore Cloud Resources

Restore RDS and EFS

To restore Amazon RDS (Relational Database Service) and Amazon EFS (Elastic File System) from AWS Restore:

  • Security Group and Subnet Group: Before restoring the database, create the necessary AWS Security Group (SG) and database subnet group for RDS, as this information is required during the restore process. Refer to create-rds.sh for creation of the database subnet group and SG.

  • Confirm Security Group: Ensure that the Security Group for RDS allows incoming traffic from the Cluster VPC CIDR on the RDS port. You can refer to the following script for creating the Security Group and subnet group

After restoring EFS:

  • Set Mount Target and Security Group: Please refer create-efs.sh for creating the mount target and SG required for EFS. Confirm that the SG for EFS allows incoming traffic from the Cluster VPC CIDR on NFS port 2049.

  • Create Storage Class: Create a storage class using the above created EFS. Refer to the following documentation create-storage-class.

Step 2: Create Service Account for Velero

Set up a service account for Velero in your EKS cluster.

RECOVERY_CLUSTER=<cluster-name>
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

eksctl create iamserviceaccount \
    --cluster=$RECOVERY_CLUSTER \
    --name=velero-server \
    --namespace=velero \
    --role-name=eks-velero-recovery \
    --role-only \
    --attach-policy-arn=arn:aws:iam::$ACCOUNT:policy/VeleroAccessPolicy \
    --approve

Step 3: Install Velero in EKS Cluster

Add the VMware Tanzu Helm repository and create a values_recovery.yaml file for the Velero installation.

# Add Helm repository
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts

# Create values_recovery.yaml file
cat > values_recovery.yaml <<EOF
configuration:
  backupStorageLocation:
  - bucket: $BUCKET
    provider: aws
  volumeSnapshotLocation:
  - config:
      region: $REGION
    provider: aws
initContainers:
- name: velero-plugin-for-aws
  image: velero/velero-plugin-for-aws:v1.7.1
  volumeMounts:
  - mountPath: /target
    name: plugins
credentials:
  useSecret: false
serviceAccount:
  server:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::${ACCOUNT}:role/eks-velero-recovery"
EOF


# Install Velero
helm install velero vmware-tanzu/velero \
    --create-namespace \
    --namespace velero \
    -f values_recovery.yaml

Step 4: Restore Kubernetes Runtime Resources Using Velero

Using Velero, you can effectively restore Kubernetes runtime resources. An S3 bucket has been set up for backups in the same cluster, and a policy named “VeleroAccessPolicy” is created for the service account prior to this step. Now that the service account and Velero are already installed, you can proceed with restoring the Kubernetes runtime contents.

Restore Backups

Restore CRDs and PVCs

Now, restore the backup that contains the CRDs required by the router, followed by restoring the backup of the contents of the cp1-ns namespace:

# Verify the backups
velero get backup

# We need to restore CRD's first
velero restore create --from-backup cp1-ns-crds

# After CRD's We will only restore the pvc's and configmap so that we can update database endpoint and restore data before backing up pods
velero restore create --from-backup cp1-ns --include-resources namespaces,configmaps,VolumeSnapshotClass,VolumeSnapshotContents,VolumeSnapshots,PersistentVolumes,PersistentVolumeClaims

Update Database Configuration

Modify the provider-cp-database-config with the new database endpoint.

kubectl edit cm  provider-cp-database-config -n cp1-ns

# edit the DBHost, LocalReaderHost and  MasterWriterHost endpoints with restored db url and save it

Check PVC Name

Check the Persistent Volume Claim (PVC) name created by the Control Plane in this new cluster, as this will be needed for moving the content in EFS. The PVC name will be used while moving EFS contents.

Access EFS and Move Contents

To access Amazon EFS (Elastic File System) and move contents, perform the following steps:

  1. Create a Persistent Volume (PV) for EFS: This defines the EFS as a storage resource in your Kubernetes cluster.

  2. Create a Persistent Volume Claim (PVC): This requests storage from the PV.

  3. Create a Pod: This pod mounts the PVC and allow you to access the EFS to move contents and change permissions.

# Set the EFS ID
export EFS_ID=<enter-efs-id>

# Create Persistent Volume (PV)
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  persistentVolumeReclaimPolicy: Retain
  capacity:
    storage: 50Gi # Please adjust the storage as necessary.
  nfs:
    path: "/"
    server: ${EFS_ID}.efs.us-west-2.amazonaws.com
EOF

# Create Persistent Volume Claim (PVC)
kubectl apply -f - <<EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: efs-pvc
spec:
  volumeMode: Filesystem
  storageClassName: ""
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi # Please adjust the storage as necessary.
  volumeName: efs-pv
EOF

# Create Pod to access EFS
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
 name: checkefscontentroot
spec:
 restartPolicy: Never
 nodeSelector:
   kubernetes.io/os: linux
 containers:
   - name: cli
     image: alpine:3.20
     command: [ "/bin/sh", "-c" ]
     args:
     - "sleep 86500"
     volumeMounts:
       - name: store-vol
         mountPath: /private/cp
 volumes:
   - name: store-vol
     persistentVolumeClaim:
       claimName: efs-pvc
EOF

# Access the Pod and execute commands
kubectl exec -it checkefscontentroot -- sh

# Inside the Pod, execute the following commands:

# Navigate to the mounted directory
/ # cd private/cp/
/private/cp # ls

# Navigate to the backup directory
/private/cp # cd aws-backup-restore_2024-12-20T05-09-36-045105050Z/
/private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # ls

# Copy contents to the desired location i.e in the new pvc folder
/private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # cp -r pvc-903b1ce3-ce28-4a29-8fb4-ac2475d98f78/* /private/cp/<<new-pvc>>

# Change ownership and permissions
/private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # chown -R 50000:50000 /private/cp/<<new-pvc>>
/private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # chmod -R 700 /private/cp/<<new-pvc>>

Step 5: Restore Remaining Backups

In this step, restore all the runtime resources from the cp1-ns namespace.

##After updating the cm with correct db details and restoring the pvc data
## we need restore everythign else now except the CR's, whcih we can only restore once router is started successfully, as it needs router webhook to be ready.

velero restore create --from-backup cp1-ns --exclude-resources tibcoclusterenvs,tibcoresourcesettemplates,tibcoresourcesets,tibcoroutes,tibcotunnelroutes

# Wait till router pod is started successfully
## Once router is started, restore the CR's
velero restore create --from-backup cp1-ns --include-resources tibcoclusterenvs,tibcoresourcesettemplates,tibcoresourcesets,tibcoroutes,tibcotunnelroutes

Check the status of the pods in the cp1-ns namespace; all should be in an "Up & Running" state.

Step 6: DNS Switching

Update the existing record in the Route 53 service with the DNS name of the newly created load balancer. This ensures that all traffic is directed to the same domain name, completing the necessary DNS switching in Route 53. This action finalizes the disaster recovery activity, and now all traffic is pointing to the new load balancer cluster.