Disaster Recovery for Self-Hosted TIBCO Control Plane
This document outlines the steps required to set up a disaster recovery solution for an existing self-hosted TIBCO Control Plane cluster using Velero and AWS services. The process includes creating backups of cloud resources, Kubernetes runtime resources, and restoring them when necessary.
Considerations
-
This document refers to the AWS backup or restore service but similar services of other cloud vendors can be used as applicable.
-
Velero can also be used with other cloud providers, as it supports multiple providers for backup and recovery. For more information, see Velero documentation.
-
Initial testing has been successful after conducting the necessary tests, and the existing data planes were able to connect to the restored Control Plane.
Setup used for this disaster recovery test
Parameters used for this test:
-
Number of Clusters: 1 (1 Control Plane)
-
Number of Subscriptions: 3 Active subscriptions
-
Number of Data Planes: 2
-
Backup method: Manual backup for RDS and EFS (instead of scheduled backups)
-
Capabilities Deployed in data plane: TIBCO BusinessWorks Container Edition, TIBCO Flogo, and TIBCO Enterprise Message Service.
This information is provided as a reference, and results and processes may differ from use case to use case.
Backup
Back up the resources of the Control Plane namespace using Velero to perform disaster recovery. Additionally, you can use the AWS Backup service to ensure that cloud resources are protected and can be restored when needed.
Step 1: Create a S3 Bucket
Create an Amazon S3 bucket for Velero backups using either the AWS Management Console or the AWS Command-Line Interface (CLI). Replace <BUCKETNAME> and <REGION> with your chosen values.
# Replace <BUCKETNAME> and <REGION> with your own values below. BUCKET=<BUCKETNAME> REGION=<REGION> aws s3 mb s3://$BUCKET --region $REGION
Step 2: Create an IAM Policy
Create a policy that grants Velero the necessary permissions to access AWS resources.
cat > velero_policy.json <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeVolumes", "ec2:DescribeSnapshots", "ec2:CreateTags", "ec2:CreateVolume", "ec2:CreateSnapshot", "ec2:DeleteSnapshot" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:DeleteObject", "s3:PutObject", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::${BUCKET}/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::${BUCKET}" ] } ] } EOF
Create the IAM policy in AWS:
aws iam create-policy \ --policy-name VeleroAccessPolicy \ --policy-document file://velero_policy.json
Step 3: Create a Service Account for Velero
Set up a service account for Velero in your EKS cluster.
CLUSTER_NAME=eks-cluster ACCOUNT=$(aws sts get-caller-identity --query Account --output text) eksctl create iamserviceaccount \ --cluster=$CLUSTER_NAME \ --name=velero-server \ --namespace=velero \ --role-name=eks-velero-backup \ --role-only \ --attach-policy-arn=arn:aws:iam::$ACCOUNT:policy/VeleroAccessPolicy \ --approve
Step 4: Install Velero in EKS Cluster
Add the VMware Tanzu Helm repository and create a values.yaml
file for the Velero installation.
# Add Helm repository helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts # Create values.yaml file cat > values.yaml <<EOF configuration: backupStorageLocation: - bucket: $BUCKET provider: aws volumeSnapshotLocation: - config: region: $REGION provider: aws initContainers: - name: velero-plugin-for-aws image: velero/velero-plugin-for-aws:v1.7.1 volumeMounts: - mountPath: /target name: plugins credentials: useSecret: false serviceAccount: server: annotations: eks.amazonaws.com/role-arn: "arn:aws:iam::${ACCOUNT}:role/eks-velero-backup" EOF # Install Velero helm install velero vmware-tanzu/velero \ --create-namespace \ --namespace velero \ -f values.yaml
Step 5: Backup Cloud Resources
Utilize the AWS Backup service to take on-demand backups of Amazon RDS (Relational Database Service) and Amazon EFS (Elastic File System). You can also create a backup plan for automatic backups.
Step 6: Backup Kubernetes Runtime Resources
Using Velero, take two sets of backups: one for the Custom Resource Definitions (CRDs) required by the router and another for all resources in the cp1-ns namespace. Velero also supports automated backups, allowing for scheduled backup operations to ensure that your Kubernetes resources are consistently protected.
# Backup cluster-level resources needed by the router velero backup create cp1-ns-crds --include-cluster-resources --include-resources validatingwebhookconfigurations,mutatingwebhookconfigurations,customresourcedefinitions # Backup all resources in cp1-ns namespace velero backup create cp1-ns --include-namespaces cp1-ns # Check backup status velero get backups # Verify backup details velero backup describe <backup-name> --details
Restore
This section outlines the steps for restoration of cloud resources and Kubernetes runtime resources using Velero.
Cluster Setup
Followed the workshop to create the cluster and install third-party charts in the EKS cluster, including:
-
cert-manager
-
external-dns
-
metrics-server
-
AWS Load Balancer Controller
-
dp-config-aws chart for ingress creation
Step 1: Restore Cloud Resources
Restore RDS and EFS
To restore Amazon RDS (Relational Database Service) and Amazon EFS (Elastic File System) from AWS Restore:
-
Security Group and Subnet Group: Before restoring the database, create the necessary AWS Security Group (SG) and database subnet group for RDS, as this information is required during the restore process. Refer to create-rds.sh for creation of the database subnet group and SG.
-
Confirm Security Group: Ensure that the Security Group for RDS allows incoming traffic from the Cluster VPC CIDR on the RDS port. You can refer to the following script for creating the Security Group and subnet group
After restoring EFS:
-
Set Mount Target and Security Group: Please refer
create-efs.sh
for creating the mount target and SG required for EFS. Confirm that the SG for EFS allows incoming traffic from the Cluster VPC CIDR on NFS port 2049.
-
Create Storage Class: Create a storage class using the above created EFS. Refer to the following documentation create-storage-class.
Step 2: Create Service Account for Velero
Set up a service account for Velero in your EKS cluster.
RECOVERY_CLUSTER=<cluster-name> ACCOUNT=$(aws sts get-caller-identity --query Account --output text) eksctl create iamserviceaccount \ --cluster=$RECOVERY_CLUSTER \ --name=velero-server \ --namespace=velero \ --role-name=eks-velero-recovery \ --role-only \ --attach-policy-arn=arn:aws:iam::$ACCOUNT:policy/VeleroAccessPolicy \ --approve
Step 3: Install Velero in EKS Cluster
Add the VMware Tanzu Helm repository and create a values_recovery.yaml
file for the Velero installation.
# Add Helm repository helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts # Create values_recovery.yaml file cat > values_recovery.yaml <<EOF configuration: backupStorageLocation: - bucket: $BUCKET provider: aws volumeSnapshotLocation: - config: region: $REGION provider: aws initContainers: - name: velero-plugin-for-aws image: velero/velero-plugin-for-aws:v1.7.1 volumeMounts: - mountPath: /target name: plugins credentials: useSecret: false serviceAccount: server: annotations: eks.amazonaws.com/role-arn: "arn:aws:iam::${ACCOUNT}:role/eks-velero-recovery" EOF # Install Velero helm install velero vmware-tanzu/velero \ --create-namespace \ --namespace velero \ -f values_recovery.yaml
Step 4: Restore Kubernetes Runtime Resources Using Velero
Using Velero, you can effectively restore Kubernetes runtime resources. An S3 bucket has been set up for backups in the same cluster, and a policy named “VeleroAccessPolicy” is created for the service account prior to this step. Now that the service account and Velero are already installed, you can proceed with restoring the Kubernetes runtime contents.
Restore Backups
Restore CRDs and PVCs
Now, restore the backup that contains the CRDs required by the router, followed by restoring the backup of the contents of the cp1-ns namespace:
# Verify the backups velero get backup # We need to restore CRD's first velero restore create --from-backup cp1-ns-crds # After CRD's We will only restore the pvc's and configmap so that we can update database endpoint and restore data before backing up pods velero restore create --from-backup cp1-ns --include-resources namespaces,configmaps,VolumeSnapshotClass,VolumeSnapshotContents,VolumeSnapshots,PersistentVolumes,PersistentVolumeClaims
Update Database Configuration
Modify the provider-cp-database-config
with the new database endpoint.
kubectl edit cm provider-cp-database-config -n cp1-ns # edit the DBHost, LocalReaderHost and MasterWriterHost endpoints with restored db url and save it
Check PVC Name
Check the Persistent Volume Claim (PVC) name created by the Control Plane in this new cluster, as this will be needed for moving the content in EFS. The PVC name will be used while moving EFS contents.
Access EFS and Move Contents
To access Amazon EFS (Elastic File System) and move contents, perform the following steps:
-
Create a Persistent Volume (PV) for EFS: This defines the EFS as a storage resource in your Kubernetes cluster.
-
Create a Persistent Volume Claim (PVC): This requests storage from the PV.
-
Create a Pod: This pod mounts the PVC and allow you to access the EFS to move contents and change permissions.
# Set the EFS ID export EFS_ID=<enter-efs-id> # Create Persistent Volume (PV) kubectl apply -f - <<EOF apiVersion: v1 kind: PersistentVolume metadata: name: efs-pv spec: volumeMode: Filesystem accessModes: - ReadWriteMany storageClassName: "" persistentVolumeReclaimPolicy: Retain capacity: storage: 50Gi # Please adjust the storage as necessary. nfs: path: "/" server: ${EFS_ID}.efs.us-west-2.amazonaws.com EOF # Create Persistent Volume Claim (PVC) kubectl apply -f - <<EOF kind: PersistentVolumeClaim apiVersion: v1 metadata: name: efs-pvc spec: volumeMode: Filesystem storageClassName: "" accessModes: - ReadWriteMany resources: requests: storage: 50Gi # Please adjust the storage as necessary. volumeName: efs-pv EOF # Create Pod to access EFS kubectl apply -f - <<EOF apiVersion: v1 kind: Pod metadata: name: checkefscontentroot spec: restartPolicy: Never nodeSelector: kubernetes.io/os: linux containers: - name: cli image: alpine:3.20 command: [ "/bin/sh", "-c" ] args: - "sleep 86500" volumeMounts: - name: store-vol mountPath: /private/cp volumes: - name: store-vol persistentVolumeClaim: claimName: efs-pvc EOF # Access the Pod and execute commands kubectl exec -it checkefscontentroot -- sh # Inside the Pod, execute the following commands: # Navigate to the mounted directory / # cd private/cp/ /private/cp # ls # Navigate to the backup directory /private/cp # cd aws-backup-restore_2024-12-20T05-09-36-045105050Z/ /private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # ls # Copy contents to the desired location i.e in the new pvc folder /private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # cp -r pvc-903b1ce3-ce28-4a29-8fb4-ac2475d98f78/* /private/cp/<<new-pvc>> # Change ownership and permissions /private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # chown -R 50000:50000 /private/cp/<<new-pvc>> /private/cp/aws-backup-restore_2024-12-20T05-09-36-045105050Z # chmod -R 700 /private/cp/<<new-pvc>>
Step 5: Restore Remaining Backups
In this step, restore all the runtime resources from the cp1-ns namespace.
##After updating the cm with correct db details and restoring the pvc data ## we need restore everythign else now except the CR's, whcih we can only restore once router is started successfully, as it needs router webhook to be ready. velero restore create --from-backup cp1-ns --exclude-resources tibcoclusterenvs,tibcoresourcesettemplates,tibcoresourcesets,tibcoroutes,tibcotunnelroutes # Wait till router pod is started successfully ## Once router is started, restore the CR's velero restore create --from-backup cp1-ns --include-resources tibcoclusterenvs,tibcoresourcesettemplates,tibcoresourcesets,tibcoroutes,tibcotunnelroutes
Check the status of the pods in the cp1-ns namespace; all should be in an "Up & Running" state.
Step 6: DNS Switching
Update the existing record in the Route 53 service with the DNS name of the newly created load balancer. This ensures that all traffic is directed to the same domain name, completing the necessary DNS switching in Route 53. This action finalizes the disaster recovery activity, and now all traffic is pointing to the new load balancer cluster.