Cybersecurity
DevOps Cloud
IT Operations Cloud
This whitepaper is about building a stand-alone disaster recovery cluster for SMAX running on embedded Kubernetes on a different data centre where the primary cluster cannot run as a stretched cluster across three datacentres.
This document highlights the steps it will be required to create and setup a new cluster for SMAX (with or without NativeSACM) using the fresh installation of light weight OMT and restoring the applications using the velero backup from the primary cluster. This approach is useful when the primary cluster cannot be stood up across the three different datacentres to provide high availability in case of the datacentre outage.
In contrast, if the SMAX cluster is stood up on the cloud platform (like AWS), generally the underlying Kubernetes cluster stood up by the cloud provider is stretched across the multiple data centres within the same region thus providing inherent capability of disaster recovery if one of the data centres become unavailable.
For the avoidance of any doubt, the strategy listed in this document is for the on-premise installation of SMAX cluster (OMT, SMAX and CMS) using the embedded Kubernetes OMT.
The concepts listed in this document is tested on the four components of the SMAX solution named OMT, SMAX, SAM and CMS with the version 2023.05. However, the concept can largely be applicable for other applications running on the OMT but they may have some specific nuances to deal with.
This solution is going to be applicable for both unplanned disaster recovery where the primary cluster has become unavailable suddenly and cannot be restored following other methods mentioned in the product documentation as well as for the planned disaster recovery where the primary data centre has to go under a planned / routine outage and gives enough time to failover services to the secondary (DR) cluster.
Depending on the backup procedure, in case of unplanned sudden disaster, there can be minimum to no loss of data if the primary cluster NFS and Database is backed up in real time to another datacentre site. The MTTR (Mean time to recovery) could be quite quick as well once the DR cluster is built beforehand and during DR, only the recovery steps are to be followed.
This document shows the approach of building DR cluster whilst primary cluster is up and running thus having minimum to no impact on the already live services.