Why resilient landscapes are important?
SAP Landscapes are at the core of large organisations. They hold a system of record for millions of transactions, which cover the majority of Globally recognised household brands. An outage of these core systems can cost the organisation significant loss of earnings and critical time. It’s no surprise that customers architecting SAP systems take topics like resilience, high availability, disaster recovery, backup and business continuity seriously.
Critical workloads, facing the new attitude towards cloud and changes in processes (physical and mental mind shifts)
When you decide to move critical SAP workloads to Microsoft Azure, there is a temptation to dive directly into the discussions about clustering, replication and backups. To meet the needs of the business at the lowest cost you need to take a step back and get agreement on the business expectations for service recovery. Critical can translate to different outcomes for different organisations.
We need to talk about recovery time objectives (RTO), recovery point objectives (RPO) and recovery level objectives (RLO) to make sure we don’t lose site of the purpose of these technical solutions. When you move to Azure you have new concepts to consider like Availability Sets/Zones, multiple Regions and many more backup and storage options.
It’s really easy to over-engineer when you are presented with so much choice.
It’s important to challenge some of the assumptions that come from an on-prem or co-location design perspective, especially when we are also moving to a SAP HANA based implementation like S/4HANA or from a physical to virtual implementation. The solution you have today might not deliver what you need. Some customers find that when the take the time to discuss these needs with the business they have changed since they were first implemented. Systems can become promoted or demoted in criticality without any changes to the underlying infrastructure, especially when you consider that SAP systems typically live for decades.
As you start to consider RTO/RPO and RLO values, you will need to map them against the potential disaster scenarios that exist in an Azure implementation, working through from the most likely to the least likely. Clearly the solution to each failure scenario will increase in cost as the probability of it decreases. With Microsoft’s Globally distributed data centres, it is possible to design protection against catastrophic regional events. This would have been difficult to achieve with the classic data centre model.
Your organisation will have to draw the line at acceptable trade-off between cost and risk. There will also need to be discussions about expectations for development or testing projects. It’s common to assume that business continuity is about production systems only, but if an organisation has an aggressive digital transformation agenda with significant investment in development activities the RTO/RPO/RLO for non production systems need to be considered with the same process.
For mission critical systems some customers will consider what levels of resilience to consider whilst running in DR mode.
Customer example from Centiq
When you look to SAP HANA on Azure experts, you will find that companies like Centiq, who provide services to Global customers have a level of understanding that allows Global brands to evaluate and make informed decisions, here we look at an example from one of our customers and the considerations observed.
The below table shows the example output you might expect from the recovery objectives review.
Azure failure scenarios | Failure examples | RTO | RPO |
Loss of Region pair | A widespread national disaster that impacted multiple Azure regions within a geographical area. | N/A | N/A |
Loss of Single Node / Component whilst running in DR | A single node is accidentally terminated, or suffers HW failure whilst running on the DR solution | 24 hours | 20 min |
Loss of Region | Loss of all Availability zones within a region that is unlikely to recover within RTO | 12 hours | 20 min |
Loss of availability Zone | Fire, flood, power + cooling disruption to data centre operations | 5 hours | 20 min |
Loss of Data | Data corruption or deletion through human error or malicious attack | 5 hours | 20 min |
Loss of Availability Set | Power + cooling disruption to data centre operations within a region | 10 min | 0 min |
Loss of Single Node / Component | A single node is accidentally terminated, or suffers HW failure | 10 min | 0 min |
Choosing an expert who knows SAP and knows Microsoft Azure is the right choice as their focus is purely in delivering successful outcomes.
When to consider clustering and when not to?
The example is similar to an example customer that had tight externally-governed objectives to meet. It was an S/4HANA deployment on Azure for an organisation that was used to IBM PowerHA solution on DB2 database split between two sites (co-location) with LVM based synchronous mirroring. A classic deployment seen in many of the customers moving to Azure.
The RTO/RPO requirements for Production dictated a clustered solution. Big SAP HANA systems take time to load data into memory so requiring a recovery time of minutes can only be achieved through clustering. But that’s not the case for everyone. If you have an RTO measured in hours for hardware failure, or you have a small database that’s going to take less time to load you might be able to accept the standard SLA’s offered by Azure VM’s. And it’s worth challenging the need for clustering as it is a level of complexity that should be avoided if not needed. Clustering needs careful management and can make some changes more difficult. Poorly managed clusters can create services outages through false detection of issues. It can take specialist skills to detect and diagnose issues with clusters.
The customer risks of complexity in clustering where mitigated through the implementation of Infrastructure as Code (IaC) automation techniques to improve the quality of build and maintenance and reduce the dependency on specialist skill sets. The ASCS cluster setup on Red Hat required over 200 configuration choices that would typically be managed by hand.
Introducing IaC reduced the opportunity for human error, as well as introducing Git version control to the configuration that changes the behavior of the cluster. When it came to the clustering of HANA on Red Hat careful management of the configuration settings was key to tuning the STONITH method to minimise the failover delay from “killing” the other node through the Microsoft Azure fencing agent.
The difference between HA & DR in Microsoft Azure
Customers coming from the co-location setup with high availability clustering (HA) often consider this a Disaster Recovery (DR) solution. We have some customers have HA between halls on the same site and are happy to consider this DR. If you speak with an Azure Architect they will be reluctant to design an equivalent solution under the banner DR. But if that meets your objectives then you can get the same levels of resilience by choosing Azure Regions that offer Availability Zones. They are the equivalent of co-located data centres close enough to offer synchronous replication.
5 key things to consider building resilience into your SAP deployment on Azure
- Start with the RTO/RPO/RLO needs of your business, don’t just replicate what you have today
- Think about RUN before you build – Think about the operational challenges for managing any complexity that creeps into your design.
- Benefit from consistent configuration between sites through IaC deployment
- Consider building runbook automation to improve the success of your DR process
- Build into the design the ability to test changes to HA and DR in a system other than Production
If you would like to discuss any of the content above please do contact us.