Finding windows of opportunity is just a walk in the park

Walking the dog

I find taking my little black Patterdale Terrier Freddie for a walk a great way to get the old brain cells churning. Often managing to come up with ideas or solutions to troublesome problems whilst chasing him around the local parks and trails.

So, when faced with the task of coming up with a plan for updating all of the Operating Systems in one of my Managed Service Customer’s landscapes, Freddie got to go for a few extra walks.

But it wasn’t until they also asked about Disaster Recovery that a solution started to come together.

DR and BCP

As did many people at the time, I first came across Disaster Recovery and Business Continuity Plans in the aftermath of 9/11. It’s something we should all have been thinking about prior to then but for most of us knowing that we had an off-site tape backup was all we needed. Our backups worked fine for restoring accidentally deleted files but we never really tested them in a full recovery scenario.

What’s the difference?

The planning and testing we subsequently had to do also highlighted the differences between DR and BCP, described quite simply to us as DR is keeping IT going whereas BCP is keeping the Business going.

They are both closely linked of course, but my focus was always on DR even if it was interesting to be involved in the BCP too – it’s all well and good for the IT department to know how to get access to emails but how do the Sales Team host a client meeting if they can’t get into the office?

Joined up thinking

So, back to my Customer’s problems. By combining them together and adding in a little hard-earned experience and Centiq expertise, I was able to start designing a set of procedures that would not only allow me to carry out the updates I needed to do but could also be used in the event of a DR test or indeed a real disaster.

The only piece of the puzzle missing was to identify suitable downtime windows. Simple, yes?

Aiming high to keep downtime low

Achieving the Industry Standard of “five nines” of system availability is a good target to aim for. But from my point of view as a Systems Administrator, 5.26 minutes of downtime a year doesn’t give me much time to carry out reboots, updates, failovers or DR tests.

So updating multiple servers, keeping downtime to a minimum and making sure we could cope with a disaster looked like it was going to be challenging.

So what next?

Well, the first stage was to map out their entire landscape to try and get a handle on what they actually had.
I spoke to their IT department to help me identify all of the hardware components, networks and other links they had along with the virtualisation that sits on top of it all. Armed with this diagram I was then able to sit down with the Customer’s Product Owner, point to each part of the diagram in turn and ask them a simple question:

What happens if I turn this bit off?

Extending the question a little bit further allowed us to start filling in some blanks and putting a plan together that could cope with planned and unplanned downtime:

What happens if this bit goes down unexpectedly?
What has to happen if I want to take this bit down?

The answers interestingly not always the same.

Fitting the pieces together

By posing these questions to the customer in this way it was then quite easy to build up a picture of the links between systems and the knock-on effect of any single piece of the puzzle being unavailable. Quite often people will have their own way of coping with system downtime and what might be perceived as a major problem by the Business could in fact just be a minor inconvenience if downtime is kept to a minimum.

However, it’s not until you actually pose these questions do people begin to realise there may be an alternative to trying to keep everything running 24×7.

One final question

But before finalising the update and DR plan it was worth considering one further question:

How do we put it all back together again?

It’s all well and good having a plan which keeps Production downtime to a minimum when doing updates or DR tests but if it takes several days to return to normal there won’t be much enthusiasm for using it.

Understanding what we have and how to use it

So, by talking to the IT department to get an understanding the infrastructure and speaking to the Business about their actual requirements and not just their perceived ones, designing a plan that was simple, effective and can be used for both updates and DR was easily achievable.

In the end, I was able to assure the customer we could do the updates with minimal downtime and if there was a need to invoke DR everything would be covered.

In the long run, they may have to sacrifice one of those five nines but if the Business can cope with that then it’s probably not a bad thing if it means we can keep their systems up to date and ready for any impending disasters.

Martin Sverdloff
Centiq Operations Team

Centiq Dog

Centiq Dog

“Come on Freddie, time for a walk – I’ve got another couple of problems to work on!”
Freddie has over 1k followers on Instagram Fredsverds