When Your High Availability Solution Becomes a Disaster

I recently got back from a wilderness camping trip in upstate New York. It’s an opportunity for me to disconnect and enjoy the opportunity to be alone in nature.

Wilderness camping means you will be off the grid for a few days. This means no phone signals, no public restrooms, no grocery stores, nothing. It also means planning and preparing for everything you’ll need while you’re in the middle of nowhere.

Like most people, I rely on my smart phone for the basic things like communication and navigation. So, I have to make sure my phone is on full charge before I get to the start of the trail. I also have two battery packs on standby. My goal was to minimize phone usage to conserve battery.

Before I started the hike, I took a picture of the trail maps and the landmarks that I need to be aware of to get to my campsite. That way, I don’t have to use the GPS most of the time. I also wanted to make sure I make it to the campsite before sun down.

As I was getting close to the campsite, I realized that I accidentally turned on the camera while my phone was in my back pocket. I don’t know how long it’s been going on. Maybe an hour? Maybe more? But my phone only had 9% battery left.

No worries. I brought battery packs.

My heart started to race the minute I turned my battery packs on. What I thought was a fully charged pack only had 5% left. The other pack had a blinking red light. While I brought – not one but – two battery packs, none of them were useful.

This reminded me of a recent case I worked on. I got brought in to bring an Always On Availability Group online – FAST.

The customer deployed an Availability Group for their mission-critical platform. They based their decision on their service provider’s recommendation. They did this despite not having internal resources for managing a complex high availability solution. This is corporate speak for “no one knows how this works“.

The service provider took care of the design and deployment. When the platform went live, they also took care of managing it. And for the past year-and-a-half, everything went well.

Up to this point. Otherwise, they would not have called me to intervene.

The customer had a massive power failure in the middle of a business day. That power failure took down their failover cluster. They managed to get the power and the failover cluster back as fast as they can.

But since they did not have the skills to manage the Availability Group, the service provider took over. That’s when the problem got worse.

The Goal of a High Availability Solution

Implementing a high availability solution such as an Always On Availability Group should not be taken lightly. You don’t just “do it” because your service provider said so. Or your CIO heard a speaker talk about this solution at a recent conference he attended.

A high availability solution is implemented with one thing in mind: reduce downtime.

This means everyone is clear on what that acceptable downtime is. And when I say everyone, I mean everyone – executives, IT team, service providers, consultants, etc.

So, it goes without saying, being clear on what the acceptable downtime is means knowing exactly how many minutes of downtime is acceptable.

Talking to the customer after business-as-usual, four things became clear:

– they do not have the skills internally to manage the solution
– they do not have a disaster recovery strategy for when their high availability solution fails
– they have not communicated to everyone involved what the acceptable downtime is
– they do not have a proper escalation process in place when dealing with such incidents

After bringing the power back on and failover cluster online, it’s the service provider’s responsibility to bring the Availability Group – and databases – online. Not having the internal resources and skills means you’re completely dependent on someone else.

There’s nothing wrong with relying on someone else. Especially if they have a very good track record of helping you achieve your goals. In fact, this is highly recommended as the company continues to grow. You need others who are very good at what they do so you can keep being good at what you do. This kind of relationship is beneficial and helps both parties grow at the same time.

But it’s dangerous to bring in someone external to manage something critical when they do not have a very good track record of achieving the main goal. Be those vendors, consultants, service providers. In this case, managing an Always On Availability Group solution.

Four (4) hours have passed from the time that the failover cluster came back online to when the databases became accessible again.

That’s FOUR HOURS. Plus the time from when the power outage happened until the time the failover cluster is back online. That’s roughly five hours in total.

The platform generates an average revenue of US$15,000-per-hour. That outage cost them US$75K.

It’s very dangerous to assume that your high availability solution is all you need to be protected. I’ve worked with customers who ditched database backups immediately after deploying an Always On Availability Group. Or those who do not have a disaster recovery plan for when their high availability solution fails.

It’s also dangerous to rely on someone who does not have a very good track record of helping you reduce downtime. Because they could potentially cause more unnecessary downtime. Exactly like what happened with the service provider.

Even worse is if they don’t know what the acceptable downtime is. They’ll end up wasting their time despite managing a Severity 1 issue.

When I realized that I won’t have my phone available for the rest of the camping trip, I resolved to rely on what I have available – pen and pocket notebook, my five senses, and my navigational skills.

I took notes on the trail marks I followed on my way to the camp site. I recalled landmarks from the photos of the trail maps I took and wrote them down in my notebook. I followed the path of the sun to know where east or west is (I parked my car in the east and I was heading northwest towards the campsite).

I realized that while I can plan and prepare for anything, it’s impossible for me to anticipate everything. However, with proper training, I can quickly adopt to any challenge when the unexpected happens.

I also realized I can live without my smart phone for 3 days.

P.S. Make sure your mission-critical databases are protected from potential disasters. Take an inventory of your infrastructure and make sure the right high availability and disaster recovery solutions are in place.

If you need help with this, feel free to reach out. My company provides services to help you protect your SQL Server databases from unexpected disasters. Schedule a call with me using my calendar link.

Blog

PostWhen Your High Availability Solution Becomes a Disaster

The Goal of a High Availability Solution

When Your High Availability Solution Becomes a Disaster

Schedule a Call

Leave a Reply Cancel reply