PostExpensive Mistakes…and Why You Cannot Afford Them On Your HA/DR Solutions

I was looking at my phone, waiting for my Amazon order to arrive.

 

I heard a notification confirming that my delivery had arrive. The photo-on-delivery feature is a great way to confirm that the package was indeed delivered and in the right address.

 

To my surprise, I see an unfamiliar porch in the photo with my package. The house number says 1000. But it’s definitely not my house.

 

For the next half-hour, I walked around the neighborhood, trying to find another house with the number 1000. When I finally found the house, I knocked on the door and showed the owner a copy of my ID, confirming that I owned the package. He looked a bit confused but eventually gave me my package.

 

I complained to Amazon about the incorrect delivery, telling them that I spent half-an-hour looking for the address where my package was delivered. They refunded my payment and apologized for the mistake.

 

Let’s Do The Math

 

I like using numbers when I talk to my customers about the cost of an outage. This gives them an idea of the potential risks that the business will be in when a disaster strikes.

 

In the case of my Amazon delivery, a $50 refund is no big deal for a billion-dollar company.

 

But imagine Amazon doing 1000 daily deliveries. I’m being conservative just to simplify my math. I’m sure Amazon does 100X that all over the world. Just because I’m Asian doesn’t mean I’m very good at math. Quite the contrary, I should say.

 

At 1000 deliveries, let’s say 5% of the package gets delivered to the incorrect address. That’s an average of $2,500-per-day refund. Again, not a big deal for a behemoth like Amazon.

 

In a year, that’s $912,500. That’s almost a million dollars in refunds for incorrect package delivery.

 

In a previous blog post, I talked about a customer that lost an average of US$75K of potential revenue due to a 5-hour outage. I had another customer from years ago that lost US$120K from a 2-hour outage.

 

But where’s the mistake? Nobody can predict when an outage will happen.

 

Expensive Mistakes When Deploying HA/DR Solutions

 

I see these happen in my customers’ environments that they are worth mentioning. Because despite the fact that I love solving problems, I prefer avoiding them in the first place if I can.

 

1) Choosing the technology before being clear on the goal. Microsoft did a great job of advertising Always On Availability Groups since it’s introduction as the ultimate HA/DR solution. It’s the reason why every new SQL Server deployment that needs HA/DR will almost certainly be on an Always On Availability Group. And it doesn’t help that “experts” will instinctively recommend it.

 

A technical solution is a means to an end. If you’re not clear on the “end“, no amount of “means” will help you get there.

 

That’s why I always start customer conversations with recovery point (RPO) and recovery time (RTO) objectives. Because it steers them away from the technology so they can focus on the real goal.

 

Besides, starting with technology solutions before being clear on the goal will introduce other problems you don’t necessarily need. Like very expensive SQL Server licensing and additional hardware.

 

Avoid the expensive mistake of picking the technology solution first before being clear on what the HA/DR goals are.

 

2) Deploying a technical solution without the proper process that will help achieve the goal. I’ve seen cases where a vendor or an external consultant gets hired to build the Always On Availability Group. The solution gets tested and everyone’s happy.

 

Until…

 

An outage happens and the engineer assigned the on-call duties took it upon himself to fix the problem on his own.

 

And because somebody else built the solution, the operational staff are sometimes unaware of what it really does. So, when an issue occurs, they treat it like everything else. This includes monitoring and escalation procedures.

 

HA/DR solutions need to have their own operational processes. Like patching or installing updates. A hard lesson for me to learn was when I forgot to inform the operations engineers not to reboot all of our domain controllers at the same time. A failover for a SQL Server failover clustered instance that’s supposed to take a few seconds took almost 15 minutes because of the unavailability of domain controllers.

 

Once you’ve made a decision to deploy an HA/DR solution, create a process for properly managing it. Change management, monitoring, maintenance, troubleshooting, escalation, etc.

 

And don’t get me started on DR strategies that look good on paper but have never been tested.

 

3) Deploying a technical solution without the proper people to support it. Whether that’s your in-house engineers or a managed services provider, make sure there are skilled people in place who know the solution and are available for when they’re needed.

 

I emphasize on having both the CAPABILITY and AVAILABILITY. Because what good is a highly skilled engineer if they’re not available? Or an available engineer but isn’t capable of doing the work.

 

If you have very tight service level agreement (SLAs), have enough people on your team to make sure you meet them. And make sure those people are capable and skillful enough to resolve technical issues.

 

A few years ago, I got a call from a customer that “accidentally” took down an Always On Availability Group after adding another replica. In the process of adding the node to a failover cluster, the engineer forgot to uncheck the box that says “Add all eligible storage to the cluster.” That was enough to take the Always On Availability Group offline for a good 10 hours.

 

Another case I worked on involved the sysadmin performing maintenance on the failover cluster. He didn’t know that failover should be done inside SQL Server and not from Failover Cluster Manager. When the failover attempt didn’t happen after several tries, he assumed that the standby server/secondary replica had issues. So, he decided to reboot the machine.

 

Let’s just say not knowing that the failed attempt triggered the Maximum Failures in the Specified Period value and the secondary replica not being failover ready caused a 4-hour outage. I had to run a FORCE_FAILOVER_ALLOW_DATA_LOSS to bring the Always On Availability Group online.

 

Mistakes are bound to happen. We cannot avoid them. We’re humans.

 

But when you’re dealing with an HA/DR solution, mistakes can be very expensive.

 

The way to avoid expensive mistakes is to learning from other people’s experiences.


P.S. How sure are you that you’re not making these expensive mistakes with your HA/DR solutions? Take an inventory of your solutions, your processes, and your people.

 

If you need help with this, feel free to reach out. My company provides services to help you avoid these very expensive mistakes. Or if you’ve already made them, get it fixed ASAP. Let’s talk.Schedule a call with me using my calendar link.

Schedule a Call

 

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close