“I want zero downtime and zero data loss” – anonymous IT manager
Don’t we usually hear this a lot when implementing a new high availability and disaster recovery (HA/DR) solution? The stakeholders want zero downtime and zero data loss. As IT professionals we struggle so much trying to find the right solution that would meet the requirements. We ask questions on forums, newsgroups, Facebook groups, Twitter, etc., trying to figure out how to make the most of the technology that we have. We end up frustrated when we realize that we can’t, eventually thinking of a better way to explain to the stakeholders. But then it still happens. Over and over again.
We’ve all believed the mythical five (5) 9’s, the 99.999% uptime in high availability requirements. That equates to about 5.26 minutes of downtime per year. We hear stories of how enterprise-level infrastructures can failover to a standby one in less than 5 seconds. I’m privileged to have been a part of teams that implement such solutions. It’s exciting, fun and challenging. But as engineers (and even the stakeholders,) we fixate on the numbers. “Oh, if we can only reduce the failover time to 3 seconds from 5.” We always want to improve and become more efficient. And we use the numbers as our point of reference.
But what good is increasing efficiency and meeting 99.999% uptime if it is not contributing to the bottom line?
The reason I say this is because your organization’s goals are different from others. What impacts your organization’s revenue and profit is different from others. Even the systems within your organization have different business impact. The human resource (HR) system that is probably being used during normal business hours has a lot less business impact than your CRM or sales system. So, saying that you need 99.999% uptime is like saying “I need to keep up with the Joneses‘”
The Realistic Approach To High Availability and Disaster Recovery
There is nothing wrong with wanting to achieve 99.999% uptime if that is what you really need. But we also need to set realistic goals for our HA/DR solutions that will contribute to the bottom line. Here are some guidelines on setting realistic goals for our HA/DR solutions:
- 1. Start with a business impact analysis. A business impact analysis (BIA) answers the question , “What are the risks of losing a specific process in the business?” As IT professionals, it is not our responsibility to define the risks of losing a specific process in the business. However, we need to properly communicate the concept to the stakeholders so they, too, can start redefining the business risks associated with losing those processes. Use the organization’s balanced sheet as a reference point to correlate the revenue and profit associated with a process. Then, rank those processes in terms of their criticality to the business.For example, if your organization provides inbound call center solutions, your telecommunications system would probably be highest on the list. If you are selling products on your store while supplementing it with an online version, take a look at which of the two generates more revenue. The higher revenue makes it to the top of the list.
- 2. Define your own uptime. After identifying the different processes and systems that your business operations depend on, define what uptime means for those systems. For the organization that provides inbound call center solutions, that could mean keeping the telecommunications system available 24 X 7 if they are indeed providing services all across the globe. But what if they are simply providing services for customers in North America, roughly between 7AM Eastern time to 7PM Pacific time Monday to Friday? This means that they have a 9-hour maintenance window during the weekdays and more than 48 hours during the weekends. It also means that their uptime is not necessarily 99.999% based on a 24 X 7 operations. They could basically redefine their uptime in terms of 4,464 hours per year (estimated based on a full weekend and a 9-hour daily maintenance window) instead of 8,765 (assuming 365.24 days in a year.) And because you’ve redefined what uptime means to that system, you now have a realistic goal of implementing the appropriate HA/DR solution that will not break the bank.
- 3. Include recovery objectives and service level agreements. I’ve been advocating for these ideas ever since I can remember. We stared off with an analysis of the system’s business impact and we’ve redefined its uptime. But we still need to include the recovery objectives and service level agreements. Just because the inbound call center solution isn’t operating 24 X 7 doesn’t mean that it isn’t necessary to provide high availability. The telecommunications system is deemed mission-critical during the normal business hours. This means that a downtime during normal business hours is considered a critical emergency compared to a downtime happening over the weekend. The definition of recovery objectives and service level agreements should include what’s required during normal business hours and outside of business hours.
In the good-old-days of bare metal machines, having a solution that would meet ever-changing demands of business would be very expensive. That’s the main reason behind why most of the HA/DR solutions nowadays focus more on what is applicable across the board. With virtualization and cloud computing, you have the flexibility of building the solutions that will meet your objectives without costing you a lot. But the important thing to consider is to sit down and think about setting realistic HA/DR goals that will contribute to the overall bottom line. Now, if you can do that, you can be proud to say that you have helped the company save or make money. Besides, everyone will be willing to pay for something that makes money, saves money or save time.
Additional Resources
- Five Nines: Chasing The Dream?
- Business Impact Analysis
- SLA and Uptime Calculator
- What Technical Experts Are Probably Not Telling You About SQL Server Availability Groups