With the continuous improvement that Microsoft has invested in the WSFC technology over the years, implementing and managing it has become a lot easier. I started doing my Gone Clustering in 60 Minutes presentation back in 2010. And I’m still amazed at how much it has improved. But I still talk about the frustrating hardware compatibility list from the good old days of Windows Server 2003.
Windows Server 2008 changed the WSFC game. I was converted – from trying to avoid Microsoft Clustering Service (MSCS) since the Windows NT 4.0 days to completely immersing myself into it.
The Magic of the Failover Cluster Validation Wizard
Before Windows Server 2008, you need to make sure all of the servers and hardware configuration that you will use for the WSFC are the same – CPU, memory, network adapter, firmware, etc. You also need to make sure that all of the hardware is certified to run on the specific version of Windows Server that you will use and that it is in the hardware compatibility list. That meant long procurement processes with your hardware vendors – a lot of back-and-forth emails and phone calls. And this is just the hardware part. You haven’t even done anything yet.
It’s no surprise that building a very simple two-node MSCS (yes, I am using the acronym from the past) running a single SQL Server failover clustered instance took three (3) months or more, depending on the organization’s processes.
Enter the Failover Cluster Validation Wizard.
Introduced in Windows Server 2008, this changed the game of deploying WSFC infrastructures. No more hardware compatibility lists, no more requirements for having everything exactly the same hardware for all servers (it is recommended but not required), and no more scratching-your-head-figuring-out-what-went-wrong situations when building a WSFC. You run the wizard on a set of servers that you want to use as nodes in your WSFC and it will tell you whether or not your configuration will be supported – even before you build it.
Smart Automatic High Availability Configuration
Windows Server 2012 (the first release) introduced the concept of dynamic quorum. In the past, the administrator is responsible for constantly monitoring the nodes in the WSFC and manually adjusting the quorum votes to maintain availability. From a high-level perspective, quorum is what dictates whether a WSFC stays online or not. The goal is to have majority of votes – more than 50% – in order to keep the WSFC online. Lose more than 50% of votes and the WSFC shuts down together with the workloads running on top of it. For example, if you have four (4) nodes in your WSFC and three (3) of them were powered off for maintenance, the WSFC goes offline as well.
Windows Server 2012’s dynamic quorum feature removed that burden from the administrator. When a voting member – be it a node or a witness – is offline, its vote is excluded from the quorum. You, the administrator, don’t even have to do anything.
Windows Server 2012 R2 introduced the concept of a dynamic witness. Adding a witness in the WSFC serves as a tie-breaker if you have an even number of nodes. But if you have five (5) votes and you lose one (1) vote, you end up having an even number of votes toward quorum which will not result in majority of votes. Since when is 50% a majority anyway? Dynamic witness handles automatically assigning a vote to a witness if and when necessary, such as when the WSFC ends up having an odd number of nodes – the witness vote is excluded to maintain an odd number of votes.
Although the dynamic quorum and dynamic witness features are both available in Windows Server 2012 R2, the reality is that the quorum is the most complex yet the most important concept in a WSFC. Every administrator managing a WSFC needs to understand how the quorum works in order to design for and prepare the infrastructure properly.
I know what’s going on in your head right now. You’re probably wondering, “These are really great features to keep my SQL Server databases running on WSFC highly available. I don’t understand why he said challenges.”
Learning about and implementing WSFC nowadays is much easier than it has ever been. So, what’s the biggest challenge?
When you’re dealing with a complex system like a WSFC, the only way to really understand it is when you get your hands dirty and do the work. The biggest challenge is not the availability of learning resources but the ability to apply those concepts. And not just the ability to apply those concepts but a rather safe environment where you can make mistakes and start all over again.
You’ve probably heard the phrase, “Production is the new development.” That’s because changes made to IT systems are “tested” and performed on live, production environment. It’s like taking driving lessons right in the middle of the rush hour busy streets of Manhattan or Shanghai. It’s a disaster waiting to happen. Oh, have I told you about how I ended up deploying a Windows XP image on a machine running SQL Server?
That’s why you need to have a lab environment. Because this was my biggest challenge in mastering the concepts involved in dealing with a WSFC.
You need a place where you can safely apply the concepts that you’ve learned. One where it’s OK to make a mistake, not something that can get you a pink slip for taking down a SQL Server failover clustered instance while changing its virtual IP address.
I know how valuable having a lab environment is in mastering a concept. I got my hands on a very expensive toy when I was still in university. In case you want to know what it was, take a look at this video (we didn’t have YouTube and video recording on phones back in 1999 but I was in the 1999 World Skills Olympics in the Mechatronics division. You’ll get a gift if you can find anything on the Internet to prove this claim.)