SQL Server Availability Groups have been around for a while. A lot of organizations are still in the process of implementing them as they upgrade their database platforms to the latest version of SQL Server. As more and more organizations move away from older versions and into the newer ones, Availability Groups will become much more common. And with the Basic Availability Groups feature made available starting with SQL Server 2016 Standard Edition, it will eventually become a viable replacement for the already deprecated Database Mirroring feature.
In the first installation in this series, I talked about how Active Directory Domain Services (ADDS) authentication and replication affect a SQL Server Availability Group (AG) implementation in a disaster recovery (DR) site. The second installation in this series is about Domain Name System (DNS.) The third installation was all about the Windows Server Failover Cluster (WSFC) heartbeat and how intra-node communications affect the overall health of the WSFC. This blog post will focus on what I think is the most complex yet very important concept in a WSFC. It is so complicated that I actually created a dedicated online course just for it – the quorum.
What is Quorum?
With all the talks about the United States 2016 election, I might as well use this as an example. In a democratic government, the citizens get to decide thru election who will become their next political leader The results of the election will be the determining factor (side note: if Donald Trump gets the highest seat in the country, we know pretty well how he got there.) I’d like to simplify this process in a single phrase: “majority vote wins.”
The availability of any workload – be it a SQL Server failover clustered instance, AG, clustered file share, etc. – depend so much on the quorum and whether or not the WSFC can get a majority of votes. By default, all nodes in a WSFC (just like all citizens of a country that is eligible) can and will get a vote towards quorum. The goal is to make sure that we get as many votes as possible to keep the WSFC – and all the workloads running on top of it – online. You can think of majority as having an odd number of votes – losing half of the votes when you have an even number of voters does not constitute a majority because 50% still isn’t.
Different Quorum Configuration
It is important to understand the different quorum configuration available in order to properly design and configure the WSFC and ensure SQL Server AG availability. As far as I’m concerned, there are are only two (2) types of quorum configuration even though Microsoft documents four (4) of them. I find that the remaining two almost always get excluded in discussions primarily because organizations want to lower down overall operations cost.
- – Node and Disk Majority. This uses a combination of cluster nodes and a shared disk in the WSFC acting as a witness to achieve majority of votes. The shared disk gets to vote towards quorum
- – Node and File Share. This uses a combination of cluster nodes and a file share in the WSFC acting as a witness instead of a shared disk to achieve majority of votes. The file share gets to vote towards quorum
As I said, I don’t even want to mention No Majority: Disk Only in the list because a single point of failure (shared disk) does not make sense in a highly available configuration. Some organizations consider using Node Majority as their quorum configuration but I would argue that keeping operations and maintenance costs to a minimum is every organizations goal. Besides, you will need to patch, maintain, monitor, reboot, document, audit, etc. that extra node when all it does is simply providing an extra vote towards quorum.
Different Behavior Between Versions of Windows Server
As the Windows Server operating system evolved and improved throughout the years, how the quorum behaved has also changed. Quorum in Windows Server 2008 behaves similar to Windows Server 2003 failover cluster. Windows Server 2012 (not the R2 version) introduced the concept of dynamic quorum where the votes required to achieve majority depends on the availability of voting members. Windows Server 2012 R2 introduced the concept of dynamic witness where the witness vote – be it a file share or a disk – can be counted to achieve majority depending on whether or not the WSFC already has an odd number of votes.
How your SQL Server AG remains online when a voting member becomes unavailable will depend on the version of Windows Server operating system it is running on top of. It’s one of the reasons why, even if I don’t get commissions for selling Microsoft licenses to my customers, I strongly recommend being on the latest version of the Windows Server operating system.
Quorum Configuration for Stretched Clusters
Raise Your Hand If You Are In Favor Of A Disk Witness?
Most WSFC experts will argue that a Node and Disk Majority configuration should be recommended even when your WSFC is stretched out across different geographical locations. This becomes relevant when you now have a SQL Server AG replica in a disaster recovery site. Their main reason for this is because the cluster database is stored in the shared disk which becomes available to all of the nodes in the WSFC, regardless of where they are. My question for this type of configuration is: would you use a shared disk even if your workload doesn’t? Note that SQL Server AG does not require a shared disk. it wouldn’t make sense to provision a shared disk for the WSFC just to have an extra vote towards majority. Plus, you need to replicate that shared storage across the network from your production data center into the DR data center. If you start calculating the cost of the shared storage plus the storage replication plus the network bandwidth plus everything else that comes with it and put a dollar value beside it, I doubt that your finance manager would not even look at your request.
Raise Your Hand If You Are In Favor Of A File Share Witness?
On the other hand, a file share will not cost you as much especially if you already have a highly available file server. All you have to do is to create a new folder in your highly available file server and make it available to your WSFC to use as a file share witness. Nothing to replicate across the network and re-use what you currently have. That definitely sound a lot better in the ears of your finance manager.
But here’s the catch: the cluster database won’t be in the file share. That means cluster configuration may not be immediately available to all of the nodes in the WSFC. That could potentially pose a risk and create the possibility of having one WSFC node be out-of-sync from all of the available nodes. This is known as partition-in-time and would affect whether or not the WSFC can come online if the voting members. This is the main argument of those WSFC experts who are in favor of using a shared disk as a witness. And it is a valid argument. Imagine making a WSFC configuration change on your production environment and, before that change gets propagated over to the DR environment, the WSFC goes offline. That change will not be reflected in the DR environment and will not allow the workload in the WSFC to come online.
Where Should I Place My Witness?
Ideally, this should be on a third data center separate from your production and DR data center. Again, imagine having that conversation with your finance manager: “I’d like to request for a rented space in another data center that has network connectivity to both our production and DR data centers; and we’re only going to use it for an extra vote towards the WSFC quorum.” You’ll probably get a “NO” answer even before you finish your sentence. That’s because of the extra cost associated with that additional data center. So, while it is considered a recommended best practice, it is not practical.
What Would I Recommend?
Notice that I did not provide a hard-and-fast rule with regards to either using a shared disk or file share witness or where to store the witness. That’s because I don’t know your recovery objectives (RPO/RTO,) your service level agreements (SLAs,) and your budget. Every technical decision should be dictated by these three key things, not by any expert who knows more about WSFC than you do. The goal of this blog post is to explain what the WSFC quorum is, how it affects your SQL Server AG deployments when you have replicas in a DR site and how to use this information to make informed decisions that meet the business objectives. If I work for a data center like AWS, Microsoft Azure or Rackspace, I would convince you to host that witness in our data center. If I work for storage vendors like EMC, HP, or Hitachi, I would convince you to use a shared disk as a witness. But because only you know your environment and what your organization’s goals and objectives are, you will have to make that decision.
It’s like having to vote for the next President of the United States. I can’t make that decision for you nor can I make any recommendations. Without the right information, it would be difficult to make a decision. So be sure to gather as much information as you can. And be sure to make your vote count. Because it matters.
Additional Resources
- Understanding Quorum Configurations in a Failover Cluster
- Online Course: Deep Dive on Windows Server Failover Cluster Quorum
- Details of How Quorum Works in a Failover Cluster (Windows Server 2008/R2)
- Quorum configuration and dynamic quorum (Windows Server 2012)
- Dynamic Witness (Windows Server 2012 R2)
Schedule a Call
This is a great post – really great analogies and overview of different options.
What are they voting on?
They are voting on keeping the failover cluster alive.
If there’s not enough votes to keep the failover cluster alive, it will take itself offline, hence, taking the Availability Group with it offline