SQL Server Availability Groups have been around for a while. A lot of organizations are still in the process of implementing them as they upgrade their database platforms to the latest version of SQL Server. As more and more organizations move away from older versions and into the newer ones, Availability Groups will become much more common. And with the Basic Availability Groups feature made available starting with SQL Server 2016 Standard Edition, it will eventually become a viable replacement for the already deprecated Database Mirroring feature.
In the first installation in this series, I talked about how Active Directory Domain Services (ADDS) authentication and replication affect a SQL Server Availability Group implementation in a disaster recovery (DR) site. The second installation in this series is about Domain Name System (DNS.) Now, as a SQL Server professional, the first two items mentioned are things that we can consider outside of our job description. We could easily dismiss those as “someone else’s job.” But this one is something that we are responsible for the moment we start managing SQL Server Availability Groups. I consider this the lifeblood of the Windows Server Failover Cluster (WSFC): the heartbeat.
The cluster heartbeat is a private communications channel within the cluster used to monitor the status and health of all the nodes. I like to call it the cluster nodes’ “secret language” to pass messages around each other to make sure they have each others’ backs. In a WSFC where the nodes are in the same data center, server administrators barely paid attention to the heartbeat because the network connectivity between the nodes have relatively good performance or quality of service (QoS.) I bet that nobody ever configured the heartbeat for WSFC where the nodes are in the same data center.
Now, you might be thinking, “isn’t that the job of the network engineer?” Yes and No. Yes, because it’s the network engineers job to make sure that the network QoS meets the service level agreement of whatever applications use it. I mean, let’s face it. You do complain when your Internet connection is slow. And, no, because they don’t have any idea about the application other than the amount of data that is being sent and received thru the network. That makes it our responsibility.
How the Heartbeat Affects the WSFC
In a previous blog post, I talked about the WSFC properties that affect the heartbeat behavior for nodes in the same data center. This behavior also affects quorum (I’ll save this for part 4 of this series.) If the WSFC nodes are in the same data center, there wouldn’t be much of an issue communicating with each other to determine and decide about the health of the WSFC.
But if there are WSFC nodes in a remote location like that of a disaster recovery (DR) site, things would be slightly different. I use the example of sending a text message to your friend on a regular basis to determine how your friend is doing.
When The WSFC Nodes Are In The Same Data Center
Imagine that you and your friend send text messages to each other to check how each one is doing. You both agree that you need to send a text message every second (I know that could become annoying) and if you don’t receive any text messages after five consecutive tries, you panic and call the National Guard (OK, maybe not that extreme but you get the picture.) Sending text messages is similar to a WSFC heartbeat. The delay – every second – is similar to the SameSubnetDelay property of the WSFC; the amount of time it takes to send the heartbeat and ask the next “are you OK?” question. This property value is set in milliseconds where the default value is 1000 or one (1) second. That means the WSFC will ask the “are you OK?” question to all of the nodes every second. The number of times before you panic is similar to the SameSubnetThreshold property of the WSFC; it defines how many times the question wasn’t answered before concluding that there is something wrong. This property value is set in numeric type where the default value is five (5.) That means the WSFC will ask the “are you OK?” question five times and not get a response consecutively.
When The WSFC Nodes Are In Different Data Centers
If you are sending text messages within the same city, the response will be relatively fast. Imagine sending a text message to a different country in a different time zone. Responses will take a while. Would you still consider the same agreements you had with your friend about sending text messages every second and panic if you haven’t received any response after five attempts? I doubt it. You will reconsider the original agreement, taking into account the amount of time it takes to send and receive the text message. I’m sure it will be longer. You can change it to maybe sending a text message every three (3) seconds instead of every second and maybe only panicking after not getting any responses after seven consecutive tries.
These are similar to the SameSubnetDelay and SameSubnetThreshold properties. The CrossSubnetDelay property refers to the amount of time it takes to send the heartbeat on WSFC nodes in a different network subnet (or different data center) whereas the CrossSubnetThreshold property refers to how many times the heartbeat didn’t get a response from WSFC nodes in a different subnet (or different data center.) Unfortunately, the default values for the CrossSubnetDelay and CrossSubnetThreshold properties are the same as the SameSubnetDelay and SameSubnetThreshold properties, respectively.
This means that when your Availability Groups stretch across data centers, you need to pay attention to the network QoS so you can adjust these property values accordingly, You certainly don’t want the network latency between your production data center and DR data center dictating whether or not the WSFC stays online.
NOTE: Windows Server 2016 introduced two (2) new failover cluster parameters – CrossSiteDelay and CrossSiteThreshold – to make way for the site awareness feature for failover clusters. This makes it easier to manage failover cluster nodes in different sites. Additional information about site awareness in Windows Server 2016 Failover Cluster can be found in this blog post.
Redefining The Original Agreement
We can modify the CrossSubnetDelay and CrossSubnetThreshold properties accordingly but not without consulting your network engineers. They have a very good understanding of what is going on in your network and can advise you on the appropriate values for these two. To modify these properties, we need to use Windows PowerShell. Below is a sample code to change these properties.
#Assign the output of Get-Cluster PowerShell cmdlet to a variable $clust = Get-Cluster #Modify the CrossSubnetDelay property of the WSFC to 3000 milliseconds $clust.CrossSubnetDelay = 3000 #Modify the CrossSubnetThreshold property of the WSFC to 7 $clust.CrossSubnetThreshold = 7
I just used these values to illustrate the point. The appropriate values will depend on your network configuration. Also, be careful not to mask network latency issues by changing these property values. I’ve seen cases where adjusting properties like these (and other configuration settings that are influenced by network latency) are used as band-aid fixes to hide the real issue. If there are latency issues with the network, assist the network engineers in addressing it so it does not escalate into something more problematic.
Evaluate the WSFC that your existing SQL Server Availability Groups deployment are running on top of. Modifying the CrossSubnetDelay and CrossSubnetThreshold properties can help maintain a highly available database architecture.
- The Windows Server Failover Cluster Properties That We Almost Always Ignore
- Online Course Module: Understanding Heartbeat Settings for Multi-Subnet Windows Server Failover Cluster
- SQL Server 2012 Multi-Subnet Cluster Part 2 (Tuning Cluster heartbeat Settings)