The Lifeblood Across The Wire – Learn SQL Server High Availability & Disaster Recovery

SQL Server Availability Groups have been around for a while. A lot of organizations are still in the process of implementing them as they upgrade their database platforms to the latest version of SQL Server. As more and more organizations move away from older versions and into the newer ones, Availability Groups will become much more common. And with the Basic Availability Groups feature made available starting with SQL Server 2016 Standard Edition, it will eventually become a viable replacement for the already deprecated Database Mirroring feature.

This blog post is the third installation in a series that talks about the five things that you need to consider when deploying SQL Server Availability Group Replicas in a Disaster Recovery Site

In the first installation in this series, I talked about how Active Directory Domain Services (ADDS) authentication and replication affect a SQL Server Availability Group implementation in a disaster recovery (DR) site. The second installation in this series is about Domain Name System (DNS.) Now, as a SQL Server professional, the first two items mentioned are things that we can consider outside of our job description. We could easily dismiss those as “someone else’s job.” But this one is something that we are responsible for the moment we start managing SQL Server Availability Groups. I consider this the lifeblood of the Windows Server Failover Cluster (WSFC): the heartbeat.

The cluster heartbeat is a private communications channel within the cluster used to monitor the status and health of all the nodes. I like to call it the cluster nodes’ “secret language” to pass messages around each other to make sure they have each others’ backs. In a WSFC where the nodes are in the same data center, server administrators barely paid attention to the heartbeat because the network connectivity between the nodes have relatively good performance or quality of service (QoS.) I bet that nobody ever configured the heartbeat for WSFC where the nodes are in the same data center.

Now, you might be thinking, “isn’t that the job of the network engineer?” Yes and No. Yes, because it’s the network engineers job to make sure that the network QoS meets the service level agreement of whatever applications use it. I mean, let’s face it. You do complain when your Internet connection is slow. And, no, because they don’t have any idea about the application other than the amount of data that is being sent and received thru the network. That makes it our responsibility.

How the Heartbeat Affects the WSFC

In a previous blog post, I talked about the WSFC properties that affect the heartbeat behavior for nodes in the same data center. This behavior also affects quorum (I’ll save this for part 4 of this series.) If the WSFC nodes are in the same data center, there wouldn’t be much of an issue communicating with each other to determine and decide about the health of the WSFC.

But if there are WSFC nodes in a remote location like that of a disaster recovery (DR) site, things would be slightly different. I use the example of sending a text message to your friend on a regular basis to determine how your friend is doing.

When The WSFC Nodes Are In The Same Data Center

Imagine that you and your friend send text messages to each other to check how each one is doing. You both agree that you need to send a text message every second (I know that could become annoying) and if you don’t receive any text messages after five consecutive tries, you panic and call the National Guard (OK, maybe not that extreme but you get the picture.) Sending text messages is similar to a WSFC heartbeat. The delay – every second – is similar to the SameSubnetDelay property of the WSFC; the amount of time it takes to send the heartbeat and ask the next “are you OK?” question. This property value is set in milliseconds where the default value is 1000 or one (1) second. That means the WSFC will ask the “are you OK?” question to all of the nodes every second. The number of times before you panic is similar to the SameSubnetThreshold property of the WSFC; it defines how many times the question wasn’t answered before concluding that there is something wrong. This property value is set in numeric type where the default value is ten (10.) That means the WSFC will ask the “are you OK?” question ten times and not get a response consecutively.

When The WSFC Nodes Are In Different Data Centers

If you are sending text messages within the same city, the response will be relatively fast. Imagine sending a text message to a different country in a different time zone. Responses will take a while. Would you still consider the same agreements you had with your friend about sending text messages every second and panic if you haven’t received any response after ten attempts? I doubt it. You will reconsider the original agreement, taking into account the amount of time it takes to send and receive the text message. I’m sure it will be longer. You can change it to maybe sending a text message every three (3) seconds instead of every second and maybe only panicking after not getting any responses after ten consecutive tries.

These are similar to the SameSubnetDelay and SameSubnetThreshold properties. The CrossSubnetDelay property refers to the amount of time it takes to send the heartbeat on WSFC nodes in a different network subnet (or different data center) whereas the CrossSubnetThreshold property refers to how many times the heartbeat didn’t get a response from WSFC nodes in a different subnet (or different data center.)

To be sure, check the default values for the CrossSubnetDelay , CrossSubnetThreshold , SameSubnetDelay , and SameSubnetThreshold parameters of your failover clusters. Especially if you’re still running Windows Server versions older than Windows Server 2016.

This means that when your Availability Groups stretch across data centers, you need to pay attention to the network QoS so you can adjust these property values accordingly, You certainly don’t want the network latency between your production data center and DR data center dictating whether or not the WSFC stays online.

NOTE: Windows Server 2016 introduced two (2) new failover cluster parameters – CrossSiteDelay and CrossSiteThreshold – to make way for the site awareness feature for failover clusters. This makes it easier to manage failover cluster nodes in different sites. Additional information about site awareness in Windows Server 2016 Failover Cluster can be found in this blog post.

Redefining The Original Agreement

We can modify the CrossSubnetDelay and CrossSubnetThreshold properties accordingly but not without consulting your network engineers. They have a very good understanding of what is going on in your network and can advise you on the appropriate values for these two. To modify these properties, we need to use Windows PowerShell. Below is a sample code to change these properties.


#Assign the output of Get-Cluster PowerShell cmdlet to a variable
$clust = Get-Cluster

#Modify the CrossSubnetDelay property of the WSFC to 3000 milliseconds
$clust.CrossSubnetDelay = 3000

#Modify the CrossSubnetThreshold property of the WSFC to 7
$clust.CrossSubnetThreshold = 7

I just used these values to illustrate the point. The appropriate values will depend on your network configuration. Also, be careful not to mask network latency issues by changing these property values. I’ve seen cases where adjusting properties like these (and other configuration settings that are influenced by network latency) are used as band-aid fixes to hide the real issue. If there are latency issues with the network, assist the network engineers in addressing it so it does not escalate into something more problematic.

Next Steps

Evaluate the WSFC that your existing SQL Server Availability Groups deployment are running on top of. Modifying the CrossSubnetDelay and CrossSubnetThreshold properties can help maintain a highly available database architecture.

Additional Resources

The Windows Server Failover Cluster Properties That We Almost Always Ignore
Online Course Module: Understanding Heartbeat Settings for Multi-Subnet Windows Server Failover Cluster
SQL Server 2012 Multi-Subnet Cluster Part 2 (Tuning Cluster heartbeat Settings)

If you need help with your SQL Server Always On deployments, reach out and schedule a call with me using my online calendar.

Schedule a Call

2 comments on “The Lifeblood Across The Wire”

Josh Obudzinski says:

at

I also had to find a document published by VMware that documents these settings amongst others specifically pertaining to running SQL Server using AlwaysOn availability groups on VMware platform. Very good find. If I locate the url I will share it.

1. Edwin M Sarmiento says:
  
  at
  
  Thanks for reading, Josh.
  
  Do you mean this?
  https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1037959

Blog

PostThe Lifeblood Across The Wire

How the Heartbeat Affects the WSFC

When The WSFC Nodes Are In The Same Data Center

When The WSFC Nodes Are In Different Data Centers

Redefining The Original Agreement

Next Steps

Additional Resources

Schedule a Call

2 comments on “The Lifeblood Across The Wire”

Leave a Reply Cancel reply

How the Heartbeat Affects the WSFC

When The WSFC Nodes Are In The Same Data Center

When The WSFC Nodes Are In Different Data Centers

Redefining The Original Agreement

Next Steps

Additional Resources

Schedule a Call

Subscribe to my mailing list.

2 comments on “The Lifeblood Across The Wire”

Leave a Reply Cancel reply