Have you been made aware of an item in your household that seem so common and familiar but that could have an impact in your safety? Like that little remote you tuck in the sun visor of your car to open your garage. In case you are not aware, someone might be able to hack open your garage door without you knowing.
Similarly, when you are deploying and/or managing a Windows Server Failover Cluster (WSFC) for either a traditional SQL Server Failover Clustered Instance (FCI) or Availability Group (AG,) there are properties that we almost always ignore or, even worse, are not even aware of. These properties affect your WSFC availability which in turn affect all of the clustered applications running on top of it. I, myself, have not been aware of them until I started designing and deploying multi-subnet WSFC in Windows Server 2008. These properties are the SameSubnetDelay and SameSubnetThreshold.
The WSFC Heartbeat
Inter-node communication is critical to proper operation of a WSFC. This is where the concept of a heartbeat comes in. The heartbeat is the communication between nodes in a cluster that determines their status. This communication medium exists only within the cluster and transported thru the available network adapters in the nodes. I simply look at the heartbeat communication as a means for the cluster to know what is going on within its members. Imagine the cluster asking the nodes, “are you OK?” on a regular basis.
The SameSubnetDelay property is the amount of time it takes for the WSFC to ask the next “are you OK?” question. This property value is set in milliseconds where the default value is 1000 or one (1) second. That means the WSFC will ask the “are you OK?” question to all of the nodes every second. Now, if this was my friend asking the question, it could be very annoying to hear it every second. It’s like getting a text message on my phone every second and making sure I respond immediately. The SameSubnetThreshold property defines how many times the question wasn’t answered before the WSFC concludes that there is something wrong. This property value is set in numeric type where the default value is five (5.) That means the WSFC will ask the “are you OK?” question five times and not get a response consecutively, like getting a text message on my phone every second for 5 seconds but not responding to it. My friend would probably panic and assume that there’s something wrong. Now, you might be thinking, “how can that be so important?” I’m glad you asked.
How the WSFC Heartbeat Affects the Quorum
In a previous blog post, I talked about why quorum matters and how it affects the availability of the WSFC. If a node in a WSFC does not respond within the configured SameSubnetDelay and SameSubnetThreshold values, it is considered to be unavailable and, therefore, cannot vote towards the quorum. Eventually, when the WSFC no longer has majority of votes because of the unavailable nodes, it will take itself offline. Unfortunately, in a traditional 2-node WSFC configuration where both nodes in a cluster are in the same data center, we barely even notice these properties. In the past, it was common to use cross-over cables to connect two servers directly for dedicated heartbeat communications; for more than two nodes in a WSFC, a dedicated router/switch is used. Because the cluster heartbeat communication goes thru a dedicated network path, there are no interruptions and noticeable latency.
The Appropriate Values For These Properties
In a perfect world, we don’t really need to change these default values. But as more components are added in your network infrastructure – virtualization, network routing, firewalls, etc. – on top of existing traffic that is already going thru, the heartbeat communication might suffer. Imagine driving in a highway where you have five lanes. Even if you have a very wide road, traffic congestion will not allow you to go your usual speed. But even if you only have a single-lane road, if you are the only one using it, you are guaranteed to go with the recommended speed. Same thing with the heartbeat communications. While it is OK to accept the default values of one (1) and five (5) for the SameSubnetDelay and SameSubnetThreshold properties, respectively, you need to modify appropriately. Talk to your network engineers about the current traffic that goes thru your network. They will have a profile of the network traffic – what time of the day is the network busy, what application is consuming most of the network traffic, etc. Measure the network latency between nodes in your WSFC. If you currently only have two nodes in your WSFC, a cross-over cable can still be used for dedicated cluster heartbeat communications. You just need to document everything in case you decide to add nodes in your WSFC. Of course, in the modern data center setting, I doubt that you have access to the physical servers or if they are even physical servers at all.
Addition Resources
- Tuning Failover Cluster Network Thresholds
- Windows Server 2008 Failover Clusters: Networking (Part 1) (still applies to Windows Server 2012 and higher)
- SQL Server 2012 Multi-Subnet Cluster Part 2 (an article I wrote about how this relates to SQL Server workloads)
Schedule a Call
Hi Edwin,
Thank you for sharing great stuff – Appreciate your hard work and valued knowledge sharing !!
Br,
Anil
Thanks for reading my blog, sir.