You don’t qualify!
Imagine hearing those words from someone telling you that you’re not enough. Insufficient. You didn’t meet the standard.
I’ve heard a lot of that when I started looking for a job after graduating from college. An engineering degree wasn’t enough to get hired. It didn’t help that I failed seventeen (17) courses in my undergraduate program. It’s devastating. Demoralizing.
So, how would you feel if you got an email from your manager telling you that the SQL Server Always On Availability Group solution that you recently deployed didn’t automatically failover? Yes, the infrastructure that you’ve built over the holiday weekend, one that should have been spent with your loved ones.
I get this question a lot when I work with customers who have experienced this very issue: How come our Always On Availability Group didn’t automatically failover? That followed by, “I thought it’s supposed to be highly available?”
And I totally understand why customers ask this question. It’s because of this excerpt from SQL Server Books Online.
It does clearly say “for HA“.
It’s like returning an item to a store because it doesn’t work as advertised. Except that this item costs almost a million dollars in SQL Server licences and hardware.
Conditions Required for an Automatic Failover
Automatic failover of an Always On Availability Group can only happen if these conditions are met. The Microsoft documentation highlights four bullet points on what these conditions are. In my opinion, it’s actually five, not four. You need to ask these five questions to see if your Always On Availability Group actually qualifies for an automatic failover. Oh, and you have to meet all conditions to qualify:
- Do you have a secondary replica configured in SYNCHRONOUS COMMIT mode? Sure you configured your Always On Availability Group replicas properly – both primary and secondary. You had your manager approved of a third replica as an additional insurance policy. But did you configure all of them to be in SYNCHRONOUS COMMIT mode? An ASYNCHRONOUS COMMIT replica will NEVER automatically failover. So, you better be sure that you properly configured your replicas with SYNCHRONOUS COMMIT. Also, you need to constantly monitor if any changes were made to the replication mode. Someone may have accidentally modified the synchronous commit replicas to asynchronous and forgot to change them back (hint: patching Always On Availability Groups).
- Have you configured your SYNCHRONOUS COMMIT replicas for AUTOMATIC FAILOVER? Just because your replicas are configured in SYNCHRONOUS COMMIT mode doesn’t mean they are automatically configured for AUTOMATIC FAILOVER. These are two different – but related – settings. You can have a replica in SYNCHRONOUS COMMIT mode but configured with MANUAL FAILOVER. But if you configured your replica to be in AUTOMATIC FAILOVER, it is guaranteed that the replication mode will automatically be set to SYNCHRONOUS COMMIT mode. That’s why I always check the failover mode first (AUTOMATIC or MANUAL).
- Is the SYNCHRONOUS COMMIT, AUTOMATIC FAILOVER replica failover-ready? A replica to be failover ready means that the server hosting the SQL Server instance is healthy. It also means that ALL – not one, not most, but ALL – databases in the Availability Group are in SYNCHRONIZED state. Whenever I ask this question to customers, I usually follow it up with, “Do the databases in the Availability Group share similar recovery objectives (RPO/RTO) and service level agreements?” I’ve seen cases where all of the databases in a SQL Server instance, regardless of their recovery objectives and SLAs, are combined in the same Availability Group. I’m sure the business will not treat the revenue-generating database that is running 24/7 the same way as the HR database that’s only being used during normal business hours.
- Does the Windows Server Failover Cluster (WSFC) have quorum? I can’t count the number of times my phone rang from customers asking for help in bringing the Availability Group online because the WSFC lost quorum. It usually takes me less than 5 minutes to manually bring the WSFC online by following the steps outlined in this article. But spend more than 2 hours explaining why the WSFC went offline. It doesn’t matter if your replicas are configured with SYNCHRONOUS COMMIT, AUTOMATIC FAILOVER and with all databases in SYNCHRONIZED state. If your WSFC is offline, it will the Availability Group down with it. So, be sure you are monitoring the health and state of your WSFC. That’s why I created an entire online course on this very topic alone: the Windows Server Failover Cluster Quorum.
- Have the failover conditions defined by the flexible failover policy been met? A flexible failover policy is what determines whether the WSFC (assuming that it is online and available) will trigger an automatic failover in the event that there was an issue with the primary replica. This is determined by the HealthCheckTimeout threshold and the FailureConditionLevel parameters. I’ll cover these in more detail in a future blog post. But suffice to say at this point that it is necessary to meet these policies in order for the WSFC to trigger an automatic failover. I usually don’t recommend modifying these parameters unless advised by the Microsoft Product Support team as part of a troubleshooting exercise. Messing around with these values without understanding the root cause of an outage can either trigger a false positive (unnecessary failing over the Availability Group causing more outages) or a false negative (ignoring an issue because the policies have not been met, also causing real outages).
While I did say five reasons, I wasn’t really telling you the whole truth. There are actually six (6) reasons that can cause your SQL Server Always On Availability Groups to not failover automatically. I just couldn’t find a picture that has six fingers. So, I opted for five.
I’ll add the sixth one in the next blog post. Stay tuned.