I started this series of blog post by providing a framework and a methodology for properly monitoring your IT assets. I then described why having a baseline helps provide a frame of reference for any monitoring solution. In this blog post, let’s apply those concepts in monitoring SQL Server Always On Availability Groups since more and more customers are deploying this solution for implementing high availability and disaster recovery for their SQL Server databases.
When I started working with SQL Server Always On Availability Groups several years ago, one of the most common questions I got was how to monitor the entire infrastructure. When it was a simple standalone SQL Server instance, monitoring was easy – you monitor if the SQL Server service was up or down, if the server was accessible, if the databases were accessible, etc. For monitoring performance, you simply included CPU and memory utilization on the server.
Because SQL Server Always On Availability Groups rely on Windows Server Failover Clustering (WSFC,) Active Directory (AD) and DNS, the state of the other replicas, and many more, the number of metrics that you need to monitor have increased. And, unless you have a monitoring framework that you can follow, you may end up monitoring everything which can become a management nightmare.
As mentioned in the previous blog post, let’s use the R.A.P. framework to define the metrics that we can use to monitor SQL Server Always On Availability Groups. There are some overlapping concepts on what is being monitored and where they fit in the framework.
- Reliability. Measuring reliability isn’t just a matter of defining a metric and monitoring it. It could be a combination of different metrics combined together to achieve a specific result. I used an example of the execution time of a stored procedure in the previous blog post to measure reliability. The key thing here is CONSISTENCY over time. To know if your SQL Server Always On Availability Group is consistent, you may want to monitor
- Connectivity. This could be as simple as a PING test on the hostnames of the replicas, the cluster name object (CNO) of the WSFC, the virtual computer object (VCO) or the Always On Availability Group listener name. You may need to discuss this with your network engineers because ICMP responses may be disabled across the network. You can take it a step further and monitor connectivity to the Availability Group listener name and run a simple SELECT @@SERVERNAME like how the old IsAlive check for SQL Server failover clustered instances. You run connectivity checks during certain intervals that would meet your recovery objectives, for example, every 5 minutes
- System Events. This is similar in concept to the Windows Reliability Monitor. You capture errors in both SQL Server and Windows that are related to Always On Availability Groups and analyze them over time to see if they will impact overall reliability.
- Performance. This can be considered a subset of reliability but I’ll spend more time in this later.
- Availability. Monitoring SQL Server Always On Availability Group availability is more than just monitoring connectivity. Because of its dependency on WSFC, we need to also monitor the following
- WSFC Cluster Service. Similar to how we monitor the SQL Server service, this is to check whether or not the WSFC Cluster Service is started or not. The behavior of the WSFC and the resources in the WSFC – Always On Availability Group, in this case – will depend on whether or not this service on all of the replicas is started. And because this service affects the behavior of the WSFC, we also need to monitor
- WSFC Quorum. The WSFC quorum dictates whether or not the WSFC stays online. I can spend an entire book chapter explaining what this is and why it is so important. All I can say is that we need to monitor the WSFC quorum to be alerted if it changes so we can further investigate what caused the change to prevent potential downtime. I’ll share a story about how one customer was bitten by this because they weren’t monitoring the WSFC quorum.
- Availability Group state. Because an Always On Availability Group resides as a cluster resource group in a WSFC, we need to monitor its overall state. Just because the WSFC is available and quorum working properly doesn’t mean that the Always On Availability Group is available. There could be a dozen different reasons why the Always On Availability Group is offline even though the WSFC is online – like a replica isn’t ready yet to take over as a primary, the Always On Availability Group is not configured for automatic failover, the WSFC has reached its maximum threshold for failover, the database is in a suspended state, etc. Again, this is such a broad topic in of its own that we are barely scratching the surface here. This in itself can contain different metrics that can determine the state of the Always On Availability Group. Refer to this list for more information.
- Performance. I bet you’ve been waiting to jump in to this section because this is what you’ve gotten used to in the past. Well, here it is. I’m only going to highlight those that I consider very important because they also tell me a lot about what else I can do to improve performance.
- Amount of transaction log records generated. Regardless of whether or not you have an Always On Availability Group configured, this metric is very important because it tells me the amount of transaction log records generated in my database. The amount of transaction log records generated are a by product of data types, the type of transactions performed, the volume of transactions performed, etc. The SQL Server:Database > Log bytes flushed\sec Performance Monitor counter can be used to measure this. In previous blog posts, I discussed how data types affect performance and HA/DR (this blog post shows an example of how data types affect the amount of transaction log records generated.) Reducing the amount of transaction log records can improve both performance and your recovery objectives
- Amount of transaction log sent to replica. In an Always On Availability Group configuration, the amount of transaction log generated is ideally the same as the amount of transaction log records sent to the replicas. The SQL Server:Availability Replica > Bytes Sent to Transport/sec Performance Monitor counter can be used to measure this. The reason why I’m concerned about this metric is because if this value increases over time, I may be (1) experiencing potential network performance issues, (2) a replica log disk could not keep up with the load, (3) I may be generating a ton of transaction log records from other possible causes other the main workload, (4) the replica configuration may no longer be appropriate for the increased workload, etc. Notice how one metric can lead me to different possible options for further investigation.
- Amount of transaction log records hardened on the replicas. This metric will tell me the amount of potential data loss. The SQL Server:Database > Log Bytes Flushed/sec Performance Monitor counter can be used to measure this. If this value continues to decrease on the replicas, the amount of potential data loss increases. Keep in mind that our goal for implementing Always On Availability Group is for both HA and DR which requires defining our recovery objectives (RPO and RTO) and service level agreements (SLA.) Of course, performance is part of that SLA. In a synchronous replica, this can impact the performance of the primary replica because transactions will wait to send the transaction confirmation to the client until the secondary replica has hardened the log to disk.
You’ve seen how you can use the R.A.P. Framework to get you started with implementing your monitoring solution. I used SQL Server Always On Availability Groups as an example in this blog post but you can use the framework to monitor your Microsoft Exchange servers, SharePoint farms, Oracle databases, etc.
Feeling helpless and confused when dealing with Windows Server Failover Clustering (WSFC) for your SQL Server databases?
You’re not alone. I’ve heard the same thing from thousands of SQL Server administrators throughout my entire career. These are just a few of them.
“How do I properly size the server, storage, network and all the AD settings which we do not have any control over?”
“I don’t quite understand how the Windows portion of the cluster operates and interacts with what SQL controls.”
“I’m unfamiliar with multi-site clustering.”
“Our servers are setup and configured by our parent company, so we don’t really get much experience with setting up Failover Clusters.“
If you feel the same way, then, this course is for you. It’s a simple and easy-to-understand way for you to learn and master how Windows Server Failover Clusters can keep your SQL Server databases highly available. Be confident in designing, building and managing SQL Server databases running on Windows Server Failover Clusters.
But don’t take my word for it. Here’s what my students have to say about the course.
“The techniques presented were very valuable, and used them the following week when I was paged on an issue.”
“Thanks again for giving me confidence and teaching all this stuff about failover clusters.”
“I’m so gladdddddd that I took this course!!”
“Now I got better knowledge to setup the Windows FC ENVIRONMENT (DC) for SQL Server FCI and AlwaysON.”
[callout]NOTE: Registration for my online course Windows Server Failover Clustering (WSFC) for the Smart SQL Server DBA will re-open in January 2018. But be sure you do not miss out. This will be the last time that the course will be offered. After this, you will no longer be able to register for the course.[/callout]