Guest on Brent Ozar’s Office Hours – Learn SQL Server High Availability & Disaster Recovery

I was privileged to join one of Brent Ozar’s Office Hours podcast. In case you’re not aware, Brent and his amazing team host weekly podcasts where you can ask your SQL Server-related questions. Since I was joining in, attendees were encouraged to ask SQL Server high availability and disaster recovery-related questions. Here’s the video recording for your reference.

I wish there was more time to answer the questions in more detail. So, I decided to personally respond to those who asked questions.

Q: We have a third-party application that is shrinking a database every hour. Then, our monitoring application alerts about log space usage for that database. I grew the files, but they were promptly shrunk the next hour. How would you handle this situation?

A: In addition to the responses provided during the Office Hours, I wanted to ask why the third-party application is shrinking the databases. Could it be that the application vendor isn’t aware of the side effects of database shrinking? Often times, it’s lack of awareness.

I wrote this blog post specifically for SharePoint administrators because, believe it or not, they instinctively just shrink databases when they create maintenance plans. I’ve included a video demo of how shrinking can immediately cause index fragmentation even after defragmenting the indexes. Trust me, every software vendor struggles with improving the performance of their application. Once they see this, they may change the way they deal with the databases.

Q: I’m getting a lot of alerts regarding “SSPI Handshake Failed: AcceptSecurityContext failed”. After investigation, many times this seems to be due to sessions that were destroyed or a user with an invalid password (we’re using Windows Auth on the SQL Server). I want to adjust my monitor to not send me emails about this, but I want to make sure I shouldn’t be more concerned about this error?

A: This might not seem to fall under the HA/DR category but I’ve been bitten by this a few times in the past to simply ignore it. As I mentioned during the Office Hours, in one of my troubleshooting calls with Microsoft, we found out that this error on our applications was caused by an Active Directory domain controller not being able to respond to authentication requests.

A bit of a background, I found out about this issue as I was migrating and upgrading a SharePoint farm. As part of user acceptance testing, I noticed that we were experiencing intermittent connections to the main SharePoint website. After further investigation, I requested that the network engineering team run a network trace to and from the client to the server, all the way to the Active Directory domain controller. Because the domain controller was running on a virtual machine with a single virtual network adapter, it became saturated that it started dropping network packets and caused the SSPI handshake errors. Had we not found out about the issue, it could have caused a major outage as majority of the servers are using this specific domain controller for authentication. They ended up moving the domain controller to a dedicated physical machine.

You might think that this isn’t something that DBAs need to be worried about because it’s the responsibility of either the network or the systems team. And that’s correct. But this shows the real value of highly skilled DBAs. You can’t just outsource that to a cheap service provider or to the cloud. And understanding your network prepares you to become a systems architect.

Q: How can we automate SQL patching while we have SSISDB part of Availability Group?

A: As I said during the Office Hours, you can use a combination of PowerShell with T-SQL to

1) Remove the SSISDB from the Availability Group

2) Install the service pack/cumulative update

3) Add the SSISDB back to the Availability Group after all of the replicas have been updated

The only challenge here is installing the update via command line. You need to make sure that this has been tested before automating it for production deployment. I don’t usually automate step #2 as I want to see the progress and check if there were any errors during the installation. Even with a proper change management process, there is no guarantee that UAT and staging would look exactly the same as production. I do step #2 via command line but also monitor for potential errors.

Q: Edwin, Do you have a documented punchlist so to speak of failure scenarios that you have tested with Always On. Knowing what can break makes it easier to deal with when it does…

A: As I mentioned during the Office Hours, it’s difficult to have a comprehensive list of failure scenarios. It’s like trying to predict all sorts potential disasters that can happen – you cannot plan for ALL of them.

There’s an old blog post from Microsoft on testing failover clusters. While it applies to old versions of both SQL Server and Windows Server, specifically for failover clustered instances, you can use it as a starting point to create your own series of tests. Plus, some tests are very specific to the type of hardware that you have. And while I love the idea of visiting a data center to pull out disks from hard drive bays or unplug a power cord, most servers nowadays are hosted in remote data centers and can only be accessible via Remote Desktop. The last time I’ve visited an enterprise data center – not counting the miniature one I have in my basement – was back in 2008. Some data centers don’t even allow iLO nor VNC access to the servers. So, you will be limited to performing what I call graceful failover tests.

Q: I like to know the nuts and bolt of how this works, what is the best document/video that would provide the best deep dive into how Always On actually operates?

A: A detailed explanation is provided in this Microsoft documentation. But that’s just the Availability Group. You still need to learn how the underlying platform – Windows Server Failover Clustering – works because that is what provides high availability. SQL Server just sits on top of it.

Like I said during the Office Hours, every technical documentation is confusing, be it from Microsoft or any other software company. Technical documentation was not designed to “teach” but rather to “limit liability”. That’s why it is very hard to comprehend. I try my best to simplify complicated concepts by leveraging the visual style in my presentations and training classes. Like the use of the “traffic light analogy” to explain how Windows Server Failover Clustering works.

Q: [Follow up to previous question] I figured Edwin’s class would be the best method to learn what I am looking for, however when the class was announced our budget was already set. Is there a plan to do this class again next year or will it be converted to video form for individual purchase?

A: I’m working with Brent on the scheduling of the training classes for this year and next year. I’m also in the process of creating an “online course” version of the training class for those who cannot afford to spend 3 straight days away from work. Besides, there’s only so much I can cover in 3 days. The online course allows me to add more lessons and topics as newer features are introduced and newer versions released. I’m very careful not to talk about things that aren’t publicly announced yet due to my NDA with Microsoft.

The live training classes are recorded. But only those who took the class can access the recordings.

Q: What’s a good way to measure transaction log usage volumes prior to implementing AWS Data Migration Services?

A: I did mention my blog posts on how to use the undocumented fn_dump_dblog() to read the transaction log backup and analyze the activities in the transaction log.

How To Analyze Database Activity If You Don’t Have A Monitoring Tool Part 1

How To Analyze Database Activity If You Don’t Have A Monitoring Tool Part 2

The process outlined in the blog posts are for tracking INSERT, UPDATE, DELETE or any other transactions. In your case, you can simply look at the backup history table to see the trend in the transaction log usage volume over time. Log backups contain transaction log records since the log backup was taken. Since log backups truncate the log (assuming that there are no active VLFs when the backup was taken), it represents the transaction log usage volume (excluding free space) over that period of time. This should give you an idea for capacity planning and moving to the cloud – be it on AWS or Azure.

Q: clusterless = none, how many secondary can be supported and can secondary be on linux

A: I’m guessing you meant Cluster Type=NONE. This applies to both Windows and Linux. The number of secondary replicas supported is still 8 (for Enterprise Edition) with one primary replica. The Cluster Type=NONE just tells SQL Server that there is no high availability solution to rely on.

Speaking of Linux + Always On Availability Groups, check out this series of articles that I wrote on MSSQLTips.com. I’m still waiting for the last article in this series to be published. But this should get you started on configuring Always On Availability Groups on a Linux machine using Pacemaker.

Q: Edwin, do you recommend using service broker with always on?

A: As I mentioned during the Office Hours, I only recommend something that is fully supported. Getting something to work doesn’t mean Microsoft will support it (one example is configuring SQL Server failover clustered instance with Storage Spaces Direct in Windows Server 2016 + Always On Availability Group with a standalone instance as a secondary replica; it works but it isn’t officially supported). In this case, Service Broker is supported with Always On Availability Groups. You have to be aware of the fact that when you restore a database with Service Broker enabled, Service Broker gets disabled. You would have to re-enable it first before adding it to an Always On Availability Group. Otherwise, you’ll get the impression that it doesn’t work.

Q: How would you configure quorum settings and vote in a cluster with 6 SQL nodes and a FSW where 3 nodes are in one region and 3 are in a different region as async replicas in the AG?

A: My response during the session was a bit unusual. You won’t hear a lot of experts say “I don’t know.” Quite frankly, I really don’t know because I don’t have enough information to provide an answer. For one, I do not know the version of the Windows Server operating system that you are using. Windows Server 2008/R2 differs from Windows Server 2012 and Windows Server 2012 R2 (and higher) when it comes to quorum behavior. Dynamic quorum was made available in Windows Server 2012 and dynamic witness was introduced in Windows Server 2012 R2. This difference in quorum behavior between versions of Windows Server makes it challenging to come up with a straightforward answer. The fact that the quorum is the most complex, complicated and convoluted yet most important concept in Windows Server Failover Clustering does not make it easier. I actually created an entire mini-course on this very topic alone.

I’m assuming that you configured the replicas in the other region in asynchronous replication mode because they are meant for disaster recovery. Is that correct?

Here’s a video recording of a presentation I did on how to design an Always On Availability Group topology with considerations on the quorum configuration. This should give you an idea on how to properly configure the quorum depending on the version of the Windows Server operating system that you are running.

I had a great time being on the Office Hours podcast. I hope I get the opportunity to do it again in the future.

If you have any SQL Server high availability and disaster recovery-related questions, feel free to use the Comments section and ask away.

Blog

PostGuest on Brent Ozar’s Office Hours

Leave a Reply Cancel reply