Imagine getting a phone call in the middle of the night. While still figuring out who it was on the other line, you get a text message from your manager. He’s telling you that the databases are offline and that the operations team have been working on it for the past hour. But they’re stumped and could not get it back online.
You knew this was very serious. Otherwise, your operations team wouldn’t be calling you. And it becomes worse every minute you spend not solving the problem. In the past, you simply looked at the Availability Group dashboard to get a sense of what’s going on and fix the problem. Except this time, it was really bad. The Availability Group is offline.
You find out that a secondary replica was turned off for maintenance and that the log file on one of the databases filled up the disk. You try to take a log backup of the database but it failed. You try removing it from the Availability Group. And that also failed.
It’s a terrifying feeling when you’re in an emergency situation and all the tricks you have up your sleeve aren’t working. And it doesn’t help that your manager keeps bugging you every minute asking for an update. You start to worry that this is going to take much longer that you thought. No more getting back to bed, your work day just begun. And it’s going to be a long one.
The Case of a Large Transaction Log File in an Availability Group
You’ll find thousands of blog posts and Microsoft articles that talk about how to deal with a database when the transaction log file has grown too large, even to a point of filling up the disk. Some will talk about shrinking the log file to free up some disk space. Others will tell you to take several transaction log backups to mark the VLFs as inactive so you can free up some space inside the transaction log. These are all valid strategies and they do work – to some extent.
But when you have a high availability solution like an Always On Availability Group, dealing with a large transaction log file for the databases is more than just freeing up space and regular maintenance. It can lead to an outage situation.
I REPEAT. It can lead to an outage situation.
I wish I was kidding. I’ve dealt with several cases in the past where one database – just one – in an Availability Group took the entire thing offline because the transaction log file filled the disk. Agonizing hours were spent trying to free up space on the disk, shrink the transaction log file, and all the other tricks that you have already thought of. This while answering numerous phone calls from end users, customers, your manager, etc. asking about an update.
There’s just ONE REASON why your transaction log file grew very large: it is not being truncated. Maybe someone decides to run a long running transaction and prevented the log from being truncated. Maybe you don’t have regular log backups (although a common misconception is that a log backup will truncate the log – not always).
This is intensified when you have a high availability solution like Always On Availability Groups or Database Mirroring (I’m including it here because it is still technically supported). I wouldn’t call Replication as a high availability solution because it was not designed to be one. But this also magnifies the problem with the transaction log file growing.
So, what do you do when you have huge transaction log files in your Availability Group databases?
- 1. Think outside the box. This is very hard to do when you’re in an emergency situation. That’s because you’re in a fight-or-flight mode, your brain is focused on surviving. Your default behavior will be to do what you already know and what you’re comfortable with. It’s the reason why trying up every trick up your sleeve isn’t working. You need to constantly train yourself to try new things and get comfortable with them. Practice. Practice. Practice. So, when emergency situation happens, you don’t need to think about these new “tricks” – you just do them.
I worked on a case for an Internet company several years ago with the exact scenario I described above (where do you think I got that from?) I’ll spare you the details of who the culprit was but it wasn’t their DBA team. There’s no log backup, no shrinking of log file, no removing of database from the Availability Group, nor removing the secondary replica that would solve the issue. So, I asked the Windows sysadmins if they can spare me a USB thumb drive and attach it to the server. I don’t recommend doing this but this is an emergency situation. I created an additional log file on the thumb drive, just large enough for me to remove the database from the Availability Group. It could have been an SMB file share. But we cannot get a hold of their Active directory domain admins so I settled with a USB thumb drive. I was able to bring the Availability Group back online and remove the offending secondary replica.
- 2. Prevent it from happening in the first place. We technology professionals love solving problems. It’s in our DNA. But it won’t hurt if we can prevent problems from happening in the first place.
Resize your database transaction log files accordingly.
Configure regular log backups.
Provision a sizeable amount of disk space.
And please, I’m begging you, MONITOR YOUR AVAILABILITY GROUPS! - 3. Get everyone involved. DBAs like yourself want to be in control of the database servers. So, when a problem occurs, you want to be the one to solve it. You step up to the plate and do what needs to be done.
It is very easy to fall into the trap of the Superhero Syndrome. But a football team doesn’t win championships with just the quarterback. Every team member has a contribution.Your Windows sysadmins need to be aware of the impacts of performing maintenance on the servers, be it for installing a security update or replacing hardware.
Your storage admins need to be aware of the effects of thin-provisioning storage for your databases will lead to both performance and availability issues.
And be brave enough to tell your manager to stop bugging you for updates when you’re dealing with an emergency. He wouldn’t bother a doctor performing a surgery on his kid every minute, would he?
Don’t take managing transaction log files for granted. Because when you have an Always On Availability Group solution, this could lead to an outage situation.
Schedule a Call