Your boss wants you to automate patching for your SQL Servers. Is that a good idea? How far should you take it? Find out a DBAs perspective.
This is a “listen-able” 17 minute video. Prefer a podcast instead? Find it at littlekendra.com/dearsqldba.
Show notes (transcript with links) are below the video.
Dear SQL DBA…
I’ve been tasked with getting a patching solution for our SQL servers that doesn’t require us to sit and watch the installer.
The installer provides a lot of good feedback, and the patch is going to take as long as it takes no matter what. If I don’t watch the installer (ie, I run the patch from a command line with /quiet /action=Patch, etc), I’m still going to have to review the summary.txt file for feedback on the patch success, etc.
Would you recommend cobbling together a command line/scheduled task scheme and hope it all goes fine with manual reviews of the txt/log files after the fact?
I’m just as lazy as any other DBA. Actually, I’m more lazy.
Patching isn’t fun work, but it’s really necessary. Organizations that don’t regularly apply Windows patches run the risk of security breaches and weird performance problems.
It involves a lot of waiting, it’s pretty boring, and it usually happens at a time where you’d rather be doing something else – like Friday night or Sunday morning.
I’m generally in favor of anything that makes patching less work. I don’t want it to introduce unacceptable risks for the data, based on business requirements, but I love smart automation that can help me do more, faster, and be safer.
It stinks when patching goes wrong
I used to spend a lot of time doing patching, and I had plenty of times when:
- Servers wouldn’t come back up after a reboot. Someone had to go into the iLo/Rib card and give them a firm shove
- Shutdown took forever. SQL Server can be super slow to shut down! I understand this better after reading a recent post on the “SQL Server According to Bob” blog. Bob Dorr explains that when SQL Server shuts down, it waits for all administrator (sa) level commands to complete. So, if you’ve got any servers where jobs or applications are running as sa, well…. hope they finish up fast.
- Patching accidentally interrupted something important. Some process was running from an app server, etc, that failed because patching rebooted the server, and it fired off alarms that had to be cleaned up.
- Something failed during startup after reboot. A service started and failed, or a database wasn’t online. (Figuring out “was that database offline before we started?” was the first step. Ugh.)
- Miscommunication caused a problem on a cluster. Whoops, you were working on node2 while I was working on node1? BAD TIMES.
I don’t believe Windows updates that say they don’t require a reboot.
I realize I sound like a crotchety retiree when I say, “In my day, we ALWAYS rebooted after applying patches.”
But patches that say they don’t require a reboot are dirty liars, and I don’t trust them.
If stuff starts getting weird later, and you skipped the reboot during the maintenance window, you know how that’s gonna look. Not good. Always have a maintenance window, and don’t skip the restart.
I love that you’re already planning patch verification!
You must have had that experience where run Windows Update and it says it worked, but then you find that it actually didn’t install one of the critical patches.
Maybe we’ve all had that experience.
Checking to make sure critical patch numbers all made it in during the maintenance window is super smart.
Separate servers into Pets and Cattle (sort of)
One of the concepts of managing cloud environments is for developers to create systems where servers are no longer special little flowers with unique names. If your servers cause lots of tragedy when they get sick, then you’re vulnerable.
Instead, the idea is that developers build systems where servers are more like cattle. They may be expensive, but if they get ill, you quickly “remove” them (OK, kill them) for the safety of the herd and replace them.
It’s a brutal metaphor, but they really mean it. Everything needs to be killable.
(I’m not actually sure who to credit for coming up with this comparison. Update: According to The Register, this came from a former Microsoftie named Bill Baker.)
For patching, I’ve always had groups of SQL Servers. The most critical are pets. The least critical are pretty darn close to cattle.
The general groups go like this:
- Development SQL Servers – cattle
- Fully automated patching
- Something goes wrong, you spin up a new VM and redeploy
- Staging SQL Servers – cattle
- Fully automated patching
- Manual verification that things look normal after the patching before patches are cleared for production SQL Servers
- Non-Critical Production SQL Servers – prize cattle that compete at the county fair. Can be down for 2+ hours during a scheduled maintenance and it’s no problem. Wouldn’t be a huge issue if they were down all weekend after a scheduled maintenance or lost some data.
- Often these get fully automated patching, depending on the tools in use and data loss requirements
- These are the first SQL Servers to be patched in production
- Critical Production SQL Servers – pets.
- Someone is around to run the patches.
- In a highly automated environment, that person may mostly monitor a dashboard while the patches deploy and respond if needed. (Someone who can actually respond meaningfully, not just the Network On Call center, because usually for these servers the business would rather not wait for the NOC to page out.)
It isn’t like this everywhere. I’ve had readers and clients where the development SQL Servers were NOT cattle, because they contained changes that hadn’t been checked into source control.
If that’s the case where you work, you want a separate project to make changes so Development environments are entirely cattle.
Automated patching: use System Center for cattle
System Center is really good at deploying patches. Servers that qualify as cattle are a great fit for system center.
The patches roll out. If there are issues, the Network On Call center, or whomever is on call, follows a documented process to get things back online. If there’s major issues, you restore from backup. Or possibly just redeploy from source, in the case of development servers.
If “restore from backup” sets off any alarm bells for anyone, that server isn’t cattle.
I’d look at PowerShell for the SQL Servers that need special attention (prize cattle and pets)
For the “pet” servers, I love the idea of automation for quickly allowing you to add bells and whistles
I’d consider PowerShell, because it’s a popular swiss army knife and you could work in lots of great features over time.
I would try to work in features like this.
- Validate the last backup times for critical databases and raise a flag if they aren’t what they should be prior to starting patching.
- Record the status of all databases on the instance for you (ARE any offline?)
- Check for long running jobs
- Stop the SQL Server services before initiating Windows restart. This helps demystify any slow shutdown experiences, as you know exactly when the SQL Services finish stopping without having to go into Event or SQL Logs.
- Install and validate Windows patches
- Run a few verification queries as a smoketest to make sure databases are properly online and available to users
- Validate that all SQL Services started as expected (and didn’t fail immediately after)
- Check Event Logs and SQL Logs for warnings after the reboot
Doing these things manually kinda stinks.
Do I know how to do all this stuff in PowerShell? Nope. I used to do all this stuff manually, back when PowerShell was a DOS prompt’s kid brother.
But if it was my job to do it now, I think PowerShell’s probably a good bet for being able to do a lot of it pretty well.
I’d work to get this working first in my pre-production and staging environments, then roll it out to the “prize cattle”, then to the “pet” SQL Servers last.
I’d leave development SQL Servers as pure cattle, though. Because I’m lazy.
If your company has developers, it’s possible that you could get some of their time to work with you to get this going. You’d help explain the requirements, they’d write the code. Someone would buy donuts.
Cluster Aware Updating (CAU)
Cluster Aware Updating is a feature introduced in Windows Server 2012 to relieve some of the pain from patching Windows Failover Clusters.
This helps automating:
- Changing nodes into maintenance mode
- Moving roles off the node
- Installing patches
- Restarting the node
- Changing the node out of maintenance mode
- Moving things around more
- Moving on to other nodes
There’s an option to have an administrator trigger the run from a remote server so they can monitor it real-time, or it can be set to run on a schedule on the cluster itself.
There are still outages during failovers. And it doesn’t cover those bells and whistles I talked about earlier about checking for running jobs.
Microsoft has published a whitepaper on using Cluster Aware Updating with SQL Server 2012 and higher for Failover Cluster Installs of SQL Server.
Be aware this doesn’t cover Availability Groups. My understanding is that those are still unsupported for CAU, even in SQL Server 2016.
Ideally, you’re not doing the patching yourself forever
Summing this up, as a DBA I really like automation for patching.
I may want someone to be around sipping their coffee while it runs in some cases. And I’d grow the automation so that it writes clear logs, and if things DO go wrong it outputs helpful diagnostics so the person can respond more quickly.
I’d even try to make it so good that a junior level person could run it.
DBAs never need to worry about automating themselves out of a job, oddly enough. There’s always something new that comes along that needs to be taken care of.
I know where you can find some PowerFriends
I can’t help you write PowerShell. It’s not that it isn’t cool, my current specialization just lies in teaching performance tuning and drawing cat pictures.
But there’s a SQL Server Slack community full of PowerShell loving people who do cool things with automation. And they are super helpful! You can join the Slack channel for free anytime (whether or not you wanna learn PowerShell).
Go to dbatools.io/slack to join sqlcommunity.slack.com
I wanna hear them! Like I said, patching hasn’t been my actual job for a long while now. If you know magical secrets, SPILL THEM.