Should I Automate my Windows Updates for SQL Server? (Dear SQL DBA Episode 10)

Your boss wants you to automate patching for your SQL Servers. Is that a good idea? How far should you take it? Find out a DBAs perspective.

This is a “listen-able” 17 minute video. Prefer a podcast instead? Find it at littlekendra.com/dearsqldba.

Show notes (transcript with links) are below the video.

Dear SQL DBA…

I’ve been tasked with getting a patching solution for our SQL servers that doesn’t require us to sit and watch the installer.

The installer provides a lot of good feedback, and the patch is going to take as long as it takes no matter what. If I don’t watch the installer (ie, I run the patch from a command line with /quiet /action=Patch, etc), I’m still going to have to review the summary.txt file for feedback on the patch success, etc.

Would you recommend cobbling together a command line/scheduled task scheme and hope it all goes fine with manual reviews of the txt/log files after the fact?

Sincerely,

Windows Updater

I’m just as lazy as any other DBA.  Actually, I’m more lazy.

Patching isn’t fun work, but it’s really necessary. Organizations that don’t regularly apply Windows patches run the risk of security breaches and weird performance problems.

It involves a lot of waiting, it’s pretty boring, and it usually happens at a time where you’d rather be doing something else – like Friday night or Sunday morning.

I’m generally in favor of anything that makes patching less work. I don’t want it to introduce unacceptable risks for the data, based on business requirements, but I love smart automation that can help me do more, faster, and be safer.

It stinks when patching goes wrong

I used to spend a lot of time doing patching, and I had plenty of times when:

  • Servers wouldn’t come back up after a reboot. Someone had to go into the iLo/Rib card and give them a firm shove
  • Shutdown took forever. SQL Server can be super slow to shut down! I understand this better after reading a recent post on the “SQL Server According to Bob” blog. Bob Dorr explains that when SQL Server shuts down, it waits for all administrator (sa) level commands to complete. So, if you’ve got any servers where jobs or applications are running as sa, well….  hope they finish up fast.
  • Patching accidentally interrupted something important. Some process was running from an app server, etc, that failed because patching rebooted the server, and it fired off alarms that had to be cleaned up.
  • Something failed during startup after reboot. A service started and failed, or a database wasn’t online.  (Figuring out “was that database offline before we started?” was the first step. Ugh.)
  • Miscommunication caused a problem on a cluster.  Whoops, you were working on node2 while I was working on node1? BAD TIMES.

I don’t believe Windows updates that say they don’t require a reboot.

I realize I sound like a crotchety retiree when I say, “In my day, we ALWAYS rebooted after applying patches.”

But patches that say they don’t require a reboot are dirty liars, and I don’t trust them.

If stuff starts getting weird later, and you skipped the reboot during the maintenance window, you know how that’s gonna look. Not good. Always have a maintenance window, and don’t skip the restart.

I love that you’re already planning patch verification!

You must have had that experience where run Windows Update and it says it worked, but then you find that it actually didn’t install one of the critical patches.

Maybe we’ve all had that experience.

Checking to make sure critical patch numbers all made it in during the maintenance window is super smart.

Separate servers into Pets and Cattle (sort of)

cattleOne of the concepts of managing cloud environments is for developers to create systems where servers are no longer special little flowers with unique names. If your servers cause lots of tragedy when they get sick, then you’re vulnerable.

Instead, the idea is that developers build systems where servers are more like cattle. They may be expensive, but if they get ill, you quickly “remove” them (OK, kill them) for the safety of the herd and replace them.

It’s a brutal metaphor, but they really mean it. Everything needs to be killable.

(I’m not actually sure who to credit for coming up with this comparison. Update: According to The Register, this came from a former Microsoftie named Bill Baker.)

For patching, I’ve always had groups of SQL Servers. The most critical are pets. The least critical are pretty darn close to cattle.

The general groups go like this:

  • Development SQL Servers – cattle
    • Fully automated patching
    • Something goes wrong, you spin up a new VM and redeploy
  • Staging SQL Servers – cattle
    • Fully automated patching
    • Manual verification that things look normal after the patching before patches are cleared for production SQL Servers
  • Non-Critical Production SQL Servers  – prize cattle that compete at the county fair. Can be down for 2+ hours during a scheduled maintenance and it’s no problem. Wouldn’t be a huge issue if they were down all weekend after a scheduled maintenance or lost some data.
    • Often these get fully automated patching, depending on the tools in use and data loss requirements
    • These are the first SQL Servers to be patched in production
  • Critical Production SQL Servers pets.
    • Someone is around to run the patches.
    • In a highly automated environment, that person may mostly monitor a dashboard while the patches deploy and respond if needed. (Someone who can actually respond meaningfully, not just the Network On Call center, because usually for these servers the business would rather not wait for the NOC to page out.)

It isn’t like this everywhere. I’ve had readers and clients where the development SQL Servers were NOT cattle, because they contained changes that hadn’t been checked into source control.

If that’s the case where you work, you want a separate project to make changes so Development environments are entirely cattle.

Automated patching: use System Center for cattle

System Center is really good at deploying patches. Servers that qualify as cattle are a great fit for system center.

The patches roll out. If there are issues, the Network On Call center, or whomever is on call, follows a documented process to get things back online. If there’s major issues, you restore from backup. Or possibly just redeploy from source, in the case of development servers.

If “restore from backup” sets off any alarm bells for anyone, that server isn’t cattle.

I’d look at PowerShell for the SQL Servers that need special attention (prize cattle and pets)

For the “pet” servers, I love the idea of automation for quickly allowing you to add bells and whistles

I’d consider PowerShell, because it’s a popular swiss army knife and you could work in lots of great features over time.

I would try to work in features like this.

  • Validate the last backup times for critical databases and raise a flag if they aren’t what they should be prior to starting patching.
  • Record the status of all databases on the instance for you (ARE any offline?)
  • Check for long running jobs
  • Stop the SQL Server services before initiating Windows restart. This helps demystify any slow shutdown experiences, as you know exactly when the SQL Services finish stopping without having to go into Event or SQL Logs.
  • Install and validate Windows patches
  • Run a few verification queries as a smoketest to make sure databases are properly online and available to users
  • Validate that all SQL Services started as expected (and didn’t fail immediately after)
  • Check Event Logs and SQL Logs for warnings after the reboot

Doing these things manually kinda stinks.

Do I know how to do all this stuff in PowerShell? Nope. I used to do all this stuff manually, back when PowerShell was a DOS prompt’s kid brother.

But if it was my job to do it now, I think PowerShell’s probably a good bet for being able to do a lot of it pretty well.

I’d work to get this working first in my pre-production and staging environments, then roll it out to the “prize cattle”, then to the “pet” SQL Servers last.

I’d leave development SQL Servers as pure cattle, though. Because I’m lazy.

If your company has developers, it’s possible that you could get some of their time to work with you to get this going. You’d help explain the requirements, they’d write the code. Someone would buy donuts.

Cluster Aware Updating (CAU)

Cluster Aware Updating is a feature introduced in Windows Server 2012 to relieve some of the pain from patching Windows Failover Clusters.

This helps automating:

  • Changing nodes into maintenance mode
  • Moving roles off the node
  • Installing patches
  • Restarting the node
  • Changing the node out of maintenance mode
  • Moving things around more
  • Moving on to other nodes

There’s an option to have an administrator trigger the run from a remote server so they can monitor it real-time, or it can be set to run on a schedule on the cluster itself.

There are still outages during failovers. And it doesn’t cover those bells and whistles I talked about earlier about checking for running jobs.

Microsoft has published a whitepaper on using Cluster Aware Updating with SQL Server 2012 and higher for Failover Cluster Installs of SQL Server.

Be aware this doesn’t cover Availability Groups. My understanding is that those are still unsupported for CAU, even in SQL Server 2016. 

Ideally, you’re not doing the patching yourself forever

Summing this up, as a DBA I really like automation for patching.

I may want someone to be around sipping their coffee while it runs in some cases. And I’d grow the automation so that it writes clear logs, and if things DO go wrong it outputs helpful diagnostics so the person can respond more quickly.

I’d even try to make it so good that a junior level person could run it.

DBAs never need to worry about automating themselves out of a job, oddly enough. There’s always something new that comes along that needs to be taken care of.

I know where you can find some PowerFriends

I can’t help you write PowerShell. It’s not that it isn’t cool, my current specialization just lies in teaching performance tuning and drawing cat pictures.

But there’s a SQL Server Slack community full of PowerShell loving people who do cool things with automation. And they are super helpful! You can join the Slack channel for free anytime (whether or not you wanna learn PowerShell).

Go to dbatools.io/slack to join sqlcommunity.slack.com

Got Opinions?

I wanna hear them! Like I said, patching hasn’t been my actual job for a long while now. If you know magical secrets, SPILL THEM.

 

Previous Post
Free Poster: Troubleshooting SQL Server After a Migration or Upgrade
Next Post
Altering an INT Column to a BIGINT (Dear SQL DBA Episode 11)

Related Posts

No results found

10 Comments. Leave new

I heard Jeffrey Snover from Microsoft talk about the pets and cattle comparison at Microsoft Ignite 2015 in a session about the Windows Server 2016 Nano Server installation option. That’s the first I heard of this metaphor.

Reply

[…] Kendra Little takes on the question of whether patching should be automated on SQL Server instances: […]

Reply

great script for windows updates https://community.spiceworks.com/scripts/show/82-windows-update-agent-force-script-email-results-version-2-7 have a scheduled job on the server call the script from a file server. Set it to like every Saturday at 8PM or whatever day/time you want to run. When you don’t want it to run rename the file, the job will fail. when you want it to run, rename it to the correct name. The script give you loads of information about your updates and provide e-mail with logs.

Reply

Hi, Kendra!
It will be very interesting to see your post about SQL Server (not Windows OS) patching, like installing Service packs and hotfixes. I think there are a lot of underwater rocks you may point.

P.S. And thank you for your community input! 🙂

Reply

Hi Kendra,

You mention a couple times about “database offline?” type of verbiage.

Is it a best practice when patching Windows servers which have SQL Server to put databases offline prior to shutting down the SQL Service and launching the updates? It seems implied by the way you talk about it. Or am I misunderstanding that it was a problem that the databases were offline, as opposed to wanting them to be offline?

Disclaimer: I’m not a DBA, I’m a BI Developer with no responsibility for updating the servers in my org.

Reply

    Great question! I don’t set databases offline for patching. In the cases in question, the databases were in the process of being archived. They were legacy databases, and they were *believed* to not be in use anymore, but the system was complex enough that we wouldn’t have been surprised if a monthly or quarterly process queried them which nobody had remembered (or perhaps nobody realized was still running).

    For this reason, the removal process was to take a final backup, set the database offline (so it could quickly be brought back if needed), and then eventually verify that we still have the backup, the backup still restores, and drop the database.

    The obvious problem that would happen is that sometimes the drop process would be delayed or forgotten. Usually people were pretty good about documenting in the runbook which databases were purposefully offline, but you know how that goes 🙂

    Reply

In my experience there’s a significant point of contention between how management thinks every server is or *should be* cattle, and a DBA experience on the ground knowing that every server is more like someone’s fussy pet made unique with the scars of 5-10 years of patches, OS upgrades, SQL upgrades, tweaks, applications, and who knows what.

I apply about 100 SQL SPs/CUs per month. They can fail in twenty ways that I’ve documented so far and that’s just the predictable stuff. So when it comes to scripting automation for it’s a massive project and nobody seems to “understand” why.

Of course if you have a smaller environment it’s probably not a big deal, maybe you won’t hit all those edge cases and you’re lucky. I suspect most people haven’t even tried to understand the Microsoft documentation on properly patching a cluster or AG or replication member – because what they suggest is crazy, out of date (written mostly for Windows Server 2008 R2), and borderline dangerous. Are you *really* going to automate removing node votes in the cluster during each patch session and then reapply it all automatically afterwards? That’s a resume generating event waiting to happen.

And you have to make crazy decisions. Let’s say you’re super smart and automate querying the version from the SQL Server AND automate working out which patch files are what build and can be used. Now.. what patch are you going to apply to a SQL 2008 R2 instance? You better know there are two completely different version number streams; GDR and QFE. That’s assuming you do not jump up to the latest TLS 1.2 which overrides both but has been fraught with problems from every angle. (Is anyone even using SQL Server TLS? Judging from my chats with security professionals and searching for information on how to do it, my answer would be probably not.)

WSUS isn’t really an answer. It requires very specific settings to be used to pick up the right patches and which often aren’t used in some larger environments. SCCM uses WSUS under the covers and is better but… now you have all of those same problems plus one more trying to integrate “your” lowly DBA stuff into the system administrator’s baby and he/she is very protective of their baby!

Oh well, those are my thoughts 🙁

Reply

    Oooo, great comment. So with the issues with WSUS/SCCM, what do you use for your patching when it comes to Cumulative Updates and Service Packs? Just curious what ends up being most efficient for you at that volume.

    For other readers who are confused by the GDR and QFE comments (and it’s confusing), the SQL Sentry team has a great resource for figuring out what build you want to apply at: http://blogs.sqlsentry.com/category/sql-server-builds/

    When there are hotfixes released out of band (whether Quick Fix or General Distribution), they note them and then also do some of the legwork for you to figure out which QFE’s/GDRs are in which Cumulative Updates, etc. What an acronym party 🙂

    Reply
Aaron Gonzalez
June 4, 2018 1:25 pm

Okay, this is an old thread but I found this somewhat concerning…

“So, if you’ve got any servers where jobs or applications are running as sa, well…. hope they finish up fast.”

… because Ola Hallengren’s maintenance solution creates its jobs with sa as owner.

Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Menu