Fail Over, Fail Again, Fail Better - Preparing for Disaster Recovery (Dear SQL DBA)

on June 23, 2016

You’re setting up SQL Server log shipping for disaster recovery. What else do you need to do to best prepare for a failure?

This is a “listen-able” 30 minute video. Prefer a podcast instead? Find it at littlekendra.com/dear-sql-dba-podcast.__

This Week’s Question: Dear SQL DBA…

We are going to start log shipping to reduce our time to recover from an outage. What can I do proactively to make things easier on myself when I want to switch over to the disaster recovery site (especially if the original system is unavailable)? I’m particularly worried about syncing user accounts, and dealing with SQL Agent jobs.

One of my favorite quotes is from Samuel Beckett. “Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.”

When things go terribly wrong, our job as DBAs is to make it the best failure possible.

Simply setting up logshipping, mirroring, clustering, or an availability group isn’t enough

The fact that you’re thinking about this is great!

You’re right, there are two major types of fail-overs that you have to think about:

Planned failover, when you can get to the original production system (at least for a short time)
Unplanned failover, when you cannot get to it

Even when you’re doing a planned failover, you don’t have time to go in and script out settings and jobs and logins and all that stuff.

Timing is of the essence, so you need minimal manual actions.

And you really should have documentation so that whomever is on call can perform the failover, even if they aren’t you.

Step 1: You need a safe test environment with the same security and network setup as production

One of the biggest risks in solving this problem is “whoops, I was was trying to plan for disaster, and accidentally messed up production.”

You need an environment where you can restore all the databases, set up tools, and test things out. I usually call this “prod test”, but it goes by “staging” sometimes as well.

You want it to be as similar as possible…

If your production environment has one domain and the DR environment has another, you want the same setup with the same trust relationship. It is safer for you for prod test to not be in the actual production and DR domains, though.
Same thing for firewall rules.

Don’t cut corners with security to make things easier for disaster failover.

Cutting corners with security could cause its OWN disaster
You do not want an attacker who compromises your DR site to have access to accounts that give them the production environment as well

Use color coding on registering the instance names in SSMS so you know what domain you’re connected to.

I like using ye olde BGINFO.exe (free tool from Microsoft’s Sysinternals team) to color code desktops for when you remote desktop, too.

Step 2: Decide on your balance between process and automation

If you just have a few SQL Server instances to manage for disaster recovery, you may prefer a manual process where:

You get the logins and jobs in sync at first
You change your release process so that every time security or a job changes, you deploy the change to both production and DR
You audit the two environments once every six weeks to make sure they’re in sync

This can be easier for things like creating a new domain security group in the environments, particularly if you don’t have the permissions to automate something like that yourself.

It can also be simpler for things like adding a linked server, where the linked server name is going to be different in the environments.

But if you’re managing more than a couple of instances, and if you have a lot of people who can deploy changes to the SQL Servers, doing it 100% manually just isn’t going to work. In that case you may want to have more automation:

Nightly scripts that look for new SQL Server Agent jobs and copy them to the DR environment (you probably want to leave them disabled until they are reviewed in case there’s any chance they can alter production in a negative way)
Nightly scripts that look for changes in logins and stage or execute changes appropriately
Nightly scripts that look for linked server changes and stage changes in DR or notify you
Audit the two environments every six weeks, because no automation is perfect. And weird things like Windows Scheduled Tasks can sneak in (even though they shouldn’t)

“Staging changes” in this case means generating and saving a script for you when you aren’t sure that it’s safe to just set something loose.

Step 3: Test out powershell tools for your automation needs

Good news: you don’t have to write everything from scratch. Check out free tools that others have created.

One example is dbatools.io: https://dbatools.io/getting-started/

Disclaimer: I didn’t write this, and I only know enough PowerShell to be dangerous. VERY dangerous. But you’ve got a test environment, right?

Consider testing this (or other tools) out and adapting them to your needs instead of writing all your code from scratch.

For SQL Server login setup, Microsoft has a KB with a script to transfer logins. I believe this is handled by dbatools.io if you use it (but test for your setup, because I don’t know my PowerShell from my Power Rangers)

Script for transferring logins using TSQL - KB 918992: https://support.microsoft.com/en-us/kb/918992

Step 4: Practice planned AND unplanned failovers in your test environment

You are totally right to worry more about unplanned failovers

But test planned failovers first, because it’s still hard!

4. A Planned failovers

With logshipping, it’s important to know how to take the last log backup and make sure that no changes can get into the database
You do a special last log backup with NO RECOVERY. You restore that to the secondary, and you can reverse the logshipping without re-setting up the whole thing.
That could actually come in handy for some situations, so it’s worth getting really comfortable with in your test environment.
You also need steps to turn on backups in the DR site after you fail over. Will you do that manually, or automate it?

After failover, test that the application tier can access the database. Even a basic minimal test will help, because it’ll help you find things that need to be reset or modified after databases fail over. Things like:

“Orphaned Users”. If you use SQL Authentication, you may need to use sp_change_users_login to fix this, and it’s always good to check for it.
Trustworthy database property: this restores as “off”. Not all databases require this, so don’t just set it on everywhere– you should only enable it for databases that need it.

Tip: Build a checklist for your planned failover. This will also work as a draft for your checklist for unplanned failovers.

4.B Unplanned failovers

When practicing unplanned failovers, turn off the logshipping publisher in your test environment entirely, so you can’t get to it, then run your practice.
One of the big questions you’ll be asked in a real scenario is “how much data did we lose, and can we get it back once production is back online?”
Practice estimating how many log backups may have been taken, but not copied to the secondary
Practice the scenario of bringing the secondary online, then getting access to the original production environment back after an hour
- If you don’t drop the original production databases, do you have space to make it a secondary again?
- If you DO drop the original production databases, how much data will be lost that never made it to the secondary, and what is the business impact?

Evaluating your risks in the DR environment…

How long will it be acceptable to run in the DR environment without logshipping set up to another failover environment?
How much space do you have in the DR environment for full / differential / log backups?

Usually practicing DR failures end up with you realizing you need more space and resources than you originally planned

That’s totally normal - just make sure you ask for it before the disaster.

Step 5: Practice planned failovers in your production environment

Companies that are serious about planning for failure practice failovers in production. They do it often.

The more often you fail over, the better you get at it. And if you don’t keep up at it, things sneak in that break your processes and automation.

You can’t do this by yourself. Failing over is about a lot more than the databases.

But once you’re practicing regularly outside of production and also getting your processes and production environment ready, too, it’s time to start suggesting planned production failovers

Even if the answer is “no”, keep bringing it up on a regular basis in a friendly, positive way.

Got an annual review? An annual review is a great time to bring this up as a place you think your team can improve, and a way you’d like to contribute with the entire IT team.

Recap - Failing better

Here’s the five steps toward disaster preparedness…

Step 1: Set up a safe test environment with the same security and network setup as production
Step 2: Decide on your balance between process and automation
Step 3: Test out powershell tools for your automation needs
Step 4: Practice planned AND unplanned failovers in your test environment
Step 5: Practice planned failovers in your production environment

Even if you can’t complete step 5 (because you can’t do that alone), don’t forget to brag on yourself.

Write up a summary of the work you’ve done for your manager. If you’re blocked due to time or outside resources, summarize what you’d like to do further and what can help you make that happen.

This is the kind of project that people outside your team might not think to ask about. But it’s fantastic work for your team to do, and it’s something you should make sure to talk about!