Tag Archives | Database Administration

Will the Cloud Eat My DBA Job? (Dear SQL DBA Episode 31)

Will cloud services render DBAs obsolete? Does the cloud mean that developers will use less of SQL Server? In this post I talk about the future of database administration and give tips on strategizing your career.

Watch this week’s 28 minute episode, or scroll down to read a written version of the answer. Us YouTubers love it when you subscribe to our YouTube channels. You can also get this by podcast (and I would love it if you left a review on iTunes).


I moved into the role of DBA about a year ago and still have a lot to learn, however I’m worried…

  • Cloud adoption may move more and more companies away from SQL Server (due to cost, for instance)
  • As the transition to the cloud becomes more commonplace the DBA job will become a service included in that offering

I responded with a few questions of my own:

  1. What made you decide to be a SQL Server DBA, specifically?
  2. What do you like about being a DBA? What parts of your job do you wish you could do all the time?
  3. Are you curious about cloud technologies at all? Or do they seem uninteresting to you?

We’re talking about career strategy here. It’s important to reflect a little on how you got to the point you’re at now, and what you enjoy doing and what you’re interested in in this context.

We’ll get back to how those questions were answered soon.

Lots of things have been reported to kill the DBA over the years

SQL Server 2005 was said to be “self-tuning”! Who needs a DBA when the instance tunes itself? (Apparently everyone.)

Outsourcing: All the DBA jobs are going to X location, then Y location, then Z location. Then back to X. DBA jobs have become more global, but “outsourcing” hasn’t gotten rid of DBA jobs in the United States. It has been part of the trend to make working remotely more normal and easy, which is generally good for DBAs.

DevOps! All the developers will manage everything. And somehow know to do so.  I love Dev Ops, and I have seen it wipe out some QA departments, but I haven’t seen it wipe out DBAs. I think it’s fun to be a DBA working with a Dev Ops team.

One job is going away… slowly

“Old School Databass Administrator”


  • Installs SQL Server
  • Sets up and monitors jobs for backups, CHECKDB, and index maintenance
  • Manages user permissions
  • Stays in cube, silently

I don’t think the cloud specifically is killing off this role. IT teams are being demanded to communicate better, pay more attention to their customers, be more dynamic and specialized.

Who’s a DBA’s #1 customer?

Developers wanna develop

The cloud enables developers to use a lot of technologies.

But that doesn’t necessarily mean less SQL Server will be used.

The cloud is also bringing new ways for SQL Server to be cost effective – like elastic pools.

Developers have a massive challenge: there’s an incredible amount for them to learn.

Microsoft has been changing

It’s not just cloud — it’s things like SQL Server on Linux in vNext of SQL Server.

Yeah, you can spin up a Linux VM in the cloud and run SQL Server on it. But you can also do it on prem. Or in a container. You can do it anywhere. That’s really cool.

Microsoft is also embracing lots of technology in their cloud. You can bring data from multiple data sources together– including SQL Server.

Don’t get threatened by the fact that it’s not all SQL Server. If customers can do more and build more, and if there’s many ways to include SQL Server in that, it’s going to get used. It has tons of cool features for performance, and it’s a proven relational database with a really solid foundation.

Developers are still gonna need help

  • “This looks weird, what’s happening in the database?”
  • “Why is this slow?”
  • “What can we do to make queries faster without changing the code?”
  • “Is there a way to do this more cheaply?”

And also…

  • “Can we roll our own solution?”

The complexity developers face is only expanding. For those teams who choose SQL Server, there’s plenty of need for specialists.

Business owners will still need help

They have their own questions:

  • “What are the developers not seeing?”
  • “Are we meeting our SLAs, RPOs, RTOs and other acronyms?”
  • “Is our data secure?”

And also…

  • “Should we be in the cloud, or running our own?”

Remember how once upon a time, virtualization was going to completely take over everything? It definitely did become very common, but there were still plenty of companies who kept their SQL Servers using physical hardware for performance reasons. There are plenty of others who virtualized, then reverted to physical as well. And now there’s a trend toward the new hotness of containers.

As the options increase, the need for a SQL Server DBA who’s aware of the pros and cons of each just goes up.

DBA Specializations

Sysadmin / Platform

  • Scripting and managing large private implementations
  • High Availability and Disaster Recovery
  • Storage specializations

Performance tuning / architecture

  • Performance tuning critical workloads
  • Specialized code design patterns for high transaction workloads
  • Index tuning and new technologies

Cloud is a part of both of these. Many companies will use the cloud for failover. Many will come to use it primarily. Even then, which cloud do you use? Which way do you deploy to best suit your cost requirements and service level agreements? How do you manage it as it evolves?

Operations can be its own specialization

I used to work with two men who came from the “punch-card” days. When I met them they ran operations for a set of proprietary Hadoop-like clusters for a dot com. They were doing NOSQL before I ever heard it called NOSQL.

They were really good at operations. They adapted to new things to manage, and they did it very well.

That made me realize that as long as I’m good at solving a problem for a business with technology, I’m going to be able to find a job. I just need to keep adapting as the technology evolves.

Business intelligence is another example

The rise of self-serve data sounded like it was going to put BI folks out of business at the beginning.

After all, PowerPivot used “The Engine of the Devil!”

Who needs to build cubes anymore if people can just do everything they need in an Excel spreadsheet that references the cloud?

This customer wants their BI team more than ever

BI consultants now

I don’t work in Business Intelligence, but people seem to be really productive, happy, and active in that field.

More options on how to get things done resulted in more work, and more demand for interesting data to work with from business users. People want to figure out how to make their data meaningful in new ways.

What if you’re just starting out?

If you are…

  • Fresh in the industry, no experience
  • Someone who loves working with data

Should you focus on just Microsoft SQL Server?

Well, no, but that has nothing to do with SQL Server, or even the cloud specifically. Don’t settle down too fast. Why marry one technology before even getting to know others?

If you’re in this position, it’s an amazing time to experiment with all the technologies out there. This definitely includes cloud technology, but give yourself freedom to explore and try as many new things as you can get your hands on.

Our questioner isn’t in that position

Here’s how they answered my questions…

1.What made you decide to be a SQL Server DBA?

  • Accidental DBA, support background

2.What do you like about being a DBA?

  • Inner workings of SQL Server, flagship technologies. (NOT writing reports.)

3.Are you curious about cloud technologies?

  • Yes

The cloud is part of your job security

Get to know different options to deploy SQL Server to the cloud.

  • Strengths
  • Weaknesses / limitations

Ask questions around these areas…

  • Monitoring
  • HA/DR
  • How to support it / what maintenance is required

Cloud is a great reason to get to go to conferences! You need to provide the curiosity and justification for attending the training. Ask, and keep asking. When you do get to go to training, take the initiative to schedule time and share what you learned with your team. (It really helps when it comes time to ask to go again.)

Will the robots take my job?

Robots may take EVERYONE’s job. I can’t change that.

However, I think as database people, we’re probably near the end of the list of jobs that will be replaced by Skynet. The robots need us to serve them until they are fully self-sustaining, and no longer need humans at all.

But for the foreseeable future, there’s tons of work for data professionals to answer the question…

“How do we get the most out of all this stuff?”

My #1 advice: Look for problems you want to solve at work, and go after them

Don’t get stuck on a job title. I did this at one point in my career, and it was a huge mistake. The job title “DBA” may disappear, or you might find you like a job with a different title.

That’s fine. It’s not like “administrator” is a sexy term.

You get one life. You may as well get to solve problems that you enjoy while you work. That’s way more important than how the HR department tries to classify you.

Got a question for Dear SQL DBA?

Ask away at LittleKendra.com/dearsqldba!

Should I Upgrade to SQL Server 2016? (Dear SQL DBA Episode 22)

It’s a big week for SQL Server! And it’s the perfect week to talk about this week’s question, which is about explaining to your management why it’s worth it to upgrade to SQL Server 2016, and which features you can use right away.

Watch the 24 minute video, scroll down to read the article, or subscribe to the podcast.

Three different people asked me a variation of this question recently at the SQL PASS Summit:


We recently got budget to upgrade our SQL Server. My manager wants us to go with SQL Server 2014, because 2016 is so new. How can I convince management that 2016 is better?

Lucky DBA

As of yesterday, you have more reasons than ever to go with SQL Server 2016.

SQL Server 2016 Service Pack 1 was released, and not only do Service Packs give managers warm fuzzy feelings, this one was full of goodies.

SQL Server 2016 Service Pack 1 adds more features to Standard Edition (plus Web, Express, and Local DB)

I know most of y’all care about Standard Edition.

With SP1, you can use all sorts of (formerly) super-expensive features like data compression, partitioning, Columnstore, Change Data Capture, Polybase, and more in Standard Edition.

A few features have scalability caps in “lower” editions. There’s a memory limit for In-Memory OLTP (aka Hekaton), and Columnstore. Columnstore also has restrictions on parallelism.

This isn’t every feature. Transparent Data Encryption and licensing for high-availability features hasn’t changed. Memory and CPU limits for editions haven’t changed either: Enterprise Edition is still needed for scalability and HA/DR.

But overall, you get way more bang for your licensing buck in Standard Edition in SQL Server 2016 SP1.

Or, in the case of Express Edition, you get more for free.

Read the list of features by edition here: https://www.microsoft.com/en-us/sql-server/sql-server-editions 

Quick DBA wins to start using in Standard Edition

Your management may ask, “which of these features can you use without code changes? And how much will they help?”

That’s a great question.

  • Database snapshots can be immediately useful as part of your release process. You take a database snapshot before deploying a change, use it for verification, if needed, and drop it after the change is done. You incur write overhead for modifications used while the snapshot exists, but it doesn’t copy the whole database.
  • Data compression can make better use of your storage and memory.
    • Fine print: don’t just compress all your indexes. You need to determine which tables might benefit, and which type of compression to use. There are CPU tradeoffs. Start with the Microsoft Whitepaper on Data Compression.
  • Table Partitioning – maybe. Technically, you can implement table partitioning without a bunch of code changes. But in practice, you frequently do need to tune or rewrite some queries because of quirks as to how things like TOP or GROUP BY might behave differently. For tables that are sensitive to read performance, it usually takes both developers and DBAs to implement partitioning. However, you may have some tables which are easier wins– like a table you almost exclusively write to for logging purposes, where table partitioning could be useful for truncating partitions instead of running deletes.

Security can be pretty persuasive

When it comes to Standard Edition, the biggest change with 2016 SP1 is that Always Encrypted is now available.

If your database stores anything like credit card numbers or social security numbers, you want this. You REALLY want this if you’re the DBA, because your life is simpler if there’s no way you could exploit that data. Think of this as a really nice protective vault for your data kryptonite.

Note: there are other security features in 2016 that may also be attractive, like Dynamic Data Masking. They were available in Standard Edition as of RTM, but now are available in Web, Express, and Local DB.

Fine print: Transparent Data Encryption (TDE) is still an Enterprise Edition feature.

How I’d start talking about In-Memory OLTP

sqlserverexpresseditionYour developers may ask, “does this mean we can start using that Hekaton thing everywhere?”

Well, you can, but you might break a few things.

In-Memory OLTP, aka Hekaton, has a memory cap in the “lower” editions of 25% of the Buffer Pool Memory Limit. For Standard Edition, you’ll be limited to 32GB of memory for In-Memory tables, and when you run out of memory, things get real dicy.

So you’ve got to be careful. But you’ve got to be careful even in Enterprise Edition, too– because your hardware doesn’t have an infinite amount of memory. Even in EE, you have to learn to set up monitoring of how much memory is being used by In-Memory tables, alert when it’s getting low, and learn how to adjust it.

Having In-Memory tables in Standard Edition gives you a much better place to learn.

I would look for less critical applications that might have a good use case for In-Memory tables. Applications where it’s not a disaster if you have an outage. You need something that has a lot of writes to make it worth your while for the experiment.

Logging databases where you don’t have to retain a ton of history come to mind– I’ve worked with a bunch of apps that can flip verbose logging to a database on and off, and the app is designed to keep going even if the logging database goes off the farm.

Essentially, you now have room to be creative and cautious with this feature. That’s gold.

What about Columnstore?

For Columnstore, there’s the same memory limit as for In-Memory tables: 25% of the Buffer Pool Limit for that Edition. Plus, Standard Edition is limited to 2 cores for parallel queries, Web and Express just get a single thread.

This is a bit easier to play with as writes don’t stop when you run out of memory (unless it’s Clustered Columnstore on In-Memory OLTP). Reading Columnstore data from disk just isn’t as fast as it is to read it from Memory.

You also have to balance out making sure the overheads of writes to a Columnstore index aren’t slowing you down (and that you can monitor this), and that you’re maintaining the index properly.

For Columnstore, look for relatively narrow tables that have many millions of rows. I say narrow because you’ve got limited memory and parallelism to burn for this. And Columnstore shines when you’ve got many millions of rows to compress.

It’s really not unusual to have many millions of rows in OLTP tables anymore, and to have a diverse amount of queries hitting them. OLTP tables are often narrow as well, so even with the limits, I see this as a big deal for Standard Edition to get this feature.

SQL Server 2016 has Query Store (all editions)

This hasn’t changed, but it was already awesome.

Query Store gives you a way to track  your query plans along with metrics on query resource usage and execution time. So you can do things like see if adding a filtered nonclustered Columnstore index made things faster… or oops, if it made things slower.

Query Store is a fantastic feature for DBAs and Developers. It helps DBAs and Developers work better together, because they aren’t trying to save as many giant XML strings for query text and plans in spreadsheets anymore.

Before yesterday, it was one of my biggest reasons to argue that a SQL Server 2016 upgrade is far more attractive, even for Standard Edition. Now it’s just another item on the list.

You’ve still gotta test it

Things can go wrong in service packs. You’ve got to test any upgrade, even if it’s full of goodies.

2016 isn’t the new kid anymore. Meet SQL Server v.Next CTP 1

This train is picking up speed.

SQL Server v.Next CTP1 is now available for download on Windows, Linux, and even Mac (via Docker containers).

Check it out online here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-vnext-ctp

So it’s not like 2016 is the “new version” anymore, anyway.

Setting up Free Blocking Alerts and Deadlock Monitoring (Dear SQL DBA Episode 17)

What tools in SQL Server will notify you about blocking and help track the queries behind your toughest blocking and deadlocking problems?

Watch the 21 minute video, subscribe to the podcast, or read the episode notes and links below.


What is the best way to set up blocking and deadlock alerts on the server? I want to be notified automatically without any impact on the prod server.

I have tried alerts with SQL server performance condition alerts with SQL server agent. They do not show the queries or tables involved etc?


All Blocked Up

Woo hoo, I love this question!

So first off, I’m going to answer this discussing the free, built-in tools with SQL Server. If you have a budget for custom monitoring tools, you can buy fancy tools that have customized notifications for blocking and which capture the queries and plans involved. If that’s the case, set up free trials against a test system.

But not everyone has budget for every single SQL Server instance. So it’s extremely useful to know what SQL Server offers to help you with this.

And by the way, if you’re going to be at the SQLPASS Summit in Seattle in just under a month, I’m giving a session that shows different blocking scenarios. Come see my session, The Great Performance Robbery: Locking Problems and Solutions on Thursday, Oct 27th at 10:45AM in room 6C.

Free, simple blocking notifications

I like to set up blocking notifications with a simple SQL Server agent alert on the “SQLServer: General Statistic: Processes Blocked” performance counter.

This will not give you the queries involved in the blocking — don’t worry, we’ll cover tracking that in the next step.

This alert is low impact and it will let you know when you need to look at the SQL Server.

To get this alert to work, you’ll need to:

  • Configure the SQL Server Agent service to auto start
  • Set up database mail, enable it on the  SQL Server Agent, and then restart the SQL Server Agent
  • Configure an Operator in the SQL Server Agent
  • Create a new ‘performance’ style alert, base it on the “Processes Blocked” counter, and tell it who to notify

A few things to note:

  1. The SQL Agent doesn’t poll counters constantly – and we want this to be lightweight, so that’s a good thing. It will only poll every 15-30 seconds, and there’s no published guaranteed SLA on that polling frequency.
  2. If you really need something more sensitive and reliable, you need a monitoring system fully independent / outside of the SQL Server to be polling in and checking it for availability and performance.
  3. You can configure the alert to only fire every X minutes. I highly recommend that, so you don’t get an inbox of alerts every 20 seconds

Create some blocking in a test database or in tempdb and make sure the alert works.

I have example code to create blocking and deadlocks for your dev environments in my post, Deadlock Code for the WorldWideImporters Sample Database.

For production databases, you can create a temp table and write similar code to create blocking in those.


Finding the queries involved with the Blocked Process Report

OK, we’ve got notifications. We need SQL Server to give us more information on who is involved in the blocking.

I like to use the built-in Blocked Process Report for this. This has been in SQL Server for a long time, and it’s extremely useful.

The Blocked Process Report shows you the “input buffer” of the commands involved – it may be partial information and not the full text of the query. It will also show you the login name for who is running what, and the type of lock requests involved.

The Blocked Process Report is pretty lightweight, because SQL Server has to frequently wake up and look for blocking situations that can’t resolve themselves. By default, the deadlock monitor wakes up every 5 seconds and looks around to see if there is a deadlock which it needs to break. You may enable a feature called the ‘Blocked Process Report’ that tells SQL Server to additionally issue a report on any blocking which it finds.

To get this to work, you need to:

  • Enable the sp_configure option for the blocked process threshold. This defaults to ‘0’, which is off. You configure the threshold to the number of seconds you’d like the threshold to be. This should be a value of 5 or higher, because making the deadlock monitor run constantly could tank your performance. A good ‘starting’ value is 30 seconds.
  • You also need to set up a trace to collect an event called the ‘blocked process report’. Setting the threshold causes the event to be output, but SQL Server won’t collect it for you unless you start a SQL Trace or an Extended events trace that collects that event.

Once you have the trace file, you can copy it off of the production server to interpret it.

Michael J Swart has written a great free tool called the Blocked Process Report Viewer to help interpret the blocking chains. It’s free at https://sqlblockedprocesses.codeplex.com.

The viewer makes it easier to see the root of the blocking chain and who was blocking whom.

This trace is pretty lightweight, but with any trace you want to make sure that you don’t add a bunch of events that you don’t need, and that you periodically clean up the files and don’t let it impact drive space.

When I talk about running traces, I don’t mean running Profiler

We’re entering dangerous territory here. Whenever you talk about tracing in SQL Server these days, someone gets offended.

Here’s what you need to know. There’s two main ways to run a trace:

  1. SQL Trace. This is the old school option. You can run this using…
    1. The Profiler client (I don’t like this option)
    2. A Server Side trace scripted out from Profiler (much better!).  You can get up to speed on Server Side Traces reading this article on generating a service side trace by Jen McCown. (Note that she wrote this article back in 2009. That’s fine, SQLTrace hasn’t been changing since then.)
  2. Extended Events. This is much easier to use on SQL Server 2012 and higher than in previous versions because a GUI was introduced in Management Studio for it under Object Explorer.

I do not like leaving the Profiler application running because I’ve seen it do everything from slowing down performance to filling up drives over the years. And creating Server Side traces isn’t very hard if you do want to use SQL Trace.

I personally only like to have a trace running if I know I need it and am going to look at it. So I only enable this when I have a blocking problem. Whenever you choose to leave a trace running, you need to periodically check in on the files its created and clean up after it.

Detecting and handling deadlocks

What about locking problems where SQL Server has to step in and kill one of the queries?

You have a few built in options about how to get info on this. There are some trace flags that you can turn on which cause some information about who is involved in the deadlock to be printed to the SQL Server Error Log. This isn’t my preferred option because the information is very hard to parse through and read.

I find it more helpful to get a ‘deadlock graph’, which is a picture of how the locking fight went down.

On older versions of SQL Server, you can capture the deadlock graph with a server side trace.

On newer versions of SQL Server, you can capture this with an Extended Events trace.

A great resource for deciding how to capture deadlock information is Jonathan Kehayias’ excellent Simple Talk article, Handling Deadlocks in SQL Server.  He covers how to collect the graphs, shows examples of how they look, and gets you started tackling them.

If you get through this point and need to get really fancy with deadlocks, Michael J Swart recently wrote about using Event Notifications to collect execution plans related to deadlocks in his post, “Build Your Own Tools“. Just don’t try to run before you walk: this is pretty advanced stuff and you need to be comfortable using Service Broker, which is part of Event Notifications behind the scenes.

Updates, Oct 12, 2016:

  • Deadlock alerting: While it’s always better for the application itself to catch, handle, and log/alert on failed queries (not just deadlocks, but timeouts and anything else)… if your application can’t do that, and you can’t change it, check out this SQL Magazine article by Michael Campbell on alerting on deadlocks from SQL Server.
  • Also, thanks to Johan for pointing out that the free SQL Sentry Plan Explorer client tool is also able to display deadlock graphs. Just save them as an XDL file and you can open and view them.

Quick recap – getting more info on blocking

A fast rundown of the free, built-in tools we covered:

  • I like to use a simple, light, performance alert in the SQL Server agent for notification about blocking
  • I like the Blocked Process report to find out who’s blocking whom – collected by a server side SQL Trace or Extended events
  • I find collecting deadlock graphs with either a server side SQL Trace or Extended Events to be the most helpful way to figure out who’s involved in the nastiest blocking tangles.

Want to submit a Dear SQL DBA Question?

Want clarification on some of the basics? Got a question that jumped into your mind reading or listening to this? I’d love to hear it– asking is always free and easy!

Why an Upgrade can Cause Performance Regressions (Dear SQL DBA Episode 9)

You finally got approval to move to new hardware and a fresher version of SQL Server. After months of work,  you do the migration and then… performance gets worse. What can cause this, and what do you look for?

This is a “listen-able” 27 minute video. Prefer a podcast instead? Find it at littlekendra.com/dearsqldba.

Show notes with clickable links are below the video. A free poster of the concepts is here.

Here’s this week’s question:


I recently went through a process of migrating databases on a SQL Server 2008 R2 stand alone instance to a SQL Server 2012 SP2 cluster.

The cluster’s servers and SQL Server configurations were built to be as close to identical as possible to the previous instance (memory, cores, disk, maxdop, CTP, etc).

After the migration, I noticed that CPU utilization jumped from the normal 25% to a consistent 75%.

I did several other migrations with similar server loads with no issues, so I’m a bit puzzled as to what might be going on here. Could the upgrade from SQL Server 2008 R2 to SQL Server 2012 simply be exposing bad queries that 2008 was handling differently?


Unexpected Performance Regression

I love this topic! I’m going to go a bit broader and talk about a variety of issues I’ve seen with migrations and upgrades. Let’s get to it!

Review your Max Degree of Parallelism and Cost Threshold for Parallelism settings (and other things from sys.configurations)

Why-So-Tired-SQL-ServerYou’ve already done this one. I’m listing it for others who might hit the same situation, because it’s an important step.

SQL Server defaults “max degree of parallelism” to 0, which means when a query goes parallel, it can use all the cores. If you accidentally left that set at 0 on the new instance, your workload could have queries fighting over all CPUs all the time.

For more information on how to determine these two settings, check out my last episde, “Max Degree of Confusion.”

While you’re at it, compare all the settings in sys.configurations between the instances to check for differences, and look into each one. I’ve seen weird accidents cause performance issues, like people mistakenly setting minimum memory for a query on one instance.

The new hardware may have a bottleneck that only shows up under load

I’ve run into a couple of issues with hardware configuration that weren’t caught until after migration.

In one case, the server power wasn’t completely plugged in. This led to CPUs that operated at lower power, and actually showed higher percent utilization when examined in Windows task manager. They were just capable of doing less!

You can detect this by doing two things:

  • Check the Windows System log for problematic messages — in this case there were messages about power sources not being fully plugged in, but I look at all messages in the system log (some things are “informational” that are really important)
  • Run a free tool like CPU-Z to measure the clock speed that your processors are running at

In another case, following a migration we had one single core that was always at 100%! It turns out that our new NIC was pegging one core under load.

  • We saw SOS_SCHEDULER_YIELD in wait stats – when SQL Server was trying to use that one core, it had to wait a LOT, because it was being hogged by the NIC
  • This skewed overall utilization  up
  • We were able to fix this by tweaking settings on the NIC so that it used multiple cores – required planned maintenance
  • This hadn’t happened on the old server, and our load testing pre migration just wasn’t chatty enough over the network to surface the issue

You may have power savings enabled on your processors

I wouldn’t expect this to make the entire difference between 25% and 75% usage, but it could be one of a few things contributing to the problem. It’s easy to fix, so it’s worth checking for.

Look at the BIOS on the new servers and make sure that power savings has been disabled.

Just changing the power savings settings in Windows Server does not always disable power savings, you need to look at a server management tool like Dell OpenManage or HP System Insight Manager.

Use your server model and look up the manufacturer’s guide to configuration for performance. These guides will list the power settings needed to get “high performance.”

Most servers these days ship with default settings that clock  the processors down, and they can be very sluggish to clock up.

This typically takes a planned outage to fix: you need to make a change in the BIOS and then restart the server.

Your fresh new SQL Server version could be suffering from stack dumps or “15 second” storage errors

Stack dumps typically causes periodic “freezeups” that are followed by high load. 15 second storage errors mean the storage literally isn’t responding for 15 seconds.

Look in the SQL Server error log. Look pretty closely at everything for a few days for errors, then filter for the terms:

  • “Stack Dump”
  • “15 second”

For more on 15 second errors, check out my previous episode, “Outside the big SAN Box: Identifying Storage Latency in SQL Server.”

Licensing may not be allowing you to use all the CPUs / memory banks

If SQL Server can’t use all your CPUs, it can cause weird high CPU usage on some of them– and in some cases not allow SQL Server to use all the memory on the server, which can cause unexpected slow performance patterns.

I’ve seen this happen in a couple of different ways:

  • SQL Server 2012 has a weird Enterprise Install limited to 20 cores
    • There was an upgrade path for people using “seat based” or CAL licensing to use SQL Server 2012 Enterprise Edition, but the installation limited them to using 20 cores or less.
    • Some people didn’t understand the upgrade licensing completely, and just didn’t realize they hadn’t licensed all the cores. In these cases CPU affinity masking can at least be used to spread the cores in use evenly across all server sockets (be very careful, affinity masking can go wrong)
    • Once I found a case where someone had simply downloaded the incorrect install media from MSDN and it was used on all the instances at their company by accident– and these servers had a lot more than 20 cores. The name of the download appeared very innocent
    • Aaron Bertrand wrote a post with lots of detail on this on SQLBlog
  • I’ve also seen this happen when people are running SQL Server Standard Edition with a VM, and they configure the VM to look like it has more than 4 CPU sockets with a single core by accident

Look in the SQL Server Error log after the last startup for the message telling you how many sockets and cores SQL Server detected and make sure it’s using all the cores it should be.

Something outside SQL Server may be using CPU

You’ve probably checked for this.

It never hurts to doublecheck, because I’ve found surprises on servers that I manage. Stuff just creeps in. Someone decides to use the server for testing before you go live and forgets to tell you, and also forgets to turns stuff off.

Or if you’re me, you did something in the middle of the night before the instance went live that you forgot about.

Check the Windows task manager for processes using CPU and memory, and make sure there’s not scheduled tasks or agent jobs that are running things that didn’t used to run on the old server.

You could be using a different version of the SQL Server Cardinality Estimator on the new instance

This is specifically for folks who have upgraded to SQL Server 2014 or higher.

Check your database compatibility level and related settings.

  • Database compatibility level 120 uses a new version of the “cardinality estimator”
  • This can lead to very different plans being generated
  • On SQL Server 2016, you can raise the database compatibility level to 120 or 130 and set LEGACY_CARDINALITY_ESTIMATION=ON for the database to use the old level

The new cardinality estimator is a good thing, but some folks have found it causing performance regressions, so it’s worth checking if this is part of your issue.

SQL Server may be optimizing the top CPU burning queries differently

Instead of moving the files and attaching, I prefer backup and restore (or log shipping for a fast cutover).

  • I like to keep the old copy of the database on the old hardware for at least a month
  • This means if a “slow query” issue comes up, I can run it against the old hardware and compare execution times, to see if it’s really slower or burns more resources
  • I can also compare query execution plans to see if SQL Server is making a different decision about how to run the query

You should be taking a final backup pre-migration no matter what, so you can restore that last backup and compare, even if you moved the database files to do the migration.

In any case, pull the top 10 queries by total CPU usage. Look for:

  • large amounts of reads / indexing opportunities
  • huge amounts of executions per minute
  • plan warnings of any sort (these show up with a little yellow warning sign, like for roadwork)

I like to pull this list prior to migration as well, so I can tell if the top queries afterward are the same or different.

Free tools to pull top queries:

A new bottleneck may have emerged simply because the new hardware is faster

Even if you’re matching things like the number of CPUs, the CPUs themselves are hopefully faster. Because hardware is just faster.

And since memory is cheaper, we often have a bit more memory in the new SQL Server.

When you make one thing faster, sometimes that can lead to a new bottleneck. I’ve hit this situation:

  • Migrate to a server with faster CPUs and more memory
  • Amount of physical reads goes down, because SQL Server can cache more, and queries are faster
  • Suddenly, blocking problems get worse because query patterns changed

For migrations, I like to capture wait statistics samples for a couple of weeks prior to the migration. That way, after the migration I can compare waits and see if anything new pops up, and track it from there.

Two free tools to sample waits:

You could just be hitting a weird wait on the new version

Good news, the wait stats check in the previous section will find this, too!

Sometimes you just get unlucky, and your code hits a weird issue on a new version of SQL Server that you didn’t have on the old one. In once case I had very high CMEMTHREAD waits under load on one specific version of SQL Server.

This is a pretty rare wait type and I had to do a bunch of research on it, and we eventually found a fix for our version.

I always look for weird wait types, and want to compare waits pre and post migration when possible.

You could be getting more load

Sometimes people migrate right before the busy season hits. They know some load is coming, so they buy new hardware.

And sometimes the person doing the DBA work doesn’t get to talk to the people who know when it’s a high or low period for load.

You might be able to see this looking at top queries by execution, but another simple way to check it is by collecting a few performance counters before and after migration

The “SQL Statistics” Performance Object has three counters that can help measure how much work your instance is doing:

  • Batch requests/sec
  • Compilations/sec
  • Recompilations/sec

Compilations particularly burn CPU, so whenever CPU goes up I’m always interested to see if compilation rates have risen.

Migration planning and troubleshooting checklists

Some planning notes (to make troubleshooting easier)

  • Keep the original server and database around as part of your plan. Do the migration with backup /restore (or possibly logshipping or mirroring to reduce downtime).
    • Lock accounts out of the old database and even shut of the instance if needed
    • Keep it around for a month in case it’s needed to troubleshoot performance / compare execution plans
  • Collect samples of wait stats during important periods for several weeks prior to migration, save them off
  • Collect top sql server queries with plans by CPU, logical reads, and execution count prior to migration
  • Collect performance counters prior to migration:
    • SQL Statistics – Batch requests/sec, Compilations/sec, Recompilations/sec
    • % Processor Time

Troubleshooting post migration

  • Review your Max Degree of Parallelism and Cost Threshold for Parallelism settings
  • Compare all the settings in sys.configurations between the instances to check for differences, and look into each one
  • Check the Windows System log for problematic messages — in this case there were messages about power sources not being fully plugged in, but I look at all messages in the system log (some things are “informational” that are really important)
  • Run a free tool like CPU-Z to measure the clock speed that your processors are actually getting
  • Look at the BIOS on the new servers and make sure that power savings has been disabled
  • Look in the SQL Server Error log after the last startup for the message telling you how many sockets and cores SQL Server detected and make sure it’s using all the cores it should be
  • Check task manager for processes using CPU and memory, and make sure there’s not scheduled tasks or agent jobs that are running things that didn’t used to run on the old server.
  • Review SQL Server Error log for stack dumps, 15 second IO latency warnings, or other errors
  • Check the Windows task manager for processes using CPU and memory, and make sure there’s not scheduled tasks or agent jobs that are running things that didn’t used to run on the old server
  • Check your database compatibility level and related settings to see if you’re using the new cardinality estimator (and weren’t before) – SQL Server 2014+
  • Identify top 10 CPU burning queries. Compare query plans against older instance. Look for large amounts of reads / indexing opportunities, or huge amounts of executions per minute.
  • Sample SQL Server wait stats when performance is poor to see what the top waits are — and if they’re surprising

Fail Over, Fail Again, Fail Better – Preparing for Disaster Recovery (Dear SQL DBA)

You’re setting up SQL Server log shipping for disaster recovery. What else do you need to do to best prepare for a failure?

This is a “listen-able” 30 minute video. Prefer a podcast instead? Find it at littlekendra.com/dearsqldba.


This Week’s Question: Dear SQL DBA…

We are going to start log shipping to reduce our time to recover from an outage. What can I do proactively to make things easier on myself when I want to switch over to the disaster recovery site (especially if the original system is unavailable)? I’m particularly worried about syncing user accounts, and dealing with SQL Agent jobs.

How can I be better prepared for disaster?

Mystified in Montana

One of my favorite quotes is from Samuel Beckett. “Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.”

When things go terribly wrong, our job as DBAs is to make it the best failure possible.

Simply setting up logshipping, mirroring, clustering, or an availability group isn’t enough

The fact that you’re thinking about this is great!

You’re right, there are two major types of fail-overs that you have to think about:

  • Planned failover, when you can get to the original production system (at least for a short time)
  • Unplanned failover, when you cannot get to it

Even when you’re doing a planned failover, you don’t have time to go in and script out settings and jobs and logins and all that stuff.

Timing is of the essence, so you need minimal manual actions.

And you really should have documentation so that whomever is on call can perform the failover, even if they aren’t you.

Step 1: You need a safe test environment with the same security and network setup as production

One of the biggest risks in solving this problem is “whoops, I was was trying to plan for disaster, and accidentally messed up production.”

You need an environment where you can restore all the databases, set up tools, and test things out. I usually call this “prod test”, but it goes by “staging” sometimes as well.

You want it to be as similar as possible…

  • If your production environment has one domain and the DR environment has another, you want the same setup with the same trust relationship. It is safer for you for prod test to not be in the actual production and DR domains, though.
  • Same thing for firewall rules.

Don’t cut corners with security to make things easier for disaster failover.

  • Cutting corners with security could cause its OWN disaster
  • You do not want an attacker who compromises your DR site to have access to accounts that give them the production environment as well

Use color coding on registering the instance names in SSMS so you know what domain you’re connected to.

I like using ye olde BGINFO.exe (free tool from Microsoft’s Sysinternals team) to color code desktops for when you remote desktop, too.

Step 2: Decide on your balance between process and automation

If you just have a few SQL Server instances to manage for disaster recovery, you may prefer a manual process where:

  • You get the logins and jobs in sync at first
  • You change your release process so that every time security or a job changes, you deploy the change to both production and DR
  • You audit the two environments once every six weeks to make sure they’re in sync

This can be easier for things like creating a new domain security group in the environments, particularly if you don’t have the permissions to automate something like that yourself.

It can also be simpler for things like adding a linked server, where the linked server name is going to be different in the environments.

But if you’re managing more than a couple of instances, and if you have a lot of people who can deploy changes to the SQL Servers, doing it 100% manually just isn’t going to work. In that case you may want to have more automation:

  • Nightly scripts that look for new SQL Server Agent jobs and copy them to the DR environment (you probably want to leave them disabled until they are reviewed in case there’s any chance they can alter production in a negative way)
  • Nightly scripts that look for changes in logins and stage or execute changes appropriately
  • Nightly scripts that look for linked server changes and stage changes in DR or notify you
  • Audit the two environments every six weeks, because no automation is perfect. And weird things like Windows Scheduled Tasks can sneak in (even though they shouldn’t)

“Staging changes” in this case means generating and saving a script for you when you aren’t sure that it’s safe to just set something loose.

Step 3: Test out powershell tools for your automation needs

Good news: you don’t have to write everything from scratch. Check out free tools that others have created.

One example is dbatools.io: https://dbatools.io/getting-started/

Disclaimer: I didn’t write this, and I only know enough PowerShell to be dangerous. VERY dangerous. But you’ve got a test environment, right?

Consider testing this (or other tools) out and adapting them to your needs instead of writing all your code from scratch.

For SQL Server login setup, Microsoft has a KB with a script to transfer logins. I believe this is handled by dbatools.io if you use it (but test for your setup, because I don’t know my PowerShell from my Power Rangers)

Step 4: Practice planned AND unplanned failovers in your test environment

You are totally right to worry more about unplanned failovers

But test planned failovers first, because it’s still hard!

4. A Planned failovers

  • With logshipping, it’s important to know how to take the last log backup and make sure that no changes can get into the database
  • You do a special last log backup with NO RECOVERY.  You restore that to the secondary, and  you can reverse the logshipping without re-setting up the whole thing.
  • That could actually come in handy for some situations, so it’s worth getting really comfortable with in your test environment.
  • You also need steps to turn on backups in the DR site after you fail over. Will you do that manually, or automate it?

After failover, test that the application tier can access the database. Even a basic minimal test will help, because it’ll help you find things that need to be reset or modified after databases fail over. Things like:

  • “Orphaned Users”. If you use SQL Authentication, you may need to use sp_change_users_login to fix this, and it’s always good to check for it.
  • Trustworthy database property: this restores as “off”. Not all databases require this, so don’t just set it on everywhere– you should only enable it for databases that need it.

Tip: Build a checklist for your planned failover. This will also work as a draft for your checklist for unplanned failovers.

4.B Unplanned failovers

  • When practicing unplanned failovers, turn off the logshipping publisher in your test environment entirely, so you can’t get to it, then run your practice.
  • One of the big questions you’ll be asked in a real scenario is “how much data did we lose, and can we get it back once production is back online?”
  • Practice estimating how many log backups may have been taken, but not copied to the secondary
  • Practice the scenario of bringing the secondary online, then getting access to the original production environment back after an hour
    • If you don’t drop the original production databases, do you have space to make it a secondary again?
    • If you DO drop the original production databases, how much data will be lost that never made it to the secondary, and what is the business impact?

Evaluating your risks in the DR environment…

  • How long will it be acceptable to run in the DR environment without logshipping set up to another failover environment?
  • How much space do you have in the DR environment for full / differential / log backups?

Usually practicing DR failures end up with you realizing you need more space and resources than you originally planned

That’s totally normal – just make sure you ask for it before the disaster.

Step 5: Practice planned failovers in your production environment

Companies that are serious about planning for failure practice failovers in production. They do it often.

The more often you fail over, the better you get at it. And if you don’t keep up at it, things sneak in that break your processes and automation.

You can’t do this by yourself. Failing over is about a lot more than the databases.

But once you’re practicing regularly outside of production and also getting your processes and production environment ready, too, it’s time to start suggesting planned production failovers

Even if the answer is “no”, keep bringing it up on a regular basis in a friendly, positive way.

Got an annual review? An annual review is a great time to bring this up as a place you think your team can improve, and a way you’d like to contribute with the entire IT team.

Recap – Failing better

Here’s the five steps toward disaster preparedness…

  • Step 1: Set up a safe test environment with the same security and network setup as production
  • Step 2: Decide on your balance between process and automation
  • Step 3: Test out powershell tools for your automation needs
  • Step 4: Practice planned AND unplanned failovers in your test environment
  • Step 5: Practice planned failovers in your production environment

Even if you can’t complete step 5 (because you can’t do that alone), don’t forget to brag on  yourself.

Write up a summary of the work you’ve done for your manager. If you’re blocked due to time or outside resources, summarize what you’d like to do further and what can help you make that happen.

This is the kind of project that people outside your team might not think to ask about. But it’s fantastic work for your team to do, and it’s something you should make sure to talk about!

Next Door to Derpton – When Your Fellow DBA is a Danger to Databases (Dear SQL DBA)

What do you do when your fellow DBA is a ticking time-bomb of bad decisions, waiting to explode your production environment?

Note: This is a “listen-able” video. You can also listen to this as a podcast – learn how at littlekendra.com/dearsqldba.

Here’s Today’s Question

What do I do with a co-worker (who claims to have 20 years being a DBA) who puts all the production databases into Simple recovery mode?


Next Door to Derpton

This is a Tough One…

In this case, we’re assuming that SIMPLE recovery model isn’t appropriate for all those databases— and that losing all the data since the last full (or full + diff) backup might be big trouble for the business.

It’s a difficult situation when one of your peers makes decisions that you feel risk the availability and safety of the data. Going deeper, it’s tough being on a team with someone who you feel doesn’t have the knowledge and skills to do their job.

Especially if they might make more money than you do.

You Need Protection, and Change Management is that Protection

The biggest problem is that your data may be at risk. You need to stabilize the configuration of your environment, and make sure that changes to the configuration go through a good review and approval process.

This may sound like a drag, but it protects you as well. We all have those times where something that seems like a good idea backfires on us.

If you don’t have Change Management, you need to become its champion. There are a lot of ways you can champion this for the sake of good process, and management typically loves it.

If you do have Change Management, your mission is to make sure it’s being used well, and that when changes go wrong, you’re finding root cause.

Be Careful Spinning the Wheel of Blame

Should you tell your boss that your coworker doesn’t know their transaction log from their tempdb

Usually, no.

If peer review is a part of your work system, it’s OK to be honest during that peer review framework. Make sure you’re being constructive.

In that case, pretend it was you: you’d want to know the extent of where you needed to improve, but you wouldn’t want your nose rubbed in it.

If your boss asks you what your impression is of your coworker’s skills in a private conversation, think through specific changes that have gone wrong and mention those incidents. Request that your boss keep your comments confidential.

Outside of private conversations with the team manager, change the subject. You’re a team. Team dynamics that turn against one team member are bad for the whole team.

If your team is having problems because of misconfigurations and changes that have gone wrong, look through those changes and make recommendations to processes to fix those.

  • Better change review
  • Better adherence to using change control
  • Improving documentation on how to do things  / breaking down “knowledge silos”

It’s also OK to be honest about areas where you believe your team needs more training, but talk generally about the team.

It takes really hard work to stay positive and keep it from getting personal in this situation, but it’s absolutely the best thing you can do.

It’s bad to have a coworker who lacks skills and may put your environment at risk. It’s even worse to have them believe you’re out to get them!

What if Your Coworker Regularly Goes Off the Ranch and Doesn’t Use Change Control?

Don’t cover for them.

Ask them about it first to make sure there wasn’t an emergency change request you’re unaware of for the change, but be honest about what happened when you’re asked.

In other words, treat them as an equal and a grown-up.

Sometimes in this situation, people sugar coat things or cover for the person who makes mistakes. You need to treat them as an adult though.

If you made mistakes, you would own up to what happened and work to not do it again, right? It’s just about respectfully allowing others to own their actions.

Mindset: Focus on Building Your Own Skills

It’s hard to stay positive in this situation. Your mindset is critical to navigating this successfully without having it drag you down.

As you grow your own skills, you’re likely to work with Junior DBAs more and more.

You’ll need to build strong processes, documentation, and change control to help them succeed.

After working with a peer with those issues, leading Junior DBAs will seem easy, so this is awesome training for a senior level position!

As often as you can, focus on your own learning and your ability to build resilient processes that help people make the right choices (and allow every change to get review and input).
Because after all, that’s good for you at 3 am when the pager goes off, too.