Lost in Performance Troubleshooting - Perf Triage for SQL Server

on June 2, 2016

Psssttt – I have an updated blog post on this called the Learner’s Guide to SQL Server Performance Tuning

The SQL Server is slow, what should you do? I answer a reader question and share my strategy for performance troubleshooting.

Note: This is a “listen-able” video. You can also listen to this as a podcast - learn how at littlekendra.com/dearsqldba.

This week, our question is about troubleshooting performance

Dear SQL DBA,

is there any step-by-step script that you would use to start your troubleshooting? My production SQL Server has a problem with memory pressure. Page Life Expectancy is low and lazy writes are high. I heard that Microsoft engineers will use SQL Diag then Nexus to troubleshoot, but I have not had success using those tools.

Lost in Performance Troubleshooting

You need a strategy

Tools change over time - for example, Query Store is a huge new tool in SQL Server 2016 (awesome!)

Develop a strong performance troubleshooting strategy
Plug tools into your strategy based on the SQL Server version, and whether you can afford a legit monitoring tool for the instance in question

My performance troubleshooting strategy

When things are slow, I want to know three things:

What queries are running, how long have they been running, and what are they waiting on?
What are the overall waits for the time period?

And never ever skip….

Sanity check: Am I missing anything obvious?

Sanity check examples:

Messages in the SQL Server Error Log indicating stack dumps, corruption, failed logins
Database settings where databases are automatically shrinking
SQL Server settings where someone forgot to raise the Max Server Memory, or thought it was set in GB, not MB
Messages in the Windows Event Log indicating power problems, failing RAM, etc

Why are wait stats so important?

I’m recommending looking at waits in two out of three steps:

What’s running now and what is it waiting on?
What are the overall waits?

Waits are critical to performance because it’s SQL Server’s way of explaining, “What made this slow?”

Waits are overlooked because they can be confusing to decode, and it takes time to learn to interpret them.

Waits are worth the effort because…

They describe slowness rather than # of whatevers per whatever. Perhaps I just don’t have more whatevers I need to do.
When combined with “What’s executing now?”, waits point you in the direction of areas to look deeper and help clarify: is the whole server slow? Are one or a few queries slowing the whole server down?

Given the example, how would memory pressure look with this approach?

What is running? Look at the duration and waits

Could be a lot of shorter commands that finish quickly
- If so: does a current 30 second sample of waits have much higher waits than normal?
Could be several longer commands
- What are they waiting on? Is blocking making some queries slower than others?
- Save off the queries, execution plans, and wait information
  - They’ll disappear, and these are how you can sort out if the query needs an index / if the storage is slow, etc

What are things waiting on? In a memory pressure situation, I’d expect to see…

PAGEIOLATCH waits (usually).
- If data isn’t in memory, we have to go read from storage, and this wait indicates we’re reading from storage a lot.
LCK waits (possibly)
- When queries get slow, sometimes they may start keeping each other from making progress, creating a backlog– that can take over memory.
RESOURCE_SEMAPHORE waits (sometimes) - a specific kind of memory pressure so that queries can’t even get memory grants
- This may happen if there’s a big backlog of queries, or if a process is kicking off a bunch of queries that need large memory grants at once

What if I’m not at the keyboard when performance is bad?

You can’t always be there when the SQL Server is slow.

Plus, you need a baseline of waits. (This is hard enough without knowing what’s normal!)

Specialized SQL Server monitoring tools collect wait information.

There’s also free tools out there you can use, but you need to make sure you don’t impact the system too much.

Free SQL Server performance troubleshooting tools

Several fine paid vendor tools work to provide all of these things for you in one interface. I’m talking about the free ones here.

Waits Reference - Paul Randal’s SQL Server Wait Library: https://www.sqlskills.com/help/waits/

What queries are running, how long have they been running, and what are they waiting on?

sp_WhoIsActive by Adam Machanic - http://sqlblog.com/files/default.aspx
This can be used to log information to a table - /2011/02/01/whoisactive/

What are the overall waits for the time period? (You only really need one of these tools, but try them out and see which you like best.)

Paul Randal’s script - http://www.sqlskills.com/blogs/paul/capturing-wait-statistics-period-time/
Brent Ozar Unlimited’s sp_AskBrent - https://www.brentozar.com/askbrent/

Sanity check: Am I missing anything obvious? (You only really need one of these tools, but try them out and see which you like best.)

Brent Ozar Unlimited’s sp_Blitz script: https://www.brentozar.com/blitz/
Glenn Berry’s Diagnostic Information Queries: http://www.sqlskills.com/blogs/glenn/category/dmv-queries/

How will SQL Server 2016 change this?

The new Query Store feature doesn’t collect wait statistics

However, it does collect what was running, how long it was running, and what the plan was
This is very powerful
Wait statistics remain important– to collect to decode what’s making things slow, but Query Store will become the “new best friend” of wait stats

And hey, who knows if Query Store might start to collect waits as well in the future … or if it might have a Wait Store friend? (No secret knowledge here, just a wish… because that would be awesome.)