Lost in Performance Troubleshooting – Kendra’s Perf Triage for SQL Server (Dear SQL DBA)

The SQL Server is slow, what should you do? I answer a reader question and share my strategy for performance troubleshooting.

Note: This is a “listen-able” video. You can also listen to this as a podcast – learn how at littlekendra.com/dearsqldba.

This week, our question is about troubleshooting performance

Dear SQL DBA,

is there any step-by-step script that you would use to start your troubleshooting? My production SQL Server has a problem with memory pressure. Page Life Expectancy is low and lazy writes are high. I heard that Microsoft engineers will use SQL Diag then Nexus to troubleshoot, but I have not had success using those tools.

Lost in Performance Troubleshooting

You need a strategy

Tools change over time – for example, Query Store is a huge new tool in SQL Server 2016 (awesome!)

  • Develop a strong performance troubleshooting strategy
  • Plug tools into your strategy based on the SQL Server version, and whether you can afford a legit monitoring tool for the instance in question

My performance troubleshooting strategy

When things are slow, I want to know three things:

1) What queries are running, how long have they been running, and what are they waiting on?

2) What are the overall waits for the time period?

And never ever skip….

3) Sanity check: Am I missing anything obvious?

Sanity check examples:

  • Messages in the SQL Server Error Log indicating stack dumps, corruption, failed logins
  • Database settings where databases are automatically shrinking
  • SQL Server settings where someone forgot to raise the Max Server Memory, or thought it was set in GB, not MB
  • Messages in the Windows Event Log indicating power problems, failing RAM, etc

Why are wait stats so important?

I’m recommending looking at waits in two out of three steps:

  • What’s running now and what is it waiting on?
  • What are the overall waits?

Waits are critical to performance because it’s SQL Server’s way of explaining, “What made this slow?”

Waits are overlooked because they can be confusing to decode, and it takes time to learn to interpret them.

Waits are worth the effort because…

  • They describe slowness rather than # of whatevers per whatever. Perhaps I just don’t have more whatevers I need to do.
  • When combined with “What’s executing now?”, waits point you in the direction of areas to look deeper and help clarify: is the whole server slow? Are one or a few queries slowing the whole server down?

Given the example, how would memory pressure look with this approach?

What is running? Look at the duration and waits

  • Could be a lot of shorter commands that finish quickly
    • If so: does a current 30 second sample of waits have much higher waits than normal?
  • Could be several longer commands
    • What are they waiting on? Is blocking making some queries slower than others?
    • Save off the queries, execution plans, and wait information
      • They’ll disappear, and these are how you can sort out if the query needs an index / if the storage is slow, etc

What are things waiting on? In a memory pressure situation, I’d expect to see…

  • PAGEIOLATCH waits (usually).
    • If data isn’t in memory, we have to go read from storage, and this wait indicates we’re reading from storage a lot.
  • LCK waits (possibly)
    • When queries get slow, sometimes they may start keeping each other from making progress, creating a backlog– that can take over memory.
  • RESOURCE_SEMAPHORE waits (sometimes) – a specific kind of memory pressure so that queries can’t even get memory grants
    • This may happen if there’s a big backlog of queries, or if a process is kicking off a bunch of queries that need large memory grants at once

What if I’m not at the keyboard when performance is bad?

You can’t always be there when the SQL Server is slow.

Plus, you need a baseline of waits. (This is hard enough without knowing what’s normal!)

Specialized SQL Server monitoring tools collect wait information.

There’s also free tools out there you can use, but you need to make sure you don’t impact the system too much.

Free SQL Server performance troubleshooting tools

Several fine paid vendor tools work to provide all of these things for you in one interface. I’m talking about the free ones here.

Waits Reference – Paul Randal’s SQL Server Wait Library: https://www.sqlskills.com/help/waits/

1) What queries are running, how long have they been running, and what are they waiting on?

2) What are the overall waits for the time period? (You only really need one of these tools, but try them out and see which you like best.)

3) Sanity check: Am I missing anything obvious? (You only really need one of these tools, but try them out and see which you like best.)

How will SQL Server 2016 change this?

The new Query Store feature doesn’t collect wait statistics

  • However, it does collect what was running, how long it was running, and what the plan was
  • This is very powerful
  • Wait statistics remain important– to collect to decode what’s making things slow, but Query Store will become the “new best friend” of wait stats

And hey, who knows if Query Store might start to collect waits as well in the future … or if it might have a Wait Store friend? (No secret knowledge here, just a wish… because that would be awesome.)

, , , , ,

11 Responses to Lost in Performance Troubleshooting – Kendra’s Perf Triage for SQL Server (Dear SQL DBA)

  1. Tom V June 2, 2016 at 8:12 am #

    Dear DBA,

    I believe you copy pasted the wrong text in the href attribute of the query store link at the end.

    Regards,
    A Reader

  2. Bill June 6, 2016 at 11:06 am #

    All great suggestions. I am adding a little more about what Microsoft uses.

    Microsoft uses an internal version of http://diagmanager.codeplex.com/. Diagmanager creates a self-extracting executable, which includes pdssdiag.cmd, which calls sqldiag and uses an pssdiag.xml file for configuration of sqldiag.

    Diagmanager offers two templates, “Standard” and “Detailed”. I typically choose “Standard” and use Diagmanager yo manually add the events “Stored Procedures\SP:StmtStarting”, “Stored Procedures\SP:StmtCompleted”, and “Performance\Showplan Statistics Profile”. Yes – Diagmanager harnesses tracing, not many extended events. That means files should be extracted to a disk attached to your SQL Server box, preferably a very fast disk (and depending upon what events are chosen) preferably with XX GB of free disk space. And that means you will want to monitor disk space consumption while pssdiag is running, as you probably will collect XXGB in 20 minutes. It also means you will need to be very careful about the events you choose to trace on a very busy system (a system with low to mid thousands of batch requests/sec, or more). Above every 10 seconds while pssdiag is running, it polls various DMVs, including dm_os_wait_stats and collects perfmon metrics (5 second polling, by default). Think of pssdiag as a glorified task scheduler or a juggler – it collects trace files, DMVs, perfmon logs, and event logs. It can be scheduled and it can keep just the latest N trace files. Even so, pssdiag is not going to collect all data for all performance problems (my crude guess is ~80%).

    After pssdiag has been stopped, it’s data needs to be moved off the production box to another (non-production) server for resource-intensive processing by http://sqlnexus.codeplex.com/. SQL Nexus imports the trace files by calling readtrace.exe (which is the resource hog). Readtrace replaces every statement’s or batch’s scalars/literals/constants with tokens, so that results can be aggregated – it can run for tens of minutes. SQL Nexus also imports dm_os_wait_stats and a few other DMVs, plus a few of the more basic perfmon counters. The aggregated trace results are tabulated and graphed in SSRS, with the ability to choose a specific time period (within the whole pssdiag collection period) or compare results with another pssdiag collection (a baseline, for example). I usually first go right to SQL Nexus’ Bottleneck Analysis page (which shows dm_os_wait_stats’ differential) and use that page to focus my thoughts when next reading the readtrace results page.

    Pssdiag and readtrace (SQL Nexus) are well suited for Microsoft Support’s needs – what readtrace.exe does has been automated – when a customer uploads pssdiag output to Microsoft Support, a link to readtrace results and SSRS report will be emailed internally to the MS Support employee, who will inspect and call the customer once that employee’s analysis is done.

    For me, pssdiag and readtrace/sql nexus are just a couple of tools in my bucket of tools. Ones that I keep in my back pocket for times when I believe my ad-hoc investigations are going nowhere. I expect Extended events to pssdiag’s replace tracing, sooner than later (it’s main author, Jack Li is a busy person :)).

    • Kendra Little June 6, 2016 at 11:10 am #

      Thanks for your comment!

      I can’t recommend Performance\Showplan Statistics Profile, though. Tracing that event can slow down performance significantly, unfortunately, no matter how fast the disk that you’re writing to.

      • Bill June 6, 2016 at 5:04 pm #

        Yeah. Showplan stats is text, not XML, so it’s the smallest of the bunch, but it still generates enough data to cause tracing of my system to put out 50 GB in 20 minutes. I trace to SSDs, and my systems go up to about 2000 batches/sec, with an average of 1 and a fraction of a statement per batch.

        SQL Nexus can parse show plan stats (not the newer and larger xml events) and based upon What was parsed, will report on every operator in every plan that exhibits a cardinaily estimate concern. So show plan stats does have a use, sometimes. Otherwise it bloats trace file throughput demands by a factor of about 10 to 1 :).

  3. Bjoern June 7, 2016 at 3:24 am #

    Hi,

    I prefer it the other way round…
    first have a look for any general issue in error log or eventviewer, then executing Brents awesome scripts. After that and not findung any hint to dive deeper… then I’ll go deeper inside the “whoisactive” and who is increasing system load. My last choice is to analyze some “heavy load” statements which might be the troublemaker…

    But many thanks to Kendra to write down some advices on “free” tools and for her nice blog post.

  4. Uday June 8, 2016 at 6:38 am #

    Thanks for the write up……..it’s really helpful.

  5. Red8Rain June 8, 2016 at 9:31 am #

    a good 3rd party tool I find useful is DPA by solarwinds. Some issue still require digging but DPA gives good insight.

  6. Ray Herring December 7, 2016 at 8:40 am #

    Some sanity checks are not so obvious, particularly when you only look within SQL.
    For example, I found out that by default many (All?, Most?) Windows Server installations have a scheduled Windows Task Scheduler (not SQL Agent) job to defrag disks. In my case it was scheduled for 2AM every Tuesday. Took awhile to track that down.
    If your server is virtual then there are some not so obvious external items that can seriously affect performance. (Over committed memory, poorly configured virtual disks, etc.)

    One of my favorite debugging axioms (from a grizzled old mentor) is “If the problem is not where you are looking, then LOOK SOMEWHERE ELSE”. His point was, if you don’t find a problem in your house after a reasonable look then it is time to at least consider the problem may be next door, or upstairs, or …

Trackbacks/Pingbacks

  1. Identifying Storage and SAN Latency in SQL Server (Dear SQL DBA) - by Kendra Little - June 16, 2016

    […] Lots of reads but very low latency is evidence that you’re using the storage, but things are going swimmingly. And that may be the case– be open to that!  The beauty of this DMV is that it can tell you if the problem is elsewhere. (If so, start here.) […]

  2. Administering COTS databases (ISVs / Third Party Vendors) - by Kendra Little - January 10, 2017

    […] performance tuning COTS databases (aka third party databases), I use the same performance tuning methodology I do for other databases — but I know that re-writing the queries is typically a last resort […]

Leave a Reply