The Learner’s Guide to SQL Server Performance Triage
I’m introducing a series of “learner’s guides” — overviews of a given topic, chock full of links and references. For this first post, the information is based on what I learned when I was part of the team at Brent Ozar Unlimited who put together the original First Responder Kit and built a related consulting practice using those tools.
In this post…
- What is performance triage, and when do we do it?
- Who does performance triage?
- “DMVs” revolutionized the way performance triage is done
- Sources for information in performance triage
- When is performance triage difficult?
- SQL Server Performance Triage Checklist
- Common mistakes and pitfalls
- How is this evolving, and what does the future look like?
What is performance triage, and when do we do it?
Performance triage is done when a production database is suspected to be the cause of slowness or of loss of availability. Many times when perf triage is done it’s unclear what the source of the problem is, and the goal is to either identify what is happening in the database or instance to cause this problem, or to rule out the database as much as possible.
In other words, most applications aren’t written with enough built in observability to clearly tell if the problem is in the application layer, the database layer, the network or somewhere else. But the database is often an early suspect in an investigation.
Who does performance triage?
Database administrators (DBAs) have been the main people to do performance triage over the last 20 years. This is largely due to separation of duties: DBAs have typically been responsible for the performance and availability of production databases, while access has been limited to these environments for developers. Often DBAs are part of an IT group which is managed as a separate business unit from that of software development. DBAs also often are responsible for databases developed externally from the company, such as vendor databases.
In recent years, responsibility for production environments has been shifting, at least when it comes to in-house development. As organizations work to be able to deliver new features to customers more quickly in competitive environments, movements such as Agile development, DevOps, and Infrastructure as Code have been shifting roles: it is more common than ever for developers to need to maintain responsibility for their code throughout the entire deployment process, and for Database Administrators to be asked to participate in the development process earlier to help with quality design and testing.
Monitoring tooling also better enables communication to both of these teams, and automation is used to help manage permissions for different environments, allowing access to be granted quickly, but only when it is needed and approve.
“DMVs” revolutionized the way performance triage is done
Prior to the release of SQL Server 2005, “Profiling” using SQL Trace was the most widely used methodology, most often run by the SQL Server Profiler tool.
SQL Server 2005 changed methodologies dramatically with the introduction of Dynamic Management Views and functions, which are commonly referred to as “DMVs”. DMVs hold a rich amount of information regarding the present and recent aggregate performance of a SQL Server instance and its databases. The major benefits of DMVs include:
- Much lighter overhead than tracing
- Data on everything from internal performance of the SQL Server to which user queries have been run and how often
However, many DMVs in SQL Server are cleared when different events occur, such as an instance restarting, a database going offline, a query recompiling, a configuration changing, or memory pressure. For this reason, monitoring tools are very popular and are used to harvest information from DMVs (as well as the sources listed below), store them in a separate repository, and use them to provide historic performance information.
Sources for information in performance triage
In addition to dynamic management views, monitoring tools and folks performing manual triage often reference:
- Operating system error logs. Windows is most commonly used to run SQL Server in production environments, so this is usually the Windows Event log, both System and Application logs
- Windows Cluster Manager logs if Failover Clusters and / or Availability Groups are in use
- SQL Server Error Log – SQL Server maintains its own log of errors
- Virtualization metrics (if applicable)
- Performance counters – SQL Server performance counters can be accessed via Perfmon or queried via SQL Server in sys.dm_os_performance_counters
- There is some overlap between performance counters and DMVs – occasionally perf counters are either easier to use or hold information not available or easily attainable from DMVs
- OS performance counters are available via Perfmon and are often relevant as well
- SQL Trace (either Server Side SQL Trace or live SQL Trace via SQL Profiler) and Extended Events (XE) traces – these have more impact on the SQL Server and it’s easier to cause a performance problem with them, but they can be useful when handled with care
- Query execution plans and related query runtime metrics – execution plans may be harvested via DMVs, Query Store, or a trace (warning: these are expensive to collect in traces, even using Extended Events)
When is performance triage difficult?
Performance triage is most difficult when monitoring of the SQL Server either isn’t in place, or monitoring is only being done by a high level “platform” style monitoring system such as System Center Operations Manager (SCOM). This is because a significant amount of critical information about performance is cleared when failovers happen, restarts occur, configuration is changed, or memory pressure occurs.
While “platform” tools sometimes do have specialized management packs to attempt to gather some deeper information, in practice most teams find it difficult to harvest and use this information. For this reason, it’s quite common for teams to use both a platform level tool for high level monitoring across many types of databases, applications, and services, as well as a specialized SQL Server monitoring tool for monitoring databases and instances, both on-prem and in the cloud.
While it is possible to do performance triage in a reactive model and begin gathering information after the problem happens, in this mode you are left waiting for the problem to happen again before you can begin to diagnose it.
If you’d like to see an example of SQL Server specialized monitoring running against a live system, Redgate’s SQL Monitor has an online demo which runs against environments including the live SQL Server Central databases as well as cloud PAAS workloads. (Disclaimer: I work for Redgate.)
SQL Server Performance Triage Checklist
A frequent check: What’s happening now?
Whenever I begin triage, I often start with checking: what queries are running RIGHT NOW? No matter what monitoring system or tooling is in place I like doing this check with the free utility sp_WhoIsActive. I like this because sp_WhoIsActive is a stored procedure you may run against the instance, and like many operations people I like to see information in multiple tools — so a monitoring tool can help confirm what I see in sp_WhoIsActive and vice-versa.
Throughout triage I typically stop and run sp_WhoIsActive multiple times to confirm that the instance is responding quickly and that I don’t see signs that something is going wrong. After all, you never want to make a problem worse by triaging it, and you don’t want to be oblivious to what is going on while looking at logs of past behavior.
Phase 1 – look for big gotchas
When triaging performance I generally want to start big picture and focus on the following sources:
- Windows Error Log – How many times did I forget to check this at the beginning, only to realize there was an obvious error which was either the source of the problem or a major immediate clue? At least 10 times. Do a quick pass on this early on!
- Key virtualization performance counters or checks – Using a hypervisor? Check (or ask the managing team to check) if the balloon driver kicked in when you had the problem, the VM was migrated between hosts, or something else unusual occurred.
- SQL Server Error Log – Have there been stack dumps? Did the instance restart? Are there messages about super slow storage latency? Again, this is always worth a quick pass early on as it can save you loads of time.
- Do we have a ridiculous lack of resources? – Spending extensive time troubleshooting performance on a 1 vCPU instance or one with 4GB of memory is going to cost way more operationally than is worth it.
- I like to use the free sp_Blitz script from Brent Ozar Unlimited (note: I’m biased, I worked there!) to quickly check for the following. Just be sure to get used to the script so you don’t get lost in the details and can find the high level info quickly when you run it. (Again, I like to cross-check this information in a monitoring tool — I’m all about triangulation and multiple tools, though.)
- Bad settings which can harm performance — super low max memory, auto-close or auto-shrink on databases
- Signs of “poison wait stats” which can kill performance (I think I may be the person who coined that term! It me!)
- Signs of super-slow storage from storage wait stats DMVs like sys.dm_io_virtual_file_stats – this DMV stores an average since startup, so don’t jump to conclusions too fast, but sometimes this holds the answer
- Database backups / RPO and RTO – Although database backups / recovery models are unlikely to be the cause of your performance or availability problem, please do take the time to glance at these: because one problem is that often backups get cancelled when performance is bad in order to lighten load on the system. This can leave you exposed to MAJOR RISKS. You don’t want to be tweaking configuration and skipping backups. Well, you just don’t want to be skipping backups.
- Major missing indexes – SQL Server records when it thinks a nonclustered disk-based rowstore index will be useful to speed up a query in its DMVs and execution plans. Don’t assume that something in this list is the reason poor performance happens, but possibly this will inform your list of “suspects”
Phase 2 – Narrow down: is it the infrastructure, is it the workload, is it one query?
If I haven’t found a clear direction to go in so far and instead I’ve just ruled out the big gotchas, now I start narrowing in. Hopefully I have either a specific time period that I’m looking at where I know the problem occurred, OR I know it’s an ongoing problem which I can observe now. When it comes to anything that isn’t constant, having a specialized monitoring tool is critical.
If I don’t have a specialized SQL Server Monitoring tool, I need to start sampling and capturing information without trying to put too much pressure on the system. This can be done, but you can’t do everything and once and you have to be careful — so you must do a dance of watching what is running, sampling information, analyzing it, and then moving on to more samples. Data I sample or review in a monitoring tool includes things like:
- SQL Server wait statistics for the impacted time – sys.dm_os_wait_stats and sys.dm_os_waiting_tasks
- These are an invaluable tool to finding out why SQL Server is slow, but it’s quite an art to use them. You need to know what to ignore, which waits are available in different versions of SQL Server and what they mean, and which waits are more problematic than others.
- It’s incredibly useful to have a baseline of wait statistics, either collected manually or by your monitoring system. Erin Stellato gives a background on wait stats and explains why here.
- Storage waits / samples of sys.dm_io_virtual_file_stats and / or performance counter information for Physical Disk (Avg Disk sec/read and write, Disk Reads/sec and Writes/sec) for the impacted time
- How much are/were we using storage when there is a problem?
- How much storage latency do we have when there was a problem? (What latency is acceptable? As Paul Randal says, everyone has their own idea of what that is — here is his take.)
- Blocking and/or deadlock reports, if available
- If you see lock waits in the wait statistics, blocking may have been part of the issue
- If you have a monitoring tool, it will help you analyze the chain of blocking and view deadlock graphs to decode which queries are involved
- Query durations and runtimes in the problem period, along with query execution plans
- Both monitoring tool repositories and Query Store are rich sources of information for this
- If you don’t have either of those, you can reference sys.dm_exec_query_stats and related DMVs, but will need to play a guessing game about which of your suspects were running in the problem period, what their duration was then, and what execution plans they had
Common mistakes and pitfalls
The biggest mistake that I’ve seen with this is that folks rely too much on tracing. Not only can tracing slow down the SQL Server, but it slows down your triage process:
- You have to wait for the problem to happen again
- Traces generate a lot of data which are time consuming to sort through and analyze
- You have to be cautious about the amount of things you collect to avoid causing a whole new performance problem, so it’s not uncommon to find that you weren’t tracing the right things to troubleshoot the issue after it occurs, and to need to start all over again
My most common personal mistake is that I’ve forgotten to check for errors in the OS and SQL Server Logs early, and missed clear indications of the source of the problem, wasting loads of time. This is easy to do in a stressful situation even when you have practice, so I recommend having a checklist or runbook for your team to use when triaging.
Another common issue is that not all SQL Server performance counters are useful, and some have persistent myths and misinformation to be aware of.
I’ve found that one of the hardest problems to identify is parameter sniffing. It’s good to be familiar with this in advance if you need to troubleshoot performance in SQL Server. A great resource on this is Erland Sommarskog’s Slow in the Application, Fast in SSMS.
How is this evolving, and what does the future look like?
One of the most exciting pieces of tooling introduced into SQL Server regarding performance tuning recently is Automatic Plan Correction. This requires the use of Query Store and it helps identify when queries are sometimes fast and sometimes slow. In other words it can help with identifying that tricky problem of parameter sniffing.
- You may use this in an advisory mode, or let it test freezing different plans for you in production
- Note that “corrections” are intended as temporary fixes: for long term performance, queries that are “corrected” likely need either query tuning and/or index tuning
Automatic Index Tuning is available in Azure PAAS editions. I’m not crazy about trusting SQL Server’s missing index suggestions — I look at them but take them as a suggestion to look at the related queries directly and do my own analysis, as they’re often oversimplified. ( I also do NOT trust ye olde Database Tuning Advisor.) But automatic index tuning is an interesting feature and it involves an element of testing. I haven’t personally looked into this much, but I’m curious about it, and I can see it working well on the right workloads, probably workloads with relatively small data sizes. Automatic index tuning does present interesting challenges in terms of how teams want to manage schema in source control and what indexes they want to deploy to a new workload if using a single-tenant-per-database architecture.
APIs for Customization from Monitoring tools – “Did that deployment/maintenance/configuration change we did yesterday cause a problem? When exactly did it happen, anyway?” Figuring that out has always been a toughie. One of the cool things that SQL Monitor has done is to add an API for annotations so that teams can track significant events on the graph of performance metrics. This can be automated to happen right away when an event occurs, or it can be used by team members to note times of interest, perhaps when triaging performance and noting times when a problem occurred. As DBAs merge more into platform teams and/or combine with development teams, the use of APIs to both manage alerts and the ability customize how alerts are delivered will only increase, along with other custom features enabling teams to customize their responses to incidents. (Again, I work for Redgate, so I’m biased on my choice of tooling and examples.)