Overindexing: Missing Index DMVs and the Database Tuning Advisor

on March 8, 2017

SQL Server has tools that suggest indexes– and they’ll even auto-create the indexes for you.

I’m not a huge fan of these tools because they’ve got some notable flaws: they lead to creating more indexes than you need, and they aren’t super smart about the indexes they recommend.

Two queries that could use some index help

I’m using the free BabbyNames sample database (small version). All the code in this example is also available in a gist for easy access.

I run two queries, which each scan the clustered index of agg.FirstNameByYear:

USE BabbyNames;
GO

SET NOCOUNT ON;
GO

SELECT
    ref.FirstName,
    agg.NameCount
FROM agg.FirstNameByYear as agg
JOIN ref.FirstName as ref on 
    agg.FirstNameId=ref.FirstNameId
WHERE 
    Gender='F'
    and ref.FirstName = 'Calliope';
GO 10

SELECT TOP 100
    ref.FirstName,
    agg.NameCount
FROM agg.FirstNameByYear as agg
JOIN ref.FirstName as ref on 
    agg.FirstNameId=ref.FirstNameId
WHERE 
    Gender = 'M'
ORDER BY NameCount DESC;
GO 20

These two queries generate ‘Missing index requests’

SQL Server notices the clustered index scans when it optimizes these queries, and it quickly figures out that a nonclustered index would make these queries more efficient. It generates missing index requests.

You can see missing index requests in the execution plans as green hints (as seen in yesterday’s post). You can also query them from SQL Server’s Dynamic Management Views with a query like this:

SELECT 
    d.statement as table_name,
    d.equality_columns,
    d.inequality_columns,
    d.included_columns,
    s.avg_total_user_cost as avg_est_plan_cost,
    s.avg_user_impact as avg_est_cost_reduction,
    s.user_scans + s.user_seeks as times_requested
FROM sys.dm_db_missing_index_groups AS g
JOIN sys.dm_db_missing_index_group_stats as s on
    g.index_group_handle=s.group_handle
JOIN sys.dm_db_missing_index_details as d on
    g.index_handle=d.index_handle
JOIN sys.databases as db on 
    d.database_id=db.database_id
WHERE db.database_id=DB_ID();
GO

* This simple query doesn’t sort or prioritize the requests, it just returns the raw info

Here’s what the missing index requests for agg.FirstNameByYear look like in this case:

The missing index feature has a few limitations. The limitations are documented, but the problem is that most people don’t know about the documentation. The feature doesn’t suggest key column order or fine tune your indexes.

But I think the biggest limitation is that the missing index feature doesn’t provide a way to learn about the queries making the request – and that link would be extremely helpful so that you can fine tune your indexes!

There are a couple of problems with the requests for these relatively simple queries:

  1. The second line shows that the index with KEY (Gender) and INCLUDE (FirstNameId, NameCount) has been requested 30 times. The first sample query, which runs 10 times, is asking for both of these indexes. There is no way to know from just looking at this list that the query would do very well with just one index on KEY (FirstNameId, Gender) INCLUDE (NameCount).
  2. The recommendation for an index with KEY(Gender) and INCLUDE (FirstNameId, NameCount) isn’t great for the TOP 100 query that executed 20 times, either. That query has an equality predicate on Gender, and wants the TOP 100 rows ordered by NameCount DESC: it would do very well with an index on KEY (Gender, NameCount DESC) INCLUDE (FirstNameId). With the suggested index, it will have to scan all the rows for every male baby – about half the index– then sort them all in memory by NameCount. Bummer.

We have confusing advice. If we don’t know better, we’ll end up over-indexing – and still not creating the best index for one of the queries.

Does the Database Tuning Advisor do better?

I don’t love running the DTA directly against SQL Servers that you care about: if it fails during its run, it leaves a bunch of temporary objects behind. I know that this happens a lot, because I’ve found those temporary objects in SQL Servers scattered around the world. It’s like space trash: it doesn’t pose an immediate threat, but it’s sloppy.

But, for science, I ran the DTA against this database, using the execution plan cache after I ran these queries.

The DTA suggested that I create four indexes:

We’ve got problems here too, it just looks fancier.

  1. For the TOP 100/ ORDER BY query which ran 20 times, this recommended KEY (NameCount DESC, Gender) INCLUDE (FirstNameId). We have an equality predicate in that query on Gender. With this recommended index, SQL Server will look at rows for both genders with the highest names until it finds the 100 rows for boy babies with the most frequent names (it’ll look about approximately twice the amount of rows it would need to). It’s not horrible, because the TOP operator is smart enough to stop after it ‘finds’ 100 rows, but it’s not brilliant.
  2. One index request on ref.FirstName is baffling to me. The ref.FirstName table has a clustered primary key on FirstNameId. We can already easily seek to any row based on FirstNameId, and access the FirstName column very efficiently. Yet a duplicate index is being recommended here.

What’s wrong with over-indexing?

Overindexing isn’t great: it wastes storage, wastes memory, prolongs index maintenance and corruption checking, and makes restores take longer.

This isn’t harmless. It’s like throwing junk in your closet. At first it’s no big deal. Eventually, you can barely open the door without things falling on your head.

What would I really do in this case?

The query that runs 10 times would do very well with either:

  • KEY (Gender, FirstNameId) INCLUDE (NameCount)
  • KEY (FirstNameId, Gender) INCLUDE (NameCount)

The query that runs 20 times would do very well with:

  • KEY (Gender, NameCount DESC) INCLUDE FirstNameId

We have a small conundrum here, because key order really matters.  There’s no “perfect” way to combine these indexes, because only one column can be second in the list of keys. And ‘Gender’ is not very selective as a leading key column.

I see this as a prioritization problem. What are the performance requirements for each query?

Once we know that, the next problem is, how long are the queries actually taking? It’s possible that I don’t need to do anything and that the performance of my queries is totally acceptable.

Remember: just because you see an index request, that doesn’t prove you have slow queries! All it means is that the optimizer guessed something could be faster when it was guessing about how to run a query, before the query even executed.

If the “TOP 100” query is more important and I create only the following index, it’s great for the ‘TOP 100" query. It’s good enough for the other query to result in CPU time of 45 milliseconds and elapsed time of 18 milliseconds.

CREATE INDEX ix_iamapersonnotamachine on agg.FirstNameByYear
    (Gender, NameCount) INCLUDE (FirstNameId);
GO

* Sorry about the terrible index name

I didn’t specify NameCount as DESC. It doesn’t make a difference in this case: SQL Server can do a backward scan. Although that zone of the plan will be serial, it’s feeding into a TOP operator anyway, which requires a serial zone. And it’s able to stop after it reads 100 rows.

If we frequently use predicates on the FirstName column on ref.FirstName, it makes sense to index it, even though it’s a tiny table. So I’d do this:

CREATE INDEX ix_FirstName_FirstName_INCLUDES on ref.FirstName
    (FirstName) INCLUDE (FirstNameId);
GO

Thoughts and Takeaways

The ‘Missing Indexes’ feature is like a two year old: it asks for what it wants in the moment. It often asks for too much, or not the right thing. That’s understandable: these index requests have to be generated super fast during query optimization. SQL Server doesn’t have time to sit around and ponder. It’s up to you to figure out why it’s asking for something and whether or not it should get it – and if you want to do that well, you’re going to have to track down the queries that are making the requests.

That means doing some legwork and identifying the query execution plans that are the most important to index. You can do that using the execution plan cache, Query Store, or a custom monitoring tool.

As for the Database Tuning Advisor, it doesn’t have the excuse of having to work super fast during query compilation. It’s still prone to over-indexing and littering in your database if it hits a hiccup. I’ve also known it to recommend some really wacky clustered indexes that people applied because they thought the DTA must follow best practices. I just can’t find a way to be a fan of the DTA.

There’s a bright side to this

The computers haven’t quite taken our jobs, yet.