Can I Use Statistics to Design Indexes? (Dear SQL DBA Episode 18)

on October 6, 2016

Should you look at automatically created statistics on your tables in SQL Server to help you design better indexes? Learn why in this 20 minute video, or subscribe to the Dear SQL DBA podcast.

No time to watch? Scroll on down, everything is written in article form below the video.

Here’s this week’s question:

Dear SQL DBA,

I’ve noticed that many indexes in my data warehouse aren’t used frequently. Is there a way to use the automatically generated statistics to make useful indexes?

… (insert witty pun about indexes)

I’ve been asked this question several times, and I even remember once wondering this myself.

There’s no good way to analyze the column based statistics that SQL Server automatically generates for the purpose of index design. Let’s talk about why.

First: Why do we have ‘statistics’, anyway?

Let’s say we have a table named dbo.Grades. It has columns for GradeId, ClassId, StudentId, Year, Term, and Grade.

We run a query looking for all StudentIds where Year=1990.

The table has a Clustered Primary Key on GradeId – so the whole table is stored sorted by GradeId. That doesn’t help our query at all, we want the StudentIds where Year = 1990.

The table has a nonclustered index on Year and GradeId. That nonclustered index does NOT contain StudentId, though.

So SQL Server has two main options:

Is it better for SQL Server to go look at the nonclustered index and then loop back into the base table and look up the StudentId using the GradeId?
Or it might be easier to just scan the whole dbo.Grades table and check the StudentId on each row. Maybe the school was only open for the year 1990, and all the rows have that value!

How does SQL Server know which is better?

SQL Server uses statistics to guess which way is the most efficient to execute your query.

Statistics vs Indexes (and their relationship)

Statistics are little, tiny pieces of metadata that describe the data in the table – things like an approximation of the number of rows equal to 1990. Statistics don’t take up a lot of space in the database.

When you create an index, a statistic is automatically created for the key columns in the index. This can be a multi-column statistic if you have multiple key columns, but the statistic describes the first column in the index in the most detail.

If you use a predicate like Year = 1990 against a column that isn’t the leading column in an index, SQL Server will create a single-column statistic. (Fine print: automatic stats creation can be disabled using a database setting.)

On the other hand, indexes are COPIES of the data itself for the columns defined in the index. The index on Year and GradeId on the dbo.Grades table takes up extra space on disk and on memory and has copies of all the rows for Year and GradeId.

SQL Server uses statistics to optimize an execution plan.

SQL Server uses indexes within the execution plan to find data.

What does the existence of a system created column statistic tell us?

We’ve talked a lot so far about how much statistics and indexes are related. This is why it seems like statistics might be useful for designing indexes!

But here’s the thing – SQL Server doesn’t track and report on how many times a statistic was used during optimization.

I didn’t write the optimizer, but I’m not sad about this at all, I think it’s fine, because:

Optimization needs to be very fast. SQL Server wants to start your query as soon as it can. Everything that has to be written out during that process costs CPU and time.
Just considering a column statistic in optimization doesn’t necessarily mean that a single column index on that column would be useful. There might be a multi-column index that would be useful. Or it might actually be better for it to be scanning the table! We wouldn’t know just by the fact that the statistic had been examined by the optimizer.

To have useful information about which single-column statistics might be needed in an index, SQL Server would have to do a lot of work– and it’s already got a feature in optimization for this at a higher level.

Whenever there’s more than one way to run a query, SQL Server thinks about whether an index would help the query run faster. If it thinks an index would help, it records this using the “Missing Indexes” feature.

I wouldn’t make too many assumptions on a column that lacks statistics, either

If a column does NOT have a statistic on it, that isn’t proof that the column is unused.

Statistics are automatically created on columns where you use joins and ‘where’ predicates.

Some columns may just be returned to the user without being filtered. They wouldn’t need a statistic generated for that. But the columns are still in use– and in some cases, using them as “included columns” in indexes might be useful.

So I wouldn’t use the existence OR non-existence of column statistics to make decisions about indexes.

And anyway, we have other options!

What’s the best way to design indexes?

There’s two main approaches you can take. After you’ve been doing this a while, you’ll probably mix the approaches.

Identify the most expensive queries that run in each database, and tune indexes for those. You can find the “most expensive” queries using any of the following tools: SQL Server Query Store, the SQL Server Execution Plan cache, or a monitoring tool that watches execution and persists data from the plan cache and memory for longer analysis (and doesn’t get cleared out by memory pressure or restarts).
Use SQL Server’s “Missing index requests” to see where SQL Server is asking for an index. This isn’t perfect – it has to make those requests in a hurry, and the requests get cleared out when you take the database offline (or rebuild an index on the table).