Why is this a problem? (7 minutes)


We normally think of indexes as a terrific tool to make our queries faster in SQL Server. That’s for good reason!

When we craft an index, we are giving the SQL Server optimizer a way to limit IO, limit CPU use, and dramatically speed up our queries.

When indexing works well…

Indexes are helpful to the optimizer, our queries become faster, and the indexes are very valuable.

Sometimes things go wrong

Sometimes you’ll create an index, and for either the query that you’re tuning or maybe for some other query entirely, the query uses the index and hits a regression and becomes slower.

We can have these cases where we create an index that — we’re putting it there, we’re allocating space to speed up performance – but it makes our performance worse.

Either we’re wasting space with the index, or we just look and feel kind of dumb.

There are reasons why this happens

The more indexes you tune, eventually this will happen to you– because you can’t index every query perfectly. If you try to– in most systems, we have so many different queries that run with different joins and different where clauses, and different columns in the SELECT, that we can’t create the perfect indexes tailored for each of these queries.

When we create new rowstore indexes in SQL Server, we’re creating with these disk based rowstore indexes – we’re creating a separate copy of the columns that we’re indexing. The columns that we specify as the keys and the included columns get their whole new beautiful structure.

This structure can be really useful, but the more we put in there, the more space it takes up

If we indexed every query, the size of our database would just explode with all those disk based rowstore indexes. Because this copy of data is going to take up storage when SQL Server reads it into memory in its buffer pool, it’s going to take up our memory.

We’re going to have to maintain them, too. We need to check them and make sure they don’t get corrupt. We need to deal with fragmentation., or they’ll just expand with all this empty space in them. And we’ve got to back them up so that we don’t lose data.

This means that…

  • Usually we are going to work hard to tailor really efficient indexes for our most important queries
  • But for our less important queries– and honestly everywhere we can get away with it– we’re going to let SQL Server use less than perfect indexes.

We’re not going to try to cover all the columns in every query or have the perfect index keys for every query

We’re going to often let the SQL Server optimizer make the best of what it has.

It can be pretty good at making the best at what it has!

I’m going to show you a case where SQL Server not that great at making the best of what it has

We’ll have a problem query where it’s not the fastest query in the world, but it’s performance suffers after we add a nonclustered index. SQL Server decides to use the nonclustered index and it’s not always great for our query.

You’ll get to take a challenge

You’ll get to look at the problem and, if you want to, you can then take a crack at it yourself. You can grab the TSQL and work on it and try to figure out: “Why is it getting slower, and can I change the TSQL to speed it up?”

Now, I say change the TSQL. This is not an indexing challenge!

You could absolutely take it that way and improve the index and say: “Okay, how can I make you faster this way?” But in this session what I’m focusing on is: “Okay, what if I have to deal with a less than perfect indexing scenario, and I can only change the TSQL? Can I tune it to at least get back to how fast I was before, even if it wasn’t perfect?” For a different solution, what are the pros and cons of those solutions?

There are trade offs when we are tuning our queries in SQL Server.

The data that we’ll be looking at is sample data of baby names

I got the original data set from Data.gov. The United States government releases each year, aggregate information about all the baby names that were born. And by aggregate I mean it gives you, for the year, the total number of babies with a given name. Let’s say Jacob, it gives you for Jacob, for a given gender what was the total count of names.

I’ve taken that data and I’ve made a much of fake data with it, because I wanted to create a detail table with a lot of rows.

The detail table is on the right here. It’s dbo.FirstNameByBirthDate_1966_2015. It has a row for every baby born that I have state level data from. It says which state in the United States it was from. Each of those babies has their own row that specifies the BirthYear, some fake columns based – I don’t know exactly the minutes they were born, so I faked it out. The StateCode they were in. The FirstNameId, an integer representing what their name is, and then their Gender.

We have another table, ref.FirstName, which is much smaller. Ref.FirstName is only .003 GB and 95,000 rows. It lists: okay what are the FirstNames? How long is the name, when was it first and last reported? And then it’s got the total NameCount.

The reason I have this detail table is it’s 3.7 GB and it’s 159 million rows

As the data in our tables grow, we get more and more interesting challenges with our queries for optimization. We have data actually going back to 1880, I just picked the data from 1966 to 2015 to give us a reasonably sized table to play around with. Next up: we’ll dive in and explore that problem query.