This is complex: let’s dig in
This isn’t a bad index — this is an optimization problem.
Thinking through how our query is written, both inside and outside of the stored procedure: when we’re writing the query with this join, and on the little ref.FirstName table, we’re saying where FirstName = ‘Matthew’.
SQL Server can, on that little ref.FirstName table, it does have statistics available to it on that table.
But think about what that table is. It’s a .003GB table where each FirstName has one row.
The statistics on FirstName describe, “Matthew has one row in this table.” SQL Server knows that doesn’t tell it how many rows Matthew has on the LARGER table. On the larger table, dbo.FirstNameByBirthDate_1966_2015, the one that’s 3.7 GB, we have statistics for the FirstNameId.
So SQL Server knows how many rows there are.
It could use the histogram for a given FirstNameId, but it doesn’t have the FirstNameId! We’re joining on FirstNameId, the predicate is on FirstName.
What is Matthew’s FirstNameId?
Well, SQL Server can figure it out, but to figure it out, it would have to actually query ref.FirstName. It would have to do data access.
Optimization has to happen really, really fast
SQL Server wants to give you an execution plan for that query and get you going ASAP, because if it started running queries to optimize a query, well, that counts as part of your query’s duration.
That’s not how optimization works right now in SQL Server.
So, since it doesn’t have the FirstNameId– if you don’t have the FirstNameId, then I agree your best bet is to say, “okay if I don’t know what the FirstNameId was, and I need to generate your best bet query.”
Okay, let’s look at what’s most likely the best number for any given FirstNameId: using the density vector and multiplying it by the rows gives you exactly that.
When I changed the query, essentially what I did was I ran a query ahead of time myself and looked up Matthew’s FirstNameId.
I did the data access for SQL Server and then I put the FirstNameId in the query.
Armed with this information, it said, oh hey, okay, yeah I’ve got a statistic on that FirstNameId. I can look at the histogram. I’ve got a great estimate for Matthew. We shouldn’t do all those key look ups! It’s faster to do a clustered index scan, and we also got that green index hint to say, okay there could be an index that could make this better for us.
When we gave SQL Server more info, it was able to do a better job optimizing the query
It was able to use the histogram. This piece of information about why it’s slow and how providing the FirstNameId can even make it faster will play into some of the possible solutions that we’ll cover.
Not every solution relies on this, but it always helps to understand where SQL Server slowed down.
A quick note about the future
There are some exciting new features in SQL Server 2017, including the first version of a feature with Adaptive Joins. This feature is interesting. The 1.0 isn’t going to cover our use case that I’ve talked about in this example, but it’s just the first version.
The idea behind the Adaptive Join feature is having in our execution plan, a join that if it’s not as fast as SQL Server thinks it is, on future executions, it might change and it might run differently. It might be able to adapt. A couple of things to notice. This is about not just that first execution, so if all your queries only run once, if you have a lot of adhoc queries, that feature might not be that helpful. It may be a long time, and we can’t be sure if it’s going to cover our scenario, but it is really interesting and really cool that the optimization/execution process is getting more interesting and more complicated, and the way that the query is initially optimized isn’t always going to be the whole story.
So let’s dig in and look at different ways we can tune this query since WE need to currently provide the adaptation for our TSQL.