Meet your problem query (20 minutes)

Ready to tune our query?

Time Statistics / IO statistics in actual execution plans

Performance statistics for duration, CPU time, and IO appear in actual execution plans for SQL Server 2014 SP2 and higher.

To see this information in graphical execution plans in SQL Server Management Studio, you must be running a recent release of SSMS as your client tool. (Free download from Microsoft.)

Checkpoint

In the video I talk about checkpoint flushing dirty pages to disk. For you details-lovers out there, checkpoint does a little bit more as well – I just mentioned the part that’s most relevant to the demo. Get an overview of checkpoint here, if you’re interested in the nitty-gritty: https://technet.microsoft.com/en-us/library/ms189573.aspx

Transcript

Time to meet our problem query.

A bit of info on how my SQL Server instance is configured

The demo that you’ll see today, I’ve recorded using a virtual machine that has four cores and 6 GB of memory allocated. I’ve set SQL Server to use the max degree of parallelism of four. If a query qualifies to go parallel, it can use all four cores.

My max server memory is set to around 5 GB, and my cost threshold for parallelism is set to the default value of 5. That default value is pretty darn low, but I’m not using the largest tables in the world, so for the purposes of demo and comparing parallelism in plans to NOT having it in plans, I have left it down at that relatively low default of 5 query bucks, or 5 estimated cost, which is in its own unit in SQL Server.

I have restored our sample database

It takes about 30 seconds to restore on my SQL Server instance, and my demo is ready to go.

The first thing I’m going to do is create an index on the little ref.FirstNames table

This is the little table that has a unique list of all the first names in the database. I’ve made it really easy in that little table to go straight to a given first name, then return the FirstNameId column.

Here’s the query that we’ll be tuning

It’s in a stored procedure called NameCountByGender, and that stored procedure takes a @FirstName. So people can pass in that @FirstName parameter.

Then, for the given FirstName, SQL Server is going to group by the Gender column and return a list of all the genders we have for that FirstName between 1966 and 2015, then the count of the number of rows we have for each of those genders.

It does this by joining that detail table, dbo.FirstNameByBirthDate to the ref.FirstName table. If we’re going to look for a given FirstName in a varchar field, we’ve got to plug that into the ref.FirstName table. We only have FirstNameId over in our larger table. We’ve normalized our data somewhat. So, we’ve got to join the two tables on FirstNameId. This Gender column is in the dbo.FirstNameByBirthDate table, and we’re grouping by that, then returning it with the count.

Not a super complex query

I think the beauty of this is: there’s a lot of interesting things, even though this isn’t the most complex query in the world. I’m going to make sure that my procedure exists in my database.

For the purposes of our example, imagine that this isn’t the most important query in our database

It’s a query that runs occasionally, and when it runs, the tables aren’t guaranteed to be in memory. Imagine that this dbo.FirstNameByBirthDate table isn’t super frequently accessed, but we imagine that we do have other things on our instance and in our database that often are resident in memory.

When this query is run, it frequently has to go out to disk and bring the pages into the buffer pool.

We will be measuring our performance with a cold cache

In other words, every time we’ll be measuring it, we’ll be saying, dump out the memory on the whole instance so that queries that run have to populate the memory in the SQL Server. Just disregard what’s in memory and start over from disk.

Now, you’ve got to be really careful about where you run this…

…because saying disregard all the pages in the buffer pool means everybody’s going to have to start going to disk.

So I’m running this against a private VM test instance where I’m not going to impact anyone else, and that is really the only place where you should ever run a command like DBCC DROPCLEANBUFFERS, which says: okay, I don’t care what clean pages you’ve got in your buffer pool. Just disregard ‘em and start going to disk.

Because, I mean, in production, you’re going to immediately make everybody start reading from storage, and that can really slow them down.

Even in a test environment, if anyone else is looking at performance, they may start seeing weird things if you vanquish your buffer pool suddenly.

So be careful out there.

I’m going to clean out my clean buffers – or pages that haven’t been modified in memory – and say, just ignore them after this.

Your problem query is: you will be running dbo.NameCountByGender with @FirstName = ‘Matthew’ as the parameter

For my first run here, I don’t have execution plans on, but I am running these statements to turn on STATISTICS TIME and IO for my session, and then I do turn them off when I’m done after I run my query.

We can see that there are Matthews of multiple genders. I’m not sure if there really are females named Matthew or if that’s a data entry issue, but it’s an interesting question.

Here is my output from STATISTICS TIME and IO

This time the query ran, my elapsed time is just over five seconds. Usually for this query on my instance I see this between 4.5 seconds and 6.5 seconds when it has to go do those physical reads, so this is right there in the middle.

It did use more CPU time than this. When we look at the execution plan, I’ll show you why that was.

Sure enough, it did have to do a lot of reads against that big FirstNameByBirthDate 1966 to 2015 table.

We don’t have any non-clustered indexes on that thing yet, and we had to do 487,000 logical reads, and then three physical reads and 476,000 read-ahead reads. Huh.

STATISTICS IO separates out two different kinds of reading from storage

Physical reads are the smaller reads, where we’re either reading a single 8KB page, or maybe we’re reading an extent of eight 8KB pages, but we’re kind of plucking them out is how I picture it– ‘physical reads’ means I’m plucking them out into memory.

But STATISTICS IO separates out read-ahead, which can read in a bigger group of pages at the same time. The pages reported by this were read in larger sections. I think of it kind of as a big vacuum cleaner.

One of the features of Enterprise Edition is that your vacuum cleaner is even larger. Your read-aheads can read even larger chunks. I am running Developer Edition here, which is free, which is awesome, but Developer Edition has all of the features of Enterprise Edition, so I can actually take advantage of that on my instance too. Those are kind of like being able to read the data in a more efficient, bigger way. We’re doing a big scan. So SQL Server’s just like, “Bring it all in!” is what that’s doing.

I’m going to run this again, and this time when I run this, I’m going to go ahead and run DBCC DROPCLEANBUFFERS to say, okay, I want you to go out to cache again, and I had checkpoint there before. I should have DBCC DROPCLEANBUFFERS in there. Checkpoint says, for my current database, if I just say checkpoint for my current database: “take the dirty pages in memory and flush them to disk.” But I haven’t dirtied any pages in my database, right? I just ran a select query. So just dropping clean, unmodified things, is enough. I’m going to run it this time.

I’m going to turn on actual execution plans, so we can see how this query is being run

We will also, because of my version of SQL Server be able to see similar information to that STATISTICS TIME and IO in our execution plan.

I’m running SQL Server 2016 Service Pack 1. As of SQL Server 2014 Service Pack 2 and higher, we now have in our actual execution plans information about: How long was our elapsed time? How much CPU did we use? As well as IO info. I’ll show you how to see that.

If you open up your properties pane, and then I clicked around to make sure that it’s showing the info for my select operator, I can scroll on down and find QueryTimeStats in here. This says, oh, okay, this time, on this run, this took 4.7 seconds, and we used 18.6 seconds of CPU time.

Really cool info, and I didn’t have to turn on STATISTICS TIME and STATISTICS IO. This is just part of my actual execution plan.

I got less than five seconds of duration this time

Well, where did I spend my time? I can see that on my operators, notice that I have this gather streams parallelism operator, and that my hash match operator and other operators, they have these double arrows on them. These are indicators that they could use multiple cores. That’s why my CPU time can be higher than my elapsed time.

CPU time measures how much time was used by all the cores in total. They may have been running some of those operations simultaneously.

Let’s look a little bit at our plan

We’ve got a really efficient index seek against ref.FirstName. That little index we put on there to help it find the FirstNameIds quickly, it worked really well. But we’ve got a big old clustered index scan here for FirstNameByBirthDate_1966_2015, and it was parallel. How long did that take?

Well, let’s look at the properties pane for that guy. I can see– I think this is so cool– I can see the actual time statistics for the operator. This clustered index scan took 4.4 seconds. That is most of the duration of my query, and it took more than 17 seconds of the CPU time used by the query.

So, this makes sense.

I’ve got this big table that’s almost 4 GB. I’m having to scan it and look at every row and be like, are you Matthew? Are you Matthew? Are you Matthew? Are you Matthew?

This makes sense as taking most of our time.

How to interpret per-operator elapsed time and cpu time in ‘row mode’=

It can be a little confusing when you’re looking at this info, though, because if you go up to the hash match node, and you look at its actual time statistics– the first time I looked at this, I was a little confused, because my hash match, I’m looking at my hash match operator up here. It says it took 4.5 seconds of elapsed time.

Well, wait, but if they both took more than 4 seconds, then my total duration would be more than 8. But it wasn’t.

There’s a method {to this madness}. It actually makes sense. Look at your actual execution mode of the operator that you’re on. We are in ‘row’ mode for our hash match operator, and we’re actually in ‘row’ mode for everything in this execution plan. I don’t have any batch mode in this execution plan.

For row mode, the actual time statistics are for that operator and children. So the hash match operator here, this clustered index scan is considered a child of it, and its time is reflected in the hash match operator. You’ve got to, when you’re trying to figure out what actually took {how long}– you need to check what your execution mode was, and if it’s row mode, go start looking at the children of it.

Our clustered index scan doesn’t have any children, so we know that really it took that long. That does make good sense, that it would have taken that long.

So this is our baseline performance.

I’m not saying that 4.5 seconds is great performance, and in some environments, that would totally not be acceptable. But imagine that in our environment, that is a normal runtime for this query. The users are happy, and it’s totally acceptable.

Next up, I’m going to create an index

This index is on the FirstNameId column, only on our large table that contains a row for every baby.

And imagine that this index is being created for different queries, and that it’s great for those other queries out there. We’re not making it any wider, because imagine there’s lots of insert activity on this table.

Not only do we not want to bloat up our database, but we want to make sure inserts, updates, and deletes are as fast as possible. We’re trying to keep our indexes perfect for the most critical queries, but still narrow. I’m going to create this index, and it takes a little while for it to create.

You guys get to fast forward while the index creates and meet me at the end

All right. It took about two minutes for that index to create, and another couple minutes for the fans on my laptop to settle down. Since I did dirty some pages, every time I do anything to dirty some pages {in an isolated test environment}, I try to remember to run a checkpoint to flush the dirty pages from disk, and then also take the clean pages and clear ‘em out too.

We are testing against cold cache again

This time, I’m going to turn off my actual execution plans for the first run, just using STATISTICS TIME and IO test for our procedure. Again, we’re just running it for the first name of Matthew.

Now, we did create an index against our table, and whenever we create an index against the table, SQL Server automatically recognizes, hey, I shouldn’t just reuse the execution plan. I should come up with a new one.

And in this case, right away, when we look at our timing in STATISTICS IO, we can see that our elapsed time is way higher than 4.5 to 6.5 seconds.

We’re at 12 seconds elapsed time, and our CPU time’s actually lower

We did even more logical reads against our table, and our balance of physical reads versus read-ahead reads is different. It’s not all read-ahead reads anymore. We’re just doing more IO for this.

Well, let’s give this another run. I’m going to clear out my clean pages from cache, because I haven’t been dirtying pages. Just been running this query. I’m going to turn on my actual execution plan, and let’s take a look and see what’s different in our actual plan for Matthew after we added this index.

Again, we’re at nine seconds, 10 seconds

It is taking longer than it did before. So, it wasn’t just one run, right? That wasn’t just a fluke. It really is taking longer, and it really does have a different plan.

We still have our nice, efficient seek against ref.FirstName, but now we’re doing nested loops and finding in our big table with all the rows for every baby, we’re doing an index seek in there.

That’s our new index on FirstNameId. And then for every row that comes back out of there, we’re doing more nested loops and doing a key lookup in the table itself. Let’s look at the properties of the key lookup.

What exactly are you looking up? I can find the output list in the properties of the operator, and when I expand that, I see that it’s having to look up the Gender column. I didn’t put Gender in my non-clustered index. SQL Server, for every row that it finds for Matthew has to go see in the base table, “Okay, what is the Gender column?” before it can group by that column, which in our query is being done by a hash match aggregate.

How much time is being taken by this key lookup?

Let’s look at the actual time statistics on that. We are now just on this one lookup operator spending 5.7 seconds of elapsed time looking up those individual genders, and we had to get 1.4 million rows, because there are a lot of Matthews, it turns out, in the United States between 1966 and 2015.

As of this nested loop operator, if I look at its properties, we are, by the time we finish processing this nested loop operator and, you know, this is cumulative for it and its children, we’re already at 12.7 seconds of elapsed time there.

This is confirming, yeah, we did use that new index, but it’s not perfect for our query. We had to go do this key lookup.

But why did SQL Server fall for this?

Why did it choose that index, and what can we do about it? Here’s where the query tuning challenge comes in. You have the option, you don’t have to, but you have the option to play around with this code and take your hand at figuring out: why did the non-clustered index make this slower?

What is it about this query that made SQL Server think it was a good idea? Can you speed up the query?

You don’t have to make it super fast, but can you at least get it back to that 4.5- to 6.5-second range when having to read from disk, or if you’re getting different performance, can you get it back to the original logical reads numbers or less, because it’s possible on different hardware or with different cores, you may see different performance.

So can you get it back to either the previous IO or CPU time, and what are the pros and cons of your solutions? I’ll see you back in future videos in this session to step through:

  • What solutions DON’T work?
  • Why is the query slow?
  • And what solutions DO speed things up, but what are the trade-offs for each one?