More, more about columnstore!

Why are modifications such a big deal to the performance of columnstore indexes

Hugo Kornelis gives a great discussion of this in “Stairway to Columnstore Indexes Level 6: Updating and Deleting Data in a Columnstore Index.” The other parts of this stairway are also a fantastic resource f you’re evaluating columnstore indexes for your environment.

Get a bigger scoop of info on string columns in columnstore indexes

Read Niko’s post on working with strings in columnstore.

Interested in learning about limitations of columstore indexes?

Watch the free video, “Worst Practices & Less Known Limitations for Columnstore Indexes by Niko Neugebauer” from the online GroupBy conference.

Transcript

Once you start looking at this, I think it becomes so much fun, because there are so many questions you may ask.

You might wonder, “Do I have to choose between rowgroup elimination and partition elimination if I have a partitioned columnstore index?”

Well, let’s take a look!

We now are declaring our parameter value– or this is actually a local variable. We are now declaring our local variable with the correct {precision} of DATETIME2 that matches the data type and {precision} in the table. So in this case, we should get successful partition elimination because oh, thank goodness, we remembered the data type correctly this time.

I’m going to turn on my actual execution plans and I’m using STATISTICS IO and STATISTICS TIME so I get that rowgroup elimination information as well.

In my execution plan, I can look at the properties of the columnstore index scan, and I can see that because it’s an actual plan, hey, I did get partition elimination. I only had to look at six partitions, they were partition numbers 51 to 56.

I also got that push down predicate magic in here too, of the 3.1 million rows, but just like last time, I had to push 819,000 rows over from that delta store to be counted by the hash match.

When I look at my STATISTICS IO output on the messages tab, now I see segment reads were four, segments skipped zero. Again, what it means to say here is rowgroups. We actually had to read four rowgroups. We skipped zero rowgroups.

The number of skipped rowgroups has changed because we eliminated a bunch of partitions

For the partitions who were eliminated, it doesn’t even think about the rowgroups because it’s like, “oh, we eliminated the partitions, I’m only going to even consider the rowgroups in the remaining partitions.” And then at that point, potentially it could eliminate rowgroups. In this case it did not have to.

Let’s go ahead and edit our query a little bit

What if we only wanted the things that were after January 15th? We are now partitioned by year, so we may eliminate some partitions but we now only want a subset of the data within the remaining partitions. We’ve got our actual plans on, when I look at my columnstore index scan, I was still– I used the correct data type, so I was still able to eliminate partitions. I only needed partition 51 to 56.

When I look at my information here, the way my data laid out is actually kind of funny. All of my births are really close to the beginning of the year, and it’s like, oh, I looked at the metadata and I was actually able to skip all four of them. Because we had a huge burst of babies born at the beginning of the year, and then it stopped. I’m not very good at distributing my {fake} birth date for my babies. My babies are all born just right away. So I’ve kind of made the human species be a little weird in my database.

We’ll reset that demo, but it is able to use both.

It can eliminate partitions and then, in the remaining rowgroups, eliminate rowgroups when that is applicable.

If we look at our data who came back, it’s actually after that date. It’s only the babies born in 2016 who we got back after that date.

That’s great that we can have our partition elimination cake, and we can also eat our rowgroup elimination too.

You might wonder, “can we get rowgroup elimination on other columns?”

For example, our non-clustered columnstore index contains a FirstNameId column. That column is an integer.

We’re going to see something else that’s a little weird about these execution plans in here too, when zero rows is not really equal to zero rows

So I’m looking at: I want to know the count of babies who have the FirstNameId of four. I don’t actually know what the FirstNameId of four is, it was just a fun number to throw in there, and my table is partitioned based on {FakeBirthDateStamp}.

There could be babies born at any time between 1966 and 2015 with the FirstNameId of four, right? We can’t just eliminate– we’re partitioning by a totally different column. We can’t just eliminate partitions.

But, those compressed rowgroups do have metadata on them about the min and max values in that rowgroup, so potentially we could eliminate rowgroups depending on how frequent FirstNameId four is. I don’t actually know, so let’s take a look and see what we get for the count of babies with FirstNameId four.

Well, it’s not a very popular name. There’s only 21 babies with that name, and when we look at our execution plan, here’s our columnstore index scan, and I look at the properties, sure enough it’s in batch mode, and our actual partition count is zero, of course.

We weren’t able to get partition elimination– again I think of this as not applicable. But maybe I was able to get rowgroup elimination. We had all 21 rows, we’re done in the local aggregation, none of that FirstNameId equals four was in the delta store, it was all in the compressed columnstore index. An interesting thing here: actual number of rows is zero.

This is what leads to this weird thing you may sometimes see in these execution plans. When we look in the plan and I highlighted this line, we say OK, how many rows went into the hash match? Zero rows went into the hash match, because all of the rows were counted down in the columnstore. but then you look at the rows coming out of the hash match.

When you highlight the operator coming out of the hash match it says one, so you end up with this weird situation in an execution plan where zero rows went in, and one row came out. And it looks like the hash match, at first glance if you don’t look at everything else and know what’s going on, you just see a hash match operator who had zero rows go in but magically one row go out– and you’re like, did you just make it up? Like, did you just conjure this row from nowhere?

Well, what actually happened is, we counted all 21 rows in the columnstore index using that magic counting it can push down, and it’s sort of like SQL Server’s kind of trying to say that these don’t have to be counted by the hash match aggregator. It’s kind of like this secret curved line that just sort of joins up with it and goes out of it, because it’s trying to say– I mean it is fair– the hash match operator doesn’t have to count that row, or the 21, it was already counted in there, right?

And now I have this crazy drawing on the screen.

So you may have cases, because this arrow only shows the arrow going into the hash match operator, because it only shows how many rows the hash match operator’s actually going to have to count, and in our case it doesn’t have to count any, you may end up with seeing zero rows go into something like that, and then one row come out. It does have a meaning, it’s just a little weird.

Well, OK, were we able to– actually that was the other question we had about this!

Were we able to get rowgroup elimination on FirstNameId?

I almost said segment elimination, because of the weird naming thing right? But it should be rowgroup elimination. Well we WERE able to. It had to read from six rowgroups of compressed segments, it was able to skip 228 of them just by the magic of looking at the metadata.

Very, very cool.

How about other columns though? What about StateCode?

Well, a little bit of a spoiler, we currently do not have rowgroup elimination on string columns. Niko, who writes wonderful materials about columnstore indexes and shows many of the wonders and really cool things columnstore indexes can do, Niko has written a blog post about this and he points out that yes, we would hope to someday have rowgroup elimination on strings, but strings right now are more difficult than columnstore indexes because they don’t have it.

If you think about all the different collations that you could use when storing string data, it’s not necessarily a simple thing to code, right? It’s not nearly as easy, and I’m not saying any of this is easy, but it is a more complex case to provide rowgroup elimination for that. We will see: I’m running this query for just everybody in Oregon. I have a CHAR two {datatype} for StateCode. Since that is a string column, when we look at this we are not able to skip any of the rowgroups, we had to look at them all.

Now our query still was pretty fast. It turns out, we have, let’s see, let’s throw some commas in here. We have 1.6 million babies from Oregon, and when, you know, of course we were not able to do partition elimination. We are partitioned on a totally different column.

We get that weird “actual partition count is zero” because we’re looking at a non-clustered columnstore index, but it was able to work that local aggregation magic on it. It only had to push 12,000 rows of those Oregon babies over to be counted by the hash match. So, it was still able to do this very, very quickly even though it didn’t get the rowgroup elimination.

Another string column is Gender

If I count all the male babies, I see that in that case I was– I had to look at all the segments as well. That kind of just makes sense. Even if I was able to do elimination, a lot of my segments are going to contain both male and female babies, right? There’s maybe over 50%, but there’s a very high likelihood that a segment is going to contain both male and female babies, so I wouldn’t have a great opportunity because of the way my data is laid out when I created my columnstore index.

So segment elimination {ROWGROUP!} is a very powerful tool.

We are going to get it, not for string columns at this point, but even in my case when I’m doing these aggregations / counts for string columns, I am still able to get that cool predicate push down.

Now, there are some limits to when, what data types, and sizes of things we can do when it comes to making things more complex. Check out resources like Niko’s blog for that. He also has a great presentation on some of the limitations and the things to look out for. Also, he emphasizes the power of this as well, but as with anything, there are little edge cases and places where the magic may not always be consistent as your use case gets more complex, especially when you have a wide variety of data types in play.