Why read committed is the default isolation level, and what to do next (16 minutes)

Links

One question from the end of the webcast: can you change the default isolation level for the instance. You can’t – Microsoft’s Nacho Portillo explains why in detail here.

Transcript

Why is read committed the default isolation level?

So why, with all of these problems– we can see data more than once, we can not see data at all– this happens with disk-based rowstore indexes and disk-based columnstore indexes– we can get combinations of data back that didn’t exist.

Why does SQL Server use this as the default, except for Azure SQL Database? Why do we have this?

Well, we didn’t always have all the options that we have today.

More options for SQL Server were introduced in SQL Server 2005, and these have been refined over the years with newer technologies. But when we talk about isolation level in SQL Server, the default of read committed for most databases, it was the default for a long time, before SQL 2000 even. These are the classic isolation levels in SQL Server.

SQL Server didn’t invent isolation levels

There’s lots of relational databases, there’s lots of databases, and this concept of isolation of how do I handle when modifications are happening and people are reading the data. This is a common thing. We have common definitions of phenomena that can happen under different isolation levels. And with SQL Server’s specific implementation of isolation levels, we have a range of options that were the classic options. And there were originally all of our isolation levels were very strongly based on locks.

Read uncommitted

The lowest kind of classic lock-y isolation level in read committed, or in SQL Server, is named read uncommitted.

I am willing to read data that hasn’t been committed yet. This is also, the phenomena is known as a dirty read. Or the phenomenon is known as a dirty read. But even read uncommitted, whose hint name is NOLOCK, it does use some locks. I have one lock on this rung, because it does have to use things like at least a schema stability lock on it. Like make sure the database isn’t dropped. Make sure the table isn’t dropped while I’m doing my read. There’s some basic locking there.

But the problem with reading uncommitted data is it may never be committed. It may be rolled back, right?

There may be a batch of data that’s going into the database and the problem happens and the entire thing is rolled back, and that data just isn’t there anymore. If I’m reading uncommitted data, I’m going to return information to my users that just isn’t even a fact because it hasn’t been committed.

Read committed is safer than that

But that’s one of the few things you can really say for read committed, is that it’s not as crazy as read uncommitted.

But read committed is still crazy. And it is using more locks because it’s protecting the data as it’s reading it. The problem with using more locks is you can introduce more blocking. Maybe I have more than one person updating data in my database.

If I’ve got a lot of readers and I have more and more writers, as I start to scale up, the blocking in read committed can increase.

Now because of the wackiness in read committed, I do have other options in the classic isolation levels in SQL Server.

Maybe I don’t want to have this problem of saying, okay I saw this row more than once…

I counted the money more than once. We can increase our isolation level. If we increase our isolation level to repeatable read, it starts to say okay, that phenomena of a non-repeatable read, a read that I read this name is Aaabin, but if I was to go back and read it again in the same transaction it wouldn’t be there anymore.

Okay, we can’t have that anymore. Repeatable reads still uses locks to protect those readers, and it’s more conscious about the time that it holds those locks for. So that’s good but, that’s good for my readers.

But, what if I’ve got more and more writers?

I have more and more blocking, I have readers blocking writers, blocking readers, blocking writers, so increasing and raising my safety, going up to repeatable read, can mean that I have more blocking. It can be harder for performance.

But even repeatable read has problems that can happen in it, because there are still some scenarios where data’s being inserted it can suffer from something called ghost, ghost phenomena, where there are ghost rows that maybe it didn’t see.

You can raise the isolation even farther in the classic isolation levels and go to serializable

And serializable not only says, okay I want to make sure that if I read this row I can read it again, but it also says, I want to protect a whole range of data in here so that data isn’t inserted in here that I would miss because I already read this range of row and that data wasn’t there. That protects data in another way and it holds it for longer, so it can be an even bigger problem for blocking.

So the reason that read committed is the default isolation level in SQL Server…

…Isn’t that we want to have phantom rows, otherwise known as those ghost rows, the other word for that is phantom. It isn’t that we want to have non-repeatable reads.

It’s not like we think non-repeatable reads are great, it’s that as we provide the way SQL Server classically implemented isolation levels using locks, the more safety we want, the harder it is to scale our performance because the more blocking we have.

We have different types of locks and they can be held for longer for our transactions.

With classic isolation levels in SQL Server, it’s a very difficult choice to be like, would you like more blocking or would you like more incorrect data? You can have wrong data, or you can have lots of blocking and deadlocks, which would ya like?

That’s not a real happy choice.

The default isolation level was set as read committed, but we did get, I mean a long time ago, we got more alternatives. This wasn’t the happiest choice.

In SQL Server 2005, we got some new alternatives which are very, very cool

We got version based isolation levels. And we got two of them. We got read committed snapshot and we got snapshot. And these version based isolation levels, they have different cost. There are still locks involved. Writers will still block writers but the isolation for readers is handled differently.

When modifications are happening in SQL Server, SQL Server tracks when did this modification happen. So that it knows, okay based on when your transaction started reading data, what’s consistent with the time you started. In addition, to tracking when did your transaction start, it also, if your data has been updated or if your data’s been deleted, it keeps a copy of the previous, what did it look like before the update. What does the previous version look like. It keeps that in tempdb.

If your data’s deleted well there it is, the current version is you were deleted, but the previous version is the copy of the data, so in this image we’ve got the bug, the new version of the bug has a crown on it, but we have the old version of the data stored in tempdb so that anybody who started their transaction, anyone whose read is happening, and whose time they started is, means they should get that old version of the data, can see it from tempdb. And they don’t have to go, if the current row’s being locked in the process of that update, they don’t have to wait for that lock to be released.

The reader doesn’t get blocked by the writer, they just go look at the version in tempdb.

Read committed snapshot isolation (RCSI) vs Snapshot

Read committed snapshot is implemented at a statement level

Even if I do ‘begin transaction’ and I run ten queries, each statement in that transaction sees data consistent with the time that data started.

With snapshot isolation it’s very similar, but it does provide higher isolation because it shows data consistent with the time that data was first accessed in the entire transaction.

And that is really great because let’s say I’ve got a report from my CTO, that contains lots of really sensitive information and multiple queries are run serially to gather that data, I want them all to be consistent, most of the time. I don’t want my CTO to add up all the numbers in the different sections of the report and say these numbers don’t add up.

I want her to see data that matches.

We have different, there’s timing of course, is very important when it comes to snapshot transactions.

And of course there is overhead to these, but these are really exciting alternatives.

RCSI is the default in Azure SQL Database

They are so exciting and they are so useful that Azure SQL Database uses a different default isolation level than if you install SQL Server yourself. It uses read committed snapshot isolation as its default isolation.

And this I think was actually something very brave that the SQL Server team did. This is a really great thing and I’m really glad they did it. And it’s not necessarily an easy choice. The reason it’s not an easy choice is that read committed snapshot, if you’re migrating your database to use read committed snapshot, whether or not you’re using Azure SQL Database. If you’ve been using read committed for a long time and you switch to read committed snapshot, there are certain race conditions that you can hit, that can cause incorrect data.

So by saying, okay we are just going to use this in Azure SQL Database, you need to clean up your code. ‘Cause this is better. This is more concurrent.

And Azure SQL Database, you could be using a SQL Server instance that’s shared with other people. The performance is really important.

You don’t want to have a lot of blocking on one instance causing queries to pile up using more resources. And yeah there are other ways to control resources in their write, but they want this to be a fast OLTP environment. Azure SQL Database is primarily designed for OLTP Apps. Lots of smaller inserts, updates, and deletes. Not, Data Warehouse has its own hosted application.

So to try to say, look this is the best setting for OLTP, is what this really, the fact that this is default in Azure SQL Database is really a statement that when it comes to OLTP applications, it is worth taking the time to avoid those race conditions to get better performance and to get a more consistent isolation. Because we can prove, and we do this in a different session, that read committed snapshot, it uses those versions to avoid counting those rows twice, or missing the rows, or producing combinations of data that never existed.

What should you do next?

I wish I could tell you to just turn on read committed snapshot isolation, but it’s not quite that simple because if you do suddenly just change your isolation level to snapshot or read committed snapshot, well you could depending on how your modification rate and your environment, you could overwhelm tempdb with versioning if version cleanup isn’t allowed to happen. You could hit some of those race conditions with RCSI read committed snapshot that cause incorrect results.

You don’t want to just pull the trigger right away, you want to learn about that isolation level, and figure out is snapshot right for me? For some of my queries? Or is read committed snapshot right for me and do I need to look for those race conditions. Or are there some places where I want to use these classical isolations, and I don’t want to do versioning and I want to actually use serializable. Now that’s a rarer thing. I’m not suggesting that you just go use serializable, you could just get a bunch of deadlocks, right?

You don’t want to just jump into these things, because we can make data wrong in different ways. Instead I think the first step on this journey is recognizing, okay read committed, let’s us return incorrect data. And read uncommitted is even worse. Like even more garbage.

The first question when you look at your applications is…

How much truthiness is required? And I use the word truthiness with more seriousness than I would think here, but it’s because for some applications, I have actually had this CTO of the company say, “Oh, the data actually doesn’t have to be right for this. “No, I will sign some paper that says for here “if the data is wrong, it is totally fine, “we do not guarantee to our customers “that the data is right at all, “this just something to put on the screen “to make them happy, even though it could be wrong.” Now that’s fairly rare, but it happens sometimes. So what you want to know is, for this application, how important is it that the data is correct, because we have a lot of applications and we can’t just generally go changing the isolation levels on all of them, we want to work with the most critical ones first.

We also want to know when the data’s modified and is it read while it’s modified. Some applications only have modifications done in a window when nobody’s reading the data. Those are fewer and fewer. Used to, that used to be more common when people didn’t access websites 24 hours a day, but we do still have some of those applications. And all the problems that I’ve been showing you have to do with data that’s being modified while it’s being read. So some apps you can rule out, just ‘cause they aren’t going to have the problems. They are fewer and fewer.

And then research. What isolation levels are you currently using? This can take some digging around because, there may be a lot of NOLOCK hints in there. You may be using a lot a read uncommitted. There may be apps that are using serializable and it wasn’t clear to you. You can find this in the dynamic management views of the database, looking at connections. It’s a little tricky, because I can see: what is the connection setting its isolation level to– but of course we can use hints in our code. I can set my isolation level when I access specific tables using hints, so I have to look at multiple levels, look in the code itself often to figure out what isolation levels I’m currently using. Then, it’s time to prioritize.

Okay, if I’m using read committed, and I’m using read uncommitted and we’ve never thought about this before, we’re going to look at our most critical application, we’re going to figure out our data modification patterns and we’re going to start planning a change to figure out how to make sure that we are getting correct data to our users.