Index Types: Heaps, Primary Keys, Clustered and Nonclustered Indexes (Dear SQL DBA Episode 28)
Dear SQL DBA…
I see HEAP tables are found even when I know those tables have a clustered index, and I see a lot of forwarded records. This happens to 5 tables in my database. I can see the clustered and in some ones the non-clustered indexes… why are some scripts reporting them as heaps?
Here is an example of what I’m seeing from a script:
dbo.AnonymousTable (0) [HEAP] [RID] / 2,126,697 reads, 308,401 writes / 17,847 forwarded records fetched;
This puzzles a lot of people when they start working with indexes in SQL Server. The concepts here overlap and there are quite a few different ways you can do things.
Watch the 27 minute video discussing this, or scroll on down to read a written version of the video, complete with code samples.
First up, let’s clarify the concepts.
Concept 1- How a disk-based table is physically ordered
I say “disk-based” because we’re not talking about in-Memory tables here. I’m not getting into those today for the sake of simplicity. (For a high level overview of disk-based vs in-Memory tables, check out this post.)
Clustered Index: This will always have IndexID = 1
Clustered rowstore tables – Traditional clustered index: you choose clustering key column(s) that determine the sort order of the data
Clustered columnstore tables – Clustered columnstore indexes don’t have key columns. Every column in the table is stored in columnar format, which uses LOB (large-object) pages
The syntax to create a clustered index may look like one of these samples:
/* These samples create clustered indexes that are NOT also a primary key */ /* 1. Two step process... */ CREATE TABLE dbo.ClusterMeToo ( MakeMeAClusteredIndex BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100) ); GO CREATE UNIQUE CLUSTERED INDEX CX_ClusterMeToo_MakeMeAClusteredIndex ON dbo.ClusterMeToo (MakeMeAClusteredIndex); GO /* 2. Inline index create. This syntax works in SQL Server 2014+ */ CREATE TABLE dbo.ClusterMe ( MakeMeAClusteredIndex BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100), INDEX CX_ClusterMe_MakeMeAClusteredIndex UNIQUE CLUSTERED (MakeMeAClusteredIndex) ); GO /* 3. Clustered Columnstore Example. This exists in SQL Server 2014+ */ CREATE TABLE dbo.ClusteredColumnstore ( Col1 BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100), INDEX CX_ClusteredColumnstore_MakeMeAClusteredIndex CLUSTERED COLUMNSTORE ); GO
- You may want your clustered index to also be a primary key, in which case you want a different code sample. Keep reading!
- It’s a best practice to make rowstore indexes unique, but it’s not required.
- You can create a clustered columnstore index after the table has been created. You may even want to temporarily create a rowstore clustered index before you create the columnstore clustered index! Read the post, “Columnstore Index Performance: Rowgroup Elimination” from the SQL Server Tiger Team to learn more.
Heap: This will always have IndexID = 0
- When you don’t define a clustered index, SQL Server uses a secret Row Identifier (RID) behind the scenes. You can’t use the RID in queries.
- You may have nonclustered indexes (rowstore or columnstore) on a heap. Those are secondary physical structures. They’ll always have an IndexID greater than 1, because that’s reserved for clustered indexes.
The syntax to create a heap is something like this:
CREATE TABLE dbo.HeapExample ( IMightBeAPK BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100) ); GO /* It remains a heap if you do NOT run any commands like this: CREATE [UNIQUE] CLUSTERED INDEX [index name] on ( [Column Name(s) ] or ALTER TABLE ADD CONSTRAINT [constraint name] PRIMARY KEY CLUSTERED ( [Column Name(s)] ) CREATE CLUSTERED COLUMNSTORE INDEX [index name] on */
Concept 2- Primary Key: the column or columns that define a unique row for business purposes
A table may have only one primary key (PK). (You can enforce uniqueness in other ways with unique constraints and unique indexes, though.)
A primary key is secretly an index! It can be clustered or nonclustered.
Your primary key may technically be a “surrogate key”. That just means that it’s not a column that “naturally” identifies the data– it may be an INT, BIGINT, or UNIQUEIDENTIFIER column that was designed to uniquely identify the row, even though the number or uniqueidentifier itself isn’t meaningful to look at.
Clustered primary key: This will always have IndexID = 1 (it’s a clustered index behind the scenes, as well as a constraint)
- You may choose to make the clustered index ALSO the primary key when you create the index / constraint. This means the column or columns that uniquely identify a row also define the physical sort order of the table on disk.
- There’s no such thing as a clustered primary key on a clustered columnstore table. Clustered columnstore indexes don’t have key columns– every column in the table is stored in a columnar format.
The syntax to create a clustered primary key can look like this:
CREATE TABLE dbo.ClusteredPKExample ( MakeMeACXPK BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100), CONSTRAINT PK_ClusteredPKExample_MakeMeACXPK PRIMARY KEY CLUSTERED (MakeMeACXPK) ); GO /* Or a two step create... */ CREATE TABLE dbo.AnotherClusteredPKExample ( MakeMeACXPK BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100) ); GO ALTER TABLE dbo.AnotherClusteredPKExample ADD CONSTRAINT PK_AnotherClusteredPKExample_MakeMeACXPK PRIMARY KEY CLUSTERED (MakeMeACXPK); GO
Nonclustered primary key: This will always have IndexID > 1
- Nonclustered primary key constraints are nonclustered indexes behind the scenes
- A nonclustered primary key may be created on a heap, or a table with a clustered index
- Antipattern: sometimes people create a clustered index and a non-clustered primary key on the same column or columns. This means your table has to maintain TWO indexes on the same key column, when you could just have one. It’s more efficient to create a clustered primary key.
The syntax to create a nonclustered primary key may look something like this:
/* 1. This syntax works in SQL Server 2014+ */ CREATE TABLE dbo.NonclusteredPKExample ( MakeMeCX BIGINT IDENTITY NOT NULL, BusinessKey NVARCHAR(50) NOT NULL, Col2 NVARCHAR(100), INDEX CX_NonclusteredPKExample_MakeMeCX UNIQUE CLUSTERED (MakeMeCX), CONSTRAINT PK_NonclusteredPKExample_BusinessKey PRIMARY KEY NONCLUSTERED (BusinessKey) ); GO /* 2. Three step create... */ CREATE TABLE dbo.AnotherNonclusteredPKExample ( MakeMeCX BIGINT IDENTITY NOT NULL, BusinessKey NVARCHAR(50) NOT NULL, Col2 NVARCHAR(100) ); GO ALTER TABLE dbo.AnotherNonclusteredPKExample ADD CONSTRAINT PK_AnotherNonclusteredPKExample_BusinessKey PRIMARY KEY NONCLUSTERED (BusinessKey); GO CREATE UNIQUE CLUSTERED INDEX cx_AnotherNonclusteredPKExample_MakeMeCX on dbo.AnotherNonclusteredPKExample (MakeMeCX); GO /* 3. Clustered Columnstore with PK, two step version. This works in SQL Server 2016+ */ CREATE TABLE dbo.ClusteredColumnstoreWithPKExample ( BusinessKey NVARCHAR(50) NOT NULL, Col1 BIGINT IDENTITY NOT NULL, Col2 NVARCHAR(100), CONSTRAINT PK_ClusteredColumnstoreWithPKExample_BusinessKey PRIMARY KEY NONCLUSTERED (BusinessKey) ); GO CREATE CLUSTERED COLUMNSTORE INDEX cCx_ClusteredColumnstoreWithPKExample on dbo.ClusteredColumnstoreWithPKExample; GO
Back to the question: what happened?
Based on the initial email, I was pretty sure that the tables in question were accidentally created as heaps, each with a nonclustered primary key.
The big giveaways were that indexid zero only ever exists on a heap, and forwarded records can also only occur in a heap object. In a brief email conversation back and forth, we confirmed that this was the case.
I’ve run into this quite a few times in the wild. Sometimes the tables were a heap and someone later thought to add primary keys, and made them nonclustered without thinking. Sometimes people just accidentally use the wrong syntax at create time.
Which type of indexes should I use?
In the case of our questioner, they likely want to just recreate their nonclustered primary keys and clustered primary keys. That’s kind of a pain if you’ve got a lot of foreign keys or SQL Server replication set up.
While it’s generally a bad practice to have a unique clustered index and a non-clustered primary key on the same columns, because they’re duplicate indexes… if the tables are small and don’t have a lot of modifications, I’m not going to pin a scarlet letter on you for doing it a few times.
But for general use, let’s make some generalizations!
Clustered primary key: When the set of columns that uniquely identify a row are also very frequently used in joins and the ‘where’ clause of your query, ordering the table by those columns is usually a great fit. The clustered index automagically has direct access to all the in-row columns in a table without having to look it up in another structure.
Unique clustered index with a different nonclustered primary key (rowstore): Sometimes you have a table where it makes sense to physically sort the table on different columns than the ones that make up the primary key on the table:
- Maybe the most important use of the table is range scan on a different column or columns and those queries access lots of columns in the table. Using that as the clustered index can be very powerful.
- Maybe the primary key is a very wide set of columns on a large table. Having a wide set of columns in the clustered index bloats all your nonclustered indexes (because it’s secretly added to them). Depending on how the table is queried, another column or set of columns may work better as the clustered index.
Heap tables (possibly with a nonclustered PK, depending what you’re doing): You don’t always need a clustered index. Or any index for that matter. Heaps can have some weird problems, like those forwarded records, but that’s for another day. Heaps can be good for:
- Tables that you always scan, where you want all the columns. Scanning heaps can be really fast!
- Tables that you query in a very specific, controlled, targeted manner which is suited to non-clustered indexes
- Staging tables where you’re doing quick and dirty loads and queries. Sometimes it’s faster to not create a clustered index, depending on what you’re doing. Test and use what performs best while making sure your data is valid: no shame in that.
Columnstore indexes (clustered and nonclustered): These are extremely powerful when you need to scan a lot of rows to do aggregations.
- Data warehouse tables are the obvious fit here, and are where the feature started.
- Since data warehouses are all about analytics and massive tables, the natural patterns are updatable clustered columnstore indexes, likely with table partitioning
- Whether you should use primary and foreign keys in your data warehouse is something people fight about. I accept people of all key choices here.
- OLTP databases can be a good fit, also!
- Writeable nonclustered columnstore indexes are a big feature in SQL Server 2016, and are designed to help people with busy workloads which combine OLTP and analytic queries in the same database. That’s an increasingly common requirement: lots of businesses just cannot wait for an ETL to run before analysis.
- The natural pattern in an OLTP database using this would be a rowstore table with a Clustered PK and a nonclustered columnstore index on the columns you use for analytics
- Also in SQL Server 2016:
- Optimistic locking (snapshot / RCSI) is supported against columnstore indexes
- You can read columnstore indexes on Availability Group secondaries (which automagically use snapshot behind the scenes)
Is my excitement for columnstore indexes in SQL Server 2016 showing?
On a related note…
Sitting right next to me as I write this is the brand spanking new fifth edition of Louis Davidson’s Pro SQL Server Relational Database Design and Implementation, updated for SQL Server 2016. You can buy the book now from APress or Amazon.
Those are not affiliate links and this is not a sponsored post. I’m just excited to read the fifth edition, and if you got this far in this post then you’d probably like it, too.
If you’re in the process of modeling a SQL Server database, get Louis’ book. It will help you along the way, and in designing future projects as well.