Tag Archives | Statistics

Are Bad Statistics Making My Query Slow? (Dear SQL DBA Episode 39)

An important query is suddenly slow. Is it because statistics are out of date? This is tricky to figure out, and updating statistics right away can make troubleshooting even harder. Learn how to use query execution plans to get to the heart of the question and find out if stats are really your problem, or if it’s something else.

In this 35 minute episode:

  • 00:39 SQL Server 2017 Announced
  • 01:10 New video from Microsoft’s Joe Sack demonstrating Adaptive Query Processing
  • 03:05 This week’s question: Are bad stats making my query slow?
  • 05:26 Demo of finding plan in cache and analyzing stats begins
  • 28:17 What to do when stats ARE the problem

Code samples: https://gist.github.com/LitKnd/f07848d59cedc61fd057d12ab966f703

Audio-only version (downloadable)

Video version

Related links

SQL Server 2017 Adaptive Query Processing video by Joe Sack

Michael J Swart on finding Dark Matter Queries

Slow in the Application, Fast in SSMS? An SQL text by Erland Sommarskog

Got a question for Dear SQL DBA? Ask!

Does Truncate Table Reset Statistics?

Short answer: the SQL Server optimizer will know that the table was truncated, but statistics might not update when you expect.

For the long answer, let’s walk through an example using the WideWorldImporters sample database. I’ll be using Trace Flag 3604 and 2363 to get SQL Server to print information about how it optimized my query out to the messages tab. (Thanks to Paul White for blogging about this trace flag.)

First, a fresh restore of WideWorldImporters

USE master;
GO

IF DB_ID('WideWorldImporters') IS NOT NULL
ALTER DATABASE WideWorldImporters SET OFFLINE WITH ROLLBACK IMMEDIATE

RESTORE DATABASE WideWorldImporters FROM DISK=
    'S:\MSSQL\Backup\WideWorldImporters-Full.bak'
    WITH REPLACE
GO

USE WideWorldImporters;
GO

Before we do anything, what do the statistics look like on Sales.OrderLines?

Here’s the query that I’m using to inspect the statistics:

SELECT 
    sp.last_updated,
    stat.name as stats_name,
    STUFF((SELECT ', ' + cols.name
        FROM sys.stats_columns AS statcols
        JOIN sys.columns AS cols ON
            statcols.column_id=cols.column_id
            AND statcols.object_id=cols.object_id
        WHERE statcols.stats_id = stat.stats_id and
            statcols.object_id=stat.object_id
        ORDER BY statcols.stats_column_id
        FOR XML PATH(''), TYPE
    ).value('.', 'NVARCHAR(MAX)'), 1, 2, '')  as stat_cols,
    sp.modification_counter,
    sp.rows,
    sp.rows_sampled
FROM sys.stats as stat
CROSS APPLY sys.dm_db_stats_properties (stat.object_id, stat.stats_id) AS sp
JOIN sys.objects as so on 
    stat.object_id=so.object_id
JOIN sys.schemas as sc on
    so.schema_id=sc.schema_id
WHERE 
    sc.name= 'Sales'
    and so.name='OrderLines'
ORDER BY 1 DESC
GO

Statistics were last updated on June 2, 2016. We’ll be mostly looking at the statistic on Quantity throughout the example, so I’ve highlighted it:

statistics-before-changes

Let’s run a query that loads the statistic on Quantity

Before we truncate the table, let’s take a peek into how SQL Server optimizes a query that cares about rows in Sales.OrderLines with Quantity > 10. I’m using trace flags 3604 and 2363 to make SQL Server print information about how it used statistics to optimize this to my messages tab.

SELECT *
FROM Sales.OrderLines
WHERE Quantity > 10
    OPTION
(
    QUERYTRACEON 3604,
    QUERYTRACEON 2363,
    RECOMPILE
)
GO

Here’s the info on the messages tab:

Begin selectivity computation

Input tree:

  LogOp_Select

      CStCollBaseTable(ID=1, CARD=231412 TBL: Sales.OrderLines)

      ScaOp_Comp x_cmpGt

          ScaOp_Identifier QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity

          ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=10)

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity

Loaded histogram for column QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity from stats with id 7

Selectivity: 0.44231

Stats collection generated: 

  CStCollFilter(ID=2, CARD=102356)

      CStCollBaseTable(ID=1, CARD=231412 TBL: Sales.OrderLines)

End selectivity computation

Estimating distinct count in utility function

Input stats collection:

    CStCollFilter(ID=2, CARD=102356)

        CStCollBaseTable(ID=1, CARD=231412 TBL: Sales.OrderLines)

Columns to distinct on:QCOL: [WideWorldImporters].[Sales].[OrderLines].OrderLineID


Plan for computation:

  CDVCPlanUniqueKey

Result of computation: 102356


(102035 row(s) affected)

Highlights: one of the first thing SQL thinks about is the number of rows in the table

Right at the beginning, we see: “CStCollBaseTable(ID=1, CARD=231412 TBL: Sales.OrderLines)”

That ‘CARD’ number is the optimizer thinking about how many rows are in this table. If you glance back up at the table statistics, the most recent statistic to be updated was on the ‘LastEditedWhen’ column. When that statistic was updated, there were 231,412 rows in the table.

SQL Server decides that it wants detail on the Quantity column to figure out how to run this query, so we see that it loads that statistic up to use: “Loaded histogram for column QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity from stats with id 7”

Alright, let’s truncate this table

I wipe out all the rows with this command:

TRUNCATE TABLE Sales.OrderLines;
GO

Now, I wouldn’t expect truncating the table to automatically update the statistics.

SQL Server updates statistics when they’re used to optimize a query — so if nobody queries this table for six months, I wouldn’t expect the stats to update for six months.

Let’s re-run our query, trace flags and all:

SELECT *
FROM Sales.OrderLines
WHERE Quantity > 10
    OPTION
(
    QUERYTRACEON 3604,
    QUERYTRACEON 2363,
    RECOMPILE
)
GO

The messages tab has less info this time- it’s much more concise!

Begin selectivity computation

Input tree:

  LogOp_Select

      CStCollBaseTable(ID=1, CARD=1 TBL: Sales.OrderLines)

      ScaOp_Comp x_cmpGt

          ScaOp_Identifier QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity

          ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=10)

Plan for computation:

  CSelCalcFixedFilter (0.3)

Selectivity: 0.3

Stats collection generated: 

  CStCollFilter(ID=2, CARD=1)

      CStCollBaseTable(ID=1, CARD=1 TBL: Sales.OrderLines)

End selectivity computation


(0 row(s) affected)

SQL Server knows that we blew away all those rows

This time we see “CARD=1 TBL: Sales.OrderLines”

SQL Server doesn’t like to estimate 0 for empty tables. It likes to estimate 1. It knows this table is empty.

With this information, it chooses a different plan for computation. The plan doesn’t require looking at the quantity column this time– we don’t have any lines about that at all.

But the statistics don’t look any different

You might expect to see that the statistic on Quantity had updated. I expected it, before I ran through this demo.

But SQL Server never actually had to load up the statistic on Quantity for the query above. So it didn’t bother to update the statistic. It didn’t need to, because it knows that the table is empty, and this doesn’t show up in our column or index specific statistics.

To verify, I just rerun my metadata query above, and things look the same:

statistics-after-truncate-and-query 

What if the table has exactly one row?

Let’s insert one and find out:

INSERT INTO [Sales].[OrderLines] (OrderLineID, OrderID, StockItemID, Description, PackageTypeID, Quantity, UnitPrice, TaxRate, PickedQuantity, PickingCompletedWhen, LastEditedBy, LastEditedWhen)
     VALUES (1, 45, 164, '32 mm Double sided bubble wrap 50m', 7, 50, 112.00, 15.000, 50, '2013-01-02 11:00:00.0000000', 4, '2013-01-02 11:00:00.0000000')
GO

Now we run our familiar query, with all its merry trace flags:

SELECT *
FROM Sales.OrderLines
WHERE Quantity > 10
    OPTION
(
    QUERYTRACEON 3604,
    QUERYTRACEON 2363,
    RECOMPILE
)
GO

And here’s what SQL Server has to say about optimizing that…

Begin selectivity computation

Input tree:

  LogOp_Select

      CStCollBaseTable(ID=1, CARD=1 TBL: Sales.OrderLines)

      ScaOp_Comp x_cmpGt

          ScaOp_Identifier QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity

          ScaOp_Const TI(int,ML=4) XVAR(int,Not Owned,Value=10)

Plan for computation:

  CSelCalcColumnInInterval

      Column: QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity

Loaded histogram for column QCOL: [WideWorldImporters].[Sales].[OrderLines].Quantity from stats with id 7

Selectivity: 1

Stats collection generated: 

  CStCollFilter(ID=2, CARD=1)

      CStCollBaseTable(ID=1, CARD=1 TBL: Sales.OrderLines)

End selectivity computation


(1 row(s) affected)

One row is enough to use our column statistic

Looking at the beginning, CARD=1 for Sales.OrderLines, just like it did after we truncated the table. But SQL Server does something different this time, indicating that it now knows that the table isn’t really empty.

It goes back to the CSelCalcColumnInInterval plan to optimize. And it loads up the column stat for the Quantity column.

Since this statistic was loaded into memory, it should have auto-updated based on my database settings. Sure enough, it did:

statistics-after-truncate-and-query-and-insert

SQL Server knows when you’ve truncate a table

And the fact that the table has been truncated may mean that it doesn’t need to use statistics on the table when optimizing queries. After all, it’s an empty table, so it can take shortcuts!

So don’t get too confused if statistics look way out of date for a truncated table. Instead, ask yourself, “why am I querying a truncated table?” (Related disclaimer: I only tested this on SQL Server 2016.)

Want to learn more about statistics in SQL Server? Start here.

 

 

 

When Did SQL Server Last Update That Statistic? How Much Has Been Modified Since? And What Columns are in the Stat?

manual-stats-updateWhether I’m working as a DBA, a consultant, a teacher, or just answering questions in my inbox, I always end up needing a script to inspect statistics one way or another.

Here are some freshly written scripts for a classic DBA question: what’s going on in my stats?

How to get statistics details on SQL Server 2008 R2 and higher

For most modern versions of SQL Server, I like to join to sys.dm_db_stats_properties() — you can get a LOT of detail in a single query! (This works with SQL Server 2008 R2 SP2+ / SQL Server 2012 SP1+ / All higher versions)

Here’s the query, looking at a sample table in the WideWorldImporters database:

SELECT 
	stat.auto_created,
	stat.name as stats_name,
	STUFF((SELECT ', ' + cols.name
		FROM sys.stats_columns AS statcols
		JOIN sys.columns AS cols ON
			statcols.column_id=cols.column_id
			AND statcols.object_id=cols.object_id
		WHERE statcols.stats_id = stat.stats_id and
			statcols.object_id=stat.object_id
		ORDER BY statcols.stats_column_id
		FOR XML PATH(''), TYPE
	).value('.', 'NVARCHAR(MAX)'), 1, 2, '')  as stat_cols,
	stat.filter_definition,
	stat.is_temporary,
	stat.no_recompute,
	sp.last_updated,
	sp.modification_counter,
	sp.rows,
	sp.rows_sampled
FROM sys.stats as stat
CROSS APPLY sys.dm_db_stats_properties (stat.object_id, stat.stats_id) AS sp
JOIN sys.objects as so on 
	stat.object_id=so.object_id
JOIN sys.schemas as sc on
	so.schema_id=sc.schema_id
WHERE 
	sc.name= 'Warehouse'
	and so.name='StockItemTransactions'
ORDER BY 1, 2;
GO

The output looks like this:

Click to see a larger image

Click to see a larger image

 

Can you guess why the top row has NULL values for last_updated, modification_counter, rows, and rows_sampled? (Once you have your guess, the answer is here.)

What if my SQL Server is a dinosaur? Here’s a script for SQL Server 2005 and 2008.

If you’re using SQL Server 2008 or prior, you don’t  have the luxury of sys.dm_db_stats_properties().

For these instances, it’s not a big deal to get the date statistics were last updated – we can call to the STATS_DATE() function. But we’re really guessing when it comes to how many rows have been modified, we have to use a column called rowmodctr in sys.sysindexes which is a guess at how many rows have changed. It also was wildly inaccurate in some versions of SQL Server 2005.

But a guessed estimate is better than no information, as long as you know it’s a guess!

Here’s the query:

SELECT 
    stat.auto_created,
    stat.name as stats_name,
    STUFF((SELECT ', ' + cols.name
        FROM sys.stats_columns AS statcols
        JOIN sys.columns AS cols ON
            statcols.column_id=cols.column_id
            AND statcols.object_id=cols.object_id
        WHERE statcols.stats_id = stat.stats_id and
            statcols.object_id=stat.object_id
        ORDER BY statcols.stats_column_id
        FOR XML PATH(''), TYPE
    ).value('.', 'NVARCHAR(MAX)'), 1, 2, '')  as stat_cols,
    stat.filter_definition,
    stat.is_temporary,
    stat.no_recompute,
    STATS_DATE(stat.object_id, stat.stats_id) as last_updated,
    ISNULL(
    /* Index stats */
    (SELECT rowmodctr
        FROM sys.sysindexes as sysind
        JOIN sys.indexes as ind on 
            sysind.id=ind.object_id
            and sysind.indid=ind.index_id
        where sysind.id=stat.object_id
        and ind.name=stat.name 
    ),
    /* Column stats */
    (SELECT rowmodctr
        FROM sys.sysindexes as sysind
        where sysind.id=stat.object_id
        and sysind.indid in (0,1)
    )) 
        AS estimated_modification_counter
FROM sys.stats as stat
JOIN sys.objects as so on 
    stat.object_id=so.object_id
JOIN sys.schemas as sc on
    so.schema_id=sc.schema_id
WHERE 
    sc.name= 'Warehouse'
    and so.name='StockItemTransactions'
ORDER BY 1, 2;
GO

And here’s how the output looks:

Click to see a larger image

Click to see a larger image

 

See how the estimated modification counter is totally off in the top row, and for the column statistic? That’s because this table has a clustered columnstore index — and ye olde rowmodctr column totally doesn’t understand what’s going on with that!

This script is going to be wrong about things like Columnstore indexes. But if you have Columnstore indexes, you’re on a version of SQL Server that lets you use the non-Dinosaur script above, which is more accurate.

TLDR;

Make sure you use the script for the correct version of SQL Server that you’re running. The “dinosaur” script is less accurate, but better than nothing when you need this.

Can I Use Statistics to Design Indexes? (Dear SQL DBA Episode 18)

Should you look at automatically created statistics on  your tables in SQL Server to help you design better indexes? Learn why in this 20 minute video, or subscribe to the Dear SQL DBA podcast.

No time to watch? Scroll on down, everything is written in article form below the video.

Here’s this week’s question:

Dear SQL DBA,

I’ve noticed that many indexes in my data warehouse aren’t used frequently. Is there a way to use the automatically generated statistics to make useful indexes?

… (insert witty pun about indexes)

I’ve been asked this question several times, and I even remember once wondering this myself.

There’s no good way to analyze the column based statistics that SQL Server automatically generates for the purpose of index design. Let’s talk about why.

First: Why do we have ‘statistics’, anyway?

Let’s say we have a table named dbo.Grades. It has columns for GradeId, ClassId, StudentId, Year, Term, and Grade.

We run a query looking for all StudentIds where Year=1990.

The table has a Clustered Primary Key on GradeId — so the whole table is stored sorted by GradeId. That doesn’t help our query at all, we want the StudentIds where Year = 1990.

The table has a nonclustered index on Year and GradeId. That nonclustered index does NOT contain StudentId, though.

So SQL Server has two main options:

  1. Is it better for SQL Server to go look at the nonclustered index and then loop back into the base table and look up the StudentId using the GradeId?
  2. Or it might be easier to just scan the whole dbo.Grades table and check the StudentId on each row. Maybe the school was only open for the year 1990, and all the rows have that value!

How does SQL Server know which is better?

SQL Server uses statistics to guess which way is the most efficient to execute your query.

Statistics vs Indexes (and their relationship)

Statistics are little, tiny pieces of metadata that describe the data in the table — things like an approximation of the number of rows equal to 1990. Statistics don’t take up a lot of space in the database.

When you create an index, a statistic is automatically created for the key columns in the index. This can be a multi-column statistic if you have multiple key columns, but the statistic describes the first column in the index in the most detail.

If you use a predicate like Year = 1990 against a column that isn’t the leading column in an index, SQL Server will create a single-column statistic. (Fine print: automatic stats creation can be disabled using a database setting.)

On the other hand, indexes are COPIES of the data itself for the columns defined in the index. The index on Year and GradeId on the dbo.Grades table takes up extra space on disk and on memory and has copies of all the rows for Year and GradeId.

SQL Server uses statistics to optimize an execution plan.

SQL Server uses indexes within the execution plan to find data.

What does the existence of a system created column statistic tell us?

We’ve talked a lot so far about how much statistics and indexes are related. This is why it seems like statistics might be useful for designing indexes!

But here’s the thing — SQL Server doesn’t track and report on how many times a statistic was used during optimization.

I didn’t write the optimizer, but I’m not sad about this at all, I think it’s fine, because:

  • Optimization needs to be very fast. SQL Server wants to start your query as soon as it can. Everything that has to be written out during that process costs CPU and time.
  • Just considering a column statistic in optimization doesn’t necessarily mean that a single column index on that column would be useful. There might be a multi-column index that would be useful. Or it might actually be better for it to be scanning the table! We wouldn’t know just by the fact that the statistic had been examined by the optimizer.

To have useful information about which single-column statistics might be needed in an index, SQL Server would have to do a lot of work– and it’s already got a feature in optimization for this at a higher level.

Whenever there’s more than one way to run a query, SQL Server thinks about whether an index would help the query run faster. If it thinks an index would help, it records this using the “Missing Indexes” feature.

I wouldn’t make too many assumptions on a column that lacks statistics, either

If a column does NOT have a statistic on it, that isn’t proof that the column is unused.

Statistics are automatically created on columns where you use joins and ‘where’ predicates.

Some columns may just be returned to the user without being filtered. They wouldn’t need a statistic generated for that. But the columns are still in use– and in some cases, using them as “included columns” in indexes might be useful.

So I wouldn’t use the existence OR non-existence of column statistics to make decisions about indexes.

And anyway, we have other options!

What’s the best way to design indexes?

There’s two main approaches you can take. After you’ve been doing this a while, you’ll probably mix the approaches.

  1. Identify the most expensive queries that run in each database, and tune indexes for those. You can find the “most expensive” queries using any of the following tools: SQL Server Query Store, the SQL Server Execution Plan cache, or a monitoring tool that watches execution and persists data from the plan cache and memory for longer analysis (and doesn’t get cleared out by memory pressure or restarts).
  2. Use SQL Server’s “Missing index requests” to see where SQL Server is asking for an index. This isn’t perfect — it has to make those requests in a hurry, and the requests get cleared out when you take the database offline (or rebuild an index on the table).

Want to learn more?

Check out my article “Updating Statistics in SQL Server: Maintenance Questions and Answers,” or my article and podcast episode “Teach Yourself SQL Server Performance Tuning“.

Managing Statistics in SQL Server for DBAs (videos)

Want to learn more about managing statistics updates in SQL Server? Watch my free 27 minute presentation on managing statistics:

You can also watch 12 minutes of audience Q&A on statistics from when I presented this on a live Hangout on Air. Questions include:

  • Should I create single column statistics for non-leading key columns in an index?
  • What is asynchronous statistics update?
  • What are these statistics with “WA_sys” names?
  • I have a bunch of user created statistics and it’s blocking a release. It is safe to drop them?

Got more questions on statistics? Check out my reference post on Updating Statistics in SQL Server, and ask away in the comments of the post!

Updating Statistics in SQL Server: Maintenance Questions & Answers

I’ve been asked a lot of questions about updating statistics in SQL Server over the years. And I’ve asked a lot of questions myself! Here’s a rundown of all the practical questions that I tend to get about how to maintain these in SQL Server.

I don’t dig into the internals of statistics and optimization in this post. If you’re interested in that, head on over and read the fahhhbulous white paper, Statistics Used by the Query Optimizer in SQL Server 2008. Then wait a couple days and chase it with it’s charming cousin, Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator.

I’m also not talking about statistics for memory optimized tables in this article. If you’re interested in those, head over here.

Quick Links to Jump Around This Article

General Advice on Statistics Maintenance

Be moderate in thy statistics updates, but not in thy hair styling.

Back when I read philosophy, I found Aristotle a bit annoying because he talked so much about “moderation”. Doing too little is a vice. Doing too much was a vice. Where’s the fun in that?

Unfortunately, Aristotle was right when it comes to statistics maintenance in SQL Server. You can get in trouble by doing too little. And  you can get in trouble by doing too much.

⇒ Be a little proactive: If you have millions of rows in some of your tables, you can get burned by doing no statistics maintenance at all if query performance stats to get slow and out of date statistics are found to be the cause. This is because you didn’t do any proactive work at all.

⇒ If you’re too proactive, you’ll eventually be sorry: If you set up statistics maintenance too aggressively, your maintenance windows can start to run long. You shouldn’t run statistics maintenance against a database at the same time you’re checking for corruption, rebuilding indexes, or running other IO intensive processes. If you have multiple SQL Servers using shared storage, that maintenance may hit the storage at the same time. And what problem were you solving specifically?

The moderate approach:

  • Start with a SQL Server Agent job that updates statistics and index maintenance as part of a single operation/script run. Most folks begin running this once a week. If you have a nightly maintenance window, that can work as well.
  • Only run statistics updates more frequently when it’s clear that you need to, and then customize the more frequent job to only update the specific statistics that are problematic. Only use FULLSCAN on individual stats if you must, and document which queries require it.
  • If your tables have millions of rows and you run into estimate problems for recent data, consider using Trace Flag 2371 to increase the frequency of statistics updates for these tables instead of manually updating statistics throughout the day. (SQL Server 2008R2 SP1 +, not needed in SQL Server 2016.)
  • If you have a batch process or ETL that loads significant amounts of data, consider having it update statistics on impacted tables at the end of the process. Exception: if the job creates or rebuilds indexes at the end of its run, the statistics related to those indexes are already updated with FULLSCAN and do not require any maintenance.
  • Beware statistics updates that need to run frequently throughout the day: if this is your only option to fix a problem, you are applying a reactive fix which consumes IO and the query is periodically still slow. This is rarely a solid long term solution. Investigate stabilizing execution plans via indexes and query tuning instead.

Which Free Third Party Tools Help Update Statistics?

One widely used free script is Ola Hallengren’s SQL Server Index and Statistics Maintenance script.

If you’re managing lots of SQL Server instances and want ultimate customization, there is a free version of Minion Reindex.

What are Statistics?

Statistics are small, lightweight objects that describe the distribution of data in a SQL Server table. The SQL Server query optimizer uses statistics to estimate how many rows will be returned by parts of your query. This heavily influences how it runs your queries.

For example, imagine you have a table named agg.FirstNameByYear and you run this query:

SELECT-BY-FirstName-Id

SQL Server needs to estimate how many rows will come back for FirstNameId=74846. Is it 10 million rows, or 2? And how are the rows for Gender distributed?

The answers to both of these questions impact what it does to GROUP those rows and SUM the NameCount column.

Statistics are lightweight little pieces of information that SQL Server keeps on tables and indexes to help the optimizer do a good job.

What Creates Column Statistics?

If the agg.FirstNameByYear table was freshly created when we ran our query, it would have no column statistics.

By default, the SQL Server optimizer will see that no statistics exists, and wait while two column statistics are created on the FirstNameId and Gender columns. Statistics are small, and are created super fast– my query isn’t measurably any faster when I run it a second time.

Here’s what the statistics look like on the table in Object Explorer. Notice the artisanally crafted names.

Auto-Created-Statistics

If you want to verify which column is in each auto-created statistic, you can do that with this query:

SELECT s.name,
    s.auto_created,
    s.user_created,
    c.name as colname
FROM sys.stats AS s
JOIN sys.stats_columns as sc 
    on s.stats_id=sc.stats_id
    and s.object_id=sc.object_id
JOIN sys.columns as c on 
    sc.object_id=c.object_id
    and sc.column_id=c.column_id
WHERE s.object_id=OBJECT_ID('agg.FirstNameByYear')
    and s.auto_created = 1
ORDER BY sc.stats_column_id;
GO

Sure enough, here are our statistics, and they are on Gender and FirstNameId. These are not considered ‘user created’ even though our user query was the cause of them being auto-generated. (“User created” means someone ran a CREATE STATISTICS command.)

Decoding-Statistics-Column-Names

SQL Server can now use the statistic on Gender and the statistic on FirstNameId for future queries that run.

What are Index Statistics?

Whenever you create an index in SQL Server, it creates a statistic associated with that index. The statistic has the same name of the index. Our agg.FirstNameByYear table has a clustered primary key, and here is the statistic that was created along with that index:

Index-Statistic

If columns are important enough to index, SQL Server assumes that it’s also important to estimate how many rows would be returned by that index when you query it. You can’t drop statistics associated with indexes (unless you drop the index).

Do I Need to Create Statistics Manually?

Nope! SQL Server does a really good job creating single-column statistics automatically.

Statistics will continue to be created on single columns when queries run as long as the “Auto Create Statistics” database property remains on.  You can check that setting on your databases with the query:

SELECT is_auto_create_stats_on
FROM sys.databases;
GO

You should leave auto_create_stats_on set to 1 unless an application is specifically designed to manually create its own statistics. (That is pretty much limited to weirdoes like SharePoint.)

In rare situations, manually creating a multi-column statistic or a filtered statistic can improve performance… but keep reading to find out what those are and why it’s rare to require them.

Do I Need to Worry About Statistics Taking Up Too Much Storage in My Database?

Auto-created statistics are incredibly small, and you only get one per column in a table. Even if you have a statistic on every column in the table, this is a very small amount of overhead.

Statistics take up a negligible amount of space compared to indexes.

How Does Auto-Update Work for Statistics?

With default database settings, the  SQL Server optimizer looks at how many changes have occurred for a given column statistic as part of query optimization. If it looks like a significant amount of rows in the column have changed, SQL Server updates the statistic, then optimizes the query. Because why optimize a query on bad data estimates?

The thresholds for when auto-update statistics kicks in are a little complicated.

For SQL Server 2005 – SQL Server 2014 (with no trace flags)

  • If the table has 1-500 rows, if 500 rows have changed, statistics are considered not fresh enough for optimization
  • If the table has 500+ rows, 500 rows + 20% of total rows in the table are the threshold

Note that the statistics don’t auto-update when the rows are changed. It’s only when a query is being compiled and the statistics would actually be used to make an optimization decision that the update is kicked off. Erin Stellato proves that in her post here.

Should I Turn On Trace Flag 2371 to Control Automatic Statistics Updates?

Trace Flag 2371 makes the formula for large tables more dynamic. When tables have more than 25,000 rows, the threshold for automatic statistics update becomes more dynamic, and this adjusts as the rowcount goes up. See a graph of the adjusting threshold in this post from the SAP team. (I think we know which team really felt some pain and wanted this trace flag to exist, because the trace flag was announced on their blog!)

Trace flag 2371 is available in SQL Server 2008R2 SP1-SQL Server 2014.

SQL Server 2016 automatically uses this improved algorithm. Woot! So if you’re using SQL Server 2016, you don’t need to decide. Erik Darling tested out the behavior in 2016 and wrote about it here.

Prior to 2016, here’s a quick rundown of pros and cons of TF2371:

Benefits to turning on Trace Flag 2371

  • This trace flag is documented in KB 2754171
  • This has been used by SAP for a while, and has become the default behavior in SQL Server 2016. That seems like a big vote of confidence.

Risks of turning on Trace Flag 2371

  • The trace flag is instance-wide (“global”). You can’t change this behavior for a specific database.
  • The documentation in KB 2754171 covers its behind a little conspicuously. They advise you that “If you have not observed performance issues due to outdated statistics, there is no need to enable this trace flag.” (Unless you run SAP.)
  • This trace flag didn’t win the popularity contest to make it into the main Books Online list of trace flags.
  • If you have this trace flag on and you need to open a support ticket, you may have to spend time explaining why you have this one and jumping through extra hoops to get a problem figured out.

Overall, this is a low risk trace flag. But in general it does NOT pay off to enable trace flags “just in case” for most people.

Should I Turn on Trace Flag 7471 for Concurrent Statistics Update?

Trace Flag 7471 is a global trace flag released in SQL Server 2014 SP1 Cumulative Update 6.

This trace flag changes the locking behavior associated with updating statistics. With the trace flag on, if you set up jobs to concurrently update statistics on the same table, they won’t block each other on metadata related to statistics updates.

Jonathan Kehayias has a great post on TF 7471 on SQL Performance.com. He demos the trace flag in a video for the SQL Skills Insider Newsletter where he shows the changes in locking behavior this flag introduces. Download the video in WMV format or MOV format to watch.

Benefits to turning on Trace Flag 7471

Risks of turning on Trace Flag 7471

  • Running concurrent statistics updates against the same table can use more IO and CPU resources. Depending on what you’re doing, that might not be awesome.
  • Running concurrent parallel statistics updates against very large tables can use more workspace memory resources than executing them serially. Running low on workspace memory can cause queries to have to wait for memory grants to even get started (RESOURCE_SEMAPHORE waits). (See more on this in Jonathan Kehayias’ post here.)
  • Microsoft’s testing showed “TF 7471 can increase the possibility of deadlock especially when creation of new statistics and updating of existing statistics are executed simultaneously.” (source)
  • The trace flag is instance-wide (“global”). You can’t change this behavior for a specific database.
  • The documentation in KB 3156157 is pretty minimal.
  • This is a new trace flag introduced in a cumulative update, and it isn’t very widely used yet.
  • This trace flag also didn’t win the popularity contest to make it into the main Books Online list of trace flags.
  • If you have this trace flag on and you need to open a support ticket, you may have to spend time explaining why you have this one and jumping through extra hoops to get a problem figured out.

Any trace flag can have unintended side effects. If I had a really good reason to run concurrent statistics updates against one table after exhausting other options to avoid the maintenance, I’d consider using this. I’d also only turn it on for that specific maintenance window, and turn it off afterward. (Edit: I wrote this before Microsoft blogged about the trace flag, but it turns out it’s exactly what the recommend in their blog post due to the deadlocking issue called out in “risks”.)

Should I Update Statistics Asynchronously?

Remember how I said that when SQL Server is optimizing a query, it smells the statistics to see if they’re still fresh, and then waits to update them if they’re smelly?

You can tell it not to wait to update them, and just use the smelly stats. It will then optimize the query. The stats will still be updated for any queries that come later, but boy, I hope those statistics were good enough for that first query!

You control this with the ‘Auto Update Stats Async’ setting. You can query this setting on your databases like this:

SELECT is_auto_update_stats_async_on
FROM sys.databases;
GO

Asynchronous statistics updates are usually a bad choice. Here’s why:

  • Statistics update is typically incredibly quick
  • It’s usually a greater risk that the first query will get a bad plan, rather than it having to wait

How Often Should You Manually Update Statistics?

If you know that a lot of data is changing in a table, you may want to manually update statistics. Possibly:

  • A lot of data is changing, but it’s below the 20%+500 rows limit of where auto-update kicks in because the table has hundreds of millions of rows
  • A small amount of data is changing in a large table, but it’s frequently queried

Data Loading: Update Statistics After Each Load

A classic example of a stats problem is “recent” data from an ETL. You have a table with 10 million rows. 500,000 rows of data are loaded for the most recent batch. Important queries are looking at the table for rows with LoadDate > two hours ago.

Statistics won’t update automatically for queries, because < 2,000,500 rows have changed.

Those queries will estimate that there is one row. (To be safe. Wouldn’t wanna estimate zero!)  That’s a huge mis-estimation from 500,000, and you might end up with some very bad query plans. Gail Shaw wrote a great post on this– it’s called the “ascending date” problem.

In this kind of data load situation, it is helpful to run UPDATE STATISTICS against the entire table where the data has loaded at the end of the load. This is typically a lightweight command, but in very large tables, UPDATE STATISTICS may be run against specific indexes or columns that are sensitive to recent data to speed up the process. (Exception: if the job creates or rebuilds indexes at the end of its run, the statistics related to those indexes are already updated with FULLSCAN and do not require any maintenance.)

If you have found optimization problems and you can’t change your data load to manually update statistics, Trace Flag 2371 might help.

What if a Lot of Data Changes Through the Day Based on User Interactions?

Most databases are small. They don’t have millions and millions of rows. And most of those databases are just fine without any manual update of statistics.

If you’ve got tables with millions of rows, and you’ve found that statistics aren’t updating fast enough, you can consider:

  • Turning on Trace Flag 2371 to make auto-update statistics run more frequently on tables with large amounts of rows
  • Using a third party script to handle statistics maintenance along with your index maintenance (see the list above)

Should I Use a Maintenance Plan to Update Statistics?

No. You really shouldn’t.

Maintenance Plans are really dumb when it comes to statistics. If you run UPDATE STATISTICS against a table, index, or statistic, by default it will use a sampling. It won’t look at every row.

Maintenance Plans don’t do that. Instead, they require you to either:

  • Scan every row in the table, for every statistic in the table. So if you’ve got a 100 million row table with 50 column statistics and no nonclustered indexes, it’ll scan the 100 million row table fifty times.
  • Specify an exact percentage of each table to scan. For every table. So if you specify 3%, then 100 row tables would…. yeah.

Update-Statistics-Task-Maintenance-Plan

Neither of these options are good options.

You could use a maintenance plan to kick off a third party script that’s smarter about this, but don’t use a maintenance plan with an Update Statistics task. I found cases where that was set up using the default fullscan and it was taking many hours to run against a tiny database.

Can Statistics be Updated Using Parallelism?

If you update statistics with FULLSCAN, SQL Server may choose a parallel plan since SQL Server 2005.

If you create or update statistics with sampling, including the default sampling used in automatic statistics creation, SQL Server may choose a parallel plan as of SQL Server 2016.

When Should You Update Statistics with FULLSCAN?

 As we just covered, if you update statistics on an entire table with FULLSCAN:

  • For any column statistic where there is not an index leading on that column, SQL Server will scan the whole table
  • For any column statistic where there IS an index leading on that column, or for an index statistic, SQL Server will scan the index

As tables grow, updating with FULLSCAN starts taking longer and longer. It uses IO and resources on the instance. You started updating statistics because you were concerned about performance, and now you’re impacting performance.

Only update statistics with FULLSCAN if you are solving a specific problem, and you don’t have a better way to handle it. If you need to do this:

  • Identify the column or index statistic that needs the special treatment, and only update that one with FULLSCAN
  • Run the update as rarely as possible. If you start having to run this throughout the day against multiple statistics, it becomes increasingly hard for you to diagnose performance regressions and manage the instance

Manual statistics update with FULLSCAN can sometimes be used as a temporary measure to contain the pain of a problem while you research what query changes may stabilize the execution plans of the problem query, but this solution rarely satisfies users and business owners. Those folks are usually happier if something changes that guarantees the problem query will be consistently fast, even if a bunch of data changes. This usually means:

The essence of the problem with using manual statistics updates for a  performance problem is that this is a reactive measure, and almost never prevents the problem from occurring entirely.

Should I Update Statistics with sp_updatestats?

The built in sp_updatestats procedure is smarter than the “Update Statistics Task” in maintenance plans. It rolls through every table in the database, and is smart enough to skip tables if nothing has changed. It’s smart enough to use the default sampling.

It’s a smaller sledgehammer than the maintenance plan task, but arguably still wastes resources, particularly when it comes to statistics associated with indexes. Consider this scenario:

  • sp_updatestats is run
  • Nightly maintenance runs a script to rebuild all indexes which are more than 70% fragmented, and reorganize all indexes which are between 40% and 70% fragmented

The ALTER INDEX REBUILD command creates a beautiful new index structure. To do this, it has to scan every row in one big operation. And while it’s doing it, it updates statistics with FULLSCAN, meaning based on every row.

So our maintenance did a bunch of work updating stats with the default sampling. Then redid the same work for every index that got rebuilt.

Erin Stellato is not a big fan of sp_updatestats. Read why in her excellent article here.

If you’re going to the trouble to set this up, it’s a better use of your resources to use an indexing script that can handle that statistics update inline for columns and for indexes where REORGANIZE is done, and just skip it for REBUILD.

Why Doesn’t Updating Statistics Help if Queries Use a Linked Server?

Linked servers are special. Especially frustrating.

Prior to SQL Server 2012 SP1, any query using a linked server would not have permission to use statistics on the remote (target) database unless it used an account with db_owner or db_ddladmin permissions, or higher.  In other words, read-only permissions meant that SQL Server couldn’t use statistics to generate a good query plan.

Stated differently: prior to SQL Server 2012 SP1, you must choose between better security and better performance. You can’t have both.

Great reason to upgrade! Read more about this issue in Thomas LaRock’s article on linked server performance.

How do Duplicate Statistics Get Created?

Let’s revisit our example table. Two column statistics were created on this table, one on FirstNameId, and one on Gender:

Auto-Created-Statistics

Let’s say that later on we create an index on Gender named ix_Gender. It will have index statistics created for it! I now have a column statistic on Gender, and an index statistic on Gender.

Someone could also manually create another column statistic on Gender using the CREATE STATISTICS statement. That rarely happens, but never say never. These statistics are technically ‘duplicates’.

Do I Need to Drop Duplicate Statistics?

I’ve never found a case where dropping duplicate statistics improved performance in a measurable way. It is true that you don’t need them, but statistics are so small and lightweight that I wouldn’t bother writing and testing the code to clean them up unless I had a super good reason.

What are Multi-Column Statistics?

Multi-Column statistics are only sort of what the name sounds like. These can get created in a few different ways:

  • When you create an index with multiple columns in the key, the associated statistic is a multi-column statistic (based on the keys)
  • You can create a multi-column statistic with the CREATE STATISTICS statement
  • You can run the Database Tuning Advisor, which you can tell to apply a bunch of those CREATE STATISTICS statements. It seems to love them. A lot.

But multi-column statistics don’t contain complete information for all the columns. Instead, they contain some general information about the selectivity of the columns combined in what’s called the “Density Vector” of the index.

And then they contain a more detailed estimation of distribution of data for the first column in the key. Just like a single column statistic.

What Can Go Wrong if I Create Multi-Column Statistics?

It’s quite possible that nothing will change. Multi-column statistics don’t always change optimization.

It’s also possible that the query you’re trying to tune could slow down. The information in the density vector doesn’t guarantee better optimization.

You might even make a schema change fail later on, because someone tries to modify a column and you created a statistic on it manually. (This is true for any user created statistic, whether or not it’s multi-column.)

It’s pretty rare to make a query faster simply by creating a multi-column statistic, and there’s a very simple reason: if a multi-column statistic is critically important, an index probably is even more critical to speed up execution. So usually the big win comes in creating or changing an index. (And yes, that also creates the multi-column statistic.)

What are Filtered Statistics, and Do I Need Them?

Filtered statistics are statistics with a “where” clause. They allow you to create a very granular statistic. They can be powerful because the first column in the statistic can be different from the column or columns used in the filter.

Filtered statistics are automatically created when you create a filtered index. That’s most commonly where you’ll find a filtered statistic: helping a filtered index.

It’s rare to need to create filtered statistics manually, and beware: if your queries use parameters, the optimizer may not trust the statistics when it optimizes plans — because the plan might be reused for parameters that go outside of the filtered. So you could potentially create a bunch of filtered statistics for nothing.

For loads of details on filtered stats, watch Kimberly Tripp’s free video, Skewed Data, Poor Cardinality Estimates, and Plans Gone Bad.

How Can I Tell How Many Times a Statistic Has Been Used to Compile a Query?

There is no dynamic management view that lists this. You can’t look at an instance and know which statistics are useful and which aren’t.

For an individual query, there are some undocumented trace flags you can use against a test system to see which statistics it’s considering. Read how to do it in Paul White’s blog here (see the comments for a note on the flag to use with the new cardinality estimator).

Can I Fake Out/ Change the Content of Statistics without Changing the Data?

You can, but don’t tell anyone I told you, OK?

Read how to do this undocumented / bad idea / remarkably interesting hack on Thomas Kejser’s blog.

What are Temporary Statistics, and When Would I Need Them?

Temporary statistics are an improvement added in SQL Server 2012 for read-only databases. When a database is read-only, queries can’t create statistics in the database– because those require writes. As of SQL Server 2012, temporary statistics can be created in tempdb to help optimization of queries.

This is incredibly useful for:

  • Readable secondaries in an AlwaysOn Availability Group
  • Readable logshipping secondary databases
  • A point in time database snapshot which is queried (whether against a database mirror or live database)
  • Any other read-only database that has queries run against it

Prior to SQL Server 2012, if you use logshipping for reporting and the same workload does not run against the logshipping publisher, consider manually creating column level statistics. (Or upgrading SQL Server.)

Disambiguation: temporary statistics are unrelated to statistics on temporary tables. Words are confusing!

Do I Need to Update Statistics on Temporary Tables?

Table variables don’t have statistics, but temporary tables do. This is useful, but it turns out to be incredibly weird… because statistics on temporary tables can actually be reused by subsequent executions of the same code by different sessions.

Yeah, I know, it sounds like I’m delusional. How could different sessions use the same statistics on a (non-global) temporary table? Start by reading Paul White’s bug, “UPDATE STATISTICS Does Not Cause Recompilation for a Cached Temporary Table,” then read his post, “Temporary Tables in Stored Procedures.”

How Can I Tell if Statistics are Making a Query Slow?

If you run into problems with slow queries, you can test the slow query and see if updating statistics makes the query run faster by running UPDATE STATISTICS against a column or index statistic involved in the query.

This is tricker than it sounds. Don’t just update the statistics first, because you can’t go back!

  • First, get the execution plan for the ‘slow’ query and save it off. Make sure that if they query was run with parameters, you test each run of the query with the same parameters used to compile it the first time. Otherwise you may just prove that recompiling the query with different parameters makes it faster.
  • Do not run the UPDATE STATISTICS command against the entire table. The table likely has many statistics, and you want to narrow down which statistic is central to the estimation problem.

In many cases, queries are slow because they are re-using an execution plan that was compiled for different parameter values. Read more on this in Erland Sommarskog’s excellent whitepaper, Slow in the Application, Fast in SSMS  – start in this section on “Getting information to solve parameter sniffing problems.”