How Stale are my Statistics?

on April 22, 2009

Update: improved/more recent version of queries for this are here.

It can be pretty difficult to manage statistics in data warehouses, or even OLTP databases that have very large tables.

This is because, even with auto_update_statistics turned on, SQL is pretty conservative about when to update statistics due to the cost of the operation. For large tables, statistics are updated when “500 + 20% of the number of rows in the table when the statistics were gathered” have changed. (see BOL here)

So for a table with 50 million rows, statistics will auto update when more than 10 million 500 rows have changed. I have a lot of tables with a lot of rows, and this can be a problem. Take a fact table, for instance, where the key is sorted by an integer representing a date. Every day, a large amount of new records are loaded and there is suddenly a lot of records for a new day. Typically though, the number of rows changed in one day in these large fact tables is not enough to trigger automatic updating of statistics.

You don’t want to update statistics more often than you have to, however. A schema stability lock is put in place, and depending on what you’re doing it can cost a bit in IO. You want to learn how often you need to manage your statistics in a given set of tables, and also have a guideline to use to ensure you’re staying within where you need to be over time.

The Basics: What are statistics?

Statistics are a description of the distribution of data in a column or index.

What can happen when statistics are incorrect?

The query optimizer relies heavily on statistics, so if the stats do not represent what you are looking for very well then the optimizer can create an inefficient query plan. The optimizer’s query plan is determined before execution, and once the plan is made the execution must either complete or be cancelled. (In other words, there is no mechanism by which after running for a while SQL can see that the plan it made is based on bad statistics and start over.)

Will knowing how fresh my statistics are solve all my performance problems?

No, but it can help solve some if you set up good maintenance, and knowing how to evaluate the freshness of your stats sure makes troubleshooting easier when problems occur.

What’s this query do?

The query below will look at all statistics in the current database and for each one show you the time at which statistics were last updated, as well as the approximate number of rows that have been changed in that statistic since the last update.

RowModCtr? Does that still work?

Rowmodctr just provides an estimate in SQL 2005 and later: From books online: “The new rowmodctr value will frequently help you determine when to update statistics because the behavior is reasonably close to the results of earlier versions.” Currently the actual tracking is done is a column called colmodctr, but that is not exposed to us by the DMVs or system views. So all we have to rely on is dear old rowmodctr. I often find that it was way better than nothing.

How high a rowmodctr value is too high?

Well, how high do you think is too high? :) The number varies. You may want to link to sys.partitions and count the number of rows in each object and calculate a percentage. I’ve considered editing the script to do that, but never gotten around to it because it’s not quite as simple as a percentage. For many tables, you just want to update statistics after you do a significant amount of loading of data (like after you insert a new day of fact data). On some tables you may want to update statistics less frequently but with a higher than default sampling rate, so a high percentage of change may be OK.

So like everything else, you have to get to know your system, its rate of change, and the performance of jobs/applications/user queries, and use the rowmodctr information and query plans to help tune along the way.

SELECT DISTINCT
    tablename=object_name(i.object_id)
    , o.type_desc
    ,index_name=i.[name]
    , statistics_update_date = STATS_DATE(i.object_id, i.index_id)
    , si.rowmodctr
FROM sys.indexes i (nolock)
JOIN sys.objects o (nolock) on
    i.object_id=o.object_id
JOIN sys.sysindexes si (nolock) on
    i.object_id=si.id
    and i.index_id=si.indid
where
    o.type  'S'  --ignore system objects
    and STATS_DATE(i.object_id, i.index_id) is not null
order by si.rowmodctr desc;
GO

References

Statistical maintenance functionality (autostats) in SQL Server - Mostly about SQL 2000 but includes useful information on locks used by update statistics operations.

Statistics Used by the Query Optimizer in Microsoft SQL Server 2005 Statistics Used by the Query Optimizer in Microsoft SQL Server 2008