Filling in Data Potholes Redux: Tally Tables vs CTEs

In A Previous Installment


… our heroine (that’s me) rediscovered CTEs, specifically in the recursive style. That was in my post “Filling in Data Potholes with Recursive CTEs.”

To recap: I was working on a problem with gaps in temporal data. The basic scenario was:

Imagine that you are writing a script that looks at data grouped by the minute. You notice that there are no rows for some minutes, and you’d like to display a value when that is the case, probably showing a count of zero.

For the particular problem I was looking at, I was using small datasets and generating a list of all the valid dates with a recursive CTE performed well for me.

From the Comments

The best thing about blogging is not really sharing what you know: it’s getting to learn more. You get to learn from the process of writing the blog, and also if you are lucky enough to get comments from readers.

I had a couple of comments that mentioned performance, and Brad Shulz ( blog | sadly not on Twitter ) also linked me up to this most interesting post he did on recursion.

CTEs vs Tally Tables

Brad’s comment and post inspired me to do a little rewriting to put together a tally table style solution to the example I gave, scale it out in a few different ways, and collect and publish performance comparisons.

Why? Well, it’s interesting.

Creating Some Data

I didn’t do anything fancy for this. I took the same query I had previously to generate some data with gaps, and just modified it a bit so I could add in a small amount of data three months out, six months out, one year out, and two years out. This created datasets that were mostly “gap” for these future periods, but still made the query return a lot of rows.

--Run this with @monthOffset values: 0, 3, 6, 12, 24
--Run a trial each time.
DECLARE @monthOffset TINYINT = 0

--Let's create some data with some gaps

DECLARE @startDate DATETIME2(0)= DATEADD(MM, @monthOffset, '1/1/2011')

IF OBJECT_ID('dbo.MyImperfectData', 'U') IS NOT NULL
    AND @monthOffset = 0
        DROP TABLE dbo.MyImperfectData

IF OBJECT_ID('dbo.MyImperfectData', 'U') IS NULL
    CREATE TABLE dbo.MyImperfectData
          ItemDate DATETIME2(0) ,
          ItemCount SMALLINT ,
          CONSTRAINT cx_ItemDate_MyImperfectData UNIQUE CLUSTERED ( ItemDate )

INSERT  dbo.MyImperfectData
        ( ItemDate, ItemCount )
VALUES  ( DATEADD(mi, 0, @startDate), 12 ),
        ( DATEADD(mi, 1, @startDate), 3 ),
        ( DATEADD(mi, 2, @startDate), 6 ),
        ( DATEADD(mi, 3, @startDate), 12 ),
        ( DATEADD(mi, 4, @startDate), 24 ),
        ( DATEADD(mi, 5, @startDate), 1 ),
		-- Gap where 6 would be
        ( DATEADD(mi, 7, @startDate), 122 ),
        ( DATEADD(mi, 8, @startDate), 1 ),
        ( DATEADD(mi, 9, @startDate), 1244 ),
        ( DATEADD(mi, 10, @startDate), 23 ),
        ( DATEADD(mi, 11, @startDate), 12 ),
        ( DATEADD(mi, 12, @startDate), 24 ),
        ( DATEADD(mi, 13, @startDate), 27 ),
        ( DATEADD(mi, 14, @startDate), 28 ),
		--Gap where 15, 16, 17 would be
        ( DATEADD(mi, 18, @startDate), 34 ),
        ( DATEADD(mi, 19, @startDate), 93 ),
        ( DATEADD(mi, 20, @startDate), 33 ),
        ( DATEADD(mi, 21, @startDate), 65 ),
        ( DATEADD(mi, 22, @startDate), 7 ),
        ( DATEADD(mi, 23, @startDate), 5 ),
		--Gap where 24 would be
        ( DATEADD(mi, 25, @startDate), 4 ),
        ( DATEADD(mi, 26, @startDate), 6 ),
        ( DATEADD(mi, 27, @startDate), 7 ),
        ( DATEADD(mi, 28, @startDate), 77 ),
        ( DATEADD(mi, 29, @startDate), 94 ) ;

The Tally Table Approach

For this test I created a temporary tally table with two million rows. I used the method to populate the table which Jeff Moden recommends in his article, The ‘Numbers’ or ‘Tally’ Table: What it is and how it Replaces a Loop:

IF OBJECT_ID('tempdb..#tally') IS NOT NULL
    DROP TABLE #tally
    SELECT TOP 2000000
            IDENTITY( INT,1,1 ) AS N
    INTO    #Tally
    FROM    Master.dbo.SysColumns sc1 ,
            Master.dbo.SysColumns sc2


One thing to note with the tally table approach: you need to make sure you have enough numbers in the tally table to support your needs, or you’ll be missing rows in results. For production usage, you probably want to go well above 2 million if you’re using this for any scale.

To create the tally table on the fly with this number of rows, it took 4,438 ms of CPU time.

For fun, I used a CTE when I wrote my query to display the data for the Tally table, also. In this case, it’s just not a recursive CTE. A derived table could have been used just as well, but I think it’s more readable with the CTE.

WITH    TallyCTE
          AS ( SELECT   DATEADD(mi, t.N, limits.MinDate) AS dayDate
               FROM     #Tally t
               JOIN     ( SELECT    MIN(ItemDate) AS MinDate ,
                                    MAX(ItemDate) AS MaxDate
                          FROM      dbo.MyImperfectData AS imd
                        ) limits ON t.N <= DATEDIFF(mi, MinDate, MaxDate)
    SELECT  tcte.dayDate ,
            CASE WHEN Itemcount IS NULL THEN '[Missing Row]'
                 ELSE ''
            END AS ColumnDescription ,
            COALESCE(ItemCount, 0) AS ItemCount
    FROM    TallyCTE tcte
    LEFT OUTER JOIN dbo.MyImperfectData AS d ON tcte.dayDate = d.ItemDate
    ORDER BY tcte.dayDate

The Recursive CTE Approach

This was the same as last time– just recapping it here for anyone playing along at home:

DECLARE @startDate DATETIME2(0) ,
    @endDate DATETIME2(0) ;

SELECT  @startdate = MIN(ItemDate), @endDate = MAX(ItemDate)
FROM    dbo.MyImperfectData ;

          AS ( SELECT   @startDate AS MyCTEDate
               UNION ALL
               SELECT   DATEADD(mi, 1, MyCTEDate)
               FROM     MyCTE
               WHERE    MyCTEDate < DATEADD(mi, -1, @endDate) )
    SELECT  MyCTEDate, CASE WHEN Itemcount IS NULL THEN '[Missing Row]'
                            ELSE ''
                       END AS ColumnDescription,
            COALESCE(ItemCount, 0) AS ItemCount
    FROM    MyCTE
            LEFT OUTER JOIN dbo.MyImperfectData ld
                ON MyCTE.MyCTEDate = ld.ItemDate

Notes on Method

I tested this on a virtual machine on my laptop, which is (definitely) not a production environment! However, I wasn’t doing much else on the laptop at the time, and I ran each test a couple of times to make sure results were all clearly in the right ballpark.
I cleaned out the buffer pool (DBCC DROPCLEANBUFFERS) and the procedure cache (DBCC FREEPROCCACHE) before each query.

The Results

Here is what I saw:

Don’t forget that I was seeing 4,438 ms of CPU time to create my tally table itself– if you truly couldn’t persist objects, this might be a concern for you. However, the numbers in this table really justify creating persisting the tally object permanently.

CPU time is 8 times faster on average with the tally method when we get above 125K rows. The minimal logical reads, and the ability for read-ahead reads to be used, is also convincing.

What about the Execution Plans?

For the most part, I’m just going to link yet again to Brad’s post. Read about the Stack Spool operator!

But I will note this: when I looked at execution plans for this, I noticed that subtree cost estimates for the CTE query were .0148, and estimates for the tally table approach were 61.848. (This is from the two year batch.) The statistical estimates from the tally table approach were just far more accurate. This makes perfect sense, as the optimizer had statistics available to it to help it out: they were on the Tally table itself.

In Summary

I can’t really think of a situation in which I’ll use the recursive CTE for this specific issue. And after thinking about the stack spool operator, I’ll try to think differently about how to construct queries in general.

However, I really enjoyed writing these posts.

6 Comments. Leave new

  • […] This post was mentioned on Twitter by SoftArtisans Team. SoftArtisans Team said: Nice work! RT @kendra_little: [Blog] Filling in Data Potholes Redux: Tally Tables vs CTEs […]

  • I’m a big fan of tally tables. I used one here but generated it using a different method. Quite useful. As are CTEs.

    I haven’t done a lot of playing around with recursion in SQL but it doesn’t surprise me that it’s slower than the Tally table. Recursion is basically a loop and, as Jeff Moden shows in the article you link to, loops tend to be slow and even when they work well with small data sets they often don’t scale. The problem with recursion – and the same goes for correlated sub-queries! – is that they don’t look like loops. (I think Mr. Moden talks about that a little bit in this article.) I had a lot of problems with hidden loops in my code a couple of years back.

    Loops don’t always look like loops! That is burned into my brain now.

  • Terrific analysis, Kendra!

    The difference in Reads/CPU between the two methods was even more dramatic than I thought.

    Happy New Year!

    (LOVE your “Redux” picture, BTW… LOL).


  • Hi Kendra,

    Just attended your dates session in WIT 24 hours Pass session and liked it very much. Thanks.

  • I was searching for something else and ran across your article. Thank you for the leg up, Kendra, and nice article, indeed.

    –Jeff Moden


Share a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: