Query Store Size Based Cleanup Causes Performance Problems - How to Avoid It

By Kendra Little on April 23, 2024

I’ve got a backlog of learnings from stress-testing to blog, so if you like this post there’s more to come.

I’m a huge fan of SQL Server’s Query Store feature. Query Store collects query execution plans and aggregate query performance metrics, including wait stats. Having Query Store enabled makes troubleshooting performance issues such as bad parameter sniffing, much, much easier. Because Query Store is integrated into SQL Server itself, it also can catch query plans in a lightweight way that an external monitoring system will often miss.

When performance matters, it’s important to ensure that you’re managing Query Store so that Query Store cleanup does not run during high volume times. Query Store cleanup could slow your workload down significantly.

Look out, the cleanup hornets are coming

Cleanup isn’t easy. I know this from having needed to delete or update data in various circumstances over the years – it’s tough to clean things up safely without impacting performance.

Query Store has some issues with this currently as well.

If you collect wait stats in Query Store, Query Store size-based cleanup of wait stats runs a query:

DELETE sys.plan_persist_wait_stats WHERE plan_id IN 
    ( SELECT plan_id 
    FROM sys.plan_persist_plan 
    WHERE query_id = @query_id )
  • If you monitor your SQL Server or Azure SQL database from a third party monitoring system like Datadog or Redgate SQL Monitor, you will be able to see this query running in that system. You’ll also be able to see it in the plan cache.
  • When Query Store has a good amount of data– but not necessarily enormous (say, 4GB), this query can run repeatedly for a very long time when called by cleanup.
  • This query has significant impacts– more than I would have expected. In an environment with a synchronous availability group in a high activity period, this can depress the rate of transaction log flushes and notably raise the level of HADR_SYNC_COMMIT and HADR_GROUP_COMMIT waits.

I can produce this behavior/impact in a test environment with a synchronous Availability Group setup using Ostress.exe, after first building up a volume of data in Query Store.

I have not been able to produce similar impacts from other parts of Query Store cleanup, only wait stats cleanup, but it might be that they exist and my testing simply didn’t uncover them. Better safe than sorry.

When to consider not collecting wait stats in Query Store

If you have a performance critical database and you want to ensure this doesn’t happen, you don’t have to collect wait statistics in Query Store.

I find the wait stats in Query Store to be occasionally useful only. They are aggregate wait stats, so they can give a general idea for simple issues like whether or not a query often waits on lock waits, but they aren’t super granular. If you do have an external monitoring system that is collecting wait stats, and if it allows you to correlate those wait stats to queries, you might simply decide not to collect wait stats in Query Store.

Notes:

  • Modifying settings in Query Store will clear your execution plan cache, so don’t do this at a peak time if that’s an issue for you.
  • If you’ve been collecting wait stats in Query Store and you disable collection, it does not immediately truncate the wait stats. Cleanup will still need to take care of them, unless you clear Query Store.
    • Obviously, that will cause you to lose Query Store data. You might mitigate this by restoring a backup of the database to access that data if you need it, or you might be willing to start with a clean Query Store, up to you.

Size based cleanup - avoid this

Query Store has two types of cleanup: time based and size based.

If performance is critical, you should tune your query store so that only time-based cleanup runs.

Things to know about size based cleanup:

  • It is based on MAX_STORAGE_SIZE_MB when SIZE_BASED_CLEANUP_MODE is set to AUTO. “Size-based cleanup will be automatically activated when size on disk reaches 90 percent of max_storage_size_mb. This is the default configuration value. Size-based cleanup removes the least expensive and oldest queries first. It stops when approximately 80 percent of max_storage_size_mb is reached.” Docs
  • If you’re working to reproduce issues with sized based issues, you may hit scenarios where your Query Store goes into read-only mode before cleanup starts. I found that I could avoid this by doing a little dance: I ran a workload to grow Query Store to 90% of MAX_STORAGE_SIZE_MB. Then I stopped my workload briefly (cleanup seems to need a lull) and ran sp_query_store_flush_db. Then I started my stress test again immediately. The timing is a bit delicate, but this allowed me to run a test load at the same time size-based cleanup was running.

You may want to allow Query Store some more room in MAX_STORAGE_SIZE_MB so that sized-based cleanup doesn’t kick in. Be aware that you shouldn’t go completely wild with this – as the fantastic Erin Stellato mentions in Query Store Best Practices: MAX_STORAGE_SIZE_MB should be “set to 10GB at the absolute max, something less ideally”

Erin recommends a custom capture policy to help if you need to reduce the amount of space that Query Store data takes up.

Erin mentions in Remove Data from Query Store that…

ideally, size-based cleanup never kicks in…. This algorithm is not efficient, and it runs single-threaded. It looks for queries that are infrequently executed/less important and deletes those, one by one, until the size is less than 80% of MAX_STORAGE_SIZE_MB. Avoid this type of cleanup if at all possible.

My testing shows this to be very good advice.

To monitor size based cleanup, you can trace the query_store_size_retention_cleanup_started and query_store_size_retention_cleanup_finished Extended Events. Read more in Zikato’s answer on StackOverflow.

The mysteries of time based cleanup

Josh Darnell has written an excellent post on Timing of the Automatic Query Store Purge. I won’t rehash his whole post here– go read it!

The most important basic bits are:

  • Time based cleanup happens every 24 hours
  • It’s roughly based on startup time

That means that if you happen to have a failover in the middle of the day, you may end up with time-based cleanup running in the middle of the day. Not ideal. Depending on your tolerance of cleanup running at a high volume time, this might be worth a planned failover at a low volume time.

As much as I don’t love that, I still find Query Store to be useful enough in general to deal with that.

To monitor time based cleanup, you can trace two Extended Events: query_store_db_cleanup__started and query_store_db_cleanup__finished. Only time-based cleanup events are captured in these events.

DIY Cleanup

There are a few stored procedures you can do to add custom cleanup jobs to run on schedule if you like. Erin Stellato give examples of them in ‘Remove Data from Query Store’:

EXEC sp_query_store_remove_query @query_id = ZZZ; 
 
EXEC sp_query_store_remove_plan @plan_id = YYY;
 
EXEC sp_query_store_reset_exec_stats @plan_id = YYY;

Product improvements that would help

The following things would be super helpful:

  1. Allow configuration of when time-based cleanup runs
  2. Give more insight into which bits of Query Store are taking up space and what levers to configure to “fit” Query Store into your desired maximum size.

And one correction… I had suggested:

Increase visibility of when size-based cleanup runs. I would really prefer for this to be logged in a DMV rather than an extended event, as that’s easier to get into monitoring systems, but I’ll take whatever I can get.

Update: this already exists as an Extended Event. See Zikato’s answer on using query_store_size_retention_cleanup_started and query_store_size_retention_cleanup_finished on StackOverflow.

Happy performance tuning, and please remember to tip your Query Store (by not allowing size-based cleanup to run).

Thanks very much to Erin Stellato, Josh Darnell, and Tomáš Zíka

Erin is super helpful, so smart, and just one of the best folks around. All of her posts on Query Store are incredibly helpful, just search “Erin Stellato Query Store” when you have a question :) If you work with Erin, please tell her that her awesomeness is widely loved.

Thanks also to Josh Darnell for his excellent posts on Query Store Cleanup. This was also super helpful to me. And thanks to Tomáš Zíka (Zikato) for his answer detailing how to trace size-based cleanup.