Should release cadence be slowed if you don't have database load testing?

on April 23, 2019

I got a question recently about a panel discussion on Database Development Disasters at SQL in the City Streamed. I had framed a question as, “how fast should development go without load or performance testing?”

I got a follow-up question from my friend Chris Randvere at Redgate: he asked for more information about what the question meant? I realized that my wording had been pretty unclear. I had meant to ask the panelists what their thoughts were on release cadence when a team lacks tooling to do automated load and performance testing outside of production.

Should the lack of automated performance testing ability change the rate at which we deploy software?

In other words, if we can’t do performance and load testing, does that mean that we should or shouldn’t deploy a change to a database every weekday?

I don’t think we covered this super-well in the panel because I worded the question poorly. So I wanted to share my experience around this, and also talk about why it can be fairly common for teams to lack automated load testing ability outside of production.

Why doesn’t everyone have an environment where they can validate performance by replaying activity against an updated database before it ever gets released?

We have some built-in tools for this in SQL Server. The current version of this is called Distributed Replay. These tools are not the most lovingly tended by Microsoft in terms of updates. A frequent complaint that I’ve heard about distributed replay is that the current version of the tooling still requires you to feed it with old style profile traces done with the old SQL Trace.

You don’t necessarily have to have the Profiler app running while you do the trace, but the old-style SQL Trace results are what it takes in in other words. You can’t do a more modern Extended Events trace and feed that into the tool.

But that lack of updates isn’t the main reason why not everyone runs Distributed Replay.

Distributed replay is tricky to set up

The more complex your environment, the trickier it is to set it up. If you’ve got things like linked servers or SQL Server transactional replication creating interesting patterns in which your SQL Server communicates with other SQL Servers, that can make using Distributed Replay particularly tricky.

There are absolutely people out there who’ve configured Distributed Replay on complex systems, but they all say it wasn’t something they set up in just two hours. So one factor is the complexity.

Another factor: distributed replay is designed to replay not to amplify

When we’re doing performance or load testing, we are not always interested in: how will the system perform under the current load? A lot of times we’re interested in: how will the system perform if it’s under even more load? 150% of the load, or 200%, or more.

But with any replay tool – I’m not just dogging on distributed replay here – if we think about the nature of any database replay tool, we can’t replay the exact same commands and expect to learn how performance will be at a higher load rate.

For example, a delete command. If a delete command on the first run finds 10,000 rows to delete, that could be a fair amount of work.

If we replay that same delete command, depending on what the criteria are in the delete, it possibly will find zero rows to delete the second time, because it’s already completed. Similar things can happen with updates. We may also have constraints that mean we can’t just insert the same thing twice depending on the nature of the data.

So the way modifications work, just amping up the load in a replay isn’t the same as adding true additional load.

Now, there are other ways you can do load testing. There are third-party tools that you can buy to get around this problem of repeated modifications.

They can be expensive. But they also require a fair amount of coding, because you’ve got to put in commands that help you get to a state where you can check: Hey, let’s turn let’s turn the volume of activity up to 200%, to 300%.

So, some folks do that. but because of the cost and the effort put into it, folks only tend to do this with third party tools when it’s really worth their while, and their management is deeply invested in the idea of having load testing.

Even then, the load testing needs to be updated for some changes

For some changes, you can test them with load testing tools without any changes. For example, if I refactor a function for performance tuning, but don’t change any inputs or outputs, I could test that with an existing configuration of a load testing tool.

But what if I add a new parameter onto a stored procedure? If I don’t change the load testing, will that be appropriate, or not? Should I be running the load test with a variety of values for that?

Or what if my change involves dropping one procedure and adding in another one? A replay system would have no idea what to do, and with a load testing system I’d need to adjust what gets executed, how often, etc.

When it comes to testing database changes, load testing tools are excellent, but I’d expect some human work to be required as well.

So, most people don’t have automated performance and load testing. Should that impact how frequently we deploy changes to production? 

What we’re really looking to find with load testing in the SDLC cycle is regressions in performance. We should have other testing to catch true defects such as making sure that the right results are returned, etc.

With database changes, there is a fair amount of work we can do to make sure that things perform well without load testing. There is other due diligence that can help: we can maintain and use a staging or pre-production environment with production size datasets and the same data distribution as production, for example.

In that environment, we can testing that confirms: what indexes is the modified code using, and are they optimal for the modified code? How long are queries taking? We can make educated guesses about production performance instead of waiting until after a release to see what it’s like.

This level of manual performance testing can work extremely well, and it unlocks us to do frequent deployments, in my experience.

Without load testing, it’s best to frequently deploy small database changes

By small, I mean as small as possible. A lot of these database changes are going to be things that our customers shouldn’t notice at all. Like, hey, we add a new column to this table. We’re not actually using it, though. We’re going to use it in the future for a feature that’s coming out soon.

But we regularly trickle out the staging steps for a future change. Each of these steps is backwards compatible and deployed well ahead of the point in which we “turn on” the new feature for our customers, which is often handled via an application control called a feature flag.

This regular stream of very small changes is helpful for speedily resolving any performance issues which may occur.

If performance does change, we haven’t released a big batch of 50 changes all at once – that’s a lot of things to go through to find the culprit. Instead, we’ve released maybe 7-10 changes in the last week, and we can look at the most recent ones first, and check if they could be related to the issue.

What do you mean by “frequent”?

By “frequent”, I mean that daily releases shouldn’t be a big deal.

But remember: batch size is critical. This isn’t like a race car, this is like a steadily dripping faucet.

Complexity: sometimes performance problems don’t happen right after a change is released

There can be changes that we make where performance is fine for a while, and then something happens. Maybe it’s even three weeks after the change was released, but suddenly performance is terrible.

One particular tricky issue in SQL Server that can cause this is what’s called parameter sniffing. If we are using queries that are parameterized, a lot of the way that things perform depends on the execution plan that is cached with the first set of parameters that are passed into the query or procedure when it’s compiled, because SQL Server reuses that execution plan until something causes it to recompile.

Maybe the thing that causes the query to recompile is enough data has changed that statistics automatically update. Maybe it’s that there’s a failover in an Availability Group. Maybe the SQL Server is restarted. Maybe someone manually clears the execution plan cache.

A very wide variety of things can change, but if we happen to we have a recompile and we happen to have a compilation with a set of parameters that leads to an execution plan that doesn’t work so well, then we can suddenly run into performance issues.

But this risk of parameter sniffing doesn’t mean that we shouldn’t release that changes frequently. We’re going to have the same risk whether or not we’re releasing a big batch of changes or whether we are regularly releasing small changes.

For issues like this, I like the recent auto-tuning features in SQL Server. Essentially the automatic plan correction feature will look cases where a query is sometimes fast and sometimes slow, and attempt to identify “bad” execution plans which are periodically cached for a query.

You have to have the built-in Query Store feature enabled, and when using on-prem SQL Server this feature requires Enterprise Edition for production (Developer Edition has all the features of Enterprise for non-production environments). But if you do have a SQL Server where performance is critical, this feature is a huge deal. It gives you the option to either let it temporarily correct the plan for you, or to notify you in a DMV that it’s spotted a problem that needs tuning.

This feature allows can either work with automated load testing – you can use the feature to spot parameter sniffing problems before deployment – or it can give you early warning of performance problems before your customers start containing. It also gives you insight into how to reproduce these tricky problems outside of production.

A quick recap

Automated load testing is fantastic, if you have the time and budget to take it on for your team.

But if you don’t have it, the lack of automated load testing shouldn’t block you from frequently deploying small changes to production. By combining targeted manual performance testing into properly planned changes, you can still deploy frequent changes safely into high performance and high uptime environments successfully.