July 11, 2023
I was doing a bit of data analysis, and the resulting numbers didn’t quite add up.
I double-checked my queries. Had I goofed in my sql? Nope. Next, I looked into if some of the data was in an inconsistent state.
What I found was worse than what I’d imagined. As a data person, it made me feel sad and icky.
That’s because it’s usually not too hard to clean up bad data. It’s almost always much harder to fix a badly designed data model which is already established in production.
I’m anonymizing the schema here, because it doesn’t matter who did this. Badly designed relational databases are very, very common– this one made me feel a certain way because it’s uniquely bad, but it’s certainly not the worst thing anyone’s ever done to a database. It was simply pernicious.
So let’s say we have a table with many millions of rows, named Widgets. Widgets is primarily a table in an OLTP database, the but database is also used for lots of analytics.
Widgets has many columns, but among them are:
- WidgetId, a surrogate key that uniquely identifies every widget in the table.
- WidgetTypeId, which joins to WidgetTypes.
- WidgetOwnerId, which joins to WidgetOwner.
Here’s a mermaid.js entity relationship diagram:
What wasn’t adding up: ownerless widgets
Widgets was a very commonly used and core table for the OLTP database. What I found was that although the table had many millions of rows, all of them had a WidgetOwnerId… except for around six rows. Only six.
It smelled like a data cleansing problem at first, but when I had a look at the six rows I noticed something odd: they all had negative integers as the WidgetId, while every other row in the table had positive integer ids.
In other words:
- A handful of widgets had a negative WidgetId, and all of these rows had a null WidgetOwnerId
- Many millions of widgets each had a positive WidgetId, and all of these rows had a valid WidgetOwnerId
- All widgets had a WidgetTypeId.
This seemed like a clue. I began asking around about what the negative WidgetId values meant.
The answer was that they were actually a very different type of widget altogether. But when that type of widget was implemented, a new WidgetType wasn’t created… instead they re-used a different (let’s call it “incorrect”) WidgetType, and used a negative identifier to indicate that this wasn’t any ordinary widget.
In other words, WidgetType was not only determined by the WidgetTypeId column in the table. To actually know the WidgetType, you also had to look at whether WidgetId was a positive or negative value.
What’s so terrible about this? Let us count the ways
Essentially, we’ve got “hidden” business logic here, and it’s business logic which doesn’t follow the established conventions of normalization. Here’s the top three negative impacts:
Bad/breaking code. This is a weird pattern that is not easy to notice. Anyone not intimately familiar with the data can easily make mistakes when working with this data, which could have significant impacts on applications using the OLTP database.
Extra cognitive load. This pattern is not easy to get comfortable with or to remember, especially if you’re not a data person. Having to teach everyone who works with this object the nuances of how it is designed, and have them remember how to work with it adds a weird complexity that is tough to remember.
Knock-on effects downstream. Want to put this table in a data warehouse? The choices are to leave the structure as is, in which case you risk incorrect analytics / extra cognitive load for your analysts. Or you could build in complex transformations that may be very fragile depending on how the object changes in the future. Neither of these is a great option.
The root cause wasn’t dumb people… it was a cultural problem
To folks used to normalized data structures, this sounds pretty dumb. The tables were already normalized! Why make it weird?
Well, this didn’t happen because folks didn’t understand normalization. This happened as a “creative” solution to ship some changes faster by reusing code. That decision was driven by strong pressure from a leader to meet deadlines.
What became a technical problem was caused by a leader who created a problematic culture, in which folks felt like they didn’t have time to do the right thing. Instead they live in a perpetual “just force it to work NOW” moment. It turns out the codebase was full of weird hacks that had been made to meet shipping demands and schedules. And the hacks were growing. This was merely one example.
The result of a constant “just force it to work NOW” culture is:
- An extremely fragile production environment that breaks easily– and even small errors cause full outages.
- An engineering team that has a hard time hiring junior or even mid-level software developers because of the cognitive load required to understand and work with the complex code.
- Frequent incorrect analytics.
This is not an easy situation to dig out of while also building new features, so any changes need to be made carefully, bit by bit. And you also have to do something about all those fresh anti-patterns entering the architecture on the regular to get those new features shipped.
Read more stories of code that made folks feel a certain way
This post was inspired by Erik Darling’s TSQL Tuesday topic, “Code That Made You Feel A Way”.