Everything breaks eventually: data science lessons I’ve learned the hard way
A few years in data science will give you strong opinions about strange things: Microsoft Excel, daylight saving time, unit tests. The beauty of this job is that things can always go wrong in new and interesting ways, but I’ve gradually accumulated some hard-won instincts about what’s most likely to break first.
These are a few of the lessons I’ve learned the hard way on the technical side of the job. The human side deserves its own list (or possibly a thesis).
Dates will break you
February 29 2024 was one of the most humbling days of my career. I had multiple pipelines break because I had embarrassingly forgotten that leap years are a thing.
This isn’t even one of the more sophisticated ways that dates will mess you up. The many incorrect assumptions people make about dates have been documented thoroughly, and even after years as a data scientist I feel like I’m still discovering new ways that dates can ruin my day.
It’s rarely as simple as setting a timezone at the top of your script. Treat dates with a deep suspicion and daylight saving time as a personal affront.
Excel will break your data
Memorise these two numbers: 65,536 and 1,048,576. These are the maximum number of rows in old and current versions of Microsoft Excel. If you have this number of records, almost certainly someone has opened up the file in Excel and saved it, truncating part of your data. No tabular format is safe — Excel will butcher even the most pristine CSV.
Just ask Public Health England, who “lost” almost 15,841 coronavirus cases due to exactly this issue. The blame fell mostly on using the old .xls format (65,536-row limit) rather than Excel itself, but even 1,048,576 rows isn’t much by data science standards. If you’re working with anything large, be very cautious about even opening it in Excel.
In my experience, receiving data in Excel format is a strong signal that trouble lies ahead. If the spreadsheet also has colours, brace yourself.
Part of this is because the people who save data in Excel generally aren’t people who work a lot with data, so mistakes slip through. But a large part of the problem is Excel itself. It maddeningly lets you mix data types freely within a single column, so you’ll see dates in the same column stored as text, numbers, or actual dates. Good luck parsing that.
Excel is also troublingly overzealous in autoconverting what it thinks are dates. Reportedly, one in five published genetics papers have errors introduced by Excel, such as names like SEPT2 being converted to dates. It’s so bad that a number of genes have actually been renamed to prevent this.
At this point, I’m almost impressed by the sheer creativity of Excel’s failures.
Don’t try to be clever
Just about every time I’ve congratulated myself for turning something into a single elegant one-liner, disaster has followed. If code looks curiously labyrinthine, there’s usually a reason. Occasionally that reason is that the last person didn’t know any better, but more often it’s that your clever fix really isn’t one.
And especially don’t try to be clever, funny, or cute when naming things. It might seem hilarious now to take inspiration from pop culture or a meme when naming your functions and variables, but it won’t be when you’re explaining it in a code review or debugging it for the hundredth time. Clarity beats wit every time.
Never trust a join
More than once, a join has ruined my day by not doing quite what I expected it to. Sometimes I’ve unexpectedly lost rows because my join keys don’t match due to hidden whitespace, case differences, mismatched types, or missing values. And sometimes I gain rows I didn’t have before, because what I thought was unique wasn’t.
Then there are non-equi joins, where the possibilities for chaos multiply. Off-by-one logic and boundary conditions will find exciting new ways to defy your expectations.
A lot of these problems can be prevented by profiling your data carefully so you know where duplicates lurk, understand how tables relate, and identify the right keys. These days I always do a quick row count after a join to check whether I’ve mysteriously lost or gained rows. When joins go wrong, they don’t warn you; they just quietly hand you mangled data.
Know where your data came from
It’s always tempting to jump straight into the exciting modelling part of a project without spending much time getting to know your data. This is invariably a mistake.
What does a row actually represent? A customer, a transaction, a click? Are old records overwritten, or is a new row added for every change? If it’s the latter, how do you identify the most up-to-date record? Do you know what each field means? Are you sure? When and how is each field populated? What assumptions were made when the dataset was assembled? Who or what was left out? What was the dataset originally intended for?
If you can’t answer these questions, you have no business training a model. Some of them you can figure out by exploring the dataset, but others will require talking to people. Those conversations will likely also reveal useful undocumented tidbits about the data, like a field that changed definition in 2021, or the column everyone knows not to trust.
Data is very often collected for reporting, and then someone later has the idea of using it for data science. Unfortunately, a heartfelt desire to be data-driven doesn’t make the data suitable. It’s better to discover that before you’ve built a model that confidently produces nonsense.
Just write the wretched unit tests
Chances are you’re already testing your code, you’re just doing it badly — print statements, eyeballing a few rows, pulling out a single example. It’s hacky and non-reproducible, but it’s still testing.
You may as well turn it into a real unit test. It’ll force you to think about what your code is actually supposed to do, what edge cases exist, and which failures should be caught instead of silently ignored. This almost always results in better code. Even when your code fails, it will at least do so in a vaguely helpful way that makes it easier to pin down the problem.
If you save time by skipping tests, expect to spend twice as long debugging later. If you’re lucky.
Assume things will break
Even when I’m not a hypocrite and follow my own rules, things still go wrong sometimes. Sometimes it’s an overlooked bug on my end; sometimes it’s a change in the data inputs to a pipeline. Either way, it’s crucial to think about what should happen when things break.
Some people seem to have a horror of pipelines failing, but do you really want them to run at any cost? Even when the input is missing and the output won’t make sense?
It’s worth explicitly deciding what you want to happen if something goes wrong. Say your model depends on data that usually lands by 7 a.m. — what do you want to happen if it’s late, or doesn’t arrive at all? If it’s absolutely crucial that a prediction be generated even if the data you need isn’t there, perhaps you need a fallback model.
You can’t anticipate every failure, but you can and should plan for what will happen when they inevitably happen. The goals are simple: prevent broken pipelines from quietly producing garbage, and make the underlying problems easy to locate.
Fail loudly, fail clearly, and fail as close to the source as possible. If a key dataframe is empty, kill the pipeline immediately. The longer you let something broken flow downstream, the harder it becomes to trace where it went wrong.
Things always break at the worst possible time. Make life easy for your future self, who’ll be the one debugging. It’s bad enough fixing things at 3 a.m. without also having to explain why no one noticed the problem for weeks.