The hidden work of data science projects
You’ve probably heard the phrase “80% of data science is data cleaning.” It’s not totally wrong, but it is incomplete. While I agree that data science often involves surprisingly little modelling, the 80% line doesn’t do justice to the full range of non-modelling work that makes data projects succeed. These are the things that tend to go unnoticed and unappreciated if everything goes well — but skip them at your peril.
To me, this hidden work falls into two broad categories:
- Work that’s invisible unless it goes wrong, and
- Work that’s visible but undervalued, including work dismissed as “soft skills” (as though the people stuff weren’t at least as complex as the technical stuff!)
Invisible work: getting things right before anyone notices
Some of the most important work happens before you even look at the data. Talking to stakeholders, understanding the problem they’re trying to solve, and clarifying their real needs are essential first steps. Too often, we make an assumption about what’s required and jump straight into a fascinating data science problem that unfortunately turns out to be somewhat adjacent to the one that actually needs solving. When that happens, the analysis might be technically elegant but strategically useless.
Data cleaning, of course, takes up a lot of our time and often bleeds into the only slightly more glamorous task of feature engineering. I’ve heard of teams where cleaning is done by one or two people who then hand it over to data scientists to do their magic. This baffles me. Cleaning is where I learn the oddities of the data and understand the data-generating process. It’s also where I start thinking not just about what’s in the data, but what’s missing. The questions that surface here often improve the modelling, while the conversations they spark with stakeholders reveal both their priorities and the data’s limits.
This is also the point where assumptions are most likely to trip you up. Are you sure that the date column is the date of the event, and not the date the row was last modified? Are those users with the ‘active’ flag currently active, or simply users who have ever been active? No amount of modelling can fix a misunderstanding like that. The best data scientists cultivate a healthy distrust of their data.
Undervalued work: the care that keeps projects alive
Even once we’re past the modelling itself, there’s plenty more hidden work waiting if you want a project to succeed in the real world beyond your laptop or dev environment. This includes documentation, testing, ensuring reproducibility, and setting up monitoring. It’s easy enough to make a script run once on your machine; it’s much harder to make it run reliably every day (or every hour).
If you’re on vacation and something breaks, will one of your colleagues be able to understand your code well enough to track down the problem? More importantly, if something breaks, will your code fail? We tend to treat failing code as the problem, but sometimes failure is the safety mechanism. If the data your pipeline relies on hasn’t landed as expected, do you want it to run? Probably not. Better to fail loudly than to quietly produce incorrect results for two months and have no one notice.
Maintenance is one of the most neglected parts of data science. It’s true that you can’t anticipate every problem, but too many projects are treated as finished once the model is deployed. In fact, that’s usually when the real work starts. The numbers in a slide deck don’t mean much if the model quietly drifts off course in the real world.
Think your analysis is a one-off you’ll only ever look at or run this one time? Think again. Chances are you’ll run it again in six months, and future you won’t remember how anything works. Or someone else will ask to reuse your code, and you’ll have that sinking feeling as you open a file you promised was ready to share and realise it’s a mess of spaghetti code. Even if you’re 90% sure you won’t look at it again, save yourself time (and embarrassment) with at least minimal documentation and some basic steps to ensure reproducibility.
Communication: the bridge between technical and human
Even data scientists with strong technical skills often do themselves a disservice when it comes to communication. In most cases, your stakeholders aren’t going to pore over your code and admire its elegance. They care whether the results are correct, understandable, and relevant. Showing that you’ve met their requirements and giving them reasons to trust your work is a separate skill set from building the model itself, but just as important.
You can build a great model, but if stakeholders don’t understand it or trust it, it probably won’t be used. Worse, if they misunderstand it, they might use it in ways you never intended. When this happens, it’s often blamed on “non-technical stakeholders” (derogatory) failing to intuit what we meant but never communicated. In truth, that’s on us. Having invested all that effort in the technical side, why drop the ball when it comes to communicating what we’ve achieved?
Final thoughts
Data projects rarely fail because of one bad model. They fail because the less exciting work of scoping, cleaning, testing, documentation, maintenance, and communication didn’t get the same care and attention. These quieter parts of a project are what make the visible parts possible. Paying attention to them isn’t extra work — it is the work.