The case against propensity modelling

Author

Gabrielle Josling

Published

November 3, 2025

Propensity models look for patterns in past data to estimate how likely someone is to do something (buy, cancel, or upgrade). It’s a deceptively simple idea: train a model on who acted before and use it to predict who will act next.

It sounds powerful, and in theory it should let you focus your efforts where they’ll have the most impact. In practice, the question propensity modelling answers is often the wrong one.

How propensity modelling lost its way

Data scientists working in commercial settings might be surprised to learn propensity modelling was originally developed for causal inference. The propensity score itself was never intended to be the final output, but simply an intermediate step to estimate the thing you actually care about. It’s still used in this way in fields such as epidemiology and economics where it’s often necessary to estimate the effect of an action or intervention when there are many confounding factors.

Over time, the term ‘propensity modelling’ escaped the world of causal inference and took on a new life in marketing and customer analytics. The causal idea was replaced by a predictive one that ranks people by likelihood. The name stayed the same, but the goal completely changed.

This may sound like nitpicking, but it means that these models answer very different questions than organisations assume. This misunderstanding means that propensity scores are used in ways they were never intended to be, and often they aren’t evaluated robustly enough for anyone to even notice.

Suppose a retailer builds a so-called propensity model predicting who’s likely to use a discount voucher. They send offers to the top 10% of customers by score and see a jump in purchases. It looks like success. But how many of those people would have bought at full price had they not been sent a voucher? How many would have purchased but were put off by marketing emails? And who would have purchased had they been sent a voucher, but wasn’t because they’re in a group the retailer has never sent vouchers to in the past?

This shows how causal and predictive models answer fundamentally different questions, and how answering the wrong one leads to systematically poor decisions. The irony is that propensity modelling was invented to correct exactly this kind of bias. In its original form, it was a tool for causal reasoning, not a shortcut for prediction.

The invisible middle: who can actually be persuaded

Predictive models are good at finding people who have historically behaved in certain ways. But what most organisations actually need to know is who can be influenced; that is, who will change their behaviour because of an action you take.

Those working on customer retention have long recognised four broad groups:

Sure things, who are likely to stay regardless of what you do
Lost causes, who are likely to leave regardless of what you do
Persuadables, who are more likely to stay because you act
Sleeping dogs, who are more likely to leave because you act

Ideally, you want to identify the “persuadables” because that’s where you can have the most impact. But predictive models excel in identifying “sure things”. If you contact “sure things” they’ll probably stay, but what you don’t see is that they probably would have anyway.

The underlying problem is that the “persuadables,” or the invisible middle, are rarely visible in the training data. If you only ever see the outcomes of people you contacted, the model can’t learn what would have happened had you done nothing. And if you’ve historically neglected some groups of customers, the model will quietly learn and reinforce that pattern. It will keep recommending attention to the same groups and silence for the rest.

This leads to the worst kind of failure: one that looks like success and convinces you to keep doing what isn’t actually working.

The illusion of success

Organisations that misuse propensity modelling often compound the problem by evaluating it poorly. Without rigorous testing, they can’t even tell that it isn’t working.

There are a few common approaches organisations take to evaluate the impact of propensity models, all of them deeply problematic:

We did better than last time: A common benchmark is the previous campaign, or one arbitrarily declared to be similar and therefore comparable. Occasionally the new campaign actually performs worse, but this is explained away by the economy or other factors. The argument is then made that it would have been even worse without the model. This is a very embarrassing position for a data scientist to defend.
No control group: Only 10% of customers predicted to leave did so after being sent a discount voucher. Without a basis for comparison, that number is meaningless. Maybe the model was wrong. A proper experiment would deliberately hold out some people predicted to leave, so you can see whether the voucher actually makes a difference. This approach also tells you nothing about people who did cancel but weren’t targeted. Why did the model miss them? Could they have been persuaded?
Inappropriate control group: This is a slightly more sophisticated version of “no control group.” A retailer sends a discount voucher to customers predicted to leave and afterwards treats those who didn’t open the email as the control group. The problem here is that customers who don’t open marketing emails are systematically different from those who do, so the comparison is biased from the start. It’s hardly surprising you see a difference in behaviour between groups that differ in engagement from the outset.

Whether a model actually changes behaviour is an empirical question, not an article of faith. Impact has to be demonstrated for every model and in every context where it’s used. There are many ways to trick yourself into believing something is working when it isn’t. Fortunately, we have the field of statistics to fight against this tendency.

When you don’t evaluate performance robustly, you commit to wasting time and money on a model that doesn’t actually work. Worse still, you rob yourself of the chance to iterate and find something better.

If you never have failures, something’s probably wrong with how you evaluate your work.

The feedback loop trap

Beyond the confusion between prediction and causation, there’s a subtler danger: using predictions to change the very outcomes your model predicts.

In applied data science, we often build models to predict outcomes we want to change. That’s inherently risky. Once you act on a prediction to prevent it from coming true, you start eroding the very pattern the model depends on. This needs to be very carefully and thoughtfully managed, but in practice it’s often not considered at all.

Imagine a model predicts that a customer is about to cancel their subscription, so you send them a discount offer. They stay. Was the model wrong, or did the offer work? You can’t tell.

Even when the model works, it invalidates itself. In extreme cases, a perfect prevention system looks like a failing prediction system. The data that would prove the model’s accuracy disappears the moment you act on it. When you retrain the model later, it learns from outcomes that were shaped by its own influence. Over time, it drifts further from reality, optimising for its past interventions rather than actual patterns. Predicting a behaviour and then trying to prevent it risks breaking the link between data and reality that the model needs to learn.

In practice, this problem doesn’t trip up most organisations as much as it should, largely because their interventions aren’t all that effective. The model’s predictions survive not because they’re robust or useful, but because the actions based on them rarely move the needle.

Doing it properly

So if most propensity models are answering the wrong question, what can we do instead? The good news is that the solution isn’t mysterious. We already know how to find out what works. It’s called science.

The only reliable way to know whether an intervention changes behaviour is to test it experimentally. That means randomly holding out some people who don’t receive the intervention, even if it feels like leaving money on the table. The idea of not doing things you believe will make you money is understandably unpalatable, but it’s the only way to learn what works and what doesn’t.

You also need to test interventions across the full population, not just those your model believes are likely to respond. If certain groups have been ignored or under-targeted in the past, an experiment is your chance to discover whether they would have responded after all. Without that work, your model will keep reinforcing the same patterns of attention and neglect.

Once you have experimental data, you can go further. Uplift modelling uses that data to estimate the difference an intervention makes for each individual. It tells you not just whether someone acted, but whether they acted because of what you did. It’s a harder problem, but gets at the thing you actually care about: impact.

Doing this well is as much a communication challenge as a technical one. Your stakeholders are unlikely to be across the finer points of predictive versus causal modelling, so you’ll need to help them understand why experiments sometimes mean accepting short-term losses for long-term knowledge. Part of the job is making the case that good science is good business, and advocating for doing things the right way.

Data science is about solving the actual problem, not a more convenient version of it. Have the integrity and discipline to do it properly.