If you’re trained in epidemiology or biostatistics, you likely think in terms of models, inference, and evidence. Now, with machine learning entering the scene, you’re probably hearing about algorithms that can “predict” disease, “detect” outbreaks, and “learn” from data. But while ML offers exciting possibilities, it’s important to understand how it differs from classical statistical approaches—especially when public health decisions depend on more than just prediction.
Let’s explore how statistics and machine learning differ—not just in technique, but in mindset, use case, and the all-important concept of causality.
How They Think
Statistics and machine learning begin with different goals.
Statistics is built to answer questions like: Does exposure X cause outcome Y? It aims to explain relationships, test hypotheses, and estimate effect sizes. It relies on assumptions—like randomness, independence, and model structure—to ensure that findings reflect the real world, not just the sample at hand.
Machine learning, in contrast, asks: Given this data, what outcome should I predict? It doesn’t aim to explain but to perform—minimising error and maximising predictive accuracy, even if the relationships are complex or difficult to interpret.
That’s a major shift. While statistics seeks truth about the population, ML seeks performance in unseen data.
How They Work
Statistical methods are grounded in probability theory and estimation. They involve fitting models with interpretable parameters: coefficients, confidence intervals, p-values. The analyst usually specifies the form of the model in advance, guided by theory and prior evidence.
Machine learning models are trained through algorithms, often using large datasets and iterative techniques to optimise performance. Models like decision trees, support vector machines, and random forests find patterns without assuming linearity or distribution. You don’t always know what the model is “looking at”—you just know if it works.
There are also hybrid approaches—like regularised regression, ensemble models, and causal forests—that blend the logic of both.
What They Do Well
Statistics excels in clarity and rigour. It tells you not just whether something matters, but how much, and with what certainty. It’s ideally suited for:
Identifying risk factors Estimating treatment effects Designing policy interventions Publishing findings with transparent reasoning
Machine learning is best when:
Relationships are non-linear or unknown You have many predictors and large datasets You need fast, repeatable predictions (e.g. real-time risk scoring) The goal is performance, not explanation
In short, statistics helps you understand, ML helps you predict.
Where They Fall Short
Statistics can break down when data gets messy—especially when model assumptions are violated or the number of variables overwhelms the number of observations. It also isn’t built to handle unstructured data like images or free text.
Machine learning’s biggest limitation is often overlooked: it doesn’t care about causality. A model may predict hospitalisation risk with 95% accuracy, but it doesn’t tell you why. It might rely on variables that are associated, not causal. Worse, it might act on misleading proxies that look predictive but don’t offer actionable insight.
This matters deeply in public health. Predicting who dies is not the same as preventing death. Models that ignore cause can lead to misguided interventions or unjust decisions.
Another weakness of ML is interpretability. Many powerful algorithms (like gradient boosting or neural networks) are “black boxes”—hard to explain and harder to justify in policy decisions. While newer tools like SHAP can improve transparency, they still fall short of the clarity offered by traditional statistical models.
When to Use Each
Use statistics when:
Your primary goal is inference or explanation You need to estimate effects or support causal conclusions You’re informing policy or making ethical decisions You want results that are interpretable and reportable
Use machine learning when:
Your primary goal is prediction or classification You’re handling high-dimensional or complex data You need scalable automation (e.g. early warning systems) You can validate predictions with real-world data
Most importantly, if causality matters, don’t rely solely on ML—use statistical thinking or causal ML techniques that explicitly model counterfactuals and assumptions.
What You Should Expect
From statistics, expect:
Clear models with interpretable outputs Transparent assumptions The ability to test hypotheses and quantify uncertainty
From machine learning, expect:
High performance with minimal assumptions Useful predictions even when mechanisms are unknown Some loss of interpretability (unless addressed deliberately)
Just remember: good prediction doesn’t imply good understanding. And good models don’t always lead to good decisions—unless we interpret them wisely.
A Path Forward for Epidemiologists and Biostatisticians
Here’s the good news: your training in statistics and epidemiology is not a limitation—it’s your greatest asset. You already understand data, confounding, validity, and generalisability. You’re equipped to evaluate models critically and ask: Does this make sense? Is it actionable? Is it ethical?
Start small. Try ML approaches that are extensions of what you know—like regularised logistic regression, decision trees, or ensemble methods. Explore tools like caret, tidymodels, or scikit-learn. And when you’re ready to dive deeper, look into causal ML methods like:
- Targeted maximum likelihood estimation (TMLE)
- Causal forests (grf)
- Double machine learning (EconML)
- DoWhy (for structural causal models)
The best analysts of the future won’t just be statisticians or ML engineers—they’ll be methodologically bilingual, able to switch between explanation and prediction as the question demands.
Your role isn’t to replace one with the other, but to integrate both—so that public health remains not just data-driven, but wisely so.