Introduction
In Part I, we saw how healthcare is moving to a new era from hospital to home under an unprecedented growth in data, which will redefine how we understand health and disease.
However, this abundant growth has revealed a paradox:
More information does not equal knowledge.
While data is the foundational layer of personalised medicine, software (in the form of AI) is the avenue that turns these fragmented silos of information into meaning that can save and improve lives. To truly enable personalised medicine, health data needs to be turned into explanatory software systems that deeply understand the causal structure of health and disease in each human being.
So the main question for the future of medicine is not when will AI outperform doctors in diagnostic accuracy, but what will need to be true for us to trust AI with our health?
Current Landscape: The Superhuman Pattern Machines
In 2025, frontier AI models can already detect cancer, summarise 100-page charts, predict sepsis hours early, and design antibodies. These are superhuman pattern-recognition feats. But to truly personalise care, AI must do something far harder: predict, understand, and adapt to the health of each individual.
To achieve this, raw capability of AI systems will need to improve on four unforgiving tests defining their performance:
Accuracy/AUC: The Comfort Metric
Accuracy measures how well a model performs on the data that looks suspiciously similar to the data it has already seen. It is a measure of familiarity, not understanding. A model with high AUC can collapse the moment you shift the data to a different population or dataset. In other words: AUC thrives locally, but fails globally.
Generalisation: Achilles Heel
Modern health AI systems fail the moment they leave the data bubble they were trained on, losing 8–25% performance when deployed in a new clinical environment. The reason: AI models learn statistical shadows of data they were trained on, not deep understanding of the underlying biological principles. The solution will not be more data but AI architectures that resemble more closely the reasoning principles of the human brain.
Explainability & Calibration: The Illusion of Understanding
Explainability should tell us why a model predicted something, while calibration how sure it is. The problem today is that both are still largely illusions: models generate confident “explanations,” of which 18–42% are hallucinations, and often express 99% certainty about predictions that are correct barely 60% of the time. In medicine, this is not a UI problem, but a trust problem.
Learning frequency
Almost all FDA-cleared AI models are frozen at launch, and less than 2% are allowed to update. A system trained on 2023 medicine is outdated by 2025. Personalised intelligence requires continuous learning. The only path forward is regulatory frameworks that enable safe, controlled, ongoing model updates. Without this, health AI systems will only stay are intelligent on the day they ship.
If these metrics tell us anything, it’s that accuracy is the least meaningful; the real constraints are generalisation, explainability, and calibration and improving them requires rethinking how models are trained, updated, and validated. To understand where the next breakthroughs must happen, we need to look directly at those processes.
Frontier Processes: Where Today’s Health-AI Actually Breaks
Behind every health AI sits the same elegant, brutal assembly line almost identical across Google, Meta, OpenAI, and every ambitious startup: a four-stage metamorphosis from raw internet intelligence to something that can (almost) pass for a doctor:
Stage 1: The Giant Awakens (Foundational Pre-Training)
Everything starts with a trillion-token model behemoth: PaLM 550B, Llama 3 405B, GPT-4o, or Claude 3.5. These models have read the entire public internet, every book, every line of code. This gives them breadth but not depth.
Stage 2: Medical Immersion (Continued Pre-training)
The giant model is drowned in biomedical reality: 37 million PubMed abstracts, de-identified clinical notes (when HIPAA or GDPR allows), hundreds of billions of new tokens. This reshapes the model’s worldview from memes and social chatter to diagnoses and clinical heuristics.
Stage 3: The Residency (Supervised Fine-Tuning)
Now the real schooling begins: humans feed the model tens of thousands of labelled medical tasks like summarising a 40-page radiology reports, extracting every diagnosis, medication, and lab value from messy EHR text. This stage teaches the model to behave like a clinician-in-training.
Stage 4: The Digital Hyppocratic Oath (Alignment & Safety)
Finally, the model is taught to be humble and safe: humans rank thousands of responses. The model learns to never give definitive treatment advice without saying “consult your physician” and refuse harmful requests outright. This creates safer behaviour, but also an overcautious, vague, liability-optimised model that sometimes refuse to provide clinically useful detail even when it is safe and appropriate.
Why does the pipeline break?
This four-stage pipeline creates astonishing models, but also encodes several structural weaknesses that all health-AI systems inherit:
- Optimisation for sanitised data, not the real world
Models are trained on curated, harmonised datasets, then dropped into hospitals full of missing values and inconsistent coding. They perform beautifully on order but collapse under real-world clinical entropy.
- Training of behaviour, and not understanding
Medical immersion gives vocabulary, and the residency gives pattern recognition. But none of these guarantee biological reasoning or causal understanding. This is why frontier models still hallucinate drug dosages or fail to generalise across hospitals. They mimic thinking but still lack the nuanced understanding of doctors.
- Training on averages, blindness to individuals
The entire pipeline is built around aggregated data: pooled cohorts, averages, mean risk scores. While this is a systemic issue of modern healthcare systems, the takeaway is simple: AI models don’t capture the nuance, context and individuality of the patient the way a clinician can. But personalised medicine requires a deep understanding of the individual: baselines, trajectories, local context mapped to population statistics. Today’s pipeline does the opposite: it guarantees population-level performance and individual-level fragility.
If these structural limits define today’s systems, then the next step is clear: we must rethink the assumptions behind the pipeline itself.
The Vision: Next 5 years
If AI is ever going to deliver personalised medicine, the next five years will demand changes far deeper than just scaling up today’s models. The answer is in three uncomfortable, structural shifts that we believe will matter:
Prediction 1: Personalisation will emerge from extreme specialisation
If evolution is our guide, specialisation is important for organisms, within organisms and within societies composed of organisms (human society being prime example). In medicine, the same will hold. The next generation of AI health models will be super-specialised pipelines that understand a narrow domain of human health in incredible detail, and then combine for an integrated understanding of the individual sitting in front of them. Think of AI for genetics, AI for biomarkers and AI for heart health working together to understand a patient.
Prediction 2: Generalisation is the ultimate challenge
The real frontier closest to mimicking the human brain is generalisation. Humans possess remarkable ability to learn new skills, knowledge and find new causal explanations for the world. Models will need to do similarly to maintain performance across demographics and individuals without diminishing AUC. This is the defining challenge of the field: we may need fundamentally different architectures capable of causal reasoning and the abstract pattern transfer that the human brain performs effortlessly.
Prediction 3: Causality and explainability will force a regulatory reset
Static black boxes are dangerous to approve and regulators know this. To make continuous learning safe, models will need a causal structure (so updates don’t misfire) and meaningful explainability (so clinicians can audit them). If a model is continuously updating, regulators must know why it changed, how its reasoning shifted, and whether those shifts remain clinically safe. That means causal priors to prevent rogue updates, calibrated confidence so risks can be quantified, and explanations that can actually be interrogated. If the model cannot explain itself, no regulator, doctor, or patient will trust it with health.
Finally, we’d like to end with a valuable idea:
The Medicine’s Airplane Test: Health AI must reach aviation-level reliability across individuals. Aviation doesn’t brag about 0.99999 reliability on sunny days at Heathrow. It delivers the same safety profile in a thunderstorm over Lagos with a rookie pilot. Health AI must pass the same test across hospitals, demographics, and real-world entropy, or it fails the only exam that matters.
When this is achieved, software that learns you becomes possible.
Conclusion: Towards Health Operating System
We are moving toward a world where health AI is not a tool, a model, or a feature, but an infrastructure. The old paradigm (frozen algorithms trained on sanitised datasets) cannot survive in a world defined by the explosion of data diversity, constant change and an ageing population.
What comes next will be deeper: a Health Operating System.
A system with real-world clinical variability, enriched by causal and mechanistic reasoning, and continuously updated through safe, regulated learning loops.
This is the inevitable direction of the field: intelligence that learns the world and ultimately learns you. And when that happens, personalised medicine will be the default operating layer of human health.
Sources: Nature Medicine, npj Digital Medicine, BMC Medical Informatics and Decision Making, NEJM AI, FDA AI/ML Program, P4SC4L Research, JMIR AI (2023–2025).


Leave a comment