What Every Doctor Should Know About AI in Medical Research · AI and Evidence in Emergency and Critical Care

AI and Evidence in Emergency and Critical Care: What Every Doctor Should Know About AI in Medical Research — full transcript

Introduction

Jeremy: Welcome back to The TIME Podcast — a space for clinicians who want to think carefully about medicine, systems, and the decisions that shape patient care.

Hamish: This podcast is created and produced by Clintix, who also host The TIME Conference. The goal here isn’t quick takes or surface-level summaries — it’s proper clinical discussion, the kind you’d expect in a room full of experienced emergency physicians, intensivists, and health-system leaders.

Jeremy: Today’s episode is about artificial intelligence in medical research and clinical practice. Not the marketing version. Not the press-release version. But the version that actually matters once AI starts influencing triage, imaging, deterioration scores, staffing decisions, and clinical pathways.

Hamish: Because AI is already in hospitals — sometimes visibly, sometimes quietly — and the uncomfortable truth is that clinicians often remain accountable for tools they didn’t design, didn’t choose, and aren’t given time or training to interrogate.

Jeremy: So we’re going to take this slowly and seriously. What AI really is. What it definitely isn’t. Where it performs well, where it fails, how bias creeps in, and how clinicians should be thinking about governance and safety — not as abstract ideas, but as practical responsibilities.

Hamish: Let’s start by clearing the fog. When people say “AI” in healthcare, they’re usually talking about very different things — and that lack of precision causes a lot of trouble.

Jeremy: Exactly. Because most of what’s currently deployed in hospitals isn’t “intelligence” in any human sense. It’s statistical pattern recognition — extremely powerful pattern recognition — but still fundamentally different from clinical reasoning.

Hamish: A useful way to frame it is this: Traditional software follows explicit rules. “If X happens, do Y.” AI systems, particularly machine-learning systems, infer rules implicitly from data.

Jeremy: And that distinction matters. Because instead of us deciding what’s important — heart rate, lactate, blood pressure — the model decides what’s important based on correlations in the training data.

Hamish: Which sounds reasonable until you realise the model has no concept of physiology, causality, or meaning. It doesn’t know what sepsis is. It knows what patterns tended to appear around cases labelled as sepsis.

Jeremy: That’s a critical difference. Clinicians reason causally: “This infection is driving inflammation, which is driving hypotension.” AI reasons associatively: “When I see pattern A, outcome B often follows.”

Hamish: And that means AI can be very good at prediction — sometimes better than humans — while being completely indifferent to whether the prediction makes clinical sense.

Jeremy: It also means AI has no built-in understanding of what should matter. If a non-clinical feature — like scanner type, documentation style, or workflow artefacts — is statistically useful, the model will happily use it.

Hamish: Which is why people get misled by statements like “the AI learned medicine.” It didn’t. It learned correlations in a dataset that happens to come from a medical environment.

Jeremy: And this is where expectations go wrong. People start asking AI to “think,” to “reason,” to “decide,” when what it’s actually doing is compressing vast amounts of historical data into a probability estimate.

Hamish: That doesn’t make it useless — far from it — but it does mean we need to treat AI more like a powerful diagnostic test than like a colleague.

Jeremy: Exactly. You wouldn’t accept a new blood test without understanding what it measures, what it confounds with, when it fails, and how it behaves in different populations.

Hamish: And you shouldn’t accept an AI system without the same level of scrutiny. Because unlike a blood test, AI can change behaviour at scale — quietly, consistently, and without obvious warning signs.

Jeremy: So the key takeaway for this section is simple but important: AI is not reasoning. It is not understanding. It is not neutral.

Hamish: It is pattern recognition operating inside complex healthcare systems — and whether that helps or harms patients depends entirely on how well clinicians understand what it’s actually doing.

Hamish: Now that we’ve been clear about what AI isn’t, the next step is understanding how it actually shows up in medical practice. Because “AI” isn’t one thing — and different approaches behave very differently once you put them near patients.

Jeremy: Exactly. And this matters, because when AI fails, it usually fails in ways that are specific to the method being used. If you don’t understand the category of AI you’re dealing with, you won’t understand its risk profile.

Hamish: Broadly, in healthcare, we’re talking about three families: traditional machine learning, deep learning, and natural language processing. And each one has very different strengths — and very different failure modes.

Machine learning — structured data, fragile assumptions

Jeremy: Let’s start with traditional machine learning, because it is the quiet workhorse already embedded in many hospital systems.

Hamish: These are models that work on structured data — numbers in rows and columns. Vitals, lab values, demographics, timestamps. Think early warning scores, sepsis risk models, length-of-stay predictors, and ICU admission probability tools.

Jeremy: From a clinician’s perspective, these models feel very familiar. They’re basically doing what we do cognitively — weighing multiple weak signals together — but doing it mathematically and at scale.

Hamish: And that’s why they can be genuinely useful. A subtle rise in respiratory rate, a small lactate bump, mild hypotension, slightly abnormal electrolytes — none of that triggers alarm bells alone. Machine-learning models can aggregate those signals continuously without fatigue.

Jeremy: But here’s the catch: these models are only as good as the assumptions baked into the data. They don’t understand physiology — they assume stability of patterns.

Hamish: Which is fine until practice changes. New sepsis pathways, different fluid strategies, new antibiotics, and altered admission thresholds. Suddenly, the relationships the model learned no longer hold.

Jeremy: And that’s why traditional ML is particularly vulnerable to temporal drift — which we’ll come back to later. It’s also very sensitive to missing data, documentation habits, and local workflow quirks.

Deep learning — powerful perception, dangerous shortcuts

Hamish: Deep learning is a different beast entirely. This is where most of the excitement — and most of the risk — lives.

Jeremy: Deep learning models, especially convolutional neural networks, are exceptional at perception. Imaging, waveforms, video, audio. Anywhere raw data needs to be interpreted.

Hamish: And to be clear, this is why radiology and cardiology have been early adopters. CNNs can process imaging at a scale and speed that humans simply can’t.

Jeremy: But the key thing clinicians need to understand is how deep learning learns. These models don’t start with concepts like “lung,” “ventricle,” or “haemorrhage.”

Hamish: They start with edges, gradients, and textures. Then shapes. Then, increasingly abstract patterns. Eventually, something correlates strongly enough with a label like “intracranial haemorrhage” that the model treats it as diagnostic.

Jeremy: The problem is that the model has no idea why that pattern matters. If the easiest path to accuracy is a non-clinical feature — a scanner signature, a portable X-ray marker, a monitoring artefact — the model will take it.

Hamish: Which is why deep learning is so prone to shortcut learning. It will always prefer the most statistically efficient signal, not the most clinically meaningful one.

Jeremy: That’s what makes these systems both incredibly powerful and potentially unsafe. They can be right for reasons that don’t generalise — and unless you actively look for that, you won’t know.

Natural language processing — meaning, but with embedded bias

Hamish: Then there’s natural language processing (NLP), which is becoming increasingly relevant in ED and ICU settings.

Jeremy: NLP models work on free text: triage notes, progress notes, discharge summaries, ambulance call transcripts. They convert language into numerical representations that capture semantic meaning.

Hamish: Which is why they can detect out-of-hospital cardiac arrest from emergency call audio or flag concerning triage notes faster than a human reader.

Jeremy: But language is messy. It’s subjective, culturally loaded, and deeply influenced by clinician behaviour.

Hamish: Exactly. NLP models don’t just learn patient characteristics — they learn how clinicians talk about patients. Who gets described as “unwell,” who gets detailed notes, and who gets minimal documentation?

Jeremy: Which means NLP systems are particularly vulnerable to encoding bias related to gender, ethnicity, age, and socioeconomic status — not because the model is malicious, but because the language it’s trained on reflects real human bias.

Hamish: So when an NLP model predicts deterioration or triage urgency, it may actually be predicting documentation style rather than physiological risk.

Why this distinction matters

Jeremy: The reason we’re spending time on this is simple: you can’t evaluate an AI system unless you know what kind of AI it is.

Hamish: A deep-learning imaging model needs scrutiny for shortcut learning and black-box opacity. A traditional ML model needs scrutiny for drift, missing data, and changing practice. An NLP model needs scrutiny for embedded human bias and language effects.

Jeremy: Lumping all of this together as “AI” is how hospitals end up deploying tools without understanding where they’re fragile.

Hamish: And that’s the theme you’ll hear again and again in this episode: AI doesn’t fail randomly. It fails predictably — if you know what to look for.

Jeremy: Before we spend too much time on failure modes, it’s important to be fair. There are domains where AI is doing genuinely useful work — and understanding why it works in those domains is key to knowing where it doesn’t.

Hamish: Exactly. Because AI success isn’t random. The places where it performs well tend to share a few characteristics — even if people don’t always articulate them clearly.

Jeremy: High signal-to-noise data. Large, reasonably consistent datasets. Outcomes that are relatively well defined. And tasks that are perceptual or pattern-based rather than causal.

Hamish: Which is why imaging is always the first example people reach for — and for good reason.

Imaging — pattern recognition at scale

Hamish: Radiology is almost the ideal environment for deep learning. Images are standardised, labels are relatively stable, and the task is fundamentally perceptual.

Jeremy: Right. Detecting an intracranial haemorrhage, a pneumothorax, or a fracture is about recognising spatial patterns. CNNs are exceptionally good at that.

Hamish: And importantly, the model doesn’t need to understand why a haemorrhage is dangerous. It just needs to recognise that this pixel pattern correlates strongly with cases labelled “haemorrhage.”

Jeremy: Which is why AI performs well as a detector or triage layer in imaging. It can prioritise scans, flag urgent cases, and reduce time-to-review — especially out of hours.

Hamish: Stroke imaging is a great example. LVO detection — large vessel occlusion, typically ICA or proximal MCA — is a time-critical, pattern-recognition task.

Jeremy: And that’s where AI can add real system-level value. Not by replacing radiologists, but by accelerating workflows — alerting stroke teams earlier, reducing door-to-needle and door-to-groin times.

Hamish: The key point is that AI works well here because the task is narrow, the outcome is relatively binary, and the data is structured in a way that deep learning handles well.

Waveforms and continuous monitoring — seeing what humans can’t

Jeremy: Another area where AI shows promise is waveform analysis — ECGs, arterial lines, ventilator waveforms.

Hamish: Humans are actually quite bad at detecting subtle temporal patterns across long streams of data. We sample. We glance. We trend intermittently.

Jeremy: AI doesn’t. It can process every beat, every breath, every fluctuation — continuously.

Hamish: Which is why deep learning models can sometimes detect deterioration or pre-arrest states earlier than clinicians. Not because they’re “smarter,” but because they’re more persistent and less selective in what they pay attention to.

Jeremy: But again, this works best when the signal is physiological, and the outcome is close in time. The further you move away from the signal — or the more confounders you introduce — the weaker the model becomes.

Prediction in constrained domains — when the question is well posed

Hamish: Prediction models — sepsis risk, ICU admission likelihood, ED crowding — are another area where AI can be helpful, but only under specific conditions.

Jeremy: Yes. These models work best when the question is narrow and operational. Not “Will this patient survive?” But “Is this patient likely to deteriorate in the next six hours?”

Hamish: Short time horizons. Clear definitions. Limited scope. That’s where machine learning performs best.

Jeremy: ED operations are a good example. Forecasting bed demand, predicting access block, anticipating ambulance offload delays — these are system-level pattern problems.

Hamish: And importantly, the consequences of error are different. If an operational model is slightly wrong, it’s inconvenient. If a clinical risk model is wrong, it can be dangerous.

Why success here doesn’t generalise

Jeremy: This is the trap, though. People see success in imaging or operations and assume the same techniques will generalise to more complex clinical reasoning.

Hamish: Which is where things go wrong. Because the moment you move into causal reasoning, ethical trade-offs, or nuanced clinical judgement, AI’s strengths stop being strengths.

Jeremy: AI is excellent at recognising patterns that already exist. It is terrible at understanding why those patterns exist — or what to do when the pattern breaks.

Hamish: And that’s why understanding where AI works is just as important as understanding where it fails. Because most harm comes from applying the right tool to the wrong problem.

Jeremy: Which brings us neatly to the next section — because once we leave these relatively safe, constrained domains, the failure modes start to matter a lot more.

Hamish: So this is the section where things start to get uncomfortable. Because once AI moves outside those narrow, well-defined use cases we just talked about, the failure modes become much more subtle — and much more dangerous.

Jeremy: Exactly. And the problem is that many of these failures don’t look like failures at all. They look like success. High accuracy. Good AUROC. Impressive validation curves. Everything looks reassuring — right up until the model is deployed somewhere new.

Hamish: Which brings us to shortcut learning — probably the single most important concept clinicians need to understand about modern AI.

Shortcut learning — when the model solves the wrong problem

Jeremy: The classic example people use is the wolf-versus-dog experiment, but it’s worth unpacking properly rather than just name-checking it.

Hamish: In that experiment, a deep learning model was trained to distinguish wolves from dogs. Performance was excellent. But when researchers interrogated the model, they realised it wasn’t identifying animals at all — it was identifying snow in the background of the images.

Jeremy: The model wasn’t wrong statistically. In the training data, snow correlated strongly with wolves. The issue was that the model had learned a shortcut — a non-causal feature that happened to predict the label.

Hamish: And medicine is full of shortcuts exactly like these. Much more than most clinicians realise.

“Medicine is full of snow”

Hamish: Take portable chest X-rays. In the ED and ICU, portable films are much more common in sick patients. So if pneumonia cases in the dataset disproportionately come from portable films, the model may learn that the portable marker itself is predictive of pneumonia.

Jeremy: Not the lung fields. Not consolidation. Just the presence of “AP PORTABLE” in the corner of the image.

Hamish: Which is deeply uncomfortable, because the model can still perform extremely well on internal validation — it keeps seeing the same correlation.

Jeremy: Until you deploy it somewhere with different imaging practices. Suddenly, the accuracy collapses, and nobody knows why.

Hamish: Scanner-specific artefacts are another big one. Different CT scanners have subtly different noise textures, reconstruction kernels, and contrast behaviour.

Jeremy: If most haemorrhage scans in the training data came from a single scanner, the model may end up recognising the scanner rather than the bleed.

Hamish: So the AI becomes a “where was this scan acquired?” detector rather than a pathology detector.

Jeremy: We see the same thing with ECGs. Different monitors, filters, and sampling rates. Models end up learning device signatures instead of arrhythmias.

Hamish: And in NLP models, the shortcuts are often even more insidious. The model might learn that long, detailed triage notes correlate with sick patients — not because the patient is sicker, but because clinicians write more when they’re worried.

Jeremy: So the AI ends up predicting clinician concern rather than patient physiology.

Why shortcut learning is so hard to detect

Hamish: The reason shortcut learning is so dangerous is that it doesn’t announce itself. The model isn’t “confused.” It’s doing exactly what it was optimised to do.

Jeremy: And traditional performance metrics won’t save you. AUROC doesn’t tell you why the model is right. It just tells you how often it agrees with the labels.

Hamish: Which is why teams often deploy models with enormous confidence — until they move outside the narrow context in which they were trained.

Jeremy: This is where clinicians need to shift their mindset. A model that performs extremely well may actually be more dangerous than a mediocre one, because it inspires trust.

Hidden feedback loops — when AI changes the data it learns from

Hamish: Another failure mode that doesn’t get enough attention is feedback loops.

Jeremy: Yes. Once an AI system influences clinical behaviour, it also begins to shape the data that future models will be trained on.

Hamish: For example, if an AI sepsis model flags certain patients as high risk, clinicians may investigate them more aggressively, order more labs, and document more concerns.

Jeremy: Those patients now generate “richer” data, reinforcing the model’s belief that those features are predictive — even if the original signal was weak or spurious.

Hamish: Over time, the model trains on its own consequences. The distinction between signal and artefact becomes blurred.

Jeremy: And unless this is explicitly monitored, the system drifts into a self-reinforcing loop that looks increasingly confident and increasingly detached from physiology.

When success in one context becomes failure in another

Hamish: This is also where issues of generalisability come in. Models trained in tertiary centres often fail in regional, rural, or resource-constrained settings.

Jeremy: Different patient populations. Different documentation practices. Different staffing patterns. Different thresholds for investigation.

Hamish: And yet the model output looks identical. Same probability scores. Same confidence.

Jeremy: Which is why “it worked where we trained it” is not a meaningful safety argument.

Hamish: So we’ve talked about shortcut learning and spurious success. But even if you imagine a model that isn’t relying on snow — even if it’s genuinely using physiological signals — we still have a major problem.

Jeremy: Yeah. And that problem is opacity. The black box. The fact that most modern AI systems can’t explain themselves in any clinically meaningful way.

Hamish: Which is a very strange thing to accept in medicine, when you think about it. Almost every clinical decision we make has to be justifiable — to colleagues, to patients, to families, and sometimes to courts.

Jeremy: Exactly. We don’t just act — we reason. “I ordered this CT because the patient had focal neurology.” “I started vasopressors because the MAP remained low despite fluids.”

Hamish: AI doesn’t do that. A deep learning model gives you a probability or a classification, but it can’t articulate why that output was generated in a way that maps onto clinical reasoning.

Jeremy: And to be clear, this isn’t because vendors are being secretive. It’s because of how these models work. Their “reasoning” is distributed across millions or billions of parameters. There is no single chain of logic to point to.

Hamish: Which means that even the engineers who built the model often can’t tell you exactly why a specific prediction was made.

Why opacity matters clinically

Hamish: Opacity becomes dangerous the moment an AI system begins to influence decisions.

Jeremy: Because if you can’t understand why a recommendation was made, you can’t meaningfully interrogate it. You can’t sanity-check it against the bedside picture.

Hamish: And that’s how automation bias creeps in. Humans have a well-documented tendency to trust confident, quantitative outputs — especially when they’re framed as probabilities.

Jeremy: Give someone a number like “92% risk of deterioration,” and it carries an authority that’s very hard to challenge, even when your clinical instincts disagree.

Hamish: What’s ironic is that we wouldn’t tolerate this from a human colleague. If a registrar said “I’m 92% sure,” the next question would be “based on what?”

Jeremy: But we often don’t ask that of AI. And that’s dangerous, because AI is the least accountable voice in the room.

Opacity and accountability

Hamish: There’s also a governance issue here. If an AI recommendation contributes to harm, who is responsible?

Jeremy: Right now, the answer is simple and uncomfortable: the clinician. The AI doesn’t sign the chart. The AI doesn’t attend the coroner’s court.

Hamish: So clinicians end up absorbing risk from systems they can’t fully interrogate. That’s not a technical problem — that’s a power imbalance.

Jeremy: And it becomes especially problematic when AI tools are embedded into workflows in a way that’s hard to bypass. When the output is constantly present, always visible, always nudging behaviour.

Hamish: That’s when opacity becomes coercive rather than merely inconvenient.

The performance–interpretability trade-off

Jeremy: There’s another uncomfortable truth here. Often, the most accurate models are the least interpretable.

Hamish: Yes. As models get deeper and more complex, performance improves — but transparency worsens. This is the performance–interpretability trade-off.

Jeremy: Which creates a false dichotomy: accuracy versus understanding. And in medicine, that’s not a trade-off we can blindly accept.

Hamish: Because a slightly less accurate model that clinicians can understand and challenge may be safer than a marginally more accurate black box that silently drives behaviour.

Why this matters in emergency and critical care

Hamish: In ED and ICU, decisions are time-critical, context-rich, and often made with incomplete information.

Jeremy: Which is exactly the setting where blind trust in opaque systems is most dangerous.

Hamish: If an AI flags a patient as “low risk,” but you can’t see why, you may be falsely reassured. If it flags someone as “high risk,” you may over-escalate.

Jeremy: And because the model doesn’t explain itself, you can’t calibrate your trust appropriately.

The key takeaway

Hamish: So the black box problem isn’t an abstract philosophical issue. It’s a clinical safety issue.

Jeremy: Opacity undermines judgement, accountability, and trust — all of which are foundational to medical practice.

Hamish: And that’s why explainability isn’t a “nice to have.” It’s a prerequisite for responsible use.

Jeremy: Which brings us to the next question: if opacity is the problem, how do we mitigate it — and how do we actually inspect what a model is doing?

Hamish: If there’s one section clinicians should listen to twice, it’s this one. Because bias in AI isn’t an opinion — it’s a predictable technical consequence of how these systems are built.

Jeremy: And importantly, AI doesn’t invent bias. It absorbs it from the healthcare system it’s trained on, and then scales it with ruthless efficiency.

Hamish: Exactly. Whatever inequities exist in our data become inequities in the model — but now they’re faster, quieter, and harder to detect.

Dataset bias — who gets seen, and who doesn’t

Hamish: Let’s start with dataset bias. This is the simplest form to understand, but also one of the most harmful.

Jeremy: If certain patient groups are underrepresented in the training data, the model simply doesn’t learn how to recognise them properly.

Hamish: In Australia, the most obvious example is Indigenous patients. They’re underrepresented in many tertiary datasets, particularly those used to train AI models.

Jeremy: Which means an AI trained on metropolitan hospital data may systematically underperform for Indigenous patients — not out of malice, but because it’s never seen enough examples.

Hamish: And that underperformance isn’t random. It shows up in triage, risk prediction, and escalation decisions — exactly the places where early errors compound.

Label bias — when “ground truth” isn’t actually true

Hamish: Label bias is more subtle, and arguably more dangerous.

Jeremy: Because labels come from clinicians. And clinicians are not neutral measuring instruments.

Hamish: If women are underdiagnosed with sepsis, then the label “sepsis” appears less often in women — even when the physiology is the same.

Jeremy: The AI doesn’t know that. It assumes the label is true. So it learns that female physiology is less predictive of sepsis.

Hamish: Which means the model will under-predict sepsis in women going forward — reinforcing the original bias.

Jeremy: Same story with pain, myocardial infarction, trauma severity, and mental health. The AI inherits every diagnostic blind spot we’ve ever had.

Measurement bias — same pathology, different inputs

Hamish: Measurement bias happens when the inputs themselves differ in systematic ways that aren’t related to disease.

Jeremy: Different CT scanners. Different ECG machines. Different ABG analysers. Even different documentation systems.

Hamish: If sicker patients are scanned on one machine and stable patients on another, the model can learn the machine rather than the illness.

Jeremy: And clinicians never see this happening, because from their point of view, the pathology is the same.

Modelling bias — mathematics favours the majority

Hamish: This is the part people often miss. Most AI models optimise for overall accuracy.

Jeremy: Which means they’ll happily sacrifice performance in small subgroups if it improves performance overall.

Hamish: From the model’s perspective, misclassifying 5% of a minority group is statistically acceptable if it improves accuracy for the other 95%.

Jeremy: From a clinical perspective, that’s unacceptable — because those 5% are real patients.

Deployment bias — when models leave home

Hamish: Even a well-trained, relatively unbiased model can fail when deployed in a new environment.

Jeremy: Different hospitals have different patient populations, staffing models, workflows, and intervention thresholds.

Hamish: A model trained in a tertiary trauma centre may behave unpredictably in a regional hospital.

Jeremy: And the output doesn’t tell you that. It looks just as confident.

Temporal bias — medicine moves, models don’t

Hamish: Temporal bias is the slowest and quietest failure mode.

Jeremy: Guidelines change. Antibiotics change. Documentation habits change. Populations change.

Hamish: COVID was the most obvious example, but drift happens all the time at lower intensity.

Jeremy: Sepsis pathways evolve. Imaging thresholds change. Staffing patterns shift.

Hamish: Which means the relationships the model learned slowly become less true.

Jeremy: And unless performance is actively monitored, no one notices until harm occurs.

Why bias is a patient safety issue

Hamish: All of this matters because bias doesn’t stay in the dataset.

Jeremy: It enters the workflow. It influences triage. It shapes escalation. It affects who gets seen first.

Hamish: And because it aligns with existing structural inequities, it’s often invisible unless you actively look for it.

Jeremy: AI doesn’t just reflect bias — it operationalises it.

The uncomfortable truth

Hamish: One biased clinician affects one patient at a time.

Jeremy: One biased AI system affects thousands, quietly, consistently, and without fatigue.

Hamish: Which is why bias in AI isn’t an ethics sidebar. It’s core patient safety work.

Jeremy: And it’s why clinicians must be involved in evaluating these systems — because if we’re not, bias becomes invisible.

Hamish: So far, we’ve talked about bias, which is about who the model works for. Now we need to talk about something slightly different: whether the model actually learned anything real at all.

Jeremy: Yeah. Because a model can be completely unbiased and still be unsafe — if it’s memorised noise, cheated during training, or gone stale over time.

Hamish: And the problem is that all three of these failure modes — overfitting, leakage, and drift — tend to produce models that look excellent in development.

Jeremy: Which is exactly why they get deployed.

Overfitting — when the model memorises instead of generalises

Hamish: Overfitting is probably the easiest conceptually, but one of the hardest to detect clinically.

Jeremy: At its core, overfitting means the model has learned the quirks of the training data rather than the underlying clinical signal.

Hamish: It’s like a trainee who memorises answers from the exam bank instead of understanding physiology. They look brilliant — until the question changes slightly.

Jeremy: And modern AI models are very good at memorisation. Deep learning systems have millions of parameters — more than enough capacity to remember noise.

Hamish: In ED and ICU data, noise includes artefacts from monitors, documentation practices, scanner signatures, and even time-of-day effects.

Jeremy: So a model might learn that scans done overnight are more likely to show pathology — not because disease is nocturnal, but because thresholds for scanning change after hours.

Hamish: And the model isn’t wrong statistically. It’s just learned the wrong thing.

Why overfitting survives peer review

Jeremy: This is where clinicians need to be especially careful when reading AI papers.

Hamish: Because overfitted models often perform beautifully on internal validation.

Jeremy: High AUROC. Tight confidence intervals. Impressive-sounding metrics.

Hamish: But if the validation data comes from the same hospital, the same scanners, the same workflows — you haven’t really tested generalisation at all.

Jeremy: Which is why external validation isn’t a “nice to have.” It’s the bare minimum.

Data leakage — when the model cheats without you knowing

Hamish: Leakage is more dangerous than overfitting, because it creates an illusion of intelligence that completely disappears in real practice.

Jeremy: Leakage happens when the model is accidentally given information that wouldn’t be available at the time of prediction.

Hamish: Classic examples include including discharge destination when predicting admission, or including “palliative care consult” in a mortality model.

Jeremy: But leakage can be much subtler than that.

Hamish: Timestamp leakage is a big one. If vitals or labs recorded hours after triage are included in the “early prediction” window, the model is effectively seeing the future.

Jeremy: And because hospital data pipelines are messy, this happens more often than people realise.

Hamish: The model looks miraculous. In reality, it’s just cheating.

Why leakage is so hard to spot

Jeremy: Clinicians often assume leakage is obvious. It usually isn’t.

Hamish: Especially in EHR-derived datasets, where the temporal order of events isn’t always clean.

Jeremy: And once leakage is present, performance metrics become meaningless.

Hamish: You can’t “adjust” for leakage. You have to rebuild the dataset.

Model drift — when yesterday’s truth becomes today’s error

Hamish: Drift is different again. It’s not about training mistakes — it’s about time.

Jeremy: Clinical practice evolves. Antibiotics change. Pathogens change. Guidelines change. Populations change.

Hamish: COVID was the most obvious example, but drift happens all the time at lower intensity.

Jeremy: Sepsis pathways evolve. Imaging thresholds change. Staffing patterns shift.

Hamish: Which means the relationships the model learned slowly become less true.

Jeremy: And unless performance is actively monitored, no one notices until harm occurs.

Why these failures matter clinically

Hamish: Overfitting means the model never understood medicine.

Jeremy: Leakage means it never understood the task.

Hamish: Drift means it no longer understands the world.

Jeremy: And in emergency and critical care, that combination is dangerous.

The key message

Hamish: When clinicians read AI studies, we need to stop asking only “How accurate is this?”

Jeremy: And start asking, “How could this fail?”

Hamish: Because if you understand overfitting, leakage, and drift, most AI hype collapses very quickly.

Jeremy: Which brings us to the next step: if models fail in these ways, how do we actually inspect and police them? So at this point, if people are feeling a bit uneasy about AI, that’s probably appropriate. We’ve talked about shortcuts, bias, overfitting, leakage, and drift — all very real problems.

Jeremy: Yeah, but the message isn’t “therefore never use AI.” It’s “don’t use AI blindly.” And this is where explainability and inspection come in.

Hamish: Exactly. Because the moment an AI system influences clinical decisions, clinicians need a way to ask: what is this model actually doing?

Jeremy: And importantly, explainability isn’t about making AI comforting or intuitive. It’s about making it auditable — in the same way we audit drugs, devices, and clinical pathways.

Why explainability matters more than raw accuracy

Hamish: One of the biggest mistakes organisations make is prioritising performance metrics over understanding.

Jeremy: High AUROC feels reassuring. But it tells you nothing about why the model is right — or wrong.

Hamish: And in medicine, “right for the wrong reason” is dangerous. It doesn’t generalise, and it breaks the moment conditions change.

Jeremy: Which is why explainability should be thought of as a safety feature, not a technical extra.

SHAP — opening the black box just enough

Hamish: The explainability method clinicians will hear about most is SHAP — Shapley Additive Explanations.

Jeremy: The intuitive way to think about SHAP is this: it tells you how much each input contributed to a specific prediction.

Hamish: So instead of just saying “this patient has an 85% risk of deterioration,” the model can show that lactate, respiratory rate, and oxygen requirement raised the risk — while age or comorbidity lowered it.

Jeremy: That’s the difference between a black box and something you can actually reason with.

Hamish: And SHAP works at two levels that matter clinically. At the individual level — “why this patient?” — and at the population level — “what does this model tend to rely on overall?”

Jeremy: That second one is crucial. Because that’s where you catch snow.

Detecting snow and shortcuts in practice

Hamish: If a model is genuinely using physiology, SHAP plots should look clinically sensible.

Jeremy: Vitals, labs, imaging features — things you’d expect to matter.

Hamish: If, instead, you see the scanner brand, time of day, triage category, note length, or bed location driving predictions — that’s a red flag.

Jeremy: Because those aren’t causes. They’re artefacts of the system.

Hamish: And this is why clinicians need to be involved. A data scientist might not recognise that “resus bay” is snow. An ED physician will recognise it instantly.

Explainability isn’t perfect — and that matters too

Jeremy: We should be honest here: explainability tools aren’t magic.

Hamish: No. SHAP, for example, makes assumptions about how features interact. When variables are tightly correlated — like MAP and systolic BP — attribution can be messy.

Jeremy: Which means explainability tools themselves need to be interpreted carefully, not treated as ground truth.

Hamish: But imperfect visibility is still better than no visibility.

Beyond SHAP — other ways to interrogate models

Hamish: Explainability is only one part of inspection. Calibration is another.

Jeremy: Calibration asks a very practical question: when the model says “20% risk,” does that actually correspond to 20% of patients deteriorating?

Hamish: And crucially, does that hold true across different groups — women, men, Indigenous patients, older adults?

Jeremy: A model can be well calibrated overall and dangerously miscalibrated for specific populations.

Hamish: Then there’s counterfactual testing — one of the most powerful bias checks.

Jeremy: You take the same patient, change only one non-clinical attribute — like gender or ethnicity — and see if the prediction changes.

Hamish: If it does, you’ve found bias. Full stop.

Inspection as clinical governance

Hamish: This is the shift clinicians need to make. AI inspection isn’t research. It’s governance.

Jeremy: Just like we’d never deploy a new drug without pharmacovigilance, we shouldn’t deploy AI without model surveillance.

Hamish: Performance monitoring. Drift detection. Bias audits. Explainability review.

Jeremy: These are not “tech team responsibilities.” They’re shared clinical responsibilities.

The deeper message

Hamish: Explainability doesn’t make AI safe on its own.

Jeremy: But it makes unsafe AI detectable.

Hamish: And that’s the difference between a tool we can work with — and a black box that quietly shapes care.

Jeremy: Which leads us to the next section, because even with inspection tools, there are ethical limits to what we should delegate to machines.

Hamish: So here’s the point where I think we need to stop talking about models and start talking about people. Because once AI enters a clinical workflow, it doesn’t just change decisions — it changes behaviour.

Jeremy: I agree it changes behaviour, but I’m not convinced that’s inherently a bad thing. A lot of what we do in ED and ICU already relies on cognitive shortcuts. If AI can reduce noise and surface risk earlier, that can be a net positive.

Hamish: Maybe. But that assumes the AI is right often enough, and that clinicians remain psychologically independent of it. And that’s where I start to worry.

Jeremy: You’re talking about automation bias.

Hamish: Exactly. Humans are wired to trust confident systems — especially ones that present numbers with decimals. A risk score of “87%” feels objective in a way that a registrar saying “I’m worried” never does.

Jeremy: But we already trust scores. NEWS, qSOFA, Wells. This isn’t new.

Hamish: There’s a difference. Those scores are crude, transparent, and bounded. You can hold them in your head. You know when they fail. AI models don’t have that kind of intuitive failure surface.

Jeremy: That’s fair. But if we insist on perfect explainability, we’ll never deploy anything useful. Some opacity might be an acceptable trade-off if outcomes improve.

Hamish: That’s where I disagree. Because the moment you accept opacity, you also accept a shift in power. The model influences decisions, but the accountability doesn’t move with it.

Jeremy: Ultimately, the clinician still owns the decision.

Hamish: On paper, yes. In reality, it’s murkier. If an AI tool is embedded in the workflow, constantly visible, nudging escalation or de-escalation, it becomes very hard for an individual clinician to ignore it — especially in a busy department.

Jeremy: So your concern is that clinicians become the shock absorbers for system-level risk.

Hamish: Exactly. The hospital deploys the tool. The vendor builds it. The governance committee signs it off. But when something goes wrong, it’s the physician at the bedside who has to justify why they did or didn’t follow the AI recommendation.

Jeremy: That’s uncomfortable — but isn’t that already the case with protocols and pathways?

Hamish: To some extent. But protocols are static, visible, and negotiable. AI is dynamic, adaptive, and often inscrutable. You can’t meaningfully dissent from a recommendation you can’t interrogate.

A near-miss vignette

Hamish: Let me give you a realistic scenario. An ED uses an AI deterioration tool. A middle-aged patient with vague symptoms is flagged as “low risk.” The department is busy. The score provides reassurance. The patient waits longer than they otherwise might have.

Jeremy: And then they deteriorate.

Hamish: Exactly. Not because anyone was negligent — but because the AI quietly shifted the risk tolerance of the system. No alarm. No error message. Just a subtle behavioural nudge.

Jeremy: And retrospectively, everyone asks why the clinician didn’t escalate earlier.

Hamish: Yes — even though the system implicitly encouraged them not to.

Deskilling and cognitive load

Jeremy: One thing I do worry about is deskilling. If AI always pre-reads imaging, always flags sepsis, always predicts risk, clinicians stop practising those judgments themselves.

Hamish: And in emergency medicine, especially, judgment is a muscle. If you don’t exercise it, you lose it.

Jeremy: But there’s also the counterargument: clinicians are already overloaded. Alert fatigue is real. Cognitive bandwidth is finite.

Hamish: True — but adding another layer of probabilistic output doesn’t necessarily reduce cognitive load. Sometimes it just adds ambiguity.

Jeremy: More data, less clarity.

Hamish: Exactly. And clinicians adapt to noise in dangerous ways. They start ignoring alerts. Or trusting them blindly. Both are bad.

Governance and power

Jeremy: So let’s talk governance. Who actually owns these systems?

Hamish: That’s the question that rarely gets answered clearly. Who owns the data? Who owns the alert? Who is responsible for monitoring drift? Who decides when the model should be switched off?

Jeremy: And who carries liability when the AI contributes to harm?

Hamish: Right now, that responsibility is often diffuse by design. Which is a red flag.

Jeremy: From a leadership perspective, that’s uncomfortable — because it forces explicit decisions about accountability.

Hamish: And that’s exactly why it matters. If responsibility isn’t named, it doesn’t disappear — it just falls to the most junior or most exposed clinician in the room.

The unresolved tension

Jeremy: So here’s where I land, and I suspect we don’t fully agree. I think AI can improve care, even with some opacity, if governance is strong and clinicians are supported.

Hamish: And I think that unless governance, explainability, and accountability are explicit before deployment, AI will quietly make systems worse — not better — by masking staffing shortages, throughput pressure, and risk transfer.

Jeremy: That tension isn’t going away.

Hamish: No. And maybe it shouldn’t. Because the moment we get too comfortable with AI, we stop asking the hard questions.

Monday morning reality check

Jeremy: So here’s the real test for leaders listening. Monday morning, someone brings you an AI tool that “works,” shows great metrics, and promises efficiency.

Hamish: The question isn’t “Is it accurate?” It’s: Who carries the risk when it’s wrong? Who notices when it drifts? And who has the authority to say, “We’re turning this off”?

Jeremy: There’s no clean answer to that.

Hamish: And if that makes you uncomfortable — good. It should.

Jeremy: So after all of that — bias, black boxes, governance, power — the obvious question is: what does this actually mean for us, professionally?

Hamish: Yeah. Because if AI were just another gadget, we wouldn’t need a section like this. But it’s not. AI is already shaping decisions, priorities, and workflows — often without clinicians explicitly agreeing to it.

Jeremy: And that’s where professional responsibility comes in. Not in the sense of “learn to code,” but in the sense of knowing enough to recognise when a system is unsafe, or misaligned with clinical reality.

Hamish: Exactly. There’s a temptation to frame AI literacy as some kind of optional future skill. But that’s not really accurate anymore.

Jeremy: No. It’s more like an ultrasound twenty years ago. At first, it was niche. Then useful. Then unavoidable. Now it’s just part of being competent in acute care.

Hamish: And importantly, no college in Australia is saying, “You must complete AI modules or lose your registration.” That’s not where we are.

Jeremy: But the Medical Board expects clinicians to be competent with the technologies they use. And hospitals are already deploying AI-enabled tools into clinical pathways.

Hamish: So the obligation is implicit. If an AI tool influences your decisions, you are expected to understand its limitations — just as you would with a drug, a ventilator mode, or a diagnostic test.

CPD as risk management, not box-ticking

Jeremy: This is where CPD actually matters — not as compliance, but as risk management.

Hamish: Yes. CPD isn’t about becoming an AI enthusiast. It’s about asking hard, practical questions.

Jeremy: Questions like: What data was this trained on? Who was missing? Has this been validated anywhere, like our hospital? What happens when it’s wrong?

Hamish: And just as importantly: Who is monitoring it now? Who owns the decision to pause or withdraw it? And where does responsibility sit when outputs conflict with clinical judgement?

Jeremy: Those are not technical questions. They’re clinical governance questions.

Leadership and silence

Hamish: I think one of the most dangerous things right now is clinician silence.

Jeremy: Because if clinicians don’t engage, decisions still get made — just without clinical insight.

Hamish: Exactly. AI procurement, deployment, and governance will proceed regardless. The only question is whether experienced clinicians are in the room when those decisions are made.

Jeremy: And if they’re not, we end up with tools optimised for metrics that don’t reflect clinical reality — throughput, documentation completeness, nominal “risk scores.”

Hamish: Which then quietly reshape practice in ways nobody explicitly agreed to.

Professional identity in an AI-enabled system

Jeremy: There’s also a deeper issue here — identity. What does it mean to be a physician in a system where prediction engines are always running in the background?

Hamish: I think it makes our role more important, not less. Because AI can surface patterns, but it can’t take responsibility. It can’t hold uncertainty. It can’t justify values-based decisions.

Jeremy: And it can’t say, “This patient doesn’t fit the model.”

Hamish: Exactly. That moment — when data, probability, and the human in front of you don’t align — that’s where medicine still lives.

The unresolved ending

Jeremy: So I don’t think the question is whether AI belongs in healthcare. It’s already here.

Hamish: The real question is whether clinicians shape how it’s used — or whether we inherit systems designed without us in mind.

Jeremy: And that’s not something a policy document will solve.

Hamish: It’s something that requires attention, literacy, and a willingness to stay uncomfortable.

Jeremy: Which, honestly, has always been part of good medicine.

Jeremy: So that’s where we’ll leave it. Not with a neat conclusion, and not with a simple answer — because this isn’t a simple problem.

Hamish: AI in healthcare isn’t a question of belief or enthusiasm. It’s a question of responsibility. Who understands these systems, who governs them, and who carries the consequences when they fail?

Jeremy: If there’s one thing we hope this episode has done, it’s moved the conversation away from hype and toward literacy—the kind that lets clinicians ask better questions, not just accept better-looking dashboards.

Hamish: Because AI doesn’t remove uncertainty from medicine. It just changes where that uncertainty sits — and whether we notice it.

Jeremy: This has been The TIME Podcast, created and produced by Clintix, who also hosts The TIME Conference.

Hamish: Thanks for spending time with us—and for staying curious, critical, and engaged in a space where those qualities matter more than ever.

Jeremy: We’ll see you next time.