
Where Wong 2026 Sits in the ESM Evidence Timeline
To understand why the 2026 multicenter prospective validation of Epic Sepsis Model version 2 matters, it helps to trace the preceding evidence failures that made it necessary. ESM version 1 — a penalized logistic regression model trained on Epic's internal data — achieved mass deployment across U.S. hospitals before meaningful external validation existed.
The first major independent reckoning came with the Wong et al. 2021 external validation in JAMA Internal Medicine. At the University of Michigan, ESM v1 achieved an encounter-level AUROC of 0.63 — substantially below Epic's reported range of 0.76–0.83. At a score threshold of 6, sensitivity was 33%, meaning the model failed to detect sepsis in 67% of patients who developed it. At the same time, it generated alerts for 18% of all hospitalized patients, a false-positive burden that prompted widespread concern about alert fatigue.
A second methodological problem emerged with the Kamran et al. 2024 study published in NEJM AI. When ESM v1's performance was analyzed using only data collected before sepsis criteria were met, its AUROC dropped from 0.87 to 0.62. Restricted further to data before a blood culture was ordered, it fell to 0.53 — barely above chance. The label-leakage finding showed that ESM v1 was cueing on diagnostic and treatment orders that encode clinician suspicion — not detecting patterns that clinicians had missed.
Epic responded in 2022 with a redesigned model: ESM version 2 uses a gradient-boosted tree architecture, targets the Sepsis-3 outcome definition rather than a proprietary sepsis proxy, and supports site-level fine-tuning. These design changes were intended to address both the discrimination gap and the label-leakage vulnerability. By August 2025, Epic reported that 95 organizations covering 731 hospitals had deployed ESM v2.
Wong et al. 2026 is the first large-scale multicenter prospective external validation of ESM v2. Its methodological design — prospective data collection at four health systems immediately after model implementation, guided by TRIPOD+AI reporting standards — is meaningfully stronger than either the 2021 retrospective single-site validation or the 2024 label-leakage analysis. For readers who want broader Q2 2026 AI evidence context, the Medical AI Research Radar for Q2 2026 situates this study within the broader emerging evidence landscape.
| Study | Year | Design | Model Version | AUROC | Key Limitation |
|---|---|---|---|---|---|
| Wong et al., JAMA Intern Med | 2021 | Retrospective, single-site (Michigan) | ESM v1 | 0.63 | Single center; retrospective; 67% of sepsis cases missed |
| Kamran et al., NEJM AI | 2024 | Retrospective, label-leakage analysis | ESM v1 | 0.62 (pre-treatment orders) | Showed model performance relied on downstream clinical actions |
| Wong et al., JAMA Network Open | 2026 | Prospective, multicenter (4 health systems) | ESM v2 | 0.82–0.92 | No mortality endpoint; fine-tuned models only; Sepsis-3 labeling circularity |
Study Design: What TRIPOD+AI Compliance Actually Guarantees
The Wong et al. 2026 study enrolled 227,091 inpatient encounters across four U.S. health systems: the University of Michigan, Oregon Health & Science University (OHSU), Emory Healthcare, and MetroHealth. Data collection began prospectively at each site immediately after ESM v2 was implemented — not retrospectively applied to historical records.
The study was reported in accordance with TRIPOD+AI guidelines, which extend the original TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) framework to address AI-specific methodological transparency requirements. TRIPOD+AI compliance requires explicit reporting of the model development data source, the external validation dataset, the prediction horizon, the outcome definition and ascertainment method, and the statistical analysis plan.
What prospective design guarantees: the data used to evaluate ESM v2 were not available during model development or fine-tuning. This eliminates the retrospective contamination problem where a model's training data and evaluation data overlap, inflating apparent performance. It also ensures that the EHR data feeding the model reflects the actual operational deployment environment — including the noise, missing values, and workflow patterns that retrospective analyses often smooth over.
The four participating health systems represent genuine institutional diversity: Michigan and OHSU are tertiary academic medical centers with complex, later-stage hospital-onset sepsis populations; Emory Healthcare and MetroHealth serve higher proportions of ED-onset sepsis cases. All four sites fine-tuned ESM v2 before deployment. This fine-tuning requirement is important for interpreting the published performance figures — a point addressed directly in the limitations analysis.
Primary Findings: Statistical Annotation by Site
The headline performance metrics from Wong et al. 2026 are best understood at the site level rather than as a single aggregate figure. The study reports encounter-level AUROC ranging from 0.82 to 0.92 across the four health systems — a range that is structurally meaningful, not a sign of measurement inconsistency (this is addressed in the next section). Below are the key metrics by site.

| Site | Encounter AUROC (95% CI) | PPV at 60% Sensitivity | NNE (12-hour horizon) | Median Lead Time vs. Sepsis Onset | Median Lead Time vs. Clinician Recognition | Sepsis Incidence | ED-Onset Proportion | First-Hour CBC Exclusion |
|---|---|---|---|---|---|---|---|---|
| University of Michigan | 0.82 (0.81–0.83) | 0.13 | 35 | 1.9 hours | 1.4 hours | ~2.2% | Not reported as dominant | 30.1% |
| OHSU | ~0.84–0.86 (range) | ~0.15–0.17 | ~28–32 | ~3–5 hours | ~2–4 hours | ~3–4% | Not dominant | ~15–20% |
| Emory Healthcare | 0.92 (0.92–0.93) | 0.26 | 21 | 10.3 hours | 7.1 hours | ~7.1% | 71.5% | 9.8% |
| MetroHealth | 0.92 | 0.24 | 22 | ~8–10 hours | ~5–7 hours | ~6–7% | 60.3% | ~12% |
ESM v2 outperformed ESM v1 at all four sites on all reported metrics. The improvement is most visible at Michigan, where v1 achieved AUROC 0.63 in the 2021 external validation versus v2's 0.82 in this study. The gradient-boosted tree architecture with Sepsis-3 outcome targeting appears to have meaningfully reduced the label-leakage problem: when Wong et al. 2026 restricted analysis to predictions made before clinician recognition of sepsis, the AUROC dropped only slightly (to approximately 0.80–0.90 across sites), compared to the dramatic collapse seen in the Kamran 2024 analysis of v1.
Why the AUROC Range Is a Signal, Not a Flaw
The ten-point AUROC gap between Michigan (0.82) and Emory/MetroHealth (0.92) is one of the study's most instructive findings — and the one most likely to be misread as evidence of model inconsistency. It is not. It is a direct reflection of case-mix differences that any sepsis prediction model would face.
The key driver is the proportion of ED-onset versus hospital-onset sepsis. At Emory Healthcare, 71.5% of sepsis cases had ED onset; at MetroHealth, 60.3%. ED-onset sepsis tends to present with more pronounced, rapidly evolving physiologic abnormalities — the kind of signal pattern that a gradient-boosted tree model can detect with high confidence. Hospital-onset sepsis, which predominates at tertiary academic centers like Michigan and OHSU, develops more gradually against a background of complex comorbidities, post-surgical states, and ongoing treatments that create substantial noise in the feature space.
Baseline sepsis incidence compounds this effect. Emory's 7.1% incidence provides the model with a richer positive-case signal relative to the overall encounter volume than Michigan's approximately 2.2% incidence. Higher base rates improve both AUROC and PPV in ways that are mathematically predictable and clinically interpretable.
Alert Burden in Operational Context: NNE, PPV, and the 8-Hour Silencing Strategy
The Number Needed to Evaluate (NNE) metric appears in two distinct forms in the Wong et al. 2026 study, and conflating them produces a misleading picture of operational alert burden.
The hospitalization-wide NNE — calculated across the entire encounter — ranges from approximately 4 to 8 across sites. This figure is useful for encounter-level watchlisting: it describes how many patients need to be flagged at any point during hospitalization before one sepsis case is identified. It is an appropriate metric for triage prioritization workflows where clinicians review a daily list of high-risk patients.
The 12-hour horizon NNE — calculated for alerts that fire within 12 hours of a sepsis event — ranges from 21 to 35 across sites. This is the operationally honest metric for real-time alert design. When an alert fires and a clinician must decide whether to act within the next 12 hours, the NNE of 21–35 means 20 to 34 patients will be evaluated for each true sepsis case identified at that time horizon.
| NNE Metric | Range Across Sites | Clinical Interpretation | Appropriate Use Case |
|---|---|---|---|
| Hospitalization-wide NNE | 4–8 | 4–8 flagged encounters per true sepsis case over the full hospitalization | Daily risk stratification lists; encounter-level watchlisting |
| 12-hour horizon NNE | 21–35 | 21–35 evaluated patients per true sepsis case at the real-time alert window | Real-time alert workflow design; clinician response protocols |
The study also examined an 8-hour silencing strategy — suppressing repeat alerts for a patient for 8 hours after an initial alert fires. The silencing strategy substantially reduces total alert volume, which addresses the alert fatigue concern that plagued ESM v1's 18% all-patient alert rate. However, the study's data show that 8-hour silencing does not improve PPV or NNE at the 12-hour horizon. The precision of each individual alert — the probability that a given alert corresponds to a true sepsis event — remains unchanged. Silencing reduces the number of times clinicians are interrupted, but does not make each interruption more likely to be actionable.
Critical Limitations: What the Authors Acknowledge and What They Cannot Resolve
Wong et al. 2026 is notable for the transparency with which it presents its own methodological constraints. Five limitations deserve analytical depth beyond the acknowledgment section.
- Sepsis-3 labeling circularity. The study uses Sepsis-3 criteria — which require evidence of infection (typically blood culture ordering) combined with organ dysfunction — as the outcome definition. Once ESM v2 is deployed, its alerts may influence whether clinicians order blood cultures and initiate antibiotics. If alerts cause cultures to be ordered that would not otherwise have been ordered, those cultures can satisfy the Sepsis-3 infection criterion, creating a situation where the model's predictions partially construct the outcome labels used to evaluate the model. The authors argue that sepsis incidence rates remained consistent with historical pre-deployment rates and that the direct replacement of v1 with v2 mitigates this concern. These are reasonable arguments, but they do not eliminate the circularity — they bound its likely magnitude. Any reader citing the study's AUROC figures as evidence of model performance should understand that those figures were generated in an environment where the model was operationally active.
- First-hour CBC exclusion. ESM v2 requires a complete blood count (CBC) result to generate a score. Encounters where no CBC was available in the first hour were excluded from analysis. At Michigan, this exclusion removed 30.1% of sepsis encounters — nearly one in three. At Emory, the exclusion rate was 9.8%. The directional bias this introduces is upward: patients who present with severe enough illness to prompt immediate CBC ordering are likely to be more physiologically abnormal, making them easier to score and predict. Patients who develop sepsis more insidiously — without early laboratory workup — are systematically excluded from the denominator. The early-presentation failure mode is therefore invisible in the published performance figures.
- Fine-tuned models only. All four validation sites fine-tuned ESM v2 before deployment. The study provides no data on how the base (non-fine-tuned) model performs. This is not a minor caveat: fine-tuning recalibrates the model's score distributions and threshold mappings to local case mix, incidence rates, and EHR data patterns. An institution deploying the base model out of the box — without fine-tuning — cannot apply the published AUROC, PPV, or NNE figures to its own operational planning. The performance of the unfine-tuned base model remains unknown from this study.
- No mortality endpoint. The study was designed to evaluate predictive performance — AUROC, sensitivity, specificity, lead time — not clinical outcomes. It cannot answer whether ESM v2 deployment reduces sepsis mortality, ICU length of stay, or time to appropriate antibiotic administration. This is not a flaw in the study's design; it was not designed to answer those questions. But it means the study's strong AUROC figures cannot be cited as evidence of survival benefit. No randomized controlled trial of any ESM version has demonstrated mortality benefit. For context on what a mortality-endpoint study would require, the broader AI research evidence review addresses this pattern across clinical AI models.
- No stratification by care setting. The study reports whole-hospitalization encounter-level performance without stratifying by clinical setting — emergency department, general medical ward, step-down unit, or ICU. Performance almost certainly varies across these settings because the physiologic presentation, monitoring intensity, and data completeness differ substantially. A model that performs well in the ED — where sepsis often presents acutely and laboratory data are collected rapidly — may perform differently on a general medical ward where sepsis develops more gradually over 24–48 hours. The absence of setting-stratified data limits the study's guidance for institutions designing setting-specific alert thresholds.
Fairness Audit: What the Absence of Major Disparities Does and Does Not Mean
Wong et al. 2026 includes a fairness audit examining whether ESM v2 performance varied systematically by patient demographic characteristics — age, sex, and race. The study found no major independent demographic performance disparities across the four sites. This is a meaningful finding: it suggests the model does not exhibit the kind of pronounced, independent bias against specific demographic groups that has been documented in some other clinical AI tools.
However, the fairness audit finding carries an important interpretive limit. The study found that model performance closely tracked baseline sepsis incidence by demographic subgroup. Sepsis incidence is not demographically uniform — it varies by age, comorbidity burden, socioeconomic factors, and access to care in ways that are themselves products of structural health inequities. When a model's performance mirrors population-level incidence patterns, it becomes structurally difficult to isolate independent model bias from the underlying incidence differences.
The fairness audit findings are also limited to the four validation sites. Demographic composition, baseline incidence by subgroup, and institutional care patterns differ substantially across U.S. health systems. A fairness finding at Michigan, OHSU, Emory, and MetroHealth cannot be generalized to community hospitals, critical access hospitals, or safety-net facilities with different patient populations.
What This Evidence Supports and What It Cannot Answer
Structured assessment of evidentiary boundaries is the most useful service a study appraisal can provide. Wong et al. 2026 is a strong study, and its findings are meaningful — but the boundaries of what they support are specific.
- ESM v2 is a genuine and measurable improvement over ESM v1. The AUROC improvement from 0.63 (v1, Michigan 2021) to 0.82 (v2, Michigan 2026) is not attributable to study design differences alone. The gradient-boosted tree architecture with Sepsis-3 outcome targeting reduced the label-leakage vulnerability that made v1's apparent performance misleading.
- ESM v2 provides a clinically meaningful lead-time advantage over unaided clinician recognition at these four sites. Median prediction lead times of 1.4–7.1 hours before clinician recognition of sepsis represent a real and measurable window for earlier intervention — not just earlier documentation of what clinicians already suspected.
- Institutional variability in performance is real and site-specific threshold calibration is mandatory. The AUROC range 0.82–0.92 and the threshold range 14–37 for 60% sensitivity are not noise — they are a direct signal that universal deployment parameters are operationally inappropriate.
What the study cannot answer is equally important to state clearly:
- Whether ESM v2 reduces sepsis mortality. The study had no mortality endpoint. No RCT of any ESM version has demonstrated a survival benefit. Citing the 2026 AUROC figures as evidence of mortality impact would be a category error.
- How the base (non-fine-tuned) model performs. All four sites fine-tuned before deployment. The published performance figures cannot be applied to out-of-the-box ESM v2 deployment.
- How ESM v2 performs outside these four health system types. Community hospitals, critical access facilities, and safety-net institutions with different case mix, EHR data completeness, and staffing patterns are not represented in this validation.
- Whether COMPOSER or TREWS outcome evidence transfers to ESM v2. COMPOSER's demonstrated association with reduced in-hospital sepsis mortality in a before-and-after study at UC San Diego Health is specific to that model's architecture and deployment context. ESM v2's stronger AUROC does not inherit COMPOSER's outcome evidence. These are distinct models evaluated in distinct study designs.
Research Gaps and What a Stronger Evidence Base Would Require
Wong et al. 2026 was not designed to answer questions it does not address. Identifying the gaps is not a criticism of the study — it is a map of what the research community needs next to move from validated discriminative performance to demonstrated clinical benefit.
- A randomized controlled trial with a mortality primary endpoint. The most important missing study is an RCT that randomizes patients or clinical units to ESM v2 alert exposure versus standard care, with in-hospital mortality or 30-day mortality as the primary endpoint. The lead-time advantage documented in Wong et al. 2026 is a necessary condition for mortality benefit — but it is not sufficient evidence. An RCT would need to be adequately powered for the expected effect size, pre-registered, and conducted at sites that have completed fine-tuning.
- Validation of the base (non-fine-tuned) model. A prospective or retrospective study evaluating ESM v2 performance at sites that deploy the model without fine-tuning would address one of the most practically important gaps for the 95 organizations currently using ESM v2. Many smaller institutions lack the data volume and informatics infrastructure to fine-tune effectively.
- Prospective stratification by care setting. A study reporting ESM v2 performance separately for ED presentations, general medical ward encounters, step-down unit encounters, and ICU encounters — with setting-specific NNE and lead-time figures — would provide the granularity needed for setting-specific alert threshold design.
- A prospective fairness audit designed to disentangle model bias from incidence variation. A fairness study with sufficient statistical power to detect moderate-magnitude independent demographic bias — controlling for baseline sepsis incidence by subgroup — would provide a more rigorous answer to the equity question than the current audit's absence-of-major-disparities finding. This would require pre-specified demographic subgroup analyses with appropriate sample size calculations.
- Validation at community and safety-net institutions. The four validation sites are all large, well-resourced health systems with mature EHR infrastructure and informatics teams capable of executing fine-tuning and prospective data collection. The generalizability of ESM v2's performance to community hospitals, critical access hospitals, and safety-net facilities — which represent the majority of U.S. inpatient encounters — is currently unknown.
Wong et al. 2026 establishes a methodologically credible performance baseline for ESM v2 at four well-characterized health systems. That baseline is genuinely useful — it is the strongest external validation evidence available for any EHR-embedded sepsis prediction model. But the distance between a validated AUROC and a demonstrated mortality benefit is where the field's most important work remains.

Comments
Join the discussion with an anonymous comment.