AI Sepsis Prediction: Epic Sepsis Model Evidence & FDA-Cleared Alternatives

A split clinical composition showing a hospital bedside patient monitor with vital sign waveforms and EHR data streams connected by a timeline motif to an AI clinical decision support panel displaying a risk score interface with an amber risk flag. — AI-assisted early sepsis detection: EHR data streams feed a continuous prediction model, generating a risk flag with a lead-time window before clinical deterioration.

The Clinical Problem: Sepsis Burden and the Case for Early AI Detection

Sepsis affects approximately 1.7 million US adults annually and remains the leading cause of inpatient mortality in American hospitals. The Surviving Sepsis Campaign has consistently emphasized early identification and treatment as the primary lever for reducing mortality — each hour of delayed antibiotic administration is associated with measurable increases in death rates. That clinical imperative is the direct reason why EHR-embedded AI prediction tools have proliferated across US health systems over the past five years.

The appeal of AI-based prediction is straightforward: if a model can identify a patient's trajectory toward sepsis before clinical signs become overt, clinicians gain a treatment window that manual surveillance cannot reliably provide. That window — the lead time between an AI alert and either sepsis onset or clinician recognition — is the primary value proposition every deployed model is evaluated against.

What has proven more complicated is the gap between that value proposition and real-world deployment performance. The evidence base for AI sepsis prediction has matured substantially since 2021, but it has also revealed that performance metrics reported in development studies rarely transfer cleanly to new institutional settings. Understanding that gap — and what the evidence now says about how to close it — is the central task for any clinician, informaticist, or procurement team evaluating these tools in 2026.

For broader context on how AI decision support tools have become embedded in hospital EHRs across specialties, see AI in Medicine: How It's Actually Reshaping Clinical Workflows. This brief focuses specifically on the sepsis prediction use case, with named models, named studies, and specific performance metrics.

How AI Sepsis Prediction Works in EHR-Embedded Settings

EHR-embedded sepsis AI tools operate by continuously ingesting structured data from the patient record — vital signs, laboratory values, medication orders, nursing documentation, and clinical notes — and recalculating a sepsis risk score at defined intervals, typically every one to four hours. When a patient's score crosses a predefined threshold, the system generates an alert delivered to the responsible clinician through the EHR workflow, most commonly as a Best Practice Advisory (BPA) or an inbox notification.

Most deployed models use Sepsis-3 criteria as the outcome definition: organ dysfunction attributable to infection, operationalized as a SOFA score increase of two or more points. Labeling training data with Sepsis-3 creates its own complications — Sepsis-3 is itself imperfect, and retrospective labeling from EHR data introduces noise that propagates into model performance estimates.

Prediction horizon is a critical operational parameter. A model predicting sepsis 12 hours in advance has a different clinical utility profile than one predicting 2 hours in advance — the former provides a longer action window but generates more false positives; the latter is more actionable but leaves less time to intervene. Most published evaluations report performance at multiple horizons, and the horizon at which a model is deployed affects every downstream metric.

AUROC (area under the receiver operating characteristic curve) measures how well the model discriminates sepsis from non-sepsis patients across all possible thresholds — it does not tell you how useful any specific alert will be in practice.
Positive predictive value (PPV) tells you what fraction of alerts correspond to actual sepsis cases at a chosen threshold — this is the metric that determines daily alert burden for clinicians.
Number needed to evaluate (NNE) is the inverse of PPV — how many alerts a clinician must review to identify one true sepsis case. An NNE of 25 means 24 of every 25 alerts are false positives at that threshold.
Lead time measures how far in advance the model alerts before sepsis onset or before the clinician would otherwise have recognized the condition — the operationally relevant benefit metric.

Epic Sepsis Model v1: What the 2021 Michigan Medicine Validation Found

The most consequential early validation of a commercially deployed AI sepsis tool was published in JAMA Internal Medicine in 2021. Researchers at Michigan Medicine evaluated the Epic Sepsis Model (ESM) across 27,697 patients (38,455 hospitalizations) over ten months in 2018–2019. The findings were substantially worse than the developer's reported performance figures.

ESM v1 performance: developer-reported vs. Michigan Medicine external validation. Source: Wong et al., JAMA Intern Med. 2021;181(8):1065–1070; NHLBI summary.
Metric	Developer-Reported	Michigan Medicine Validation
AUROC	0.77–0.83	0.63
Sepsis cases detected	Not disclosed	33% (missed 67%)
Patients receiving an alert	Not disclosed	18% of all hospitalized patients
Study design	Internal validation	External retrospective validation

The Michigan team found the model missed 1,709 patients — 67% of the 2,552 sepsis hospitalizations in their cohort — while simultaneously generating alerts for 18% of all hospitalized patients. That combination — high miss rate and high false-alert burden — defined the core operational problem with ESM v1: it was both insensitive and imprecise, generating alert fatigue without reliable clinical signal.

The NHLBI-supported study was the first major independent external validation of ESM and established the baseline against which all subsequent improvements to the model have been measured. It also prompted broader questions about how proprietary EHR-embedded models are validated before widespread deployment — questions that remain relevant in 2026.

Epic Sepsis Model v2: First Multicenter Prospective External Validation (2026)

In February 2026, JAMA Network Open published the first multicenter prospective external validation of ESM v2 — a substantially redesigned model using a gradient-boosted tree architecture trained on a larger dataset with site-specific fine-tuning capability. The study covered 227,091 inpatient encounters across four major US health systems: Michigan Medicine, Oregon Health & Science University (OHSU), Emory Healthcare, and MetroHealth in Cleveland.

The central finding is that ESM v2 outperforms ESM v1 across every metric at every site. But the equally important finding is that performance varies substantially across institutions — and that variation is not random noise. It reflects real differences in patient populations, sepsis onset patterns, and institutional characteristics that make site-specific validation and threshold calibration mandatory before deployment.

A two-column clinical data comparison diagram contrasting ESM v1 lower-performing metrics with a wide alert funnel against ESM v2 improved but variable metrics distributed across four hospital icons, separated by a central upward-trend arrow. — ESM v1 vs. ESM v2: improved discrimination and reduced alert burden, but substantial performance variation across the four validation sites reflects real institutional differences rather than measurement noise.

ESM v2 multicenter validation results vs. ESM v1 Michigan Medicine baseline. Source: Wong et al., JAMA Network Open, February 2026 (PMC12949446); Wong et al., JAMA Intern Med. 2021.
Metric	ESM v2 Range Across Sites	ESM v1 (Michigan)
Encounter-level AUROC	0.82–0.92	0.63
PPV at 60% sensitivity	0.13–0.26	Not reported at comparable threshold
NNE at 12-hour horizon	21–35	Not reported
Alert threshold score (60% sensitivity)	14–37 across sites	Not applicable
Median lead time before sepsis onset	1.9–10.3 hours	Not reported
Median lead time ahead of clinician recognition	1.4–7.1 hours	Not reported
Number of encounters	227,091 across 4 sites	38,455 at 1 site

The AUROC range of 0.82–0.92 is not a single number to quote — it is a range that reflects real institutional variability. Similarly, the PPV range of 0.13–0.26 means that at the best-performing site, roughly one in four alerts corresponds to a true sepsis case, while at the lowest-performing site, it is closer to one in eight. The NNE range of 21–35 translates directly to the number of chart reviews a clinician must perform per true positive identified.

The lead time findings are the model's clearest clinical argument. Across sites, ESM v2 generated alerts a median 1.9–10.3 hours before sepsis onset, and 1.4–7.1 hours before the clinician would otherwise have recognized the condition. That advance warning window, if acted upon, is where the model's clinical value is realized — but realizing it requires workflow integration that enables rapid clinical response, not just alert delivery.

As of August 2025, Epic reported 95 organizations comprising 731 hospitals using ESM v2. The February 2026 JAMA Network Open validation represents the first prospective multicenter external validation of that widely deployed model.

Why Performance Varies So Much Across Sites

The ESM v2 multicenter validation identified several structural drivers of institutional performance variation. These are not artifacts of study methodology — they reflect genuine differences in the patient populations and clinical environments where the model operates.

Community-onset versus hospital-onset sepsis proportion: ESM v2 performs better at sites with higher proportions of community-onset sepsis — patients who arrive at the ED already on a sepsis trajectory. Hospital-onset sepsis, where the patient deteriorates after admission, is harder to predict and involves more complex clinical signals.
Baseline institutional sepsis incidence rate: Model performance tracks closely with the underlying prevalence of sepsis in the patient population. Higher baseline incidence improves PPV at any fixed threshold. Sites with lower baseline sepsis rates will see more false positives at equivalent threshold settings.
Tertiary care complexity: Pure tertiary care centers — where most admitted patients already have significant comorbidities and higher baseline illness severity — show weaker ESM v2 performance. When the entire patient population is sicker, the model's ability to distinguish sepsis from other severe illness is reduced.
Site-specific fine-tuning: ESM v2 supports local calibration on site-specific historical data. Institutions that have not performed this fine-tuning before go-live will operate with a model that is not optimized for their patient population or EHR data patterns.

The practical implication is direct: the threshold score range of 14–37 across sites is not a reporting curiosity. It means that deploying ESM v2 with a threshold borrowed from a published study or another institution's configuration is operationally equivalent to deploying an uncalibrated tool. Local validation is not optional — it is the mechanism by which the model becomes appropriate for a specific clinical environment.

Regulatory Landscape: FDA-Cleared Sepsis AI Tools in 2026

Two AI sepsis tools have received FDA marketing authorization as of June 2026, through different pathways and with distinct clinical use cases.

A three-row regulatory landscape diagram showing FDA-cleared continuous monitoring tools, FDA-cleared point-in-time diagnostic tools, and uncleared tools with widespread deployment represented by a cluster of hospital icons with an amber dashed border. — AI sepsis prediction regulatory landscape as of Q2 2026: two FDA-cleared tools with distinct use cases, and the uncleared Epic Sepsis Model deployed at scale.

Prenosis Sepsis ImmunoScore — De Novo, April 2024

In April 2024, the FDA granted De Novo marketing authorization (DEN230036) to the Prenosis Sepsis ImmunoScore — the first FDA-authorized AI diagnostic for sepsis. The tool is a Software as a Medical Device (SaMD) that combines 22 parameters, including blood biomarkers and clinical data, to output a risk score and four discrete risk categories for the presence of or progression to sepsis within 24 hours.

The ImmunoScore is a point-in-time diagnostic tool, not a continuous monitoring system. It requires a blood draw and is designed for use in the emergency department or hospital setting when sepsis is being considered. It integrates into the hospital EHR but does not generate continuous background alerts. This makes it operationally distinct from both ESM v2 and TREWS — it answers a different clinical question ("Does this patient have or is this patient developing sepsis?") rather than monitoring all patients for emerging deterioration.

Bayesian Health TREWS — 510(k), May 2026

In May 2026, the FDA granted 510(k) clearance (K250680) to Bayesian Health's Targeted Real-time Early Warning System (TREWS) — making it the first FDA-cleared continuous EHR-monitoring sepsis detection system. Developed by Suchi Saria at Johns Hopkins, TREWS analyzes EHR data including chief complaint, laboratory values, vital signs, procedures, and medications, generating a continuous sepsis risk flag within the health record.

TREWS received FDA Breakthrough Device Designation in 2023 prior to clearance. The 510(k) clearance positions the system for New Technology Add-on Payment (NTAP) reimbursement eligibility under Medicare and Medicaid — a meaningful change in the economics of hospital deployment. TREWS is currently deployed at Cleveland Clinic, MemorialCare in California, and the University of Rochester School of Medicine.

Regulatory status comparison for the three most widely discussed AI sepsis prediction tools as of Q2 2026. ESM v2 deployment figures from August 2025 Epic communication. FDA clearance details should be verified against the CDRH database.
Tool	Regulatory Status	Pathway	Use Case Type	EHR Integration	Reimbursement
Epic Sepsis Model v2	Not FDA-cleared	None	Continuous EHR monitoring	Epic native	Standard inpatient billing
Prenosis Sepsis ImmunoScore	FDA De Novo authorized (DEN230036, April 2024)	De Novo	Point-in-time diagnostic	EHR-integrated	Standard diagnostic billing
Bayesian Health TREWS	FDA 510(k) cleared (K250680, May 2026)	510(k)	Continuous EHR monitoring	EHR-integrated	NTAP eligible

For readers seeking a broader view of the companies developing these tools and their commercial contexts, the AI Companies in Healthcare: A Structured Industry Reference provides company-level profiles of Prenosis, Bayesian Health, and other healthcare AI vendors.

Other Deployed Models: COMPOSER and TREWS Evidence

Beyond ESM v2 and the two FDA-cleared tools, two other models have published peer-reviewed outcome evidence that is relevant for institutions evaluating the sepsis AI landscape.

COMPOSER at UCSD (npj Digital Medicine, 2024)

COMPOSER is a deep-learning sepsis prediction model developed and deployed at UC San Diego Health. A before-and-after quasi-experimental study published in npj Digital Medicine evaluated the model across 6,217 adult septic patients at two UCSD emergency departments between January 2021 and April 2023. The study found a 1.9% absolute reduction (17% relative decrease) in in-hospital sepsis mortality, a 5.0% absolute increase in sepsis bundle compliance, and AUROC values of 0.938–0.945 in the ED setting.

COMPOSER uses conformal prediction to flag indeterminate cases, which the authors report reduces false alarm rates. The system is nurse-facing, delivered as a Best Practice Advisory within the EHR. COMPOSER is not FDA-cleared.

TREWS Adoption and Antibiotic Timing Evidence (Nature Medicine, 2022)

A companion study to the TREWS outcomes evidence examined provider adoption patterns and clinical timing across 9,805 retrospectively identified sepsis cases over two years at five hospitals. TREWS identified 82% of sepsis cases. Among all generated alerts, 89% were evaluated by a physician or advanced practice provider, and 38% of evaluated alerts were confirmed by the provider.

The timing finding is the most clinically actionable result: patients whose TREWS alert was confirmed within three hours had a 1.85-hour reduction in median time to first antibiotic order compared to patients whose alert was dismissed, confirmed late, or never addressed (95% CI 1.66–2.00 hours). Emergency department providers and those with prior experience interacting with TREWS alerts were more likely to confirm alerts promptly.

The 18% relative reduction in sepsis mortality associated with TREWS — cited across multiple sources including the FDA clearance announcement — comes from a separate prospective observational multi-site study (Adams et al., Nature Medicine, 2022, covering 764,707 patient encounters across five hospitals). That study was prospective and observational, not randomized. The mortality association is meaningful but should not be interpreted as RCT-level causal evidence.

Study design and key findings for COMPOSER and TREWS. Study design is disclosed for every outcome claim. Neither COMPOSER nor TREWS has RCT mortality evidence.
Model	Study Design	Sites	Key Outcome Finding	FDA Status
COMPOSER	Before-after quasi-experimental	2 EDs, 1 health system (UCSD)	1.9% absolute mortality reduction; AUROC 0.938–0.945	Not cleared
TREWS (adoption)	Retrospective multi-site	5 hospitals	82% sensitivity; 89% alert evaluation rate; −1.85h to antibiotics when confirmed within 3h	510(k) cleared May 2026
TREWS (outcomes)	Prospective observational	5 hospitals, 764,707 encounters	18% relative reduction in sepsis mortality	510(k) cleared May 2026

Deployment Considerations: What Institutions Need Before Go-Live

The ESM v2 multicenter validation and the broader deployment literature converge on a consistent set of requirements that institutions should address before activating any AI sepsis prediction tool. The Saint Luke's Health System implementation in Kansas City — reported in an operator-authored case study published on EpicShare in January 2024 — illustrates how structured implementation can produce measurable operational improvements: the health system reported a 32% reduction in order-to-antibiotic turnaround time and a 16% reduction in sepsis mortality index after ESM v2 deployment. These figures are site-reported operational data from a non-peer-reviewed case study and should not be treated as externally validated clinical evidence, but they are consistent with the implementation principles the controlled literature supports.

Conduct local validation before go-live. Apply the model retrospectively to your own patient population to establish site-specific AUROC, PPV, and NNE before activating alerts. Do not assume published performance figures will transfer to your institution.
Calibrate the alert threshold to your patient population. The ESM v2 threshold score range of 14–37 across the four validation sites is evidence that no universal threshold exists. Threshold selection is a local clinical decision balancing sensitivity against alert burden.
Fine-tune the model on site-specific historical data. ESM v2 supports local fine-tuning. Deploying without fine-tuning means operating with a model not optimized for your EHR data patterns, patient mix, or documentation practices.
Implement alert silencing strategies. An 8-hour alert silencing window — suppressing repeat alerts for the same patient for eight hours after an initial alert — substantially reduces alert volume. The ESM v2 validation confirms this reduces clinician burden, though it does not improve PPV at the threshold level.
Design differentiated workflows for ED versus inpatient settings. Alert delivery mechanisms, response protocols, and escalation pathways differ between emergency and inpatient contexts. The Saint Luke's implementation explicitly customized workflows by care setting.
Invest in clinician education before and after activation. The TREWS adoption data show that prior experience with alert interaction significantly increases provider confirmation rates. Clinicians who understand what the model is detecting — and what it is not — respond more appropriately to alerts.

Equity and Fairness: What the ESM v2 Audit Found

The 2026 multicenter ESM v2 validation included a fairness audit examining model performance across race, ethnicity, and sex subgroups. The audit found no major independent disparities in performance by any of these demographic dimensions — a more favorable finding than critics of AI sepsis tools had anticipated given the documented bias problems in other clinical prediction models.

The more nuanced finding is that model performance correlates strongly with baseline institutional sepsis incidence rate. Because sepsis incidence is not uniformly distributed across demographic groups — with higher rates documented in older patients, patients with certain comorbidities, and some racial and ethnic populations — apparent differences in outcomes across demographic groups may reflect population-level disease incidence rather than independent model bias. The audit design cannot fully separate these effects.

Evidence Gaps, Limitations, and Active Research Directions

A December 2025 systematic review published in Critical Care Explorations examined 52 studies on AI and machine learning models for early sepsis detection in adult inpatients from 2015 to 2025. Reported AUC values ranged from 0.79 to 0.96 with a median near 0.88, and AI models consistently outperformed traditional scoring systems including SIRS, qSOFA, and MEWS. However, only approximately 40% of studies included external validation — and when models were tested on external data, performance dropped 5–10 AUC points on average.

The review identified four recurring implementation barriers: generalizability across sites, black-box interpretability concerns, EHR connectivity and alert fatigue, and insufficient prospective real-world validation. These barriers map directly onto the ESM v2 findings and the deployment considerations outlined above.

No RCT mortality evidence for the Epic Sepsis Model: As of June 2026, no randomized controlled trial has demonstrated a mortality benefit from any version of the Epic Sepsis Model. The ESM v2 multicenter validation is a prospective external validation study — not an interventional trial. The lead-time findings are real, but whether acting on those leads translates to mortality benefit in a controlled setting has not been tested.
COMPOSER and TREWS outcome data are not RCT evidence: The COMPOSER 1.9% absolute mortality reduction comes from a quasi-experimental before-after study at a single health system. The TREWS 18% relative mortality reduction comes from a prospective observational multi-site study. Both findings are meaningful signals, but neither constitutes the level of evidence that a randomized trial would provide.
Proprietary model opacity: ESM v2 is a proprietary model. Its feature weights, training data composition, and update history are not publicly disclosed. This limits independent reproducibility and makes it difficult for institutions to understand why the model generates specific alerts.
Sepsis-3 labeling imperfections: All models discussed here use Sepsis-3 criteria as the outcome label in training and validation. Sepsis-3 is the current clinical standard, but retrospective EHR labeling introduces noise — patients who meet Sepsis-3 criteria may not have received that diagnosis clinically, and patients who were clinically treated for sepsis may not meet the retrospective criteria. This affects performance measurement across all studies.
External validation gap: The systematic review finding that only ~40% of AI sepsis studies include external validation is a structural problem in the literature. ESM v2's February 2026 multicenter validation is a significant step, but most published AI sepsis models have been evaluated only at the institution where they were trained.

The systematic review identified several active research priorities that are likely to shape the next generation of sepsis AI tools: prospective multicenter randomized trials with mortality as the primary endpoint; explainable AI approaches that surface interpretable clinical reasoning alongside risk scores; federated learning frameworks that allow multi-institutional model training without sharing patient data; and sepsis phenotyping approaches that move beyond binary alert frameworks toward identifying clinically meaningful subtypes (four to seven have been proposed) that may respond differently to treatment.

AI Sepsis Prediction in Hospitals: Epic Sepsis Model Evidence, Alert Burden, and FDA-Cleared Alternatives