AI Triage CDS in the ED: FDA-Cleared Tools and Real-World Performance

The ED Triage Problem AI Is Meant to Solve

Emergency department crowding is a structural problem, not a staffing one. Even in well-resourced systems, the initial triage decision — which patient is seen first, which can wait, which is critically underprioritized — depends on a nurse's assessment conducted in minutes, often with incomplete information, under cognitive load, and against a backdrop of competing demands.

The consequences of getting that decision wrong are measurable. Mis-triage rates in emergency departments range from 15% to 33% globally — a figure that persists even in departments using validated triage instruments such as the Emergency Severity Index. The causes are well documented: human cognitive bias, inconsistent application of triage criteria, and the absence of real-time decision support at the point of assessment.

AI-based clinical decision support (CDS) is being positioned as a structural remedy. The argument is straightforward: if a machine learning model can synthesize a patient's vital signs, chief complaint, demographics, and active problems faster and more consistently than a clinician under time pressure, it can flag high-acuity patients before they deteriorate and reduce the rate of undertriage. Several tools have now received FDA clearance to operate in this space.

But regulatory clearance and clinical utility are not the same thing. The central problem this article addresses is the gap between what a tool demonstrates to achieve FDA clearance and what it actually delivers when deployed across real patient populations, care settings, and disease subtypes. That gap is not theoretical. It is documented, it is substantial, and it has direct implications for patient safety.

What Counts as FDA-Cleared AI Triage CDS

AI triage tools that influence clinical decision-making typically qualify as Software as a Medical Device (SaMD) under FDA's regulatory framework. SaMD is software intended to perform a medical purpose — such as predicting patient acuity or detecting a pathological finding on imaging — that is not part of a hardware medical device.

Most AI triage CDS tools reach the market through one of two premarket pathways. The 510(k) pathway — the most common route — requires a manufacturer to demonstrate substantial equivalence to a legally marketed predicate device. The De Novo pathway applies when no predicate exists and creates a new device classification. A third pathway, the Premarket Approval (PMA), applies to the highest-risk devices and requires clinical trial evidence of safety and effectiveness; it is rarely used for AI CDS software.

FDA premarket pathways for AI SaMD. Most AI triage CDS tools are cleared via 510(k).
Pathway	What It Requires	What It Does Not Guarantee
510(k)	Substantial equivalence to a legally marketed predicate device	Real-world performance across all patient populations, subtypes, or care settings
De Novo	Reasonable assurance of safety and effectiveness for a new device type	External validation in diverse populations or post-market performance maintenance
PMA	Clinical trial evidence of safety and effectiveness (highest-risk devices)	Generalizability beyond the study population used in the submission

The CDS exemption is also relevant here. FDA policy distinguishes between software that supports clinical decisions by providing information for a clinician to independently review (often exempt from device regulation) and software that provides specific treatment or diagnostic outputs intended to replace clinician judgment (regulated as SaMD). AI triage tools that output risk scores, acuity recommendations, or condition-specific alerts with the intent to drive clinical action generally fall within the regulated SaMD category.

A Functional Taxonomy of Cleared AI Triage Tools

Not all AI tools described as "ED triage CDS" do the same thing. Conflating them leads to category errors in both clinical evaluation and procurement decisions. FDA-cleared AI tools relevant to emergency department triage fall into three distinct functional categories, each with different inputs, outputs, intended users, and clinical use cases.

Three-column taxonomy diagram showing three functional categories of AI clinical decision support tools: triage acuity prediction, imaging-based condition alerting, and condition-specific risk stratification. — Three functional categories of FDA-cleared AI triage CDS. Each operates at a different point in the ED workflow with different inputs, outputs, and intended users.

Functional taxonomy of FDA-cleared AI tools relevant to ED triage. Tools within each category are not interchangeable and should not be evaluated using the same criteria.
Category	What It Does	Primary Input	Primary User	Workflow Stage
Acuity Prediction / Patient Flow	Predicts patient acuity level and risk of critical care, emergency surgery, or admission based on triage data	Demographics, vital signs, chief complaint, active problems	Triage nurse	Initial triage assessment
Imaging-Based Condition Alerting	Detects a specific finding on medical imaging (e.g., intracranial hemorrhage on CT) and generates a priority alert	CT, MRI, or other imaging studies	Radiologist, ED physician	Post-imaging, pre-report
Condition-Specific Risk Stratification	Stratifies patients into risk categories for a specific condition (e.g., hemorrhage in trauma) using physiological signals	Continuous vital signs (heart rate, blood pressure)	Trauma team, prehospital providers	Prehospital or early ED assessment

The distinction between categories matters for evaluation. An imaging-based alerting tool is not a substitute for an acuity prediction system, and a trauma hemorrhage risk stratification tool designed for battlefield or prehospital contexts does not function as a general ED acuity scorer. Applying performance benchmarks or clinical expectations across categories produces misleading comparisons.

Key Cleared Tools: Clearance Evidence and Intended Use

Three cleared tools illustrate the range of functional categories, clearance evidence bases, and intended use contexts in this space. Each is described factually with its regulatory basis and the evidence on which clearance was granted.

Aidoc ICH Triage — Imaging-Based Intracranial Hemorrhage Alerting

Aidoc's intracranial hemorrhage (ICH) triage tool is an imaging-based alerting system that analyzes non-contrast head CT scans and generates priority notifications when hemorrhage is detected. It is designed to accelerate radiologist and emergency physician review by surfacing high-priority studies before they would otherwise reach the top of a reading queue.

The tool's FDA clearance was granted based on a dataset of 220 cases. The clearance submission reported sensitivity of 96.15% and specificity of 94.83%. Critically, the clearance dataset contained no stratification by hemorrhage subtype (acute vs. subacute vs. chronic), hemorrhage size, patient acuity, or care setting. This opacity in the clearance evidence base has direct consequences for real-world deployment, as discussed in the section that follows.

APPRAISE-HRI (K233249) — Hemorrhage Risk Stratification in Trauma

APPRAISE-HRI — the Automated Processing of the Physiological Registry for Assessment of Injury Severity Hemorrhage Risk Index — received 510(k) clearance (K233249) as the first FDA-cleared AI SaMD specifically for hemorrhage risk stratification in trauma patients. It uses continuous vital signs — heart rate and blood pressure — to stratify patients into three Hemorrhage Risk Index levels.

The tool was validated across 5,895 trauma patients (543 with hemorrhagic injuries, 5,352 controls) at 8 medical centers during ED stays or prehospital transport. Hemorrhagic patients were 6.88 times as likely as controls to be classified at level III (high risk) and 0.18 times as likely to be at level I (low risk). The tool was funded by the U.S. Army Medical Materiel Development Activity and designed explicitly for battlefield and prehospital trauma triage contexts.

Johns Hopkins / Radiometer AI Triage CDS — Acuity Prediction

The AI triage CDS system developed at Johns Hopkins and licensed to Radiometer (a Danaher subsidiary) represents the acuity prediction category. The system applies site-specific machine learning algorithms to triage data — including patient demographics, vital signs, chief complaint, and active problems — to generate predicted acuity levels (1 through 5) along with probability estimates for critical care need, emergency surgery, and hospital admission.

A distinctive feature of this system is its use of SHAP (SHapley Additive exPlanations) values to generate natural-language explanations accompanying each recommendation, identifying which input variables most influenced the output. Nurses retain full decision-making authority — the AI recommendation is presented as augmentive input, not a directive. This design choice is both a patient safety feature and a regulatory consideration, as it preserves the human-in-the-loop model required for CDS tools operating in high-stakes triage environments.

Real-World Performance: What Post-Deployment Studies Show

Bar chart comparing FDA clearance benchmark sensitivity of 96.2% versus real-world deployment sensitivity of 82.2% for AI intracranial hemorrhage detection, with a breakdown panel showing lower performance across specific hemorrhage subtypes. — The clearance-to-deployment performance gap for Aidoc ICH Triage across 101,944 head CT exams at a 17-facility health system. Subtype-level degradation is the primary driver of overall sensitivity reduction.

The most rigorous post-deployment evaluation of a cleared AI triage tool to date is a 2026 study published in npj Digital Medicine examining the Aidoc ICH Triage model across 101,944 non-contrast head CT exams from 74,142 patients at a 17-facility academic health system over a two-year period. The findings quantify the clearance-to-deployment gap with precision.

Overall real-world sensitivity was 82.2% — 14 percentage points below the 96.15% clearance benchmark. Specificity held at 97.6%, modestly above the clearance figure of 94.83%. The gap in sensitivity is not uniform across the patient population: it is driven almost entirely by hemorrhage subtype, size, care setting, and compartment distribution.

Aidoc ICH Triage: clearance benchmark vs. real-world sensitivity breakdown by subtype, size, and care setting. Source: Chavoshi et al., npj Digital Medicine, 2026.
Condition / Subgroup	Sensitivity	Comparison Point
FDA clearance benchmark (overall)	96.2%	Clearance dataset: 220 cases, no subtype stratification
Real-world overall	82.2%	101,944 head CT exams, 17 facilities
Acute hemorrhage	86.2%	Best-performing subtype in real-world setting
Large hemorrhage (>10mm)	95.0%	Approaches clearance benchmark
Small hemorrhage (≤10mm)	74.8%	Substantially below clearance benchmark
Single-compartment hemorrhage	76.0%	Lower than multi-compartment cases
Subacute hemorrhage	45.5%	Severe degradation from clearance benchmark
Chronic hemorrhage	54.8%	Severe degradation from clearance benchmark
Outpatient setting	72.2%	Lowest care-setting performance
Emergency department setting	83.5%	Close to overall real-world average
Inpatient setting	82.7%	Close to overall real-world average

Multivariable analysis identified four independent predictors of model performance: hematoma size greater than 10mm (OR 3.82), acute rather than subacute or chronic acuity (OR 5.93), number of hemorrhage compartments (OR 3.25 per additional compartment), and presence of mass effect (OR 1.62). In plain terms: the model performs well on the cases it was most likely optimized for during development — large, acute, multi-compartment bleeds with mass effect — and substantially worse on the presentations that are clinically harder to identify and arguably more consequential to miss.

Importantly, the study found no statistically significant demographic disparities in model performance across age, sex, or race — a meaningful finding in the context of algorithmic bias concerns, though the authors note this result is specific to their population and should not be assumed to generalize universally.

This finding is not an outlier. A broader systematic review cited in the study found that 81% of AI radiology models showed degraded performance when evaluated on independent datasets — suggesting the clearance-to-deployment gap is a structural feature of the current regulatory and development ecosystem, not a product-specific anomaly.

Why the Gap Exists: Root Causes of Clearance-to-Deployment Degradation

The performance gap documented in the Aidoc study is not primarily a technology failure. It reflects predictable structural mismatches between the conditions under which AI models are cleared and the conditions under which they are deployed.

Small and opaque clearance datasets. Aidoc's clearance relied on 220 cases with no stratification by hemorrhage subtype, size, or acuity. A dataset of this size cannot adequately represent the distribution of presentations a model will encounter across tens of thousands of real-world scans. The clearance benchmark reflects performance on a narrow, uncharacterized sample.
Subtype distribution mismatch. If a clearance dataset is disproportionately composed of acute, large hemorrhages — the easiest cases to detect — the model will appear to perform well on the metric that matters most (sensitivity) without being tested on the subtypes where failure is most consequential. Deployment populations include the full clinical spectrum.
Care-setting population differences. Outpatient head CT studies are ordered for different clinical indications than inpatient or ED scans. The patient population, imaging protocols, and prevalence of different hemorrhage types differ across care settings. A model cleared primarily on ED or inpatient data will underperform in outpatient contexts — as the 72.2% outpatient sensitivity figure demonstrates.
Absence of subgroup reporting in clearance submissions. Without required subgroup reporting by hemorrhage type, size, and care setting, clearance submissions cannot reveal where a model's performance is weak. Clinicians and administrators have no basis for informed deployment decisions without this information.
Scanner and acquisition parameter variation. Real-world deployment involves heterogeneous imaging equipment and acquisition protocols across facilities. The npj Digital Medicine study authors note this as a limitation they could not fully analyze — a reminder that technical factors beyond patient population drive model performance.

Economic Evidence: What AI Triage CDS Implementation Data Shows

A 2026 economic evaluation published in JMIR examined the operational and financial impact of implementing the Johns Hopkins/Radiometer AI triage CDS system across 3 emergency departments (170,723 visits analyzed, 180-day pre/post periods). The study reported a 9.6% increase in ED visit volume following implementation, with a corresponding $15.4 million increase in revenue.

The translation of revenue to operating margin depended heavily on the cost modeling framework applied. Under a hospital management framework (assuming 70% fixed costs), $12.6 million of the $15.4 million revenue increase translated to operating margin gain. Under a public policy cost-per-visit framework, the margin gain was only $0.9 million. Break-even thresholds differed dramatically: $66.02 per visit under hospital management versus $4.69 per visit under the public policy framework.

Economic outcomes from AI triage CDS implementation across 3 EDs. Results vary substantially by cost modeling framework. Source: Levin et al., JMIR, 2026.
Metric	Hospital Management Framework	Public Policy Framework
ED visit volume increase	9.6%	9.6%
Revenue increase	$15.4M	$15.4M
Operating margin gain	$12.6M	$0.9M
Break-even threshold per visit	$66.02	$4.69

Conflict of interest disclosure for this study is extensive and must be weighted in interpreting its findings. Lead authors Scott Levin and Jeremiah Hinson are employed by Danaher Diagnostics. The AI triage CDS technology was developed at Johns Hopkins and licensed to Radiometer, a Danaher subsidiary; Johns Hopkins University, Levin, and Hinson are entitled to royalty distributions from this licensing arrangement. Co-authors Andrew Taylor and Rohit Sangal received research funding from Beckman Coulter, also a Danaher subsidiary. These relationships create direct financial incentives aligned with favorable study outcomes. The economic findings should be interpreted as specific to 3 capacity-constrained EDs in one health system and should not be generalized to all AI triage CDS implementations.

Implementation Considerations: Workflow, Autonomy, and Automation Bias

AI triage CDS tools do not self-implement. Their clinical impact — positive or negative — depends on how they are integrated into existing nurse-led triage workflows, how their outputs are communicated, and whether the humans using them understand their limitations.

Split-panel illustration showing a nurse at an ED triage desk reviewing a patient monitor on the left, and a digital AI decision-support interface displaying risk probability bars and a brain CT scan with detection overlay on the right, connected by a data-flow layer. — AI triage CDS operates as a decision-support layer within nurse-led triage workflows. The human-in-the-loop model is both a design requirement and a patient safety safeguard.

Preserve nurse decision authority. The Johns Hopkins/Radiometer system is designed so that nurses retain full decision-making autonomy — the AI recommendation is one input among several, not a directive. This design is critical: triage is a clinical judgment, not a classification task. Systems that present AI output as authoritative rather than advisory undermine the human oversight that catches model errors.
Use explainability features actively. SHAP-value-derived natural-language explanations — which identify the specific input variables driving each AI recommendation — give nurses a basis for evaluating whether the model's reasoning aligns with their clinical assessment. Explainability is not a cosmetic feature; it is the mechanism by which clinicians can identify when a model is wrong.
Monitor for automation bias in low-performance contexts. The settings where AI triage models perform worst — outpatient imaging, subacute and chronic hemorrhage, small bleeds — are also the settings where clinicians may be most likely to defer to AI output because the presentations are subtle. Automation bias (the tendency to accept AI recommendations uncritically) is most dangerous precisely where model performance is lowest.
Plan for ongoing post-deployment monitoring. A tool that performs adequately at deployment may degrade over time as patient population characteristics, imaging protocols, or disease prevalence shift. Post-deployment monitoring — including sensitivity tracking by subtype and care setting — is not optional. It is the only way to detect model drift before it causes harm.
Account for site-specific calibration requirements. The Johns Hopkins/Radiometer system uses site-specific ML algorithms — meaning the model is trained or calibrated on data from the deploying institution. This is a meaningful design choice that reduces distribution shift, but it also means performance at one site cannot be assumed at another. Multicenter validation of site-specific systems requires each site's data to be assessed independently.

An Evaluation Framework: Questions to Ask Before Adopting a Cleared AI Triage Tool

FDA clearance is a regulatory floor, not a clinical ceiling. A cleared tool has demonstrated substantial equivalence to a predicate device under the conditions of its submission. It has not demonstrated that it will perform at its clearance benchmark in your patient population, your care setting, or across the full spectrum of presentations your clinicians will encounter.

The following questions provide a structured framework for evaluating any FDA-cleared AI triage CDS tool before adoption.

A structured evaluation framework for FDA-cleared AI triage CDS tools. Regulatory clearance should be treated as a starting point for evaluation, not a conclusion.
Evaluation Domain	Questions to Ask	Why It Matters
Clearance dataset	How many cases were in the clearance dataset? Were subtypes, sizes, acuity levels, and demographics reported?	Small, unstratified datasets produce clearance benchmarks that do not reflect real-world performance across the full case spectrum
External validation	Has the tool been validated on an independent external dataset? Was that dataset multicenter and demographically diverse?	Single-center or manufacturer-run validation studies have known generalizability limitations
Subgroup performance	Are sensitivity and specificity reported separately for relevant subgroups (e.g., hemorrhage subtype, lesion size, care setting, age, race)?	Aggregate metrics conceal subtype-level degradation that directly affects patient safety in specific clinical contexts
Care-setting match	Was the tool validated in the same care setting (ED, outpatient, inpatient) where you plan to deploy it?	Care-setting population differences drive substantial performance variation — outpatient sensitivity for Aidoc ICH Triage was 72.2% vs. 83.5% in the ED
Post-deployment evidence	Has peer-reviewed post-deployment evidence been published? By whom, and with what funding?	Manufacturer-sponsored studies without independent replication have known bias risk; independent multicenter studies carry more weight
Conflict of interest	Are there financial relationships between the study authors and the tool manufacturer or licensor?	Conflicts of interest in supporting studies affect the reliability of reported outcomes and should be disclosed and weighted
Clinical decision authority	Does the tool preserve full clinician decision-making authority, or does its interface design create pressure to accept AI recommendations?	Tools that present AI output as directive rather than advisory increase automation bias risk
Explainability	Does the tool provide explanations for its recommendations that clinicians can evaluate and override?	Unexplained outputs cannot be critically assessed; explainability is the mechanism by which clinicians catch model errors
Post-market monitoring plan	Does the vendor provide post-market performance monitoring? What triggers a performance review or alert?	Model drift can cause performance degradation over time; monitoring is the only way to detect it before it affects patient outcomes
Functional category fit	Is the tool designed for the specific clinical task you need — acuity prediction, imaging alerting, or condition-specific risk stratification?	Deploying a tool in a context outside its intended use (e.g., using a trauma hemorrhage tool as a general ED acuity scorer) produces unpredictable and potentially dangerous results

The evidence available as of mid-2026 supports a clear conclusion: FDA-cleared AI triage CDS tools are not interchangeable, their clearance benchmarks do not reliably predict real-world performance, and the gap between the two is driven by identifiable and addressable structural factors. Clinicians and administrators who treat clearance as sufficient validation for deployment are accepting risks that post-deployment evidence has already quantified. The tools in this space can add clinical value — but only when evaluated with the same rigor applied to any other diagnostic or decision-support technology entering the emergency department.

AI Triage Clinical Decision Support in the Emergency Department: FDA-Cleared Tools and the Clearance-to-Deployment Performance Gap