The ED Triage Problem AI Is Meant to Solve
Emergency department crowding is a structural problem, not a staffing one. Even in well-resourced systems, the initial triage decision — which patient is seen first, which can wait, which is critically underprioritized — depends on a nurse's assessment conducted in minutes, often with incomplete information, under cognitive load, and against a backdrop of competing demands.
The consequences of getting that decision wrong are measurable. Mis-triage rates in emergency departments range from 15% to 33% globally — a figure that persists even in departments using validated triage instruments such as the Emergency Severity Index. The causes are well documented: human cognitive bias, inconsistent application of triage criteria, and the absence of real-time decision support at the point of assessment.
AI-based clinical decision support (CDS) is being positioned as a structural remedy. The argument is straightforward: if a machine learning model can synthesize a patient's vital signs, chief complaint, demographics, and active problems faster and more consistently than a clinician under time pressure, it can flag high-acuity patients before they deteriorate and reduce the rate of undertriage. Several tools have now received FDA clearance to operate in this space.
But regulatory clearance and clinical utility are not the same thing. The central problem this article addresses is the gap between what a tool demonstrates to achieve FDA clearance and what it actually delivers when deployed across real patient populations, care settings, and disease subtypes. That gap is not theoretical. It is documented, it is substantial, and it has direct implications for patient safety.
What Counts as FDA-Cleared AI Triage CDS
AI triage tools that influence clinical decision-making typically qualify as Software as a Medical Device (SaMD) under FDA's regulatory framework. SaMD is software intended to perform a medical purpose — such as predicting patient acuity or detecting a pathological finding on imaging — that is not part of a hardware medical device.
Most AI triage CDS tools reach the market through one of two premarket pathways. The 510(k) pathway — the most common route — requires a manufacturer to demonstrate substantial equivalence to a legally marketed predicate device. The De Novo pathway applies when no predicate exists and creates a new device classification. A third pathway, the Premarket Approval (PMA), applies to the highest-risk devices and requires clinical trial evidence of safety and effectiveness; it is rarely used for AI CDS software.
| Pathway | What It Requires | What It Does Not Guarantee |
|---|---|---|
| 510(k) | Substantial equivalence to a legally marketed predicate device | Real-world performance across all patient populations, subtypes, or care settings |
| De Novo | Reasonable assurance of safety and effectiveness for a new device type | External validation in diverse populations or post-market performance maintenance |
| PMA | Clinical trial evidence of safety and effectiveness (highest-risk devices) | Generalizability beyond the study population used in the submission |
The CDS exemption is also relevant here. FDA policy distinguishes between software that supports clinical decisions by providing information for a clinician to independently review (often exempt from device regulation) and software that provides specific treatment or diagnostic outputs intended to replace clinician judgment (regulated as SaMD). AI triage tools that output risk scores, acuity recommendations, or condition-specific alerts with the intent to drive clinical action generally fall within the regulated SaMD category.
A Functional Taxonomy of Cleared AI Triage Tools
Not all AI tools described as "ED triage CDS" do the same thing. Conflating them leads to category errors in both clinical evaluation and procurement decisions. FDA-cleared AI tools relevant to emergency department triage fall into three distinct functional categories, each with different inputs, outputs, intended users, and clinical use cases.

| Category | What It Does | Primary Input | Primary User | Workflow Stage |
|---|---|---|---|---|
| Acuity Prediction / Patient Flow | Predicts patient acuity level and risk of critical care, emergency surgery, or admission based on triage data | Demographics, vital signs, chief complaint, active problems | Triage nurse | Initial triage assessment |
| Imaging-Based Condition Alerting | Detects a specific finding on medical imaging (e.g., intracranial hemorrhage on CT) and generates a priority alert | CT, MRI, or other imaging studies | Radiologist, ED physician | Post-imaging, pre-report |
| Condition-Specific Risk Stratification | Stratifies patients into risk categories for a specific condition (e.g., hemorrhage in trauma) using physiological signals | Continuous vital signs (heart rate, blood pressure) | Trauma team, prehospital providers | Prehospital or early ED assessment |
The distinction between categories matters for evaluation. An imaging-based alerting tool is not a substitute for an acuity prediction system, and a trauma hemorrhage risk stratification tool designed for battlefield or prehospital contexts does not function as a general ED acuity scorer. Applying performance benchmarks or clinical expectations across categories produces misleading comparisons.
Key Cleared Tools: Clearance Evidence and Intended Use
Three cleared tools illustrate the range of functional categories, clearance evidence bases, and intended use contexts in this space. Each is described factually with its regulatory basis and the evidence on which clearance was granted.
Aidoc ICH Triage — Imaging-Based Intracranial Hemorrhage Alerting
Aidoc's intracranial hemorrhage (ICH) triage tool is an imaging-based alerting system that analyzes non-contrast head CT scans and generates priority notifications when hemorrhage is detected. It is designed to accelerate radiologist and emergency physician review by surfacing high-priority studies before they would otherwise reach the top of a reading queue.
The tool's FDA clearance was granted based on a dataset of 220 cases. The clearance submission reported sensitivity of 96.15% and specificity of 94.83%. Critically, the clearance dataset contained no stratification by hemorrhage subtype (acute vs. subacute vs. chronic), hemorrhage size, patient acuity, or care setting. This opacity in the clearance evidence base has direct consequences for real-world deployment, as discussed in the section that follows.
APPRAISE-HRI (K233249) — Hemorrhage Risk Stratification in Trauma
APPRAISE-HRI — the Automated Processing of the Physiological Registry for Assessment of Injury Severity Hemorrhage Risk Index — received 510(k) clearance (K233249) as the first FDA-cleared AI SaMD specifically for hemorrhage risk stratification in trauma patients. It uses continuous vital signs — heart rate and blood pressure — to stratify patients into three Hemorrhage Risk Index levels.
The tool was validated across 5,895 trauma patients (543 with hemorrhagic injuries, 5,352 controls) at 8 medical centers during ED stays or prehospital transport. Hemorrhagic patients were 6.88 times as likely as controls to be classified at level III (high risk) and 0.18 times as likely to be at level I (low risk). The tool was funded by the U.S. Army Medical Materiel Development Activity and designed explicitly for battlefield and prehospital trauma triage contexts.
Johns Hopkins / Radiometer AI Triage CDS — Acuity Prediction
The AI triage CDS system developed at Johns Hopkins and licensed to Radiometer (a Danaher subsidiary) represents the acuity prediction category. The system applies site-specific machine learning algorithms to triage data — including patient demographics, vital signs, chief complaint, and active problems — to generate predicted acuity levels (1 through 5) along with probability estimates for critical care need, emergency surgery, and hospital admission.
A distinctive feature of this system is its use of SHAP (SHapley Additive exPlanations) values to generate natural-language explanations accompanying each recommendation, identifying which input variables most influenced the output. Nurses retain full decision-making authority — the AI recommendation is presented as augmentive input, not a directive. This design choice is both a patient safety feature and a regulatory consideration, as it preserves the human-in-the-loop model required for CDS tools operating in high-stakes triage environments.
Real-World Performance: What Post-Deployment Studies Show

The most rigorous post-deployment evaluation of a cleared AI triage tool to date is a 2026 study published in npj Digital Medicine examining the Aidoc ICH Triage model across 101,944 non-contrast head CT exams from 74,142 patients at a 17-facility academic health system over a two-year period. The findings quantify the clearance-to-deployment gap with precision.
Overall real-world sensitivity was 82.2% — 14 percentage points below the 96.15% clearance benchmark. Specificity held at 97.6%, modestly above the clearance figure of 94.83%. The gap in sensitivity is not uniform across the patient population: it is driven almost entirely by hemorrhage subtype, size, care setting, and compartment distribution.
| Condition / Subgroup | Sensitivity | Comparison Point |
|---|---|---|
| FDA clearance benchmark (overall) | 96.2% | Clearance dataset: 220 cases, no subtype stratification |
| Real-world overall | 82.2% | 101,944 head CT exams, 17 facilities |
| Acute hemorrhage | 86.2% | Best-performing subtype in real-world setting |
| Large hemorrhage (>10mm) | 95.0% | Approaches clearance benchmark |
| Small hemorrhage (≤10mm) | 74.8% | Substantially below clearance benchmark |
| Single-compartment hemorrhage | 76.0% | Lower than multi-compartment cases |
| Subacute hemorrhage | 45.5% | Severe degradation from clearance benchmark |
| Chronic hemorrhage | 54.8% | Severe degradation from clearance benchmark |
| Outpatient setting | 72.2% | Lowest care-setting performance |
| Emergency department setting | 83.5% | Close to overall real-world average |
| Inpatient setting | 82.7% | Close to overall real-world average |
Multivariable analysis identified four independent predictors of model performance: hematoma size greater than 10mm (OR 3.82), acute rather than subacute or chronic acuity (OR 5.93), number of hemorrhage compartments (OR 3.25 per additional compartment), and presence of mass effect (OR 1.62). In plain terms: the model performs well on the cases it was most likely optimized for during development — large, acute, multi-compartment bleeds with mass effect — and substantially worse on the presentations that are clinically harder to identify and arguably more consequential to miss.
Importantly, the study found no statistically significant demographic disparities in model performance across age, sex, or race — a meaningful finding in the context of algorithmic bias concerns, though the authors note this result is specific to their population and should not be assumed to generalize universally.
This finding is not an outlier. A broader systematic review cited in the study found that 81% of AI radiology models showed degraded performance when evaluated on independent datasets — suggesting the clearance-to-deployment gap is a structural feature of the current regulatory and development ecosystem, not a product-specific anomaly.
Why the Gap Exists: Root Causes of Clearance-to-Deployment Degradation
The performance gap documented in the Aidoc study is not primarily a technology failure. It reflects predictable structural mismatches between the conditions under which AI models are cleared and the conditions under which they are deployed.
- Small and opaque clearance datasets. Aidoc's clearance relied on 220 cases with no stratification by hemorrhage subtype, size, or acuity. A dataset of this size cannot adequately represent the distribution of presentations a model will encounter across tens of thousands of real-world scans. The clearance benchmark reflects performance on a narrow, uncharacterized sample.
- Subtype distribution mismatch. If a clearance dataset is disproportionately composed of acute, large hemorrhages — the easiest cases to detect — the model will appear to perform well on the metric that matters most (sensitivity) without being tested on the subtypes where failure is most consequential. Deployment populations include the full clinical spectrum.
- Care-setting population differences. Outpatient head CT studies are ordered for different clinical indications than inpatient or ED scans. The patient population, imaging protocols, and prevalence of different hemorrhage types differ across care settings. A model cleared primarily on ED or inpatient data will underperform in outpatient contexts — as the 72.2% outpatient sensitivity figure demonstrates.
- Absence of subgroup reporting in clearance submissions. Without required subgroup reporting by hemorrhage type, size, and care setting, clearance submissions cannot reveal where a model's performance is weak. Clinicians and administrators have no basis for informed deployment decisions without this information.
- Scanner and acquisition parameter variation. Real-world deployment involves heterogeneous imaging equipment and acquisition protocols across facilities. The npj Digital Medicine study authors note this as a limitation they could not fully analyze — a reminder that technical factors beyond patient population drive model performance.
Economic Evidence: What AI Triage CDS Implementation Data Shows
A 2026 economic evaluation published in JMIR examined the operational and financial impact of implementing the Johns Hopkins/Radiometer AI triage CDS system across 3 emergency departments (170,723 visits analyzed, 180-day pre/post periods). The study reported a 9.6% increase in ED visit volume following implementation, with a corresponding $15.4 million increase in revenue.
The translation of revenue to operating margin depended heavily on the cost modeling framework applied. Under a hospital management framework (assuming 70% fixed costs), $12.6 million of the $15.4 million revenue increase translated to operating margin gain. Under a public policy cost-per-visit framework, the margin gain was only $0.9 million. Break-even thresholds differed dramatically: $66.02 per visit under hospital management versus $4.69 per visit under the public policy framework.
| Metric | Hospital Management Framework | Public Policy Framework |
|---|---|---|
| ED visit volume increase | 9.6% | 9.6% |
| Revenue increase | $15.4M | $15.4M |
| Operating margin gain | $12.6M | $0.9M |
| Break-even threshold per visit | $66.02 | $4.69 |
Implementation Considerations: Workflow, Autonomy, and Automation Bias
AI triage CDS tools do not self-implement. Their clinical impact — positive or negative — depends on how they are integrated into existing nurse-led triage workflows, how their outputs are communicated, and whether the humans using them understand their limitations.

- Preserve nurse decision authority. The Johns Hopkins/Radiometer system is designed so that nurses retain full decision-making autonomy — the AI recommendation is one input among several, not a directive. This design is critical: triage is a clinical judgment, not a classification task. Systems that present AI output as authoritative rather than advisory undermine the human oversight that catches model errors.
- Use explainability features actively. SHAP-value-derived natural-language explanations — which identify the specific input variables driving each AI recommendation — give nurses a basis for evaluating whether the model's reasoning aligns with their clinical assessment. Explainability is not a cosmetic feature; it is the mechanism by which clinicians can identify when a model is wrong.
- Monitor for automation bias in low-performance contexts. The settings where AI triage models perform worst — outpatient imaging, subacute and chronic hemorrhage, small bleeds — are also the settings where clinicians may be most likely to defer to AI output because the presentations are subtle. Automation bias (the tendency to accept AI recommendations uncritically) is most dangerous precisely where model performance is lowest.
- Plan for ongoing post-deployment monitoring. A tool that performs adequately at deployment may degrade over time as patient population characteristics, imaging protocols, or disease prevalence shift. Post-deployment monitoring — including sensitivity tracking by subtype and care setting — is not optional. It is the only way to detect model drift before it causes harm.
- Account for site-specific calibration requirements. The Johns Hopkins/Radiometer system uses site-specific ML algorithms — meaning the model is trained or calibrated on data from the deploying institution. This is a meaningful design choice that reduces distribution shift, but it also means performance at one site cannot be assumed at another. Multicenter validation of site-specific systems requires each site's data to be assessed independently.
An Evaluation Framework: Questions to Ask Before Adopting a Cleared AI Triage Tool
FDA clearance is a regulatory floor, not a clinical ceiling. A cleared tool has demonstrated substantial equivalence to a predicate device under the conditions of its submission. It has not demonstrated that it will perform at its clearance benchmark in your patient population, your care setting, or across the full spectrum of presentations your clinicians will encounter.
The following questions provide a structured framework for evaluating any FDA-cleared AI triage CDS tool before adoption.
| Evaluation Domain | Questions to Ask | Why It Matters |
|---|---|---|
| Clearance dataset | How many cases were in the clearance dataset? Were subtypes, sizes, acuity levels, and demographics reported? | Small, unstratified datasets produce clearance benchmarks that do not reflect real-world performance across the full case spectrum |
| External validation | Has the tool been validated on an independent external dataset? Was that dataset multicenter and demographically diverse? | Single-center or manufacturer-run validation studies have known generalizability limitations |
| Subgroup performance | Are sensitivity and specificity reported separately for relevant subgroups (e.g., hemorrhage subtype, lesion size, care setting, age, race)? | Aggregate metrics conceal subtype-level degradation that directly affects patient safety in specific clinical contexts |
| Care-setting match | Was the tool validated in the same care setting (ED, outpatient, inpatient) where you plan to deploy it? | Care-setting population differences drive substantial performance variation — outpatient sensitivity for Aidoc ICH Triage was 72.2% vs. 83.5% in the ED |
| Post-deployment evidence | Has peer-reviewed post-deployment evidence been published? By whom, and with what funding? | Manufacturer-sponsored studies without independent replication have known bias risk; independent multicenter studies carry more weight |
| Conflict of interest | Are there financial relationships between the study authors and the tool manufacturer or licensor? | Conflicts of interest in supporting studies affect the reliability of reported outcomes and should be disclosed and weighted |
| Clinical decision authority | Does the tool preserve full clinician decision-making authority, or does its interface design create pressure to accept AI recommendations? | Tools that present AI output as directive rather than advisory increase automation bias risk |
| Explainability | Does the tool provide explanations for its recommendations that clinicians can evaluate and override? | Unexplained outputs cannot be critically assessed; explainability is the mechanism by which clinicians catch model errors |
| Post-market monitoring plan | Does the vendor provide post-market performance monitoring? What triggers a performance review or alert? | Model drift can cause performance degradation over time; monitoring is the only way to detect it before it affects patient outcomes |
| Functional category fit | Is the tool designed for the specific clinical task you need — acuity prediction, imaging alerting, or condition-specific risk stratification? | Deploying a tool in a context outside its intended use (e.g., using a trauma hemorrhage tool as a general ED acuity scorer) produces unpredictable and potentially dangerous results |
The evidence available as of mid-2026 supports a clear conclusion: FDA-cleared AI triage CDS tools are not interchangeable, their clearance benchmarks do not reliably predict real-world performance, and the gap between the two is driven by identifiable and addressable structural factors. Clinicians and administrators who treat clearance as sufficient validation for deployment are accepting risks that post-deployment evidence has already quantified. The tools in this space can add clinical value — but only when evaluated with the same rigor applied to any other diagnostic or decision-support technology entering the emergency department.
Discussion
Clinical experience, implementation questions, and workflow observations from clinicians and administrators are welcome.
Comments
Join the discussion with an anonymous comment.