The phrase "AI in healthcare" covers a wide and uneven terrain. At one end sit FDA-cleared software devices with prospective trial data. At the other end sit ambient scribes and large language models operating in clinical settings without formal regulatory authorization. Between those poles, you find risk stratification algorithms embedded in EHRs, computer-aided detection tools that have been in radiology workflows for over a decade, and a growing class of pathology and ophthalmology tools that have moved from retrospective validation into real-world deployment.
This brief organizes what is actually known — and what remains genuinely uncertain — across the clinical AI applications that have accumulated the most evidence and regulatory activity as of mid-2026. The goal is not comprehensiveness but precision: each application domain is characterized by its regulatory status, the quality of supporting evidence, and the specific limitations practitioners should understand before considering adoption.
How AI Is Applied in Clinical Medicine: A Domain Map
AI applications in clinical medicine are not a single category — they span radically different task types, data inputs, and decision contexts. Collapsing them into a single discussion produces confusion rather than clarity.
| Clinical Domain | Primary AI Task | Regulatory Landscape | Evidence Maturity |
|---|---|---|---|
| Radiology (chest, breast, retina) | Detection, triage, measurement | High FDA clearance density; 510(k) predominant | Prospective RCTs exist for select applications |
| Pathology (digital/WSI) | Classification, grading, tumor detection | Growing clearance base; De Novo for novel tasks | Mostly retrospective; prospective studies emerging |
| Cardiology (ECG, echo, imaging) | Arrhythmia detection, risk stratification | Several cleared devices; wearable-adjacent tools | RCT data for ECG-based AFib detection |
| Ophthalmology (retinal imaging) | Diabetic retinopathy screening, AMD detection | FDA De Novo clearance for autonomous systems | Prospective validation in screening programs |
| Sepsis prediction (ICU/ED) | Risk stratification, early warning | Mostly non-device CDS; limited formal clearance | Mixed; several prospective studies with conflicting results |
| Gastroenterology (colonoscopy) | Polyp detection (CADe) | 510(k) cleared devices commercially deployed | RCT evidence supports adenoma detection rate improvement |
| Ambient documentation / NLP | Transcription, note generation, coding | Generally not FDA-regulated as medical devices | Vendor-disclosed data; limited independent peer review |
Domains with the Strongest Evidence Base
Diabetic Retinopathy Screening
This is the application where autonomous AI has the clearest regulatory and evidence foundation in the US. The FDA granted De Novo authorization to IDx-DR (now marketed as LumineticsCore) in 2018 — the first AI device authorized for autonomous diagnostic use without requiring a clinician to interpret the result. The intended use is specific: detecting more-than-mild diabetic retinopathy in adults with diabetes who do not have a prior diagnosis of DR, using retinal images captured in primary care settings.
The pivotal study enrolled 900 patients across 10 primary care sites and reported a sensitivity of 87.2% and specificity of 90.7% for detecting more-than-mild DR. External validation studies in real-world screening programs have generally confirmed the technology works in primary care contexts, though performance varies with image quality and patient population. A 2020 prospective study in a predominantly Latino primary care population reported sensitivity above 90%, which is notable given historical concerns about performance variation across skin tones and fundus pigmentation.
AI-Assisted Colonoscopy (Computer-Aided Detection)
Computer-aided detection for colonoscopy — specifically polyp detection during live procedures — has accumulated more randomized controlled trial data than almost any other AI clinical application. Multiple RCTs published since 2019 have consistently shown that CADe systems increase adenoma detection rates (ADR), typically by 10–15 percentage points compared to standard colonoscopy. The effect is most pronounced for small, flat adenomas that endoscopists are prone to miss.
Several CADe devices have received 510(k) clearance in the US. The clinical question that remains open is whether higher ADR translates into reduced colorectal cancer incidence and mortality — that outcome data requires longer follow-up than most trials have completed. There is also a documented false-positive problem: CADe systems flag non-adenomatous lesions, which increases procedure time and can prompt unnecessary polypectomies. The tradeoff between sensitivity and specificity is a real operational consideration for endoscopy units.
ECG-Based Atrial Fibrillation Detection
AI algorithms applied to 12-lead and single-lead ECGs for atrial fibrillation detection represent one of the most clinically mature applications in cardiology. The Mayo Clinic group published a large-scale study in The Lancet demonstrating that a deep learning model could identify patients with paroxysmal AFib even during periods of normal sinus rhythm — a capability that conventional ECG interpretation cannot replicate. The model was trained on over 180,000 ECGs and achieved an AUC of 0.87 on the validation set.
Consumer wearable ECG tools (notably the Apple Watch's AFib detection feature) have FDA clearance and have been studied in prospective trials including the Apple Heart Study, which enrolled over 400,000 participants. The clinical utility question — whether detection in asymptomatic patients leads to treatment that reduces stroke — is being examined in ongoing trials, and results are not yet definitive.
Domains with Active Evidence but Unresolved Questions
Sepsis Prediction
Sepsis prediction algorithms are among the most widely deployed AI tools in US hospitals, yet they are also among the most contested in the clinical literature. Epic's Sepsis Model (ESM) is embedded in thousands of hospital EHRs and generates alerts for patients at elevated sepsis risk. A 2021 prospective study published in JAMA Internal Medicine evaluated ESM at a large academic medical center and found sensitivity of 63% and a positive predictive value of 12% — meaning the majority of alerts did not correspond to confirmed sepsis cases. Alert fatigue from false positives is a documented implementation problem.
Other sepsis prediction tools, including those based on continuous vital sign monitoring and laboratory trends, have shown better performance in controlled settings. The challenge is generalizability: models trained on one health system's patient population frequently degrade when deployed elsewhere due to differences in patient demographics, documentation practices, and treatment protocols. This is a canonical example of the external validation problem that affects clinical AI broadly.
Mammography AI and Breast Cancer Screening
AI-assisted mammography has a large and growing body of literature, but the evidence picture is more complicated than early enthusiasm suggested. Multiple FDA-cleared AI tools exist for mammography triage and density assessment. A large Swedish RCT (the MASAI trial, published in The Lancet Oncology in 2023) found that AI-supported screening detected significantly more cancers than standard double reading while reducing radiologist workload by approximately 44%. This was a genuinely important result.
However, the MASAI trial used a specific AI tool (Transpara, by ScreenPoint Medical) in a specific screening context (population-based, single-reading with AI triage). Results are not automatically transferable to other AI tools, different screening protocols, or the US recall-rate environment, which differs structurally from European population screening programs. Several systematic reviews have flagged that many mammography AI studies lack external validation on diverse populations, and performance differences across race and breast density categories have been documented.
Radiology Triage: Intracranial Hemorrhage and Pulmonary Embolism
AI triage tools for time-sensitive radiological findings — intracranial hemorrhage on CT, pulmonary embolism on CT-PA, large vessel occlusion for stroke — have received substantial FDA clearance activity and have been deployed in emergency radiology workflows at scale. The clinical rationale is strong: these are high-acuity findings where faster notification meaningfully changes outcomes.
Real-world deployment studies have generally confirmed that AI triage tools reduce time-to-radiologist-notification for critical findings. A 2022 retrospective study of an ICH triage tool across multiple hospital sites found median notification time reduced from over 30 minutes to under 10 minutes. The limitation of most such studies is that they measure process metrics (time to notification) rather than patient outcomes (mortality, disability). The assumption that faster notification improves outcomes is clinically reasonable but has not been uniformly confirmed in prospective outcome studies.
Equity Considerations Across AI Clinical Applications
Algorithmic bias in clinical AI is not a theoretical concern — it has been documented across multiple application domains, with real implications for health equity. The mechanisms vary by application type.
- Skin tone and image quality: Dermatology AI models trained predominantly on lighter skin tones have shown reduced sensitivity for malignant lesions in patients with darker skin. Fundus photography-based tools for diabetic retinopathy screening can also be affected by image quality differences correlated with retinal pigmentation.
- Pulse oximetry and SpO2 estimation: A 2020 NEJM study documented that pulse oximeters overestimate oxygen saturation in patients with darker skin, and AI tools trained on pulse oximetry data inherit this bias. This has downstream implications for any AI system using SpO2 as a feature.
- Sepsis prediction and social determinants: Sepsis models trained on EHR data from academic medical centers may not generalize to safety-net hospitals with different patient populations, documentation practices, and baseline vital sign distributions. Performance disparities by race have been documented in post-hoc analyses.
- Chest X-ray pathology detection: A widely cited 2021 study in Nature Medicine found that AI chest X-ray models could predict race from images — and that models trained to be "race-blind" still showed performance differences by race on clinical tasks, suggesting bias can be encoded in features that appear race-neutral.
- Language and NLP tools: Clinical NLP systems trained on notes from English-speaking, well-resourced health systems may perform poorly on notes from patients with limited English proficiency or from institutions with different documentation cultures.
What FDA Clearance Does and Does Not Mean
FDA clearance is frequently misread in both directions — either overstated as a quality endorsement or dismissed as a bureaucratic formality. The accurate framing is more specific.
A 510(k) clearance means the FDA has determined that a device is substantially equivalent to a legally marketed predicate device in terms of intended use and technological characteristics. It does not mean the FDA has independently validated clinical performance data, nor does it mean the device has been shown to improve patient outcomes in a prospective trial. The evidentiary bar for 510(k) clearance is substantially lower than for PMA approval.
De Novo authorization creates a new device classification and sets a regulatory precedent. It requires more rigorous review than 510(k) but is still not equivalent to PMA. PMA — premarket approval — requires valid scientific evidence, typically including clinical data demonstrating safety and effectiveness, and is reserved for Class III devices posing the highest risk. As of mid-2026, the large majority of FDA-cleared AI medical devices have received clearance via 510(k), with a smaller number through De Novo.
| Pathway | Standard | Common AI Applications | Post-Market Requirements |
|---|---|---|---|
| 510(k) | Substantial equivalence to predicate | Radiology CAD, triage tools, CADe for colonoscopy | MDR reporting; no mandatory RCT |
| De Novo | Novel device; reasonable assurance of safety/effectiveness | Autonomous DR screening, some pathology tools | MDR reporting; special controls set by FDA |
| PMA | Valid scientific evidence of safety and effectiveness | Very few AI devices as of mid-2026 | Post-approval studies often required |
Generative AI in Clinical Settings: A Distinct Category
Large language models and multimodal generative AI systems occupy a different regulatory and evidence space from the diagnostic AI tools discussed above. As of mid-2026, no generative AI system has received FDA authorization as a medical device for clinical diagnostic use. Ambient documentation tools that use LLMs to generate clinical notes operate largely outside the FDA device framework, though the regulatory boundary remains an active area of policy discussion.
The clinical evidence base for LLMs in medicine is growing rapidly but remains dominated by retrospective evaluations on benchmark datasets rather than prospective clinical trials. Hallucination — the generation of plausible but factually incorrect content — is a documented failure mode with specific clinical risk implications when LLMs are used for clinical summarization, documentation, or patient communication. Institutions deploying ambient AI scribes should treat them as documentation assistance tools, not autonomous clinical decision-makers, and should have human review processes in place.
Deployment Realities: What Studies Don't Capture
Controlled studies of AI tools measure performance under conditions that rarely match the operational reality of clinical deployment. Several failure modes appear consistently in real-world implementation reports.
- Alert fatigue: High-sensitivity AI tools generate large volumes of alerts, many of which are false positives. Clinicians adapt by ignoring alerts — including true positives. This is documented in sepsis prediction, CDS alerts, and radiology triage tools.
- Model drift: Clinical AI models can degrade over time as patient populations, documentation practices, or imaging equipment change. A model that performed well at deployment may underperform two years later without retraining.
- Workflow integration friction: AI tools that require clinicians to exit their primary workflow to consult a separate interface see substantially lower adoption. Integration with the EHR at the point of care is a practical prerequisite for sustained use.
- Training and change management: Implementation studies consistently find that staff training and change management are as important as technical performance for determining whether an AI tool improves outcomes. Tools deployed without structured training programs often fail to change clinical behavior.
- Reimbursement gaps: FDA clearance does not guarantee reimbursement. Many cleared AI tools lack dedicated CPT codes, which means the cost of deployment falls on the health system without a direct revenue offset. This is a significant adoption barrier, particularly for smaller institutions.
Active Clinical Trials and Gaps in the Evidence Base
Several high-priority evidence gaps remain across clinical AI applications as of mid-2026. Prospective trials are underway in a number of areas, but results are not yet available for many of the most consequential questions.
- Long-term cancer outcomes from AI-assisted screening programs (mammography, lung CT, colonoscopy) — most RCTs have measured process endpoints, not mortality.
- Head-to-head comparisons of competing AI tools within the same clinical task — most studies compare AI to human performance, not AI to AI.
- Prospective trials of AI clinical decision support tools in underserved and safety-net populations, where the equity implications are greatest.
- Post-market surveillance data on model performance degradation over time — most cleared devices lack mandatory post-market performance reporting requirements.
- Clinical outcome data for ambient documentation AI — current evidence is largely limited to documentation time reduction and physician satisfaction surveys.
Summary: Where the Evidence Is Solid, Where It Is Not
| Application | Evidence Strength | FDA Status | Key Caveat |
|---|---|---|---|
| Diabetic retinopathy screening (autonomous AI) | Strong — prospective pivotal trial, real-world validation | De Novo cleared | Performance varies with image quality; equity data limited for some populations |
| Colonoscopy polyp detection (CADe) | Strong — multiple RCTs showing ADR improvement | 510(k) cleared | Outcome data (cancer reduction) not yet available; false positive burden real |
| ECG-based AFib detection | Moderate-strong — large retrospective + prospective wearable data | 510(k) cleared (wearable) | Utility in asymptomatic patients not definitively proven |
| Mammography AI triage | Moderate — large RCT (MASAI) but specific tool/protocol | 510(k) cleared | Results not transferable across tools; equity gaps documented |
| Radiology triage (ICH, PE, LVO) | Moderate — deployment studies show process improvement | 510(k) cleared | Outcome data (mortality, disability) limited |
| Sepsis prediction | Weak-moderate — prospective studies show poor PPV in real settings | Not cleared as medical device (CDS) | Alert fatigue; poor generalizability across institutions |
| Ambient documentation / LLM scribes | Weak — limited peer-reviewed evidence; vendor-disclosed data dominant | Not FDA-regulated as devices | Hallucination risk; no outcome data; institutional policy governs use |
The practical implication of this landscape is that clinical AI is not a monolith. A procurement decision about a colonoscopy CADe tool sits on very different evidentiary ground than a decision about deploying an LLM-based clinical summarization system. Treating them as equivalent — either in enthusiasm or in skepticism — misrepresents what the evidence actually shows.
Discussion
Clinical experience, implementation questions, and workflow observations from clinicians and administrators are welcome.
Comments
Join the discussion with an anonymous comment.