FDA-Cleared Radiology AI: The Clinical Evidence Gap

Split illustration showing a tall bar chart of FDA-cleared radiology AI device counts on the left versus a small amber segment representing devices with clinical evidence on the right — The gap between FDA authorization volume and devices with published clinical evidence, visualized. Through December 2025, 1,104 radiology AI devices held FDA authorization; fewer than 30% underwent any clinical testing.

The Scale of FDA-Cleared Radiology AI Through 2025

Radiology has become the dominant domain of AI medical device authorization in the United States by a wide margin. The FDA's official AI-enabled medical device list, updated through December 30, 2025, records 1,451 total AI-enabled medical device authorizations since 1995. Of those, 1,104 — approximately 76% — are radiology devices. That concentration is not an artifact of a single year; in Q4 2025 alone, 55 of 72 newly cleared AI devices (76%) were radiology tools.

The pace of authorization has accelerated substantially. An industry analysis of 2025 510(k) clearances — which used AI-assisted identification and should be treated as directional rather than definitive — counted approximately 295 AI/ML device clearances in 2025 alone, with radiology accounting for 71.5% of those. That single-year figure is larger than the entire cleared AI device inventory from 1995 through roughly 2018.

Among device makers, GE HealthCare leads with 120 radiology AI authorizations (a total that reflects acquisitions including Bay Labs, Caption Health, MIM Software, and icometrix), followed by Siemens Healthineers at 89 and Philips at 50. Canon, United Imaging, Aidoc, and DeepHealth round out the top tier. The same 2025 analysis found that 183 of 221 manufacturers that year had only a single clearance, pointing to a broad startup ecosystem feeding the market alongside the established imaging vendors.

How FDA Authorizes AI Medical Devices: Three Pathways, One Dominant Route

The FDA regulates AI medical devices as Software as a Medical Device (SaMD) under three premarket pathways. Understanding which pathway a device used is the first step toward understanding what evidence — if any — was required to obtain authorization.

FDA premarket authorization pathways for AI-enabled medical devices. Share figures are approximate, based on cross-sectional analyses through 2024–2025.
Pathway	Standard Required	Share of AI Devices	Clinical Trial Evidence Required?
510(k)	Substantial equivalence to a predicate device	~96.5%	No — predicate comparison is sufficient
De Novo	Novel device with low-to-moderate risk; establishes new device type	~3%	No — but risk controls and special controls apply
PMA (Premarket Approval)	Reasonable assurance of safety and effectiveness	<1%	Yes — typically requires clinical study data

The 510(k) pathway's dominance — covering roughly 96.5% of cleared AI devices across multiple analyses — has a structural consequence that shapes everything discussed in this article. The 510(k) standard asks whether a new device is substantially equivalent to a legally marketed predicate device, not whether it has been independently tested in patients. A manufacturer can clear a new AI diagnostic tool by demonstrating that it performs comparably to an older tool, even if that older tool was itself never subjected to a prospective clinical trial.

A review of AI/ML devices through May 2025 found that approximately one-third of AI/ML 510(k) predicates are themselves non-AI devices — meaning many AI tools are being compared to legacy software or imaging hardware that predates the machine-learning era entirely. The pathway was designed for incremental device modifications, not for a technology category that can behave in ways that have no historical analog.

The PMA pathway — which does require clinical evidence — covers fewer than 1% of cleared AI devices. De Novo, used for novel low-to-moderate risk devices without a predicate, covers roughly 3%. Both pathways impose higher regulatory burdens and longer review timelines, which partly explains why manufacturers overwhelmingly favor 510(k) even for genuinely novel AI applications.

Where Radiology AI Is Concentrated: Subspecialties and Task Types

The 1,104 radiology authorizations are not evenly distributed across the specialty. Within radiology, the largest share targets multi-system applications — tools designed to work across body regions or modalities rather than within a single imaging domain. Neurological imaging and oncologic imaging follow as the next largest subspecialty concentrations.

Distribution of FDA-cleared radiology AI devices by subspecialty, based on data through May 2025. Source: Loganathan et al., JMAI 2026.
Radiology Subspecialty	Share of Radiology AI Devices
Multi-system (cross-modality/body region)	35.4%
Neurological imaging	14.5%
Oncologic imaging	11.8%
Other subspecialties	~38.3%

By task type, the leading FDA product codes in the 2025 clearance cohort were QIH (radiological computer-aided detection), QAS (triage), and LNH (MRI systems). Detection and triage tools — which flag findings for radiologist review rather than making autonomous diagnostic decisions — represent the majority of cleared applications. This task-type distribution matters for evidence evaluation: a triage tool that flags studies for priority review has different failure modes and clinical stakes than a tool that characterizes a lesion.

The breadth of subspecialty coverage is relevant context for the evidence gaps documented in the next section. The clinical testing shortfalls are not confined to a single niche like chest X-ray triage or mammography screening — they span multi-system tools, neurological imaging, and oncologic applications alike. The evidence problem is structural, not domain-specific.

What the Peer-Reviewed Evidence Actually Shows

Four peer-reviewed studies published between 2024 and 2025 converge on a consistent finding: the clinical evidence base for FDA-cleared radiology AI is substantially thinner than the device count suggests. Each study used a different analytical lens — clinical testing rates, recall risk, transparency scoring, benefit-risk reporting — but the quantitative results point in the same direction.

A tilted balance beam with a heavy stack of FDA authorization documents on the left outweighing a small stack of clinical evidence on the right, with recall alert icons below — The evidence imbalance in radiology AI: authorization volume far outweighs the number of devices with robust clinical proof, and that gap is independently linked to higher recall risk.

Clinical Testing Rates: Below 30%

A study by Sivakumar and colleagues, published in JAMA Network Open and covering 723 FDA-authorized radiology AI devices through June 2024, found that fewer than 30% of those devices had undergone any form of clinical testing. Only 5% underwent prospective testing — the study design most capable of detecting real-world performance problems — and only 8% included human-in-the-loop evaluation, meaning testing that assessed how the AI affected actual radiologist decision-making. Only 15 devices combined both prospective testing and clinical validation; only 6 used all three methods.

Benefit-Risk Reporting: RCT Evidence in 1.6% of Devices

Lin and colleagues analyzed 691 FDA-cleared AI/ML devices cleared through July 2023, of which 531 (76.9%) were radiology devices. Among the full cohort, only 6 devices (1.6%) were supported by randomized controlled trial data. Only 53 (7.7%) were supported by prospective studies of any design. Nearly half of all device summaries — 46.7% — did not report a study design at all. And 660 devices (95.5%) provided no demographic characteristics of the populations used to validate them.

The Lin et al. analysis identified a modest post-2021 improvement: devices cleared after 2021 were more likely to report study timing (52.4% vs. 39.4%) and efficacy data (35.8% vs. 23.8%). However, the same devices cleared after 2021 were less likely to be associated with peer-reviewed publications — a pattern that suggests manufacturers may be meeting minimum reporting requirements in regulatory submissions without producing independently verifiable scientific records.

Transparency Scoring: 3.3 Out of 17 Points

Mehta and colleagues reviewed 1,012 publicly available Summary of Safety and Effectiveness Data (SSED) documents for AI/ML devices cleared from 1970 through December 2024, scoring each against the AI/ML Clinical Trial Reporting (ACTR) transparency framework. The mean score was 3.3 out of a possible 17 points. Thirty percent of devices scored zero. Only 53.1% of summaries reported any clinical study at all; of those, 60.5% were retrospective. More than half — 51.6% — reported no performance metrics.

The FDA issued its Good Machine Learning Practice (GMLP) guidelines in 2021. The Mehta et al. analysis found that post-2021 devices improved their ACTR scores by an average of 0.88 points — a statistically detectable but clinically modest shift that left mean transparency well below half the available score. The authors also found that higher transparency scores did not correlate with faster FDA review times, removing any regulatory incentive for manufacturers to invest in more thorough reporting.

Evidence Gaps in Detail: Testing Rates, Demographics, and Transparency

The aggregate numbers above describe a structural pattern. The specific dimensions of that pattern matter for anyone evaluating a particular device.

Prospective testing: Only 5% of FDA-authorized radiology AI devices in the Sivakumar et al. cohort underwent prospective testing. The remainder relied on retrospective datasets — typically curated from existing imaging archives — which are prone to case selection bias and do not capture the variability of live clinical workflows.
Human-in-the-loop evaluation: Only 8% of devices were tested in a setting that assessed how the AI affected radiologist performance. This matters because AI assistance does not uniformly improve diagnostic accuracy — research cited by the Sivakumar study authors found that high-performing radiologists maintained strong performance with AI assistance, while lower-performing radiologists did not necessarily improve.
Demographic representation: The Lin et al. study found 95.5% of cleared AI/ML devices provided no demographic characteristics of their validation populations. A complementary study covering 903 devices through August 2024 found that only 28.7% reported performance data separately by sex, and only 23.2% by age. Devices validated on non-representative populations may perform differently in patient populations that differ from the training cohort.
Patient outcome reporting: Fewer than 1% of devices in the Lin et al. analysis reported any patient outcome data — meaning data connecting AI performance to whether patients were correctly diagnosed, received appropriate treatment, or experienced better health outcomes. Device summaries document algorithmic performance metrics, not clinical impact.
Predetermined Change Control Plans (PCCPs): A PCCP allows a manufacturer to update an AI algorithm after clearance without filing a new submission, provided the changes fall within pre-specified bounds. According to the Innolitics 2025 analysis — which is not peer-reviewed and used AI-assisted identification — approximately 10% of 2025 clearances included a PCCP. The Mehta et al. study found only 1.5% of its full cohort through December 2024 reported a PCCP, reflecting how recently the mechanism became available. Without a PCCP, algorithm updates require a new 510(k) submission, which creates pressure to keep models static even as performance may drift.

Taken together, these gaps describe an information environment in which a health system evaluating a specific FDA-cleared radiology AI tool will often find that the device's regulatory summary does not answer the most clinically relevant questions: Has it been tested in patients like ours? Does it perform consistently across demographic groups? Has it been validated at sites other than the developer's own institution? The answer to each of these questions is, for most devices, unknown.

Recall Patterns: When Evidence Gaps Translate to Device Failures

Lee and colleagues analyzed 950 FDA-cleared AI-enabled medical devices matched to recall records through November 2024. Among those, 60 devices (6.3%) had experienced 182 recall events. The timing pattern is notable: 79 of those recalls (43.4%) occurred within 12 months of clearance — approximately double the early-recall rate observed across all 510(k) devices, not only AI ones.

The most consequential finding from the Lee et al. study is the multivariable association between clinical validation status and recall risk. After controlling for other factors, lack of clinical validation was independently associated with recall at an odds ratio of 2.8 (95% CI 1.6–4.7). Devices without reported validation also had more recalls per device (mean 3.4 vs. 1.9–2.0) and larger recalls by unit count (mean 12,193 vs. 6,523 units per recall event). Public company status was also independently associated with recall (OR 5.9; 95% CI 2.4–14.6), a finding the authors attribute to higher device volume and broader deployment footprints.

Diagnostic and measurement errors were the leading cause category among recalled units, accounting for 109 recall events covering 935,063 units. At the time of the study's data cutoff, 108 recalls (59.3%) remained unresolved, and 20 had been unresolved for more than three years.

Questions Clinicians and Administrators Should Ask Before Deploying Radiology AI

The evidence gaps documented above are not arguments against deploying radiology AI. They are arguments for asking specific questions before doing so — questions that FDA clearance alone does not answer. The following framework maps directly to the documented gaps.

Was clinical testing conducted, and if so, was it prospective or retrospective? Ask the vendor for the device's Summary of Safety and Effectiveness Data (SSED) or 510(k) summary. Retrospective testing on curated archives is the norm; prospective testing in a live clinical environment is the exception. If the answer is retrospective only, ask which institution provided the data and whether it resembles your patient population.
What is the demographic composition of the training and test dataset? Given that 95.5% of cleared devices in the Lin et al. cohort provided no demographic characteristics, this information may not be publicly available. If the vendor cannot provide it, that absence is itself meaningful. Ask specifically about age range, sex distribution, race and ethnicity representation, and scanner manufacturer or model, since image characteristics vary by equipment.
Has the device been validated at external sites beyond the developer's institution? Internal validation datasets are prone to optimistic performance estimates. External validation — testing on data from institutions not involved in development — provides a more reliable performance estimate. Ask whether any peer-reviewed external validation studies exist and whether they are independent of the manufacturer.
Does the device carry a Predetermined Change Control Plan (PCCP)? A PCCP specifies in advance how the algorithm may be updated and what performance monitoring is required. Its presence indicates the manufacturer has committed to a structured post-market modification process. Its absence means any meaningful algorithm update will require a new regulatory submission, which may delay performance improvements or leave a drifting model in use longer than warranted.
Are post-market performance data available? Ask whether the vendor has published or shared post-market surveillance data from real-world deployments. Given that 43% of AI device recalls occur within 12 months of clearance, early post-market performance is particularly important. Some vendors participate in the FDA's Medical Device Reporting (MDR) system; adverse event reports are publicly searchable in the MAUDE database.
Has the device been evaluated in human-in-the-loop conditions? Only 8% of radiology AI devices in the Sivakumar et al. cohort were tested in a setting that measured how the AI affected radiologist performance. Ask whether any published studies assessed radiologist accuracy with and without the AI tool, and whether those studies included radiologists at different experience levels.

None of these questions are hostile to AI adoption. They are the same questions that would be asked of any new diagnostic tool entering a clinical workflow — questions that the current regulatory framework does not require manufacturers to answer publicly before clearance. The evidence reviewed here suggests that for most of the 1,104 FDA-authorized radiology AI devices currently on the market, the answers are either unknown or unavailable. That is not a reason to avoid AI in radiology. It is a reason to treat the procurement and monitoring process with the same rigor applied to any other clinical decision-support technology.

FDA-Cleared Radiology AI: Mapping the Landscape and the Clinical Evidence Gap