AI Mammography Screening: What the Systematic Review Evidence Shows

Dual-panel illustration showing a radiologist reviewing AI-annotated mammograms alongside a three-tier evidence pyramid representing RCT, prospective, and retrospective study designs. — AI-assisted mammography screening positions the algorithm as a decision-support tool within a human-supervised workflow — not as a replacement for radiologist judgment.

Why This Systematic Review Matters Now

Breast cancer screening programs face a compounding pressure: rising examination volumes, radiologist workforce constraints, and a commercial AI market that has outpaced the evidence base. Over 20 FDA-authorized AI applications for breast imaging now exist, yet adoption in clinical screening programs remains variable and generally low — in part because the published evidence has been fragmented across single-center studies, enriched retrospective datasets, and interim trial analyses that do not individually support implementation decisions.

The BMJ Open 2025 systematic review (PMID 41475802) changes the analytical landscape. By synthesizing 31 studies encompassing more than two million screening examinations — across RCTs, prospective paired-reader designs, registry-based implementations, and retrospective simulations — it provides the most comprehensive published evaluation of AI performance in population-level breast cancer screening to date. Its narrative synthesis, organized by AI integration role rather than by study, allows clinicians and program administrators to evaluate what the evidence actually supports for each deployment configuration.

This digest uses the BMJ Open review as its primary anchor, supplemented by the highest-tier 2025–2026 prospective studies — the MASAI RCT (The Lancet), GEMINI (Nature Cancer), and AIMS (Nature Cancer) — to provide a structured synthesis of current evidence. Readers seeking broader context on AI performance across multiple clinical domains can consult the site's AI in the medical field evidence overview; this digest goes substantially deeper on a single clinical domain.

Study Design and Scope: How the BMJ Open 2025 Review Was Constructed

The systematic review searched the literature from 2012 through June 2025, capturing the period during which deep learning-based AI systems became technically viable for mammography interpretation. Thirty-one studies met inclusion criteria. The review used a narrative synthesis approach, organizing findings by the role AI played in the screening workflow rather than by study design — a methodological choice that allows more direct clinical translation.

Study quality was appraised using QUADAS-2 combined with an AI-specific critical appraisal tool. The QUADAS-2 findings were not merely procedural — they revealed concrete quality concerns that constrain how the performance data should be interpreted:

Enriched datasets: Many retrospective studies artificially inflated cancer prevalence beyond what population screening programs encounter, making sensitivity and specificity estimates non-transferable to real-world deployment.
Single-center designs: A substantial proportion of included studies drew from a single institution, limiting generalizability across screening program types, mammography equipment, and radiologist experience levels.
Incomplete interval-cancer follow-up: Interval cancers — those diagnosed between screening rounds — are a critical outcome measure for screening programs, but most included studies lacked the follow-up duration to capture them reliably. This gap is most consequential for evaluating standalone AI configurations.
Heterogeneity across AI systems: The review included multiple commercial and research AI systems with different algorithmic architectures, training datasets, and threshold calibrations, making direct cross-study performance comparisons unreliable.

The review's conclusion that transparent reporting, standardized evaluation frameworks, and long-term population studies are still required before standalone AI can be considered flows directly from these appraisal findings — not from general caution about AI, but from specific identified gaps in the existing evidence.

Population and Geographic Breadth: What 2 Million Screening Examinations Covered

The 31 included studies collectively represent a geographically diverse evidence base spanning Europe, Asia, North America, and Australia. This breadth is meaningful because breast cancer screening program structures differ substantially across these regions — including double-reading standards, recall rate thresholds, and mammography equipment ecosystems — meaning AI systems were evaluated against different baseline workflows.

Study design types represented in the BMJ Open 2025 systematic review and their key evidence-quality characteristics.
Study Design Type	Contribution to the Evidence Base	Primary Limitation
Randomized controlled trials	Highest-quality prospective evidence; MASAI is the anchor example	Single country, single device, single AI system in the largest trial
Prospective paired-reader studies	Real-world workflow testing without randomization	Susceptible to reader-order effects; variable blinding
Registry-based implementations	Large-scale real-world data; captures operational variability	Retrospective; no control arm for direct comparison
Retrospective simulations	Rapid, low-cost evaluation across diverse datasets	Enriched datasets inflate performance; no prospective validation

A consistent equity gap runs across the included evidence. No race or ethnicity data was collected in the MASAI trial — the largest and most methodologically rigorous study in the review. The AIMS and GEMINI studies similarly reported limited demographic diversity data. The review does not include any study that systematically evaluated AI performance stratified by race, ethnicity, or socioeconomic status, leaving the question of differential AI performance across demographic groups entirely unanswered in the breast screening context.

Key Performance Findings by AI Integration Role

The BMJ Open review's organizing framework — grouping findings by AI deployment role rather than by study — is the most clinically useful aspect of its synthesis. Performance characteristics differ substantially across the three primary configurations, and conflating them produces misleading conclusions about what AI can and cannot do in screening programs.

Infographic showing three AI roles in mammography screening — second reader, triage, and standalone — with sensitivity indicators and evidence-strength gradients. — Evidence strength varies substantially by AI integration role. Second-reader and triage configurations are supported by the current evidence base; standalone AI as sole reader is not.

AI integration roles in breast cancer screening programs and their corresponding evidence profiles, as synthesized in the BMJ Open 2025 systematic review.
AI Role	Key Performance Finding	Workload Impact	Evidence Certainty
Second reader	Sensitivity maintained or increased by up to +9 percentage points; specificity preserved or improved	Reduces or eliminates need for second human reader in double-reading programs	Moderate — supported by RCT and prospective data; heterogeneity limits pooling
Triage / decision-referral	Non-inferior cancer detection when thresholds are conservatively calibrated	Reading volume reduction of 40–90% across included studies	Moderate — prospective evidence supports non-inferiority at conservative thresholds
Standalone (sole reader)	AUC values comparable to radiologists in retrospective studies	Potential for full automation of initial read	Low — interval-cancer follow-up incomplete; no prospective RCT support; not recommended

The second-reader configuration has the strongest prospective evidence base. In double-reading programs — standard in many European national screening programs — replacing the second human reader with AI has been evaluated in both RCT and prospective paired-reader designs, with consistent findings of maintained or improved sensitivity. The +9 percentage point figure represents the upper bound across included studies; the central tendency is more modest but directionally consistent.

Triage configurations — where AI pre-screens examinations and routes a subset for expedited or reduced human review — show the largest workload reduction signals (40–90%), but the range is wide because threshold calibration has a dominant effect. Studies using conservative AI thresholds (routing only clearly negative cases away from full human reading) consistently show non-inferior cancer detection; studies using aggressive thresholds show larger workload reductions at the cost of detection sensitivity.

Standalone AI — where the system functions as the sole reader without human review of AI-negative cases — achieves AUC values comparable to radiologists in retrospective analyses, but this comparison is methodologically weak. AUC in enriched retrospective datasets does not predict real-world interval-cancer rates, and no prospective study has followed a standalone AI cohort long enough to report interval-cancer outcomes. The review is explicit that this configuration is not yet supported by the current evidence base.

MASAI RCT: The Prospective Anchor

The MASAI trial, published in full in The Lancet in early 2026, is the largest randomized controlled trial of AI in cancer screening conducted to date, and the first prospective RCT specifically designed to evaluate AI-supported mammography in a population screening program. Its full results provide the most direct evidence for how AI-supported screening performs against standard double reading under real-world conditions.

The trial randomized 105,934 women 1:1 to AI-supported screening (using Transpara v1.7.0, ScreenPoint Medical) versus standard double reading at four Swedish screening sites. The AI system provided a risk score for each examination; radiologists used this score to inform their read, with the final recall decision remaining with the radiologist throughout.

MASAI RCT full results (The Lancet, 2026): 105,934 women randomized 1:1 across four Swedish screening sites. AI system: Transpara v1.7.0 (ScreenPoint Medical). Interval-cancer non-inferiority margin met; descriptive reductions in invasive and advanced cancers are hypothesis-generating.
Outcome Measure	AI-Supported Arm	Standard Double Reading	Statistical Result
Cancer detection rate (CDR)	6.4 per 1,000	5.0 per 1,000	p = 0.0021
Sensitivity	80.5%	73.8%	p = 0.031
Specificity	98.5%	98.5%	p = 0.88 (equivalent)
False-positive rate	1.5%	1.4%	Non-significant difference
Interval cancer rate	1.55 per 1,000	1.76 per 1,000	Proportion ratio 0.88 (95% CI 0.65–1.18); p = 0.41, non-inferior
Invasive interval cancers	75 cases	89 cases	16% fewer in AI arm (descriptive)
T2+ interval cancers	38 cases	48 cases	21% fewer in AI arm (descriptive)
Non-luminal A interval cancers	43 cases	59 cases	27% fewer in AI arm (descriptive)
Screen-reading workload	61,248 readings	109,692 readings	44% reduction

The 29% increase in cancer detection rate — from 5.0 to 6.4 per 1,000 — was statistically significant and driven predominantly by additional small, lymph-node negative invasive cancers (58 more T1 cancers) and more non-luminal A cancers including triple-negative tumors (16 vs 6). Notably, there was no increase in low-grade DCIS detection, addressing a common concern that AI-assisted screening would primarily increase detection of indolent disease.

The interval-cancer findings are particularly clinically meaningful. A 12% relative reduction in interval cancers (1.55 vs 1.76 per 1,000) — those diagnosed between screening rounds, which tend to be more aggressive — did not reach statistical significance (the confidence interval crosses 1.0), but the descriptive data showing 16% fewer invasive, 21% fewer T2+, and 27% fewer non-luminal A interval cancers suggests a biologically plausible benefit that longer follow-up may substantiate. The authors note that 20–30% of interval cancers could have been identified at the preceding mammogram, and AI appears to be partially capturing this missed-cancer subset.

The 44% workload reduction — from 109,692 to 61,248 screen readings — was confirmed in the Lancet Digital Health analysis of the full study population and represents a substantial operational finding for programs facing radiologist capacity constraints. Equivalent specificity (98.5% in both arms) and equivalent false-positive rates (1.5% vs 1.4%) mean the workload reduction was achieved without increasing unnecessary recalls.

Our study does not support replacing healthcare professionals with AI.

This statement from the MASAI authors is not a caveat appended to otherwise enthusiastic results — it is a direct conclusion from the trial design, which maintained radiologist decision authority throughout. The AI system in MASAI functioned as an informational input to the radiologist's read, not as an autonomous reader. The trial cannot speak to standalone AI performance because it was not designed to test it.

Supporting 2026 Prospective Evidence: GEMINI and AIMS

Two studies published simultaneously in Nature Cancer in March 2026 extend the prospective evidence base from different methodological directions. Both are important, and both carry conflicts of interest that must be weighed when interpreting their findings.

GEMINI: Prospective Workflow Modeling in NHS Grampian

The GEMINI study prospectively evaluated 10,889 women at NHS Grampian using live AI integration and simulation across 17 a-priori-specified workflow configurations. This multi-workflow design is methodologically distinctive: rather than testing a single AI deployment model, GEMINI systematically evaluated how different combinations of AI triage thresholds, reading sequences, and arbitration rules affected cancer detection, recall rates, and workload.

The primary AI workflow — triaging AI-negative cases at one threshold (OP3) combined with an AI-additional read at a second threshold (OP2) — improved CDR by 10.4% (approximately 1 additional cancer per 1,000 examinations) with a 0.8% recall rate reduction and up to 31% workload reduction. This primary workflow demonstrated superiority over routine double reading for CDR, sensitivity, PPV, and specificity simultaneously — a finding that is unusual in screening research where sensitivity and specificity gains typically trade off against each other. Three additional combined workflows showed superiority across all metrics with 33–36% workload savings.

GEMINI identified 11 additional cancers attributable to AI integration, of which 7 were invasive. Interval-cancer follow-up (3 years) was not yet available at the time of publication, which limits conclusions about the clinical significance of the additional detections.

AIMS: AI as Second Reader in NHS Retrospective Cohort

The AIMS study evaluated 50,000 NHS women from two screening centers in a retrospective cohort design, testing whether replacing the second human reader with an AI system (Google's AI tool v1.2) was non-inferior to standard two-reader interpretation after arbitration.

After arbitration, the AI second-reader arm was noninferior to two human readers for both sensitivity and specificity (P<0.001, 5% non-inferiority margin), with 46% fewer human screen readings overall and reading time reductions of 36–44% across the two centers. These are operationally significant findings for single-reading programs considering AI as a second-reader replacement.

However, the arbitration dynamics reveal an important nuance. Before arbitration, AI demonstrated higher sensitivity for interval cancers (32.4% vs 15.4% in the human arm) and next-round cancers (34.0% vs 12.8%). After arbitration, this advantage was entirely erased — AI arm sensitivity for interval cancers fell to 8.8% and next-round cancers to 8.1%, compared to 5.9% and 5.5% in the human arm. The arbitration process, while maintaining overall non-inferiority, appears to have substantially attenuated the AI's potential advantage for the most clinically significant cancer subtypes.

Arbitration volume itself increased substantially in the AI arm — by 142% at center 1 and 22% at center 2 — partially offsetting the workload reduction from eliminating the second human reader. Readers reported 'somewhat trusting' the AI tool; overcalling of calcifications and cases with prior images were identified as the primary reliability concerns. AI performance was also lower for Siemens-manufactured mammography images compared to Hologic, a device-specific performance variation that has direct implications for multi-vendor screening programs.

Evidence Limitations: What the Current Body of Research Cannot Yet Answer

Taken together, the BMJ Open 2025 systematic review and the three major prospective studies leave a set of clinically important questions unanswered. These are not peripheral concerns — they are the questions that program-level adoption decisions will ultimately turn on.

Interval-cancer outcomes at scale: MASAI's interval-cancer reduction did not reach statistical significance (p = 0.41). GEMINI and AIMS lack the 3-year follow-up required to report interval-cancer rates. The most important clinical outcome for screening programs — whether AI reduces interval cancers at population scale — remains insufficiently powered.
Generalizability beyond high-resource European settings: MASAI is a single-country, single-device, single-AI-system trial conducted with moderately to highly experienced radiologists in a well-resourced national screening program. Performance in lower-resourced settings, with less experienced readers, different mammography equipment, or different AI systems, cannot be inferred from the MASAI data.
Demographic equity: No major AI mammography trial has prospectively evaluated performance stratified by race, ethnicity, or socioeconomic status. Device-specific performance variation (Siemens vs Hologic in AIMS) suggests that systematic disparities may exist and are not currently being measured.
Enriched-dataset inflation: The majority of studies in the BMJ Open review relied on enriched retrospective datasets where cancer prevalence was artificially elevated. Performance metrics from these studies are not directly applicable to population screening programs with typical prevalence rates of 5–8 cancers per 1,000 examinations.
Translation to health outcomes: A 2025 JMIR systematic review of 20 implementation studies found that cancer detection by AI 'has not been shown to translate into improved health outcomes' in most included studies. Detection rate improvements at screening are a surrogate endpoint; mortality reduction data for AI-assisted mammography does not yet exist.
Governance and implementation framework gaps: The same JMIR review identified reproducibility, evidentiary standards, technological concerns, trust issues, and post-adoption uncertainty as key implementation barriers, and proposed a governance framework using the Consolidated Framework for Implementation Research. Most current deployments lack the continuous monitoring infrastructure that the MASAI authors explicitly recommend.

For comparison, the diabetic retinopathy AI screening evidence base is more mature in one specific respect: FDA-cleared autonomous AI readers for diabetic retinopathy have real-world deployment data from primary care settings, including post-market performance monitoring. AI mammography has not yet reached equivalent deployment maturity despite a larger volume of clinical trial evidence.

Clinical and Policy Takeaways: When AI Is and Is Not Supported by the Evidence

The evidence synthesis supports a set of structured conclusions that distinguish between what AI can currently deliver in breast cancer screening and what remains outside the current evidence base.

What the current evidence supports:

AI as second reader in double-reading programs: Sensitivity maintained or improved by up to +9 percentage points; specificity preserved; workload reduced substantially. Supported by the MASAI RCT, GEMINI prospective evaluation, and multiple prospective paired-reader studies included in the BMJ Open review.
AI in triage and decision-referral roles with conservative threshold calibration: Reading volume reduction of 40–90% with non-inferior cancer detection. Supported by multiple prospective studies. Conservative threshold calibration is essential — aggressive thresholds that maximize workload reduction consistently show detection trade-offs.
Continuous monitoring as a condition of implementation: The MASAI authors explicitly state that introducing AI must be done with continuous monitoring to capture how AI influences regional and national screening programs. This is not an optional governance add-on — it is a requirement for responsible deployment given the single-country, single-system limitations of the current RCT evidence.

What the current evidence does not support:

Standalone AI as sole reader: No prospective RCT has evaluated this configuration. Interval-cancer follow-up is incomplete in all retrospective studies that have attempted to model it. The BMJ Open review, the MASAI authors, and the NEJM commentary all explicitly state that standalone AI is not supported by the current evidence base.
Replacement of radiologists: The MASAI trial maintained radiologist decision authority throughout and was not designed to test autonomous AI reading. The authors' statement that results 'do not support replacing healthcare professionals with AI' reflects the trial design, not merely editorial caution.
Claims of mortality reduction: No long-term mortality data exists for AI-assisted mammography screening. Detection rate improvements are a surrogate endpoint. Mortality benefit cannot be inferred from the current evidence.
Generalization to diverse populations without additional validation: The absence of race and ethnicity data across all major trials, combined with documented device-specific performance variation (AIMS), means that AI performance in demographically diverse screening populations is not established by the existing evidence.

A July 2025 commentary in the New England Journal of Medicine (Hyams, Kerlikowske, and Redberg) raised a specific policy concern: AI mammography tools are being marketed directly to consumers with claims of improved detection despite the absence of evidence of clinical effectiveness at the time of marketing. The commentary's argument — that regulatory authorization of AI mammography tools does not constitute evidence of clinical effectiveness and should not be represented as such — is directly relevant to how clinicians and program administrators evaluate vendor claims.

The parallel evidence structure in AI-assisted colonoscopy polyp detection offers a useful comparison: that domain also has RCT-level evidence for AI as a detection-support tool, with similar constraints on standalone AI conclusions and comparable evidence gaps in long-term outcome data. The pattern across AI-assisted screening domains is consistent — RCT evidence supports augmentation roles; standalone replacement is not yet established in any major screening domain.

Structured Study Record

Structured research digest metadata for the BMJ Open 2025 systematic review of AI performance in breast cancer screening programs, with supplementary data from MASAI, GEMINI, and AIMS.
Field	Value
Study Design	Systematic review with narrative synthesis (anchor study: BMJ Open 2025); quality appraisal using QUADAS-2 combined with an AI-specific critical appraisal tool; 31 included studies spanning RCTs, prospective paired-reader studies, registry-based implementations, and retrospective simulations
Source Journal	BMJ Open (anchor systematic review); The Lancet (MASAI full results, 2026); The Lancet Digital Health (MASAI interim analysis, 2024/2025); Nature Cancer (GEMINI and AIMS, March 2026)
Publication Date	BMJ Open systematic review: 2025; MASAI full results (The Lancet): January/February 2026; GEMINI and AIMS (Nature Cancer): March 2026
Clinical Domain	Radiology / breast imaging — population-level breast cancer screening programs
AI Technique	Deep learning–based computer vision; multiple commercial and research systems evaluated across included studies; MASAI: Transpara v1.7.0 (ScreenPoint Medical); AIMS: Google AI tool v1.2; GEMINI: Mia v.3 (Kheiron Medical Technologies Ltd.)
Key Performance Metric	Second reader: sensitivity up to +9 percentage points vs. standard reading; MASAI RCT: CDR 6.4 vs. 5.0 per 1,000 (p=0.0021), sensitivity 80.5% vs. 73.8% (p=0.031), 12% relative interval-cancer reduction, 44% workload reduction; GEMINI: 10.4% CDR improvement, up to 31% workload reduction; AIMS: noninferior sensitivity/specificity after arbitration, 46% fewer human screen readings
Funding & Conflicts of Interest	BMJ Open systematic review: no competing interests declared. MASAI: funded by the Swedish Cancer Society, Confederation of Regional Cancer Centres, and Swedish governmental clinical research funding; no AI vendor funding disclosed. AIMS: Google LLC provided the AI tool; 10 Google employees are co-authors; two NHS clinicians are paid Google consultants; Royal Surrey NHS Foundation Trust received Google funding for the OPTIMAM database — material conflict of interest. GEMINI: three co-authors (Annie Ng, Georgia Fox, Cary Oberije) were employees of Kheiron Medical Technologies Ltd., developer of the evaluated AI system — conflict of interest disclosed.
Key Limitations	Heterogeneity across study designs and AI systems prevents meta-analytic pooling; reliance on enriched retrospective datasets in most included studies inflates performance estimates; incomplete interval-cancer follow-up in GEMINI and AIMS; MASAI generalizability limited by single-country (Sweden), single-device (GE), single-AI-system (Transpara) design and highly experienced radiologist cohort; no race or ethnicity data collected in any major trial; device-specific AI performance variation documented in AIMS (Siemens vs. Hologic); cancer detection improvements not yet shown to translate to mortality reduction; standalone AI not supported by prospective evidence
PMID / DOI	Anchor study (BMJ Open 2025 systematic review): PMID 41475802

For adjacent AI diagnostic evidence across multiple clinical domains, the Q2 2026 AI and Medical Diagnosis Research Radar tracks newly published studies and preprints in structured notice format.

AI in Breast Cancer Screening: What the BMJ Open 2025 Systematic Review and 2026 RCT Evidence Actually Show