
The Clinical Problem AI Entered: Colonoscopy's Detection Gap
Colonoscopy remains the primary tool for colorectal cancer (CRC) prevention in the United States, but its effectiveness depends heavily on a variable that clinicians have long struggled to standardize: how reliably individual endoscopists detect adenomas. Adenoma detection rate (ADR) — the proportion of screening colonoscopies in which at least one adenoma is found — varies dramatically across endoscopists, and lower ADR is independently associated with higher rates of interval CRC, the cancers that develop between scheduled colonoscopies.
Adenoma miss rates in tandem colonoscopy studies have historically ranged from 20% to over 40%, with diminutive polyps (≤5 mm) accounting for the majority of missed lesions. This detection gap created a clear target for AI: if a computer vision system could flag suspicious mucosal areas in real time, endoscopists might catch lesions they would otherwise overlook. The logic was straightforward enough to attract substantial investment, regulatory attention, and clinical trial activity.
What followed was one of the most rapidly accumulated evidence bases in procedural medicine — more than 44 randomized controlled trials involving over 30,000 participants. Yet as of mid-2026, no major professional society has issued an unqualified recommendation in favor of AI-assisted colonoscopy. Understanding why requires working through the evidence carefully, not just the headline ADR numbers. This article provides that analysis, with particular focus on the 2025 AGA living guideline outcome and what next-generation platforms are attempting to address. For broader context on AI applications across medical specialties, see AI in Healthcare: Specialty Landscape Overview; this article provides the GI-specific depth that broader surveys do not.
First-Generation CADe: What the RCT Evidence Actually Shows
The aggregate RCT evidence for computer-aided detection (CADe) in colonoscopy is substantial by any conventional measure. The AGA's 2025 living guideline synthesized 44 RCTs involving more than 30,000 participants, finding that CADe increased ADR from 37.4% to 44.8% (relative risk 1.22, 95% CI 1.16–1.29). A complementary meta-analysis of 28 RCTs (n=23,861) reported a 20% ADR increase (RR 1.20, 95% CI 1.14–1.27) and a 55% reduction in adenoma miss rate (RR 0.45, 95% CI 0.37–0.54). These are not trivial effect sizes.
The picture becomes more complicated when the data are examined by lesion type and clinical consequence. The ADR benefit is driven predominantly by diminutive polyp detection. Subgroup analysis shows an incidence rate ratio of 1.46 (95% CI 1.19–1.80) for diminutive lesions, while the improvement for small and large polyps is non-significant. Advanced adenoma detection — the lesions most directly linked to CRC risk — shows no significant improvement across the RCT body (RR 1.08, 95% CI 0.95–1.22). Sessile serrated lesion detection rate shows mixed results, with ENDO-AID studies showing a 60% improvement but the overall pooled estimate non-significant.
The non-neoplastic resection rate increased by 39% (RR 1.39, 95% CI 1.23–1.57) in the same meta-analysis — a direct measure of procedural waste, representing polypectomies performed on tissue that carries no malignant potential. This figure is not a secondary concern; it is one of the central quantified harms that the AGA panel weighed against the detection benefit.
GI Genius (Medtronic), the most widely studied individual system, received CE Mark in 2019 and FDA De Novo clearance in 2021. A device-specific meta-analysis of 7 GI Genius RCTs (n=9,639) found a significant ADR improvement (RR 1.12, 95% CI 1.03–1.22) and significant gains in polyp detection rate, sessile serrated lesion detection, and per-procedure lesion counts. Critically, advanced adenoma detection rate showed no significant difference (RR 1.01, 95% CI 0.90–1.13). For a detailed device profile covering GI Genius's regulatory clearance history and technical architecture, see GI Genius AI Polyp Detection: FDA De Novo Clearance, RCT Evidence, and Real-World Deployment Considerations.
| Outcome | Effect Size (RR or IRR) | 95% CI | Significance | Source |
|---|---|---|---|---|
| Adenoma detection rate (44 RCTs) | RR 1.22 | 1.16–1.29 | Significant | AGA Guideline 2025 |
| Adenoma miss rate (44 RCTs) | RR 0.47 | 0.36–0.60 | Significant | AGA Guideline 2025 |
| ADR — 28 RCTs (GIE meta-analysis) | RR 1.20 | 1.14–1.27 | Significant | Makar et al., GIE |
| Adenoma miss rate — 28 RCTs | RR 0.45 | 0.37–0.54 | Significant | Makar et al., GIE |
| Advanced adenoma detection rate | RR 1.08 | 0.95–1.22 | Non-significant | Makar et al., GIE |
| Diminutive polyp detection (IRR) | IRR 1.46 | 1.19–1.80 | Significant | Makar et al., GIE |
| Non-neoplastic resection rate | RR 1.39 | 1.23–1.57 | Significant | Makar et al., GIE |
| GI Genius ADR (7 RCTs) | RR 1.12 | 1.03–1.22 | Significant | Sattar et al., PMC |
| GI Genius advanced adenoma detection | RR 1.01 | 0.90–1.13 | Non-significant | Sattar et al., PMC |
The AGA 2025 Living Guideline: No Recommendation and Why
In early 2025, the American Gastroenterological Association published a living clinical practice guideline on CADe-assisted colonoscopy — the first major professional society guideline to formally address the technology with a full GRADE evidence review. The outcome surprised many in the field: the panel made no recommendation for or against CADe use.
To understand why, it is necessary to understand how the GRADE framework orders outcomes. Not all outcomes carry equal weight in GRADE methodology. The AGA panel designated CRC incidence, CRC mortality, and post-colonoscopy CRC as critical outcomes — the endpoints that directly determine whether a patient benefits in a clinically meaningful way. ADR was designated as an important but not critical surrogate outcome. This distinction is not semantic. It means that even a robust, consistent ADR improvement cannot by itself justify a positive recommendation if the downstream effect on actual cancer incidence and death remains uncertain.
No CADe RCT has been conducted with CRC incidence or mortality as a primary endpoint. To bridge this gap, the AGA panel commissioned a microsimulation modeling study to translate the observed ADR improvement into projected long-term cancer outcomes. The modeled estimates: 11 fewer CRC cases per 10,000 individuals and 2 fewer CRC deaths per 10,000 individuals over 10 years — both rated as very low certainty. The same model projected 635 more surveillance colonoscopies per 10,000 individuals (low certainty), a quantified downstream harm driven by increased detection of low-risk diminutive polyps that generate follow-up procedures under current surveillance guidelines.
The panel's deliberation took an unusual turn. Members initially drafted a conditional recommendation favoring CADe use, then reversed course following the public comment period. The final vote produced no consensus for either a positive or negative recommendation, resulting in the no-recommendation outcome. The panel chair described the core challenge in accessible terms:
"How this translates into a reduction in CRC incidence or death is where we were uncertain. Our best effort at trying to translate the ADR and other endoscopy outcomes to CRC incidence and CRC death relied on the modeling study, which included a lot of assumptions, which also contributed to our overall lower certainty."
The AGA's position should be distinguished from that of the BMJ Rapid Recommendation panel, which reviewed the same underlying evidence and reached a different conclusion: a recommendation against routine CADe use. The divergence does not reflect a factual disagreement about the data. Both panels worked from the same RCT body and acknowledged the same uncertainty about long-term outcomes. The difference lies in how each panel weighted uncertain harms against uncertain benefits — a values-based judgment that GRADE explicitly acknowledges can produce different recommendations from the same evidence base. As one independent commentator noted in coverage of the AGA guideline, when evidence for benefit is uncertain, underlying values become the decisive variable, and different priorities could lead guideline bodies to recommend either for or against CADe.
| Dimension | AGA 2025 Living Guideline | BMJ Rapid Recommendation |
|---|---|---|
| Recommendation | No recommendation (no consensus) | Recommends against routine CADe |
| Evidence reviewed | 44 RCTs, >30,000 participants | Overlapping RCT body |
| Critical outcome emphasis | CRC incidence, CRC mortality | CRC incidence, CRC mortality |
| ADR classification | Important but not critical surrogate | Important but not critical surrogate |
| Modeling used | Yes — microsimulation (very low certainty) | Yes — similar modeling approach |
| Primary divergence | Uncertainty insufficient for recommendation either way | Uncertain harm weighed against uncertain benefit — tilts against |
| Panel reversal noted | Yes — conditional recommendation drafted then reversed after public comment | Not reported |
The RCT-to-Real-World Effectiveness Gap: Mechanisms and Evidence
The RCT evidence base for CADe is large and internally consistent. The real-world evidence is not. This divergence is not a minor statistical artifact — it is a reproducible finding with identifiable mechanistic explanations, and it was central to the AGA panel's reasoning.
The AGA guideline incorporated a systematic review of 8 nonrandomized real-world studies (n=9,782) that showed no significant ADR improvement with CADe (RR 1.11, 95% CI 0.97–1.28). A broader meta-analysis of 12 nonrandomized studies (n=11,660) found that while overall ADR appeared marginally higher with CADe (36.3% vs 35.8%; RR 1.13), this difference was attributable only to prospective studies and disappeared in retrospective analyses (RR 1.12, 95% CI 0.92–1.36). Among studies specifically evaluating GI Genius in real-world settings, ADR showed no significant difference with versus without CADe (RR 0.96, 95% CI 0.85–1.07).
A mixed-methods study of GI Genius implementation at Stanford provides the most granular mechanistic account of why real-world performance falls short. The pragmatic implementation trial showed null ADR results despite high endoscopist enthusiasm for the technology (82–87% favorable before and after use). Qualitative investigation identified several converging explanations:
- Alert fatigue from false positives: 61% of colonoscopists cited frequent false alerts triggered by bubbles and debris as a primary concern, and 57% reported muting the auditory alert — effectively disabling the system's real-time notification function.
- Mucosal exposure failure: Colonoscopists identified that AI can only detect polyps on mucosa the endoscopist has actually exposed. Inadequate mucosal visualization — caused by suboptimal withdrawal technique, bowel preparation quality, or patient anatomy — cannot be corrected by a detection algorithm. The AI has no mechanism to guide the endoscopist to examine mucosa they have not reached.
- Ceiling effects at high-ADR centers: Academic medical centers with expert endoscopists and high baseline ADR have less room for improvement. Studies conducted at these sites may show null results even if the technology genuinely helps lower-ADR endoscopists in community practice.
- Hawthorne effects in RCTs: Endoscopists who know they are being observed in a clinical trial may perform more carefully regardless of AI assistance, inflating the apparent benefit of CADe in controlled settings relative to routine practice.
The null real-world results are not universal. The COLO-DETECT pragmatic RCT — conducted across 12 NHS hospitals in England with 2,032 participants — found a significant ADR improvement of 8.3 percentage points (56.6% vs 48.4%; adjusted OR 1.47) and a 30% increase in mean adenomas per procedure (IRR 1.30). This trial is a meaningful counterpoint: it demonstrates that CADe can improve detection in routine NHS practice when implemented in a structured trial context. The key question the COLO-DETECT result raises is whether the trial context itself — the structure, monitoring, and protocol adherence it imposes — is doing some of the work that gets attributed to the AI.
Overdiagnosis and Surveillance Burden: Quantifying the Downstream Harm
The surveillance burden generated by CADe is not a theoretical concern — it is the primary quantified harm in the AGA's evidence-to-decision framework, and it deserves treatment as such rather than as a footnote to the detection benefit.
The microsimulation model used in the AGA guideline estimated 635 additional surveillance colonoscopies per 10,000 individuals over 10 years — compared to 11 fewer CRC cases and 2 fewer CRC deaths over the same period. Even accepting these estimates at face value (and the panel rated them as low to very low certainty), the ratio of surveillance procedures generated to cancers prevented is approximately 58:1. This is not a favorable tradeoff if the surveillance colonoscopies carry their own procedural risk, patient burden, and healthcare cost.
The mechanism driving this surveillance burden is the concentration of CADe benefit in diminutive polyps (≤5 mm). Current post-polypectomy surveillance guidelines generally do not recommend shortened surveillance intervals for patients with only diminutive polyps — these lesions have very low malignant potential. But under some guideline interpretations and in practice, detecting additional diminutive polyps can push patients into surveillance categories that generate follow-up colonoscopies. When AI detects more diminutive lesions that would otherwise have been missed, it may be generating surveillance obligation without proportionate CRC prevention benefit.
The 39% increase in non-neoplastic resection rates compounds this problem. Polypectomies performed on non-neoplastic tissue — hyperplastic polyps, normal mucosal folds, inflammatory changes — carry procedural risk (bleeding, perforation) and cost without any cancer prevention benefit. A 39% increase in this category across a large colonoscopy program represents substantial procedural waste.
- 635 more surveillance colonoscopies per 10,000 individuals (low certainty) — the primary modeled harm in the AGA framework.
- 39% increase in non-neoplastic resection rates — procedural waste with direct cost and patient burden implications.
- Benefit concentrated in diminutive polyps that do not typically require shortened surveillance intervals — generating follow-up without proportionate CRC prevention.
- No significant improvement in advanced adenoma detection — the lesions most directly linked to CRC risk show no consistent CADe benefit across the RCT body.
Next-Generation AI Colonoscopy: Beyond Detection to Characterization, Sizing, and Quality
The AGA guideline identified specific knowledge gaps that, if addressed, could shift the evidence calculus: no trials combining CADe with CADx characterization strategies; no long-term CRC outcome data; limited data on how AI performs across diverse patient populations. The next generation of AI colonoscopy platforms is structured as a direct response to these gaps — not by producing the long-term RCT data the field needs, but by expanding the AI scope beyond detection to address the overdiagnosis problem at its source.
On February 23, 2026, Medtronic announced CE Mark approval for ColonPRO, the fourth-generation software for the GI Genius platform. The CE Mark covers the European Union; as of the article's writing date, US FDA clearance for ColonPRO has not been announced. ColonPRO introduces three functional modules beyond the core CADe detection capability:

- CADs (real-time polyp sizing): Provides automated polyp size estimation during the procedure. According to Medtronic's press release, the METER study validated CADs at 85.8% sizing accuracy, compared to less than 60% accuracy in routine practice. Accurate real-time sizing is clinically significant because polyp size is a key determinant of surveillance interval assignment — more accurate sizing could reduce both under- and over-surveillance.
- CADx (optical characterization for diagnose-and-leave): Supports real-time tissue characterization to distinguish neoplastic from non-neoplastic polyps. The clinical application is diagnose-and-leave strategies for diminutive polyps: if AI can reliably characterize a small polyp as non-neoplastic at the time of detection, the endoscopist can leave it in situ rather than resecting it — directly addressing the non-neoplastic resection rate problem. Medtronic's press release references the PRACTICE RCT as demonstrating non-inferiority of AI-assisted optical diagnosis for leaving diminutive polyps in situ. Note: the PRACTICE RCT full text was not directly accessed; this attribution is based on the Medtronic press release.
- CADq (automated procedural quality monitoring): Provides automated measurement of withdrawal time, cecal intubation confirmation, and bowel preparation cleanliness assessment. This module directly targets the mucosal exposure problem identified in the Stanford human-AI interaction research: CADe cannot detect polyps on mucosa the endoscopist has not examined, and CADq provides objective feedback on whether the examination has been thorough.
The updated GI Genius hardware accompanying ColonPRO introduces the AI Access platform — an architecture designed to support multiple simultaneous AI applications on a single hardware unit, enabling future module additions without hardware replacement. This is an infrastructure investment as much as a clinical capability expansion.
| Module | Function | Clinical Problem Addressed | Evidence Basis (per Medtronic press release) | Regulatory Status |
|---|---|---|---|---|
| CADe | Real-time polyp detection | Adenoma miss rate | 44+ RCTs, multiple meta-analyses | CE Mark 2019, FDA De Novo 2021 (GI Genius) |
| CADs | Real-time polyp sizing | Surveillance interval miscalculation; over-surveillance | METER study (85.8% accuracy vs <60% routine) | CE Mark Feb 2026 (EU only) |
| CADx | Optical polyp characterization | Non-neoplastic resections; overdiagnosis | PRACTICE RCT (non-inferiority for diagnose-and-leave) | CE Mark Feb 2026 (EU only) |
| CADq | Procedural quality monitoring | Mucosal exposure failure; withdrawal time; bowel prep | Referenced in Medtronic press release; no independent study cited | CE Mark Feb 2026 (EU only) |
Competitive Landscape: Other Cleared Systems and What Head-to-Head Data Show
GI Genius is not the only CADe system with regulatory clearance. ENDO-AID (Olympus), CAD EYE (Fujifilm), and EndoScreener have all been evaluated in RCTs and appear in the meta-analysis literature. The aggregate ADR effect sizes across these systems fall within a similar range (RR 1.14–1.27), which has two practical implications: first, the evidence for CADe as a category is more robust than the evidence for any individual system; second, ADR improvement alone cannot serve as a meaningful differentiator for procurement decisions.
A benchmarking study on 101 annotated colonoscopy videos comparing GI Genius v1, GI Genius v2, ENDO-AID (Types A and B), and EndoMind provides the most granular head-to-head technical comparison available. Key findings from this study:
- GI Genius v2 showed meaningfully higher per-frame sensitivity than v1 (67.85% vs 50.63%) and faster first detection time (607 ms vs 1,510 ms), demonstrating that iterative software improvement within a platform produces clinically relevant performance differences.
- GI Genius v2 also had a higher false-positive rate than v1 (3.80% vs 2.75%) — a tradeoff that directly connects to the alert fatigue problem documented in real-world studies.
- ENDO-AID Type B had the lowest false-positive rate of all systems tested (0.63%) but also lower sensitivity (52.95%) — representing a different point on the sensitivity/specificity tradeoff curve.
- All systems showed substantially reduced sensitivity for flat lesions (0-IIa morphology, approximately 51.7% median sensitivity) compared to pedunculated or sessile polyps — a consistent limitation across the CADe category that has direct implications for sessile serrated lesion detection.
| System | Per-Frame Sensitivity | First Detection Time | False-Positive Rate | Notable Characteristic |
|---|---|---|---|---|
| GI Genius v1 | 50.63% | 1,510 ms | 2.75% | Earlier generation; lower sensitivity but lower false-positive rate than v2 |
| GI Genius v2 | 67.85% | 607 ms | 3.80% | Higher sensitivity and speed; higher false-positive rate |
| ENDO-AID Type B | 52.95% | Not reported | 0.63% | Lowest false-positive rate; lower sensitivity |
| ENDO-AID Type A | ~60% (estimated from study) | Not reported | ~1.5% (estimated) | Intermediate profile |
| EndoMind | Reported in study | Not reported | Reported in study | Included in benchmarking; specific figures in source |
This benchmarking data illustrates a broader problem for the field: cross-trial comparisons of CADe systems are unreliable because studies use different software versions, different patient populations, and different study designs. A system that performed at one sensitivity level in a 2022 RCT may perform substantially differently in its current software version. The GI Genius v1-to-v2 sensitivity improvement of 17 percentage points is a concrete example of how quickly system performance can shift — and why ADR results from older trials may not reflect current system capabilities.
Outstanding Evidence Gaps and What a Guideline Recommendation Requires
The AGA guideline was explicit about what would be required to revisit the no-recommendation outcome. The panel identified several specific knowledge gaps that, if addressed, could shift the evidence calculus toward a positive or negative recommendation:
- No CADe RCT has CRC incidence or mortality as a primary endpoint. This is the foundational gap. The entire GRADE downweighting of the ADR evidence stems from the absence of direct evidence on the outcomes that matter most to patients. A long-term RCT or prospective cohort study powered for CRC incidence would be transformative — but such a study would require tens of thousands of participants followed for a decade or more.
- No trials assess CADe plus CADx combination strategies. The AGA panel specifically noted this gap. If CADx characterization can reliably support diagnose-and-leave for diminutive polyps, it could directly reduce the surveillance burden that is the primary quantified harm — but this needs prospective RCT evidence, not just non-inferiority data from optical diagnosis studies.
- Limited data on patient values, cost-effectiveness, and equity implications. The AGA panel could not complete a cost-effectiveness analysis due to insufficient data. Patient preference data on the tradeoff between potential cancer prevention and increased surveillance burden are similarly sparse. Equity implications — whether CADe benefits are distributed equitably across patient populations of different racial, ethnic, and socioeconomic backgrounds — have not been systematically studied.
- Most RCT evidence comes from non-US settings. The majority of CADe RCTs were conducted in Italy, China, the UK, and Japan. US real-world pragmatic data (including the Stanford implementation study) tend to show attenuated or null effects. Generalizability to US community practice — where endoscopist expertise, case mix, and workflow differ from academic trial settings — remains a meaningful concern.
The AGA panel chair indicated plans to revisit the guideline in one to two years, with the expectation that newer software versions may perform better particularly for sessile serrated polyp detection — a lesion category where current CADe systems show inconsistent results. This signals that the guideline is not a permanent no, but a conditional pause pending better evidence.
This evidence challenge is not unique to gastroenterology. The gap between regulatory clearance and guideline-level clinical evidence is a recurring pattern across AI-assisted imaging. Readers interested in how analogous evidence challenges have played out in radiology AI can consult FDA-Cleared Radiology AI: Mapping the Landscape and the Clinical Evidence Gap for a comparative perspective.
Clinical Decision Framework: What Endoscopists and Health Systems Should Weigh Now
The AGA's no-recommendation outcome does not resolve the adoption decision — it transfers it back to individual endoscopists and health systems, who must weigh the available evidence against their specific clinical context. The following factors are relevant to that decision. This section describes factors to consider, not a recommendation for or against adoption.
Endoscopist-Level Considerations
- Baseline ADR: Endoscopists with high baseline ADR may be approaching a ceiling where CADe produces minimal additional detection. The RCT evidence shows consistent ADR benefit across expert and non-expert subgroups (19% improvement in expert-only subgroup), but real-world null results at high-ADR academic centers suggest ceiling effects are operationally meaningful. Endoscopists with lower baseline ADR may see greater benefit.
- Alert management: The Stanford pragmatic study found that 57% of endoscopists muted the auditory alert due to false-positive burden. If alert configuration cannot be adjusted to reduce false-positive frequency, the system may be effectively disabled in practice. Understanding a specific system's false-positive rate profile — and whether alert thresholds can be tuned — is relevant to adoption planning.
- Withdrawal technique and mucosal exposure: AI cannot detect polyps on unexposed mucosa. Endoscopists whose detection gap is primarily attributable to technique rather than visual recognition may not benefit from CADe and may benefit more from quality monitoring tools like CADq.
Health System-Level Considerations
- Surveillance capacity: The modeled 635 additional surveillance colonoscopies per 10,000 individuals represents a real capacity and cost implication. Health systems with constrained endoscopy capacity or long scheduling backlogs should explicitly model the downstream referral volume before adopting CADe at scale.
- Software version and system selection: Performance differs meaningfully between software versions and between systems. RCT evidence from a 2021 trial may not reflect the current version of the system being evaluated. Procurement decisions should account for current software capabilities, not historical trial data.
- Implementation structure: The COLO-DETECT pragmatic RCT suggests that structured implementation in a well-monitored context can produce real-world ADR gains. The Stanford pragmatic study suggests that unstructured deployment does not. Training protocols, alert configuration, and staff engagement are not secondary operational details — they may determine whether the technology produces any benefit at all.
- Reimbursement: As of mid-2026, specific CADe billing codes or CMS coverage decisions for AI-assisted colonoscopy have not been established. The absence of dedicated reimbursement coding is a practical adoption consideration for systems evaluating cost recovery.
- Next-generation module readiness: For US-based systems, ColonPRO's CE Mark is EU-only as of February 2026. Health systems evaluating the full CADe+CADx+CADs+CADq platform should verify current US FDA regulatory status before procurement planning.
The AI colonoscopy field is at a genuine inflection point. The evidence base is larger than for almost any other AI application in procedural medicine, yet it has not produced the guideline clarity that evidence volume might suggest. The reason is not that the RCTs are poorly designed or the ADR improvement is illusory — it is that ADR improvement, however consistently demonstrated, does not directly answer the question that GRADE methodology requires an answer to: does this technology reduce the number of patients who develop or die from colorectal cancer? Until that question is answered with direct evidence or high-certainty modeling, the field will continue to operate in the space between substantial RCT data and unresolved guideline uncertainty.

Comments
Join the discussion with an anonymous comment.