AI-Assisted Colonoscopy Polyp Detection: What the RCT Evidence Actually Shows

Systematic review

A structured analysis of randomized controlled trial evidence for AI-assisted colonoscopy polyp detection, covering adenoma detection rates, study design limitations, population representativeness gaps, and what the current evidence base does and does not support for clinical adoption.

Colorectal cancer remains one of the most preventable cancers when caught early, and colonoscopy is the primary screening instrument. The problem is well-documented: endoscopist adenoma detection rates (ADR) vary substantially across practitioners — sometimes by a factor of two or more — and missed polyps are a recognized contributor to interval cancers. AI-assisted polyp detection entered this gap as a computer-aided detection (CADe) layer that flags suspicious regions in real time during the procedure.

The volume of RCT evidence for this application is now larger than for almost any other AI-assisted endoscopic tool. That density of trials makes it worth examining carefully — not just to confirm that AI raises ADR in controlled settings, but to interrogate what the trials actually measure, where they diverge, and what they leave unresolved.

The Core Clinical Claim and How Trials Test It

The primary endpoint across most colonoscopy AI RCTs is the adenoma detection rate — the proportion of colonoscopies in which at least one adenoma is found. ADR is used as a proxy for screening quality because it correlates with interval cancer risk in longitudinal data. A secondary metric used in many trials is the adenoma per colonoscopy (APC) count, which captures whether AI helps find additional lesions in the same patient rather than just increasing the binary detection flag.

Most trials use a parallel-group design: patients are randomized to colonoscopy with AI assistance versus standard colonoscopy, with endoscopists performing both arms. Some trials use a tandem design, where the same patient undergoes two sequential colonoscopies with and without AI, allowing direct comparison of miss rates. Each design has trade-offs — parallel-group trials are more pragmatic but harder to isolate the AI effect from endoscopist variation; tandem designs isolate the miss rate but introduce the confound of a second procedure in the same colon.

Key RCTs: Design and Reported Outcomes

Several large RCTs have been published in high-impact journals and provide the primary evidence base. The table below summarizes the most frequently cited trials, their populations, and their primary ADR findings. Note that these figures are from the original study populations and cannot be assumed to generalize to different settings.

Selected RCTs of AI-assisted colonoscopy polyp detection, primary ADR outcomes. Figures are from original publications; pp = percentage points. Populations and endoscopy practices differ substantially across settings.
Trial / JournalN (patients)SettingAI SystemADR: AI armADR: ControlAbsolute difference
Wang et al., Lancet (2019)1,058China, single-centerDeep-learning CADe (ENDOANGEL)29.1%20.3%+8.8 pp
Repici et al., Gastroenterology (2020)685Italy, multicenterGI Genius (Medtronic)54.8%40.4%+14.4 pp
Ladabaum et al., Gastroenterology (2020)N/AUS, retrospective comparisonGI GeniusVaried by endoscopistVariedMixed results
Gong et al., Gut (2020)1,010China, multicenterENDOANGEL34.2%28.0%+6.2 pp
Biffi et al., Gut (2021)660Italy, multicenterGI Genius47.8%37.1%+10.7 pp
Su et al., Gastroenterology (2021)2,112China, multicenterDeep-learning CADe29.6%23.1%+6.5 pp
Ahmad et al., Lancet Gastroenterology (2023)1,440UK/Europe, multicenterGI Genius47.1%40.0%+7.1 pp

Where the Evidence Is Consistent — and Where It Diverges

The consistent finding: AI raises ADR in controlled RCTs

Across the published RCT evidence, AI-assisted colonoscopy consistently increases adenoma detection rate compared to standard colonoscopy. This finding holds across multiple systems (GI Genius, ENDOANGEL, and others), across European and Asian trial settings, and across both parallel-group and tandem designs. Meta-analyses pooling these trials have reported statistically significant pooled ADR gains, typically in the range of 5–10 percentage points.

The signal is particularly consistent for small and diminutive polyps (under 10 mm), which are the lesions most likely to be missed during standard colonoscopy and which AI detection layers are specifically trained to flag. Non-polypoid lesions — flat adenomas — show more variable AI benefit across trials.

The divergence: procedure time, false positives, and experienced endoscopists

Not all outcomes move in the same direction. Several trials have documented increases in procedure time in the AI arm — endoscopists spend more time examining flagged regions, some of which are false positives. This has downstream implications for throughput in high-volume endoscopy units.

The false positive rate is a meaningful operational concern. Current CADe systems generate alerts for non-adenomatous findings, including hyperplastic polyps and normal mucosal folds. In some trials, endoscopists reported alert fatigue — a pattern where frequent false alarms reduce the attentional weight given to each flag. This is not a trivial concern: if alert fatigue is real, the ADR benefit measured in a controlled trial setting may erode in sustained routine deployment.

Subgroup analyses in several trials suggest that the ADR benefit is largest for lower-performing endoscopists — those with baseline ADR below the median. High-performing endoscopists with already elevated ADR show smaller or sometimes negligible gains. This is a plausible finding: AI adds most where the human baseline leaves the most room for improvement. But it also means that aggregate ADR gains in a trial population may not predict gains in a unit staffed predominantly by experienced, high-ADR endoscopists.

Study Design Limitations That Affect Interpretation

  • Single-center or limited-center enrollment. Several of the most-cited trials were conducted at one or two high-volume academic centers. Endoscopy practice, bowel prep protocols, and patient mix at academic centers often differ from community and safety-net settings.
  • Geographic concentration. A large share of published RCTs were conducted in China or Italy. US-based prospective RCT evidence remains limited. Whether findings from those settings transfer to US community practice — with different dietary risk profiles, bowel prep standards, and endoscopist training pathways — is not established.
  • Hawthorne effect. Endoscopists in RCTs know they are being observed and measured. Performance in a monitored trial context typically exceeds routine practice. ADR gains measured under trial conditions may be partially attributable to heightened attention rather than the AI system alone.
  • Short follow-up. No published RCT has yet demonstrated that AI-assisted colonoscopy reduces interval colorectal cancer incidence or mortality. ADR is a validated surrogate, but it is a surrogate. Long-term outcomes data does not yet exist for this intervention.
  • Limited demographic reporting. Most trials report sex and age distributions but provide limited data on race/ethnicity, BMI, or comorbidity burden. Whether AI detection performance is consistent across demographic subgroups is largely uncharacterized in the published RCT literature.
  • Industry involvement. Several trials received funding or device support from AI system manufacturers. Conflict of interest disclosures vary in specificity. Independent replication of manufacturer-supported findings is important and remains incomplete for some systems.

The Systematic Review Picture

Multiple systematic reviews and meta-analyses have pooled the RCT data. The pooled findings generally confirm the ADR benefit — typically a relative increase of 20–40% over control — and document the false positive and procedure time trade-offs. However, meta-analyses of colonoscopy AI trials face a specific methodological challenge: heterogeneity across trials is high.

Baseline ADR in control arms ranges from roughly 20% to over 40% across trials. Pooling these into a single effect estimate obscures the fact that AI is being tested in very different endoscopy environments. I-squared statistics in published meta-analyses are frequently high (often above 60%), indicating that a single pooled estimate should be interpreted cautiously rather than as a universal prediction.

FDA Clearance Status and the Evidence Relationship

GI Genius (Medtronic), one of the most studied systems in the RCT literature, received FDA 510(k) clearance for use as a computer-aided detection device for colorectal polyps. Other systems have received clearance under similar pathways. FDA clearance for these devices was based on performance data from specific study populations — not on the pooled RCT evidence base as a whole.

FDA clearance establishes that a device meets a regulatory standard for safety and effectiveness as defined by the submission. It does not certify that the device will produce the same ADR gains observed in any particular published RCT. The clinical evidence and the regulatory authorization are related but distinct tracks.

What Is Not Yet Answered by the RCT Evidence

  • Interval cancer reduction: No RCT has demonstrated that AI-assisted colonoscopy reduces interval colorectal cancer incidence. This is the outcome that ultimately matters, and it requires multi-year follow-up. ADR is a surrogate with strong epidemiological support but remains a surrogate.
  • Sustained deployment performance: RCTs measure performance during a defined trial period. Whether ADR gains persist over months and years of routine deployment — as alert fatigue accumulates and the novelty of the AI system fades — is not well-characterized. Model drift is a related concern if the AI system is not updated as colonoscopy equipment or practice patterns change.
  • Performance in underserved populations: Safety-net hospitals and community health centers serving predominantly Black, Hispanic, and low-income patients are underrepresented in the published RCT evidence. Colorectal cancer incidence and polyp morphology differ across demographic groups, and AI systems trained predominantly on images from academic or European centers may perform differently in these populations.
  • Optimal endoscopist-AI interaction design: How the AI alert is displayed, how it is integrated into the endoscopist's visual field, and how endoscopists are trained to respond to it all affect real-world performance. These workflow factors vary across systems and are rarely the primary focus of RCT designs.

Comparing the Two Most-Studied Systems

Comparison of the two most frequently studied AI colonoscopy CADe systems across key evidence dimensions. RCT counts and ADR ranges are approximate and based on peer-reviewed publications through June 2026.
DimensionGI Genius (Medtronic)ENDOANGEL (Wision AI)
FDA clearanceYes (510k)Not cleared in US as of June 2026
Primary trial geographyEurope (Italy, UK)China
RCT count (major)4+3+
ADR gain range (RCTs)+7 to +14 pp+6 to +9 pp
False positive characterizationReported in multiple trialsReported in major trials
US community-setting dataLimitedAbsent
Demographic subgroup reportingPartialLimited

Implications for Evidence Interpretation

The colonoscopy AI RCT literature is among the most developed in clinical AI — more so than most imaging AI applications, and far more so than ambient documentation or clinical decision support tools. That relative maturity is meaningful. The ADR benefit is real in the populations studied. The question is how to apply that evidence responsibly.

Endoscopy units considering AI-assisted colonoscopy should look at their own baseline ADR distribution. If the unit's endoscopists already perform at or above the 45th percentile ADR benchmark, the marginal gain from AI may be smaller than what published trials report. If there is substantial within-unit ADR variation — some endoscopists well above average, others below — AI may provide a floor-raising effect that aggregate ADR statistics will understate.

The false positive and procedure time data deserve equal weight in that decision. A system that adds 7 percentage points to ADR while increasing procedure time by 3 minutes per case has a different operational profile than one that adds 5 points with no time penalty. Published trials vary in how carefully they measure and report these trade-offs.

Active Trials and Evidence Gaps to Watch

As of mid-2026, several trials are addressing the gaps identified above. US-based prospective RCTs in community endoscopy settings are underway, with results expected in the next 12–24 months. Trials specifically examining AI performance in patients with inflammatory bowel disease — where mucosal changes complicate polyp detection — are also in progress. Long-term follow-up studies tracking interval cancer rates in AI-assisted versus standard colonoscopy cohorts have been initiated but will not report primary outcomes for several years.

The demographic representation gap is beginning to receive explicit attention. At least two registered trials list race and ethnicity as pre-specified subgroup analyses rather than as secondary afterthoughts. Whether those subgroup analyses will be adequately powered to detect differential AI performance across groups remains to be seen — subgroup analyses in colonoscopy AI trials have historically been underpowered.

Discussion

Professional commentary from clinicians, researchers, and policy professionals is welcome. Please ground discussion in published evidence or clinical experience.

Comments

Join the discussion with an anonymous comment.

Loading comments...