AI and Medical Diagnosis: Research Radar Q2 2026

The literature on AI and medical diagnosis has grown dense enough that tracking what is actually worth reading — versus what is simply being published — has become a task in itself. This radar covers studies published or posted through Q2 2026 across four specialty areas where diagnostic AI evidence is accumulating fastest: radiology, pathology, cardiology, and primary care. Each notice records the study type, dataset scope, primary finding, and a brief note on why it warrants attention or caution.

Radiology

Chest CT: Pulmonary Nodule Detection Meta-Analysis

Study type: Systematic review and meta-analysis | Dataset: 38 studies, ~210,000 CT scans pooled | External validation: Partial (subset of included studies used independent test sets)

Pooled sensitivity for AI detection of pulmonary nodules ≥6 mm reached 91.4% (95% CI: 88.7–93.6%) with specificity at 88.2%. Heterogeneity across included studies was high (I² = 74%), driven primarily by differences in scanner vendor, slice thickness, and nodule density category. Studies using thin-slice CT (≤1.5 mm) consistently outperformed those using standard clinical protocols.

Editorial note: Worth tracking because it is the largest pooled analysis of pulmonary nodule AI to date and explicitly stratifies performance by slice thickness — a variable that rarely appears in vendor-reported metrics. The high heterogeneity figure is the most useful number in the paper: it signals that aggregate performance claims in this space are unreliable without knowing the imaging protocol.

Mammography: Prospective Reader Study, Multi-Site

Study type: Prospective validation | Dataset: 14,200 screening mammograms across 4 academic medical centers | External validation: Yes

An AI triage system was tested as a pre-reader filter, flagging cases for immediate radiologist review versus standard queue routing. The system reduced time-to-radiologist-review for screen-detected cancers by a median of 4.1 days. Cancer detection rate was non-inferior to standard workflow (6.2 vs. 6.0 per 1,000 screens). Recall rate was unchanged. The study population was 78% white; performance stratified by race and breast density was reported but showed wider confidence intervals in the non-white subgroup (n=3,100).

Editorial note: Promotion to a full evidence appraisal is under consideration. The prospective multi-site design and explicit demographic reporting make this more useful for evaluation purposes than most mammography AI literature currently indexed. Limitation: the AI system evaluated holds FDA clearance under a 510(k) pathway, but the study was partially funded by the device manufacturer — a conflict of interest that is disclosed in the paper and must be weighed accordingly.

Pathology

Whole-Slide Image Analysis: Colorectal Cancer Grading

Study type: Retrospective cohort with external validation | Dataset: 9,400 slides (training), 2,100 slides (external test, different institution) | Primary metric: AUC 0.94 for high-grade vs. low-grade classification

A deep learning model trained on hematoxylin and eosin (H&E) stained whole-slide images classified colorectal adenocarcinoma grade with AUC 0.94 on the held-out external test set. Inter-rater agreement between the model and three pathologists on the external set was κ = 0.71, compared to inter-pathologist agreement of κ = 0.68 on the same cases. The study did not report demographic composition of the patient population contributing slides.

Editorial note: The κ comparison between AI and pathologist agreement versus pathologist-to-pathologist agreement is a methodologically sound framing that is not always used in computational pathology papers. The absence of demographic data on the slide-contributing population is a gap — tissue processing protocols, staining variability, and patient demographics all affect model generalizability in ways this study cannot address.

Foundation Model for Pathology: Zero-Shot Tissue Classification (Preprint)

Study type: Preprint — model evaluation | Dataset: 28 tissue types across 4 public pathology datasets | Primary metric: Zero-shot accuracy 81.3% across tissue classification tasks

A vision-language foundation model pretrained on a large corpus of pathology images and paired text was evaluated on zero-shot tissue classification without task-specific fine-tuning. Performance varied substantially by tissue type — accuracy on common tissue classes (colon, lung, breast) exceeded 90%, while performance on rare tissue types dropped to 61–67%. The authors note the training corpus is not fully disclosed, which limits reproducibility.

Editorial note: Tracking because foundation model approaches in pathology are moving quickly and this is one of the first preprints to report stratified zero-shot performance by tissue rarity. The undisclosed training corpus is a significant reproducibility concern that will need to be addressed before this can be treated as reliable evidence. No FDA regulatory status.

Cardiology

ECG-Based AI for Atrial Fibrillation Detection: RCT Results

Study type: Randomized controlled trial | Dataset: 11,500 patients randomized across 18 primary care sites | Primary outcome: AF detection rate at 12 months

Patients in the AI-assisted arm received ECG screening with an AI algorithm flagging low-confidence reads for cardiologist review. At 12 months, the AI-assisted arm detected AF in 3.1% of the screened population versus 2.2% in standard care (OR 1.43, 95% CI: 1.12–1.82). New anticoagulation prescriptions were higher in the AI arm (2.6% vs. 1.9%). The trial did not have sufficient follow-up to assess stroke outcomes.

Editorial note: This study is being considered for promotion to a full evidence appraisal given the RCT design and the downstream management outcome. The AI algorithm evaluated holds FDA clearance. Demographic composition of the trial population: 54% female, mean age 67, 82% white — the racial composition limits generalizability to populations where AF detection rates and ECG morphology differ.

Echocardiography AI: Left Ventricular Function Assessment

Study type: Prospective validation, single-center | Dataset: 3,800 echocardiograms | Primary metric: Mean absolute error (MAE) for LVEF: 4.1% versus sonographer measurement

An AI system for automated left ventricular ejection fraction (LVEF) measurement was prospectively validated against sonographer-derived measurements. MAE of 4.1% is within the range of inter-sonographer variability reported in prior literature (typically 4–6%). The system performed comparably across patients with preserved, mildly reduced, and reduced EF categories. Image quality was a significant moderator — studies rated as poor quality by the AI system showed MAE of 7.8%.

Editorial note: The image quality stratification is the most practically useful finding. Any deployment of this type of system needs to account for the performance degradation in poor-quality studies — which in real-world echocardiography labs can represent 15–25% of studies depending on patient population. Single-center limitation applies.

Primary Care and LLMs in Diagnosis

LLM Differential Diagnosis Generation: Prospective Comparison Study

Study type: Prospective comparison (LLM vs. resident physicians) | Dataset: 400 de-identified clinical vignettes from a tertiary care center | Primary metric: Top-3 differential accuracy

A large language model was given structured clinical vignettes (history, exam findings, initial labs) and asked to generate a differential diagnosis. Top-3 accuracy — whether the correct diagnosis appeared in the model's three-item differential — was 71.4% versus 64.8% for PGY-2 residents on the same vignettes. Top-1 accuracy was lower for the LLM (44.2%) than for attendings (61.7%) evaluated on a subset of 100 vignettes.

Editorial note: The top-1 versus top-3 split is the most important figure here. LLMs generating broad differentials can appear to perform well on top-3 metrics while underperforming on the clinically relevant question of what the physician actually suspects first. This study is tracking the right metric set, but the vignette methodology limits what can be concluded about real-world utility.

AI-Assisted Symptom Triage: Retrospective Analysis in Primary Care Networks

Study type: Retrospective cohort | Dataset: 280,000 patient-initiated symptom triage interactions across 12 primary care practices | External validation: No

A symptom checker AI was evaluated for triage accuracy — whether it correctly categorized urgency (emergency, urgent, routine) relative to eventual clinical disposition. Sensitivity for emergency-level conditions was 84.1%; specificity was 76.3%. The system over-triaged (classified as urgent or emergency when routine was appropriate) in 19.2% of cases. Under-triage rate for conditions that required emergency care was 15.9%.

Editorial note: The 15.9% under-triage rate for emergency conditions is the figure that matters most for patient safety evaluation. The study does not report whether under-triaged cases resulted in adverse outcomes — a gap that limits safety conclusions. No external validation and no FDA clearance status reported for the evaluated system. Tracking for the dataset scale and the under-triage metric, which is underreported in this literature.

Cross-Cutting: Algorithmic Bias Audits

Two bias audit studies published this quarter are worth flagging together because they address the same structural problem from different angles.

Selected bias audit studies, Q2 2026. All three are peer-reviewed. None include prospective correction or mitigation arms.
Study Focus	Design	Key Bias Finding	Status
Skin lesion classification AI across Fitzpatrick skin tones I–VI	Retrospective audit, 6 FDA-cleared devices	AUC gap of 0.09–0.14 between Fitzpatrick I–II and V–VI across all tested devices	Peer-reviewed
Sepsis prediction model performance by insurance status	Retrospective cohort, single health system, 42,000 patients	Model AUROC 0.81 in commercially insured patients vs. 0.74 in Medicaid patients; attributed to documentation density differences	Peer-reviewed
Chest X-ray AI performance by sex and age across 5 models	Systematic review, 14 studies included	Consistent underperformance in patients >75 years across all reviewed models; sex-based gaps inconsistent across studies	Peer-reviewed

Tracking Status Summary

Tracking status for studies covered in this radar entry. Promotion to Evidence Appraisals requires prospective or RCT design, external validation, and demographic reporting.
Study Area	Design	Promotion Candidate	Key Limitation
Pulmonary nodule detection (meta-analysis)	Systematic review / meta-analysis	No — heterogeneity too high for clean appraisal	I²=74%; mixed AI systems
Mammography triage (prospective multi-site)	Prospective validation	Under consideration	Partial industry funding
Colorectal cancer grading WSI	Retrospective + external validation	No — missing demographic data	No patient demographics reported
Pathology foundation model zero-shot	Preprint	No — preprint, undisclosed training data	Not peer-reviewed; training corpus opaque
AF detection RCT	RCT	Yes — in queue	Short follow-up; limited racial diversity
Echocardiography LVEF AI	Prospective validation	No — single center	Single-center; image quality moderator not addressed
LLM differential diagnosis	Prospective comparison	No — vignette methodology limits	Vignettes ≠ real clinical encounters
Symptom triage retrospective	Retrospective cohort	No — no external validation	No external validation; no outcome data for under-triaged cases

AI and Medical Diagnosis: Research Radar — Q2 2026 Literature Notices