The literature on AI and medical diagnosis has grown dense enough that tracking what is actually worth reading — versus what is simply being published — has become a task in itself. This radar covers studies published or posted through Q2 2026 across four specialty areas where diagnostic AI evidence is accumulating fastest: radiology, pathology, cardiology, and primary care. Each notice records the study type, dataset scope, primary finding, and a brief note on why it warrants attention or caution.
Radiology
Chest CT: Pulmonary Nodule Detection Meta-Analysis
Study type: Systematic review and meta-analysis | Dataset: 38 studies, ~210,000 CT scans pooled | External validation: Partial (subset of included studies used independent test sets)
Pooled sensitivity for AI detection of pulmonary nodules ≥6 mm reached 91.4% (95% CI: 88.7–93.6%) with specificity at 88.2%. Heterogeneity across included studies was high (I² = 74%), driven primarily by differences in scanner vendor, slice thickness, and nodule density category. Studies using thin-slice CT (≤1.5 mm) consistently outperformed those using standard clinical protocols.
Editorial note: Worth tracking because it is the largest pooled analysis of pulmonary nodule AI to date and explicitly stratifies performance by slice thickness — a variable that rarely appears in vendor-reported metrics. The high heterogeneity figure is the most useful number in the paper: it signals that aggregate performance claims in this space are unreliable without knowing the imaging protocol.
Mammography: Prospective Reader Study, Multi-Site
Study type: Prospective validation | Dataset: 14,200 screening mammograms across 4 academic medical centers | External validation: Yes
An AI triage system was tested as a pre-reader filter, flagging cases for immediate radiologist review versus standard queue routing. The system reduced time-to-radiologist-review for screen-detected cancers by a median of 4.1 days. Cancer detection rate was non-inferior to standard workflow (6.2 vs. 6.0 per 1,000 screens). Recall rate was unchanged. The study population was 78% white; performance stratified by race and breast density was reported but showed wider confidence intervals in the non-white subgroup (n=3,100).
Editorial note: Promotion to a full evidence appraisal is under consideration. The prospective multi-site design and explicit demographic reporting make this more useful for evaluation purposes than most mammography AI literature currently indexed. Limitation: the AI system evaluated holds FDA clearance under a 510(k) pathway, but the study was partially funded by the device manufacturer — a conflict of interest that is disclosed in the paper and must be weighed accordingly.
Pathology
Whole-Slide Image Analysis: Colorectal Cancer Grading
Study type: Retrospective cohort with external validation | Dataset: 9,400 slides (training), 2,100 slides (external test, different institution) | Primary metric: AUC 0.94 for high-grade vs. low-grade classification
A deep learning model trained on hematoxylin and eosin (H&E) stained whole-slide images classified colorectal adenocarcinoma grade with AUC 0.94 on the held-out external test set. Inter-rater agreement between the model and three pathologists on the external set was κ = 0.71, compared to inter-pathologist agreement of κ = 0.68 on the same cases. The study did not report demographic composition of the patient population contributing slides.
Editorial note: The κ comparison between AI and pathologist agreement versus pathologist-to-pathologist agreement is a methodologically sound framing that is not always used in computational pathology papers. The absence of demographic data on the slide-contributing population is a gap — tissue processing protocols, staining variability, and patient demographics all affect model generalizability in ways this study cannot address.
Foundation Model for Pathology: Zero-Shot Tissue Classification (Preprint)
Study type: Preprint — model evaluation | Dataset: 28 tissue types across 4 public pathology datasets | Primary metric: Zero-shot accuracy 81.3% across tissue classification tasks
A vision-language foundation model pretrained on a large corpus of pathology images and paired text was evaluated on zero-shot tissue classification without task-specific fine-tuning. Performance varied substantially by tissue type — accuracy on common tissue classes (colon, lung, breast) exceeded 90%, while performance on rare tissue types dropped to 61–67%. The authors note the training corpus is not fully disclosed, which limits reproducibility.
Editorial note: Tracking because foundation model approaches in pathology are moving quickly and this is one of the first preprints to report stratified zero-shot performance by tissue rarity. The undisclosed training corpus is a significant reproducibility concern that will need to be addressed before this can be treated as reliable evidence. No FDA regulatory status.
Cardiology
ECG-Based AI for Atrial Fibrillation Detection: RCT Results
Study type: Randomized controlled trial | Dataset: 11,500 patients randomized across 18 primary care sites | Primary outcome: AF detection rate at 12 months
Patients in the AI-assisted arm received ECG screening with an AI algorithm flagging low-confidence reads for cardiologist review. At 12 months, the AI-assisted arm detected AF in 3.1% of the screened population versus 2.2% in standard care (OR 1.43, 95% CI: 1.12–1.82). New anticoagulation prescriptions were higher in the AI arm (2.6% vs. 1.9%). The trial did not have sufficient follow-up to assess stroke outcomes.
Editorial note: This study is being considered for promotion to a full evidence appraisal given the RCT design and the downstream management outcome. The AI algorithm evaluated holds FDA clearance. Demographic composition of the trial population: 54% female, mean age 67, 82% white — the racial composition limits generalizability to populations where AF detection rates and ECG morphology differ.
Echocardiography AI: Left Ventricular Function Assessment
Study type: Prospective validation, single-center | Dataset: 3,800 echocardiograms | Primary metric: Mean absolute error (MAE) for LVEF: 4.1% versus sonographer measurement
An AI system for automated left ventricular ejection fraction (LVEF) measurement was prospectively validated against sonographer-derived measurements. MAE of 4.1% is within the range of inter-sonographer variability reported in prior literature (typically 4–6%). The system performed comparably across patients with preserved, mildly reduced, and reduced EF categories. Image quality was a significant moderator — studies rated as poor quality by the AI system showed MAE of 7.8%.
Editorial note: The image quality stratification is the most practically useful finding. Any deployment of this type of system needs to account for the performance degradation in poor-quality studies — which in real-world echocardiography labs can represent 15–25% of studies depending on patient population. Single-center limitation applies.
Primary Care and LLMs in Diagnosis
LLM Differential Diagnosis Generation: Prospective Comparison Study
Study type: Prospective comparison (LLM vs. resident physicians) | Dataset: 400 de-identified clinical vignettes from a tertiary care center | Primary metric: Top-3 differential accuracy
A large language model was given structured clinical vignettes (history, exam findings, initial labs) and asked to generate a differential diagnosis. Top-3 accuracy — whether the correct diagnosis appeared in the model's three-item differential — was 71.4% versus 64.8% for PGY-2 residents on the same vignettes. Top-1 accuracy was lower for the LLM (44.2%) than for attendings (61.7%) evaluated on a subset of 100 vignettes.
Editorial note: The top-1 versus top-3 split is the most important figure here. LLMs generating broad differentials can appear to perform well on top-3 metrics while underperforming on the clinically relevant question of what the physician actually suspects first. This study is tracking the right metric set, but the vignette methodology limits what can be concluded about real-world utility.
AI-Assisted Symptom Triage: Retrospective Analysis in Primary Care Networks
Study type: Retrospective cohort | Dataset: 280,000 patient-initiated symptom triage interactions across 12 primary care practices | External validation: No
A symptom checker AI was evaluated for triage accuracy — whether it correctly categorized urgency (emergency, urgent, routine) relative to eventual clinical disposition. Sensitivity for emergency-level conditions was 84.1%; specificity was 76.3%. The system over-triaged (classified as urgent or emergency when routine was appropriate) in 19.2% of cases. Under-triage rate for conditions that required emergency care was 15.9%.
Editorial note: The 15.9% under-triage rate for emergency conditions is the figure that matters most for patient safety evaluation. The study does not report whether under-triaged cases resulted in adverse outcomes — a gap that limits safety conclusions. No external validation and no FDA clearance status reported for the evaluated system. Tracking for the dataset scale and the under-triage metric, which is underreported in this literature.
Cross-Cutting: Algorithmic Bias Audits
Two bias audit studies published this quarter are worth flagging together because they address the same structural problem from different angles.
| Study Focus | Design | Key Bias Finding | Status |
|---|---|---|---|
| Skin lesion classification AI across Fitzpatrick skin tones I–VI | Retrospective audit, 6 FDA-cleared devices | AUC gap of 0.09–0.14 between Fitzpatrick I–II and V–VI across all tested devices | Peer-reviewed |
| Sepsis prediction model performance by insurance status | Retrospective cohort, single health system, 42,000 patients | Model AUROC 0.81 in commercially insured patients vs. 0.74 in Medicaid patients; attributed to documentation density differences | Peer-reviewed |
| Chest X-ray AI performance by sex and age across 5 models | Systematic review, 14 studies included | Consistent underperformance in patients >75 years across all reviewed models; sex-based gaps inconsistent across studies | Peer-reviewed |
Tracking Status Summary
| Study Area | Design | Promotion Candidate | Key Limitation |
|---|---|---|---|
| Pulmonary nodule detection (meta-analysis) | Systematic review / meta-analysis | No — heterogeneity too high for clean appraisal | I²=74%; mixed AI systems |
| Mammography triage (prospective multi-site) | Prospective validation | Under consideration | Partial industry funding |
| Colorectal cancer grading WSI | Retrospective + external validation | No — missing demographic data | No patient demographics reported |
| Pathology foundation model zero-shot | Preprint | No — preprint, undisclosed training data | Not peer-reviewed; training corpus opaque |
| AF detection RCT | RCT | Yes — in queue | Short follow-up; limited racial diversity |
| Echocardiography LVEF AI | Prospective validation | No — single center | Single-center; image quality moderator not addressed |
| LLM differential diagnosis | Prospective comparison | No — vignette methodology limits | Vignettes ≠ real clinical encounters |
| Symptom triage retrospective | Retrospective cohort | No — no external validation | No external validation; no outcome data for under-triaged cases |
Comments
Join the discussion with an anonymous comment.