AI Clinical Decision Support in Primary Care: Evidence and Applications

A primary care physician in conversation with a patient, with a secondary monitor showing a retinal scan AI result in the background — AI clinical decision support in primary care functions as a secondary layer — the physician's attention remains on the patient, not the algorithm.

Clinical Context: Primary Care Under Strain and the AI-CDS Opportunity

Primary care is operating under compounding pressure. Workforce shortages, clinician burnout, rising rates of multimorbidity, and administrative burden have narrowed the time available for clinical reasoning at the point of care. These conditions have created genuine demand for tools that can extend clinician capacity — and AI clinical decision support (AI-CDS) is entering this space rapidly, in some cases ahead of robust real-world evaluation.

The pace of adoption is notable. In 2024, one in five UK general practitioners reported using generative AI clinically. In the United States, a 2023 American Hospital Association survey found that 65% of hospitals reported using AI or predictive models, though the regulatory status of most of those tools was unknown. The commercial activity is real; the evidence base is uneven.

This article does not treat AI-CDS as a monolithic category. The evidence supporting point-of-care screening tools for diabetic retinopathy is structurally different from the evidence behind LLM-based copilots or EHR-based preventive care algorithms. Conflating these application classes produces a misleading picture of both the opportunity and the risk. The analysis that follows is organized by application class, with evidence quality, regulatory status, and deployment stage treated as distinct dimensions for each.

A Taxonomy of AI-CDS Applications in Primary Care

A 2025 scoping review published in JMIR identified 73 empirical studies on AI in primary care, grouped into early intervention and decision support, chronic disease management, operations and patient management, and acceptance and implementation. Drawing on that review and the Lancet Primary Care 2025 analysis, four application classes structure the evidence review in this article:

Point-of-care screening AI: autonomous or near-autonomous tools for diabetic retinopathy detection and ECG-based cardiac condition identification.
Diagnostic decision support AI: skin cancer classifiers and general symptom-checker or diagnostic support tools.
LLM-based copilots and general-purpose CDS: large language model tools that assist clinicians with differential diagnosis, treatment selection, or clinical reasoning in real time.
Preventive care and population health AI: EHR-based NLP models and social determinants of health (SDOH)-aware algorithms that identify at-risk patients or support care coordination.

Four-column infographic showing AI-CDS application classes in primary care with evidence quality indicators — The four AI-CDS application classes in primary care differ substantially in evidence maturity, regulatory status, and deployment readiness.

AI-CDS application classes in primary care mapped to evidence maturity, regulatory clearance availability, and deployment stage as of mid-2026.
Application Class	Primary Use Cases	Evidence Maturity	FDA Clearance Available	Deployment Stage
Point-of-care screening AI	Diabetic retinopathy, low ejection fraction, atrial fibrillation	Strongest — prospective RCT and multi-site validation data	Yes (multiple cleared tools)	Routine deployment in some settings; real-world uptake variable
Diagnostic decision support AI	Skin cancer classification, general symptom checking	Variable — accuracy 39–89% in primary care; spectrum effect documented	Limited; most not cleared as autonomous primary care diagnostics	Proof-of-concept to pilot
LLM-based copilots / general CDS	Differential diagnosis support, treatment decision assistance	Early — one large preprint study; independent safety review published	No specific clearance; regulatory status under January 2026 FDA CDS guidance	Pilot deployment (primarily outside US)
Preventive care / population health AI	Prediabetes identification, SDOH-aware care coordination, AF risk stratification	Mostly retrospective or in-silico; prospective multi-site RCT evidence largely absent	Minimal; most tools not classified as medical devices	Proof-of-concept to limited pilot

Point-of-Care Screening AI: Diabetic Retinopathy and ECG-Based Cardiac Detection

This application class holds the strongest evidence in primary care AI-CDS. Both diabetic retinopathy screening and ECG-based cardiac detection have prospective trial data, FDA clearance, and documented real-world deployment — though uptake and clinical impact vary.

Diabetic Retinopathy AI Screening

IDx-DR was the first autonomous AI diagnostic system cleared by the FDA in any medical field, authorized via De Novo pathway (DEN180001) in April 2018. Its prospective pivotal trial reported 87.2% sensitivity and 90.7% specificity for detecting more-than-mild diabetic retinopathy. Subsequent 510(k) clearances (K203629, June 2021; K213037, June 2022) extended the product line under Digital Diagnostics. EyeArt (Eyenuk) followed with 510(k) clearance K200667 in August 2020 and K223357 in June 2023, with prospective study data showing sensitivity of 87–100% and specificity of 89–99% across multiple studies.

The clinical rationale is straightforward: diabetic retinopathy screening requires specialist-level fundus image interpretation, and most primary care offices lack on-site ophthalmology access. Autonomous AI screening addresses this gap directly, enabling point-of-care testing for patients who would otherwise go unscreened.

Despite FDA clearance and established CPT billing codes, real-world uptake of these tools remains low. The translational barrier here is not algorithmic performance — the prospective evidence is solid — but workflow integration, equipment costs, and implementation support.

Clinical graphic showing EyeArt AI diabetic retinopathy screening accuracy metrics alongside low real-world adoption data — EyeArt's diagnostic accuracy in prospective studies contrasts with persistently low real-world deployment rates — illustrating the translational gap between clearance and clinical adoption.

ECG-Based Cardiac AI Detection

AI analysis of standard 12-lead ECGs for detecting low ejection fraction (LEF) heart failure has generated the most clinically meaningful randomized evidence in primary care AI-CDS. A pragmatic RCT at Mayo Clinic (Yao et al., Nature Medicine, 2021) found that AI-ECG screening raised low ejection fraction diagnoses from 1.6% to 2.1% in the intervention arm; a follow-up analysis of frequent tool users found a 2x detection rate. These are modest absolute gains, but they represent clinically actionable identification of a condition that is otherwise systematically underdetected in primary care.

FDA clearances for ECG-AI tools have continued to accumulate. Anumana received clearance for its Low Ejection Fraction AI-ECG Algorithm via 510(k) K232699 in September 2023, and a second clearance K250652 in July 2025. Tempus AI received clearances for ECG-Low EF (K250119, July 2025) and ECG-AF for atrial fibrillation detection (K233549, June 2024).

For atrial fibrillation detection, the PULsE-AI algorithm demonstrated favorable cost-effectiveness in modeling studies — estimated at approximately £3,994 per quality-adjusted life year, with projections that wider rollout could prevent 3,299 strokes and reduce undiagnosed AF by 27%. However, the clinical trial evidence for direct impact on AF diagnosis rates in primary care settings showed mixed results, a pattern consistent with the broader challenge of translating algorithmic performance into workflow-integrated clinical outcomes.

FDA-cleared AI tools relevant to primary care point-of-care screening as of mid-2026. Sources: FDA AI-Enabled Medical Devices list.
Tool	Condition	Clearance	Pathway	Date	Key Evidence
IDx-DR	Diabetic retinopathy	DEN180001	De Novo	April 2018	87.2% sensitivity, 90.7% specificity (prospective trial)
IDx-DR	Diabetic retinopathy	K203629 / K213037	510(k)	2021 / 2022	Subsequent clearances; same product line
EyeArt	Diabetic retinopathy	K200667	510(k)	August 2020	87–100% sensitivity, 89–99% specificity (prospective studies)
EyeArt v2.2.0	Diabetic retinopathy	K223357	510(k)	June 2023	Updated version clearance
Anumana ECG-AI LEF	Low ejection fraction	K232699	510(k)	September 2023	Based on Mayo Clinic RCT evidence (Yao et al., Nat Med 2021)
Anumana ECG-AI LEF	Low ejection fraction	K250652	510(k)	July 2025	Second-generation clearance
Tempus ECG-Low EF	Low ejection fraction	K250119	510(k)	July 2025	Independent clearance for same indication
Tempus ECG-AF	Atrial fibrillation	K233549	510(k)	June 2024	AF detection; PULsE-AI cost-effectiveness data (mixed clinical impact)

Diagnostic Decision Support AI: Skin Cancer Classifiers and General Diagnostic Tools

Diagnostic AI for skin cancer classification and general symptom checking represents a more variable evidence class — with algorithmic performance that frequently does not translate to primary care settings.

Skin Cancer AI Classifiers

Across primary care-focused studies, skin cancer AI classifiers show accuracy ranging from 39% to 89%, with some achieving sensitivities above 90% for specific lesion types. These figures are not uniformly reassuring, and the methodological context matters significantly.

A systematic review of AI algorithms for early skin cancer detection in primary care found that only 2 of 272 included studies used training data from clinical settings with low disease prevalence — the kind of prevalence analogous to a primary care population. The remaining studies trained on specialist or secondary care datasets where malignant lesions are more common and more severe. This is the spectrum effect: a model trained on high-prevalence, high-severity data will systematically underperform when deployed in a setting where most lesions are benign and presentations are earlier and less differentiated.

There is an additional equity concern: most skin cancer AI training datasets have been compiled predominantly from light-skinned populations. Performance on darker skin tones is less well characterized, which is a documented limitation for primary care deployment across diverse patient populations.

General Diagnostic AI and Symptom Checkers

General-purpose diagnostic AI tools — symptom checkers and differential diagnosis generators — show accuracy in the range of 30–60% across studies. These tools face a structural challenge in primary care: undifferentiated presentations, high comorbidity burden, and the need to integrate social context into clinical reasoning. Most current tools are not designed to handle this complexity adequately.

The npj Digital Medicine perspective on AI and clinical reasoning makes a relevant distinction: the success stories in AI diagnostics — retinopathy detection, skin lesion classification, lymph node metastasis identification — tend to involve self-contained visual tasks where broader clinical context can be set aside. Primary care operates with heterogeneous, undifferentiated presentations where the signal-to-noise ratio is fundamentally different. Tools that present ready-made conclusions rather than genuinely supporting clinical reasoning are poorly suited to this environment.

LLM-Based Copilots and General-Purpose Clinical Decision Support

Large language model-based clinical copilots represent the fastest-moving and most contested category in primary care AI-CDS. The evidence base is early, geographically specific, and carries documented safety concerns that must be presented alongside any effectiveness data.

The Penda Health / GPT-4o Deployment

The most detailed published data on LLM copilot performance in a primary care setting comes from a pragmatic cluster-assigned study of 39,849 patient visits across 15 primary care clinics in Nairobi, Kenya (Korom et al.). Clinicians using an AI Consult tool built on GPT-4o made significantly fewer errors as rated by blinded independent physician reviewers: a 16% relative reduction in diagnostic errors (NNT 18.1) and a 13% relative reduction in treatment errors (NNT 13.9). The proportion of visits where clinicians missed a critical issue on first pass dropped from 45% to 35% in the AI group, while remaining flat in the control group.

The Independent Safety Review

An independent safety review of the same Penda Health deployment (Agweyu et al., Nature Health, 2026) examined 1,469 encounters and found that clinical management guidance aligned with local guidelines in 99% of cases, and hallucinations were uncommon (3.4%). However, the review identified actively harmful AI recommendations in 7.8% of encounters — 115 of 1,469 cases — with 67 of those appearing in final clinical documentation. Clinicians did not modify AI-generated content in 62% of encounters, a rate that the reviewers characterized as raising serious concerns about automation bias.

These two data sources — the effectiveness study and the safety review — describe the same deployment and must be read together. Error reduction at the population level does not preclude harmful recommendations at the individual encounter level, and a 62% rate of uncritical acceptance of AI-generated content represents a significant patient safety concern regardless of aggregate performance metrics.

Effectiveness and safety findings from the Penda Health GPT-4o primary care deployment in Nairobi, Kenya. Deployment context: clinical officers, local epidemiological prompting, 15 clinics.
Finding	Source	Evidence Level
16% relative reduction in diagnostic errors (NNT 18.1)	Korom et al. (preprint, 2025)	Preprint — not yet peer-reviewed
13% relative reduction in treatment errors (NNT 13.9)	Korom et al. (preprint, 2025)	Preprint — not yet peer-reviewed
No statistically significant difference in patient-reported outcomes at 8 days	Korom et al. (preprint, 2025)	Preprint — not yet peer-reviewed
Actively harmful AI recommendations in 7.8% of encounters (115/1,469)	Agweyu et al., Nature Health (2026)	Peer-reviewed independent safety review
Clinicians accepted AI content uncritically in 62% of encounters	Agweyu et al., Nature Health (2026)	Peer-reviewed independent safety review
Hallucinations uncommon (3.4%) but present	Agweyu et al., Nature Health (2026)	Peer-reviewed independent safety review

Regulatory Status of LLM Copilots

The January 2026 FDA final guidance on Clinical Decision Support Software (FDA-2017-D-6569) establishes a four-criteria framework for excluding CDS software from device regulation. Software that computes a risk probability or generates alerts for life-threatening conditions in a time-sensitive manner fails criterion 3 and remains regulable. How 'explainability' applies to generative AI under criterion 4 — which requires that clinicians can independently review the basis of recommendations — remains an open regulatory question for LLM-based copilots.

Preventive Care and Population Health AI

EHR-based NLP and SDOH-aware predictive models represent a category with technically promising results but the weakest prospective trial evidence of the four application classes.

NLP analysis of EHR clinical notes has demonstrated the ability to identify undisclosed prediabetes discussions with 98% sensitivity and 96% specificity (Tseng et al.) — performance that, if replicated in prospective deployment, would represent a meaningful contribution to diabetes prevention in primary care. Reinforcement learning models designed to account for social determinants of health in care coordination have shown reductions in acute care events of 12 percentage points in pilot programs, with an NNT of 8.3 (Basu et al., 2025).

These figures come from retrospective analyses or single-site pilots. Prospective, multi-site, randomized controlled trial evidence for preventive care AI in primary care is largely absent. The gap between in-silico performance and clinical trial evidence is widest in this application class.

NLP for prediabetes identification: 98% sensitivity / 96% specificity in EHR note analysis (Tseng et al.) — retrospective, single-site.
SDOH-aware care coordination AI: 12 percentage point reduction in acute care events, NNT 8.3 (Basu et al., 2025) — pilot-stage, limited generalizability.
AF risk stratification (PULsE-AI): favorable cost-effectiveness modeling (£3,994/QALY; 3,299 strokes prevented at scale) — mixed clinical trial evidence on actual diagnosis rates.
Most tools in this class remain at proof-of-concept or limited pilot stage. Multi-site prospective RCT evidence is the missing link.

Known Limitations Across Application Classes

The limitations of AI-CDS in primary care are not generic. They are specific, documented, and vary by application class. Understanding which limitation applies to which tool type is essential for clinical evaluation.

Documented limitations of AI-CDS tools in primary care, mapped to application class and supporting evidence.
Limitation Type	Description	Most Relevant Application Class	Documented Evidence
Spectrum effect	Models trained on high-prevalence specialist datasets underperform in primary care due to lower disease prevalence and less severe presentations	Skin cancer AI, general diagnostic AI	Systematic review: only 2/272 skin cancer AI studies used primary care-appropriate training data
Automation bias	Clinicians over-rely on AI outputs, particularly when presented in conversational or authoritative formats	LLM copilots, general CDS	Agweyu et al.: 62% uncritical acceptance of AI content in Penda Health deployment
Hallucinations	LLMs produce factually incorrect or clinically inappropriate outputs	LLM copilots	EUDF consensus; Agweyu et al.: 3.4% hallucination rate, 7.8% harmful recommendations
EHR integration friction	Alert fatigue, workflow misalignment, and usability barriers impede adoption	All classes	Recurring finding across 73-study JMIR scoping review
Dataset bias — skin type	Training data skewed toward light-skinned populations; performance on darker skin tones less characterized	Skin cancer AI	Documented in systematic reviews; equity concern for diverse primary care populations
Dataset bias — race/ethnicity	Racial bias documented in diabetes prediction models and other risk stratification tools	Preventive care AI	Noted in Lancet Primary Care 2025 review and related literature
Training-deployment mismatch	Models developed in secondary or tertiary care settings do not reflect primary care patient mix or presentation severity	All classes trained on specialist data	Lancet Primary Care 2025 review; JMIR scoping review

Regulatory Status: What Is Cleared, What Is Not, and What Remains Unresolved

FDA clearance authorizes a device for marketing in a defined intended use. It does not certify clinical superiority, real-world effectiveness, or suitability for deployment contexts beyond those studied in the clearance submission. These distinctions matter for procurement decisions in primary care.

The FDA AI-Enabled Medical Devices list reflects decisions through December 2025. The primary care-relevant cleared tools are concentrated in ophthalmology (diabetic retinopathy) and cardiology (ECG-based detection). The regulatory landscape for LLM copilots and preventive care AI is substantially less defined.

The January 2026 FDA final guidance on CDS software introduced a four-criteria framework for determining whether CDS software qualifies as a non-device (and thus escapes device regulation). The critical unresolved question is criterion 4: explainability. For LLM-based copilots, it is not clear how a clinician can independently review the basis of a generative AI recommendation in the way the guidance envisions. This ambiguity is acknowledged in the Lancet Digital Health regulatory analysis, which calls for radical transparency and public disclosure of CDS software to align regulation with clinical practice.

For health systems evaluating governance frameworks for AI-CDS tools, the NIST AI Risk Management Framework and its application to healthcare provides a structured approach to the gap between FDA clearance and algorithmic accountability — particularly relevant for post-deployment monitoring of AI-CDS tools in primary care workflows.

Real-World Deployment Context and Clinical Adoption

The dominant barrier to AI-CDS in primary care is not algorithmic performance. Across all four application classes, the recurring finding from the 73-study JMIR scoping review is that workflow alignment, clinician trust, and organizational implementation support determine whether a tool gets used — not sensitivity or specificity alone.

Diabetic retinopathy AI is the clearest example. The prospective evidence is strong, CPT billing codes exist, and the FDA clearance pathway is well established. Yet real-world uptake remains low. The barriers documented in implementation literature include equipment acquisition costs, staff training requirements, integration with existing EHR workflows, and the absence of dedicated implementation support for primary care practices — particularly independent and safety-net practices.

The AAFP adopted an official policy on AI in family medicine in July 2023. It states that AI must preserve and enhance the patient-physician relationship, support the four Cs of primary care (first contact, continuity, comprehensiveness, coordination), and be evaluated with the same rigor as any other healthcare tool. In April 2025, the AAFP joined the Health IT End-Users Alliance Consensus Statement calling for common principles balancing AI innovation with regulatory oversight and responsible AI development and monitoring.

The Penda Health deployment identified three factors that supported clinician adoption: a capable underlying model, clinically aligned implementation design, and active deployment strategies including peer champions, measurement dashboards, and positive incentives. These factors are not specific to LLM copilots — they describe the conditions for any technology adoption in primary care.

Routine deployment (some settings): FDA-cleared diabetic retinopathy AI (IDx-DR, EyeArt); ECG-AI for low ejection fraction in health system cardiology-primary care integration programs.
Pilot deployment: LLM-based copilots (primarily outside US; Penda Health Nairobi deployment); SDOH-aware care coordination models in select safety-net health systems.
Proof-of-concept: Most EHR-based preventive care NLP tools; general symptom-checker AI; skin cancer classifiers in primary care settings.

How tools are integrated into the workflow is more important than the tool itself.

This framing — from a Stanford Medicine presentation on AI in primary care workflow — reflects the consistent finding across implementation literature. The npj Digital Medicine perspective makes the same point more critically: AI systems that present ready-made conclusions compete with and curtail clinician judgment rather than genuinely supporting it. The design question — whether a tool supports reasoning or substitutes for it — is as important as the performance question for primary care deployment.

For a broader view of how AI tools behave in real clinical environments — including implementation failures and adoption patterns — the structured analysis of real clinical AI deployments provides relevant context across specialties.

Clinical Implications and Research Priorities

The evidence landscape for AI-CDS in primary care is not uniformly promising or uniformly concerning. It is stratified — and the stratification matters for clinical decision-making about adoption.

FDA-cleared point-of-care screening tools for diabetic retinopathy have the strongest evidence and the clearest clinical use case: extending specialist-level screening to primary care settings that lack ophthalmology access. The barrier to adoption is implementation, not evidence. ECG-AI for low ejection fraction has meaningful RCT support and is entering health system integration in cardiology-adjacent workflows.

LLM-based copilots present a different calculus. Early effectiveness data is genuinely interesting, but the documented automation bias risk and the rate of harmful recommendations in the only large-scale independent safety review available should give pause to health systems considering deployment ahead of prospective validation in their own clinical context. The preprint status of the primary effectiveness study (Korom et al.) and the non-US deployment context are material limitations.

Preventive care and population health AI is the area where the evidence gap is largest relative to the commercial activity. Retrospective performance figures for NLP-based prediabetes identification and SDOH-aware care coordination are promising but cannot substitute for prospective multi-site randomized evidence. Health systems deploying these tools should treat them as pilots with structured evaluation protocols, not as validated clinical interventions.

Prospective multi-site RCT evidence is the priority gap for EHR-based preventive care AI. In-silico and retrospective performance data are not sufficient for deployment decisions in safety-net or high-risk settings.
Workflow integration design should be evaluated alongside algorithmic performance. Tools that present conclusions rather than supporting reasoning are likely to generate automation bias regardless of underlying accuracy.
Spectrum effect testing should be a standard requirement for any AI diagnostic tool considered for primary care deployment, particularly if the tool was developed or validated in specialist or secondary care settings.
The regulatory status of AI tools in use at a given health system should be verified against the FDA AI-Enabled Medical Devices list. FDA clearance and clinical effectiveness are distinct — both matter, and neither substitutes for the other.
AAFP implementation guidance: start with a clearly defined clinical problem, involve frontline clinicians in selection and design, look to EHR vendors for integration pathways first, and exercise particular caution with prototype tools in safety-net settings where patient vulnerability is highest.

For procurement teams and health IT professionals evaluating specific AI-CDS vendors, the structured landscape of active healthcare AI developers provides factual company and product profiles organized by application area and regulatory status.

AI Clinical Decision Support in Primary Care: Evidence, Applications, and Deployment Realities