LLM Ambient AI Scribe Accuracy: What 2025–2026 Studies Show

A physician in a modern outpatient exam room maintaining eye contact with a patient while a small ambient recording device sits unobtrusively on the desk and a clinical note draft is visible on a secondary screen. — Ambient AI scribes are designed to shift clinician attention back toward patients — but the clinician review step for generated notes is a design requirement, not optional.

The Documentation Burden Problem Ambient Scribes Are Designed to Address

Physicians in the United States now spend more than half of their working day interacting with the electronic health record — and less than a quarter of that time face-to-face with patients. Clinical documentation, the process of converting a patient encounter into a structured, billable, legally defensible note, is the primary driver of that imbalance. The consequence is not merely inefficiency. Sustained documentation burden is a recognized antecedent of physician burnout, which in turn is associated with reduced care quality, increased medical error rates, and accelerating workforce attrition.

Ambient AI scribes represent the most direct technological response to this problem yet deployed at scale. Rather than asking clinicians to dictate, type, or navigate structured templates after the encounter, ambient scribes passively capture the spoken conversation during the visit and convert it into a draft clinical note that the clinician reviews and approves. The promise is a reduction in the time and cognitive load of documentation without sacrificing note quality or clinical accuracy.

Whether that promise holds in practice — and under what conditions, for which clinicians, in which care settings, and with what residual safety risks — is what the emerging peer-reviewed literature is beginning to answer. This analysis synthesizes the best available evidence through mid-2026.

How LLM-Based Ambient Scribes Work: A Technical Overview

LLM-powered ambient scribes are architecturally distinct from earlier generations of documentation assistance. Previous tools relied on physician-directed dictation — the clinician spoke to a microphone after the encounter, and automatic speech recognition (ASR) converted speech to text, often requiring manual cleanup. Rule-based natural language processing (NLP) tools could extract structured data from existing text but could not generate coherent narrative notes from conversational audio.

Current ambient scribes operate through a four-stage pipeline. First, a microphone device or application passively records the patient-clinician conversation during the visit — no clinician action is required during the encounter itself. Second, ASR converts the audio to a transcript in near real time. Third, a large language model processes the transcript and generates a draft structured clinical note, typically organized into standard sections such as chief complaint, history of present illness, assessment, and plan. Fourth, the clinician reviews the draft, edits as needed, and approves it for insertion into the EHR.

Passive ambient recording during the encounter — no clinician action required mid-visit.
Real-time ASR transcription of the patient-clinician conversation.
LLM summarization of the transcript into a structured draft note.
Clinician review, editing, and approval before EHR insertion.

This generation of tools is meaningfully different from prior ASR and NLP documentation aids because the LLM step introduces both new capability — generating coherent, contextually appropriate clinical prose from unstructured conversation — and new failure modes, including hallucination of clinical details not present in the conversation and omission of clinically relevant information that was discussed.

Efficacy Evidence: What Peer-Reviewed Studies Report on Burden and Burnout

Four peer-reviewed studies published between October 2025 and April 2026 provide the primary efficacy evidence for LLM-based ambient scribes. They vary substantially in study design, sample size, setting, and outcome measurement — and their findings, while generally consistent in direction, differ enough in magnitude to caution against treating any single result as representative.

The NEJM AI Pragmatic RCT (Afshar et al., November 2025)

The strongest evidence to date comes from a 24-week stepped-wedge individually randomized pragmatic trial published in NEJM AI in November 2025. Conducted across ambulatory clinics in two US states, the trial enrolled 66 health practitioners and captured 71,487 notes, of which 27,092 (38%) were authored using ambient AI. The trial was funded by UW Health and NIH; no funding was provided by the ambient AI vendor.

The primary outcome — practitioner well-being as measured by the Stanford Professional Fulfillment Index (PFI) — showed a significant reduction in work exhaustion and interpersonal disengagement of −0.44 points (95% CI −0.62 to −0.25; P<0.001). Documentation time decreased by −0.36 hours per day (95% CI −0.55 to −0.17). Diagnostic billing code accuracy improved with ambient AI use (P<0.001). Documentation quality, assessed with the PDSQI-9 instrument, showed no decline across the study period, with mean domain scores ranging from 3.97 to 4.99 on a five-point scale.

This is the most methodologically rigorous study in the current evidence base. Its randomized design and multi-week follow-up distinguish it from the quality improvement and observational studies that dominate the literature. Its sample size (66 practitioners) remains modest, and the setting — ambulatory clinics — limits generalizability to other care environments.

The JAMA Multisite Observational Study (Rotenstein et al., April 2026)

The largest efficacy study to date is a multisite observational study published in JAMA in April 2026, co-led by Mass General Brigham researchers. The study tracked ambient scribe adoption across five US academic medical centers over more than two years, covering more than 1,800 clinicians.

AI scribe adoption was associated with a reduction of 13.4 minutes per day in total EHR time and 16.0 minutes per day in documentation time, representing relative decreases of approximately 3% and 10%, respectively. Productivity increased modestly, with adopters completing approximately 0.49 additional patient visits per week.

The benefits were heterogeneous. Greatest reductions were observed in primary care clinicians, female clinicians, and those who used the scribe in at least 50% of visits — a threshold only 32% of adopters reached. Among high-frequency users, total EHR time reduction was approximately double and documentation time reduction approximately triple compared to lower-frequency users.

The modest reductions in documentation time we observed are unlikely to fully account for changes in burnout, underscoring the need to understand how these tools change how clinicians approach care delivery. — Rebecca Mishuris, senior author, Mass General Brigham press release on the JAMA 2026 study

The JAMA Network Open Multicenter QI Study (Olson et al., October 2025)

A multicenter quality improvement study published in JAMA Network Open in October 2025 enrolled 263 ambulatory physicians and advanced practice practitioners across six US health systems. After 30 days of ambient scribe use, the proportion of clinicians meeting burnout criteria dropped from 51.9% to 38.8%, representing a 74% lower adjusted odds of burnout (adjusted OR 0.26; 95% CI 0.13–0.54; P<0.001). Cognitive task load related to note-writing decreased by 2.64 points on a 10-point scale (P<0.001), and after-hours documentation decreased by 0.90 hours per week.

The Singapore General Hospital Time-Motion Study (Tan et al., March 2026)

The only published direct-observation study outside the United States is a prospective within-clinician time-motion study conducted at Singapore General Hospital, published in JMIR Medical Informatics in March 2026. Five trained observers directly observed 169 consultations involving 9 experienced clinicians across 7 specialties. Documentation time per consultation decreased by 15.0% (from 5.3 to 4.5 minutes; P=.04), and the proportion of consultation time spent in eye contact with patients increased by 10.6 percentage points (from 69.6% to 77.1%; P=.009). Consultation duration and total cycle time did not change significantly. Among 39 surveyed patients, 69.2% agreed their physician focused on them more during the visit.

Primary efficacy studies on LLM-based ambient AI scribes, organized by study design strength. All settings are ambulatory or outpatient. No inpatient or high-acuity RCT evidence exists as of mid-2026.
Study	Design	Sample	Setting	Key Efficacy Finding
Afshar et al., NEJM AI, Nov 2025	Pragmatic RCT (stepped-wedge)	66 practitioners; 71,487 notes	Ambulatory, 2 US states	−0.44 PFI exhaustion score; −0.36 hr/day documentation time; no PDSQI-9 quality decline
Rotenstein et al., JAMA, Apr 2026	Multisite observational	1,800+ clinicians; 5 academic centers	Ambulatory, US academic medical centers	−13.4 min/day EHR time; −16.0 min/day documentation time; +0.49 visits/week
Olson et al., JAMA Network Open, Oct 2025	Multicenter quality improvement	263 ambulatory clinicians; 6 health systems	Ambulatory, US	Burnout 51.9%→38.8% (adj. OR 0.26); −2.64 cognitive task load; −0.90 hr/week after-hours
Tan et al., JMIR Med Inform, Mar 2026	Prospective time-motion (direct observation)	9 clinicians; 169 consultations	Singapore General Hospital, 7 specialties	−15.0% documentation time/consultation; +10.6% eye contact proportion; favorable patient acceptance

Accuracy and Safety Evidence: Error Rates, Severity, and the Unedited-Note Problem

The most detailed accuracy and safety data available comes from a pragmatic prospective pilot conducted at UC Davis Health, published in JMIR Medical Informatics in April 2026. Taylor and colleagues evaluated 356 AI-generated notes — representing 4.7% of 7,545 total notes produced by 31 volunteer physicians across multiple ambulatory specialties over a two-month period — for error type, frequency, and clinical severity.

A structured taxonomy of AI-generated clinical note error types shown as color-coded horizontal zones representing omissions, hallucinations, accidental inclusions, and bias, with a severity gradient bar from mild to serious harm risk. — Error taxonomy from the UC Davis Health pilot (Taylor et al., JMIR Med Inform, April 2026). Omissions were the most frequent error type; 5.3% of evaluated notes contained errors rated as posing serious or imminent harm risk if uncorrected.

The Error Taxonomy

Error rates from the UC Davis Health pilot (Taylor et al., JMIR Med Inform, April 2026). Rates are based on physician evaluation of 356 notes (4.7% of 7,545 total AI-generated notes). Evaluators were volunteer early adopters.
Error Type	Rate in Evaluated Notes	Description
Accidental omissions	18.0%	Clinically relevant information discussed during the encounter that was absent from the generated note
Hallucinations	11.5%	Clinical details present in the note that were not discussed during the encounter or present in the source conversation
Accidental inclusions	9.3%	Information included in the note that was not intended to be documented (e.g., incidental conversational content)
Bias	1.1%	Notes containing language reflecting demographic or other bias

The majority of errors — 83.8% — were rated as mild to moderate in clinical severity (severity grades 1–3 on a five-point scale). However, 5.3% of evaluated notes contained errors rated as posing serious or imminent risk of patient harm (severity grades 4–5) if the note were accepted without correction.

The Unedited-Note Problem

Among the 960 AI-generated notes for which vendor editing data were available, 14.9% were accepted by physicians without any modification. Individual physician editing behavior varied enormously: the proportion of words changed per note ranged from 1.9% to 69.3% across physicians, with a median of 9.0% of words changed.

This variation matters because the safety architecture of ambient scribes depends entirely on the clinician review step. If a meaningful fraction of notes is accepted unreviewed — or reviewed only superficially — then the error rates observed in the evaluation sample translate directly into documentation entering the medical record uncorrected. The UC Davis authors conclude that careful clinician review remains imperative and recommend pre-deployment piloting with standardized error monitoring protocols.

Study Design Quality and Evidence Limitations

Evaluating the evidence base as a whole requires assessing not just what individual studies found but how much confidence those findings warrant. Several methodological patterns recur across the literature and constrain how far any finding can be generalized.

Volunteer and early-adopter bias. Most studies recruited participants who self-selected into ambient scribe use. Early adopters are systematically more favorable toward technology, more motivated to make it work, and less representative of the broader clinician population.
Absence of control groups in QI and observational studies. Only the NEJM AI trial used randomization. The Olson et al. QI study and the JAMA observational study cannot establish causal relationships between scribe use and observed outcomes.
EHR timestamp measurement limitations. The JAMA multisite study used EHR log data to estimate documentation time. EHR timestamps capture when the record was open, not when the clinician was actively writing — a known overestimation problem. The Singapore direct-observation study avoided this limitation but covered only 9 clinicians.
Ambulatory-only sample concentration. All four primary efficacy studies were conducted in outpatient or ambulatory settings. No peer-reviewed RCT evidence exists for inpatient, emergency department, or ICU deployment.
Short follow-up windows. Burnout outcomes in the Olson et al. study were measured at 30 days. Whether improvements persist at 6 or 12 months — or whether novelty effects attenuate — is unknown.
Heterogeneous evaluation instruments. Studies used PDQI-9, PDSQI-9, NASA-TLX, and single-item burnout scales interchangeably. This heterogeneity limits direct cross-study comparison of documentation quality and cognitive burden outcomes.

High rates of omissions and hallucinations found in some studies underscore the need to evaluate for potential diagnostic errors or other long-term safety risks. — Leung, Coristine, and Benis, JMIR Medical Informatics editorial, August 2025

The JMIR editorial launching a new section on ambient AI scribe evidence explicitly characterizes the current body of literature as dominated by small-scale, short-term pilot studies that often have volunteer participants who may be biased toward technology. That characterization is accurate as of mid-2026.

Evidence Gaps: What the Current Literature Does Not Cover

The gaps in the evidence base are as important as the findings. For health system decision-makers evaluating deployment beyond ambulatory primary care, the following domains have either no peer-reviewed RCT evidence or evidence too limited to support generalization.

Inpatient, ICU, and Emergency Department Settings

High-acuity settings present distinct barriers that ambulatory studies do not address: multi-speaker conversations involving care teams rather than dyadic patient-clinician exchanges, acoustic environments with competing noise sources, rapid-pace encounters where documentation timing differs fundamentally from scheduled appointments, and higher-stakes documentation where errors carry more immediate clinical consequence. A perspective published in npj Digital Medicine in March 2026 by teams from Mayo Clinic and Singapore General Hospital explicitly identifies these settings as presenting barriers not evaluated in existing studies, with no published RCT evidence for any of them.

Non-Physician Clinicians

Nurses, pharmacists, therapists, and other non-physician clinicians contribute substantially to clinical documentation but are largely absent from the current study populations. Whether ambient scribe performance, error rates, and workflow integration patterns differ for these groups is unknown.

Downstream Patient Clinical Outcomes

No published study has demonstrated an improvement in patient clinical outcomes attributable to ambient scribe use. All reported outcomes are process measures (documentation time, EHR log time) or clinician-experience measures (burnout, cognitive load, professional fulfillment). Whether reduced documentation burden translates into better diagnostic accuracy, fewer missed follow-ups, or improved patient safety is a critical unanswered question.

Health Equity and Underrepresented Populations

LLM training data is known to reflect demographic disparities in the source corpora used for pretraining. The UC Davis pilot identified bias errors in 1.1% of evaluated notes — a small fraction, but one that is clinically meaningful if it systematically affects documentation for specific patient populations. The npj Digital Medicine perspective explicitly identifies LLM training data disparities as a systemic concern for underrepresented groups and calls for targeted evaluation. No existing published study has conducted a formal equity analysis of ambient scribe error rates stratified by patient race, ethnicity, language, or socioeconomic status.

Multilingual and Non-English-Language Environments

The Singapore General Hospital study is the only published direct observational study to include multilingual encounters (English, Mandarin, and Malay). It used an in-house tool not commercially available, required manual EHR transfer, and studied only 9 clinicians. No commercially deployed ambient scribe has peer-reviewed published evidence specifically addressing non-English-language encounter accuracy.

Long-Term Sustainability and Cognitive Debt

No study has followed ambient scribe users beyond several months. Whether clinicians maintain active review habits over time — or whether repeated exposure to high-quality AI drafts gradually reduces the cognitive engagement applied to each note — is unknown. The JMIR editorial raises the risk of "cognitive debt": a gradual atrophy of documentation skills and clinical reasoning engagement as clinicians defer to AI-generated text.

Regulatory and Governance Status

The regulatory classification of ambient AI scribes with LLM-based summarization capability is unsettled as of June 2026. This matters for health systems making deployment decisions, because regulatory classification determines liability frameworks, post-market surveillance obligations, and the evidentiary standards that manufacturers must meet.

The npj Digital Medicine perspective published in March 2026 — the most current peer-reviewed synthesis of regulatory status available — states that the US FDA's Software as a Medical Device (SaMD) framework and the EU Medical Device Regulation have not yet provided clear, harmonized guidance on the classification and risk categorization of ambient AI scribes with summarization capability. The NHS in England is identified as the first regulatory body to release guidance specifically addressing these tools, requiring that ambient AI scribes with summarization capability undergo regulatory scrutiny.

Beyond formal regulatory classification, the npj Digital Medicine perspective identifies accountability gaps that precede regulatory resolution: liability structures for documentation errors in AI-assisted notes are undefined in most jurisdictions, institutional policies governing ambient recording and patient consent vary substantially across health systems, and international consensus on evaluation standards — including the PDQI-9, SCRIBE, and CRAFT-MD frameworks — has not been reached.

Deployment Implications: What the Evidence Supports and What It Does Not

Health systems evaluating ambient AI scribe deployment should distinguish between what the current evidence base affirmatively supports and what it does not — and should resist the tendency to extend findings from the studied population (ambulatory primary care, early adopters, academic medical centers) to settings and populations not yet studied.

Summary of what the 2025–2026 peer-reviewed evidence supports and does not support for ambient AI scribe deployment decisions. Based on studies available as of June 2026.
Domain	What the Evidence Supports	What the Evidence Does Not Support
Documentation burden	Modest reduction in documentation time in ambulatory primary care (range: 13–16 min/day in JAMA study; 0.36 hr/day in NEJM AI RCT)	Large or transformative time savings; savings equivalent across all adopter types or care settings
Clinician burnout	Short-term burnout reduction in early adopters (30-day, self-reported, no control group)	Durable burnout improvement; causal attribution to documentation time reduction alone
Documentation quality	No detected decline in note quality in the NEJM AI RCT (PDSQI-9)	Autonomous note acceptance without structured clinician review; quality equivalence to physician-authored notes
Patient experience	Improved patient-facing attention (eye contact, perception of engagement) in Singapore study	Improvement in patient clinical outcomes; generalizability beyond observed settings
Setting generalizability	Ambulatory and outpatient primary care in US academic medical centers	Inpatient, ICU, emergency department, or high-acuity settings; non-physician clinicians
Safety	Most errors are mild-to-moderate; structured review catches serious errors	Autonomous deployment without error monitoring; safety equivalence to fully physician-authored notes

The UC Davis authors' recommendation for pre-deployment piloting with standardized error monitoring is the most operationally concrete guidance the evidence base currently provides. Institutions deploying ambient scribes without a structured mechanism to track error rates, editing behavior, and near-miss events are operating without the safety feedback loops that the current evidence suggests are necessary.

Limitations of This Analysis

This synthesis has several limitations that readers should weigh when applying its conclusions.

Narrow primary study base. The evidence synthesized here is drawn from four primary efficacy studies and one accuracy study, all concentrated in US ambulatory academic medical centers. The conclusions reflect this sample and should not be extended to other settings.
Paywalled primary source. The JAMA 2026 multisite study (Rotenstein et al., DOI: 10.1001/jama.2026.2253) is behind a paywall. Key metrics cited in this analysis were verified through the Mass General Brigham institutional press release and a secondary summary; they are consistent across sources but were not verified against the full published text.
UC Davis error rate scope. The error rates reported from the UC Davis Health pilot (18% omissions, 11.5% hallucinations) derive from physician evaluation of only 4.7% of all AI-generated notes, by volunteer early adopters. These rates may not represent the full distribution of errors across all notes, all tools, or all care settings.
Self-reported burnout outcomes. All burnout findings cited are short-term, self-reported, and from populations predisposed toward technology adoption. They should not be treated as durable or causally established outcomes.
Regulatory status currency. Regulatory status claims in this analysis reflect the evidence available as of June 2026. FDA SaMD classification guidance for ambient AI scribes with summarization capability is an active and evolving area; the status described here may change.
No internal cross-references. This site has no prior published content on ambient AI scribes. Cross-references to related records — FDA device registry entries, clinical application briefs, regulatory tracker entries — will be added as the site's content base develops.

LLM-Powered Ambient AI Scribes: What the 2025–2026 Clinical Evidence Shows About Accuracy, Safety, and Documentation Burden