NLP in Clinical Documentation: AI Scribes, Coding, and CDI

Definition and Scope

Natural language processing (NLP) in clinical documentation refers to a family of AI sub-techniques that extract, interpret, and transform unstructured clinical text — and spoken language — into structured, computable data. In healthcare contexts, NLP is not a single technology but a layered pipeline of discrete computational tasks, each addressing a specific challenge in converting human-generated clinical language into formats usable by EHR systems, coding engines, and documentation review tools.

The scope of clinical NLP spans three operationally distinct workflow domains: ambient AI scribing (converting encounter audio into structured clinical notes), computer-assisted coding (mapping note text to ICD-10, CPT, and related code sets), and clinical documentation integrity and improvement (identifying documentation gaps that affect diagnosis-related group assignment, case mix index, and quality metrics). Each domain draws on overlapping but differently weighted components of the NLP pipeline.

A terminological note applies throughout this entry: the abbreviation CDI carries two related but distinct meanings in active professional use. Clinical Documentation Improvement is the process-focused framing, emphasizing workflow interventions and query practices that improve the completeness of documentation at the point of care. Clinical Documentation Integrity is the quality-focused framing, emphasizing the accuracy, validity, and compliance of the documentation record over time. Vendor materials, professional organizations such as AHIMA and ACDIS, and peer-reviewed literature use both framings; this entry uses CDI to encompass both and distinguishes them where the distinction is operationally relevant.

Why Clinical NLP Differs from General NLP

Clinical text presents a set of structural and linguistic properties that distinguish it sharply from the general-domain text on which most foundational NLP models are trained. These properties are not incidental — they directly affect model accuracy, generalizability, and the risk of consequential errors in downstream clinical or administrative tasks.

High abbreviation density. Clinical notes routinely compress multi-word concepts into institution-specific or specialty-specific abbreviations. "SOB" may mean shortness of breath or a surgical order bundle depending on context. "MS" resolves differently in neurology, orthopedics, and pharmacy. Models trained on general corpora systematically misinterpret these contractions.
Specialty-specific vocabulary. Cardiology, oncology, and psychiatry each maintain terminology, diagnostic criteria, and procedural language that diverges substantially from general English and from each other. A model performing well on general medical text may underperform significantly on subspecialty operative or pathology reports.
Negation and uncertainty patterns. Clinical notes routinely document what is absent, ruled out, or uncertain: "no chest pain," "PE unlikely," "family history of colon cancer." These negated and qualified references must be correctly scoped to avoid erroneous code assignment or documentation flagging. Negation in clinical text follows patterns that general NLP models handle poorly without domain-specific training.
Inconsistent inter-institution terminology. Documentation conventions vary across EHR platforms, health systems, and individual clinicians. A model trained on Epic-generated notes from one academic medical center may not transfer reliably to Oracle Health notes from a community hospital, even when the underlying clinical content is similar.
Implicit temporal and contextual structure. A single note may reference current diagnoses, historical conditions, family history, and hypothetical differential diagnoses without explicit structural markers separating them. Correctly attributing a clinical concept to the right temporal and contextual frame requires assertion and temporality detection capabilities that go beyond standard text classification.
Cross-institutional portability failures. NLP models trained and validated on data from one institution frequently underperform when deployed at a different site. This is a documented and persistent limitation in the clinical NLP literature, with direct implications for vendor claims about generalized accuracy benchmarks.

These properties collectively mean that clinical NLP is not simply a domain adaptation of general NLP — it requires purpose-built training data, specialized sub-task architectures, and ongoing validation in the specific deployment environment. Performance metrics from one institution or EHR context should not be assumed to apply elsewhere.

Core NLP Sub-Technique Taxonomy

Clinical NLP pipelines are composed of discrete sub-techniques that operate in sequence, each transforming the output of the previous step. Understanding which sub-technique does what — and where it can fail — is foundational to evaluating vendor claims and interpreting study findings. The following taxonomy covers the principal components found in deployed clinical NLP systems.

A left-to-right pipeline diagram showing eight clinical NLP sub-technique stages: ASR, Tokenization, NER, Negation Detection, Relation Extraction, Assertion/Temporality, Coreference Resolution, and ClinicalBERT/LLM, each connected by arrows. — The clinical NLP sub-technique pipeline: each stage transforms the previous stage's output, building toward structured clinical data. LLMs operate primarily at the generation and reasoning stages, not as a replacement for earlier pipeline components.

Core NLP sub-techniques in clinical documentation pipelines, with clinical examples and primary workflow domains. LLMs are one component of this pipeline, not a synonym for clinical NLP.
Sub-Technique	What It Does	Clinical Example	Primary Domain(s)
Automatic Speech Recognition (ASR)	Converts spoken language to raw text; clinical ASR is trained on medical vocabulary including drug names, ICD codes, and specialty terminology	Transcribes a physician's spoken assessment during a patient encounter into editable text	Ambient AI scribing
Tokenization	Segments raw text into discrete units (tokens) — words, subwords, or characters — that downstream models process	Splits "Pt c/o SOB x3d" into processable token units before entity recognition	All domains (foundational step)
Named Entity Recognition (NER)	Identifies and classifies clinically meaningful spans of text — diagnoses, medications, procedures, anatomical locations, lab values	Identifies "type 2 diabetes mellitus" as a diagnosis entity and "metformin 500 mg" as a medication entity in a progress note	Computer-assisted coding; CDI; AI scribing
Relation Extraction	Identifies semantic relationships between entities within a sentence or document	Links "metformin" to "type 2 diabetes mellitus" as a treatment relationship, or links a dosage to its corresponding medication	Computer-assisted coding; CDI
Negation Detection	Determines whether an identified entity is asserted as present or negated (absent, ruled out, or unlikely)	Correctly scopes "no evidence of pulmonary embolism" so that PE is not coded as a present diagnosis	Computer-assisted coding; CDI; AI scribing
Assertion and Temporality Detection	Classifies the assertion status (current, historical, hypothetical, family history) and temporal frame of a clinical concept	Distinguishes "history of cervical cancer, s/p hysterectomy with no evidence of recurrence" from an active cervical cancer diagnosis	Computer-assisted coding; CDI
Coreference Resolution	Links pronouns and noun phrases across a document to the same underlying entity	Resolves "she" and "the patient" to the same individual when tracking symptom attribution across a multi-paragraph note	AI scribing; CDI
ClinicalBERT / Fine-Tuned LLMs	Transformer-based language models pre-trained or fine-tuned on clinical text; perform generation, summarization, and complex reasoning tasks	Generates a SOAP note summary from a diarized encounter transcript; suggests ICD-10 codes from a discharge summary	AI scribing (generation); computer-assisted coding (code suggestion)

Domain 1 — Ambient AI Scribing

Ambient AI scribing systems convert the spoken language of a clinical encounter into a structured draft note without requiring the clinician to dictate explicitly or interact with the EHR during the visit. The end-to-end pipeline involves several sequential NLP components, each of which introduces its own accuracy profile and failure modes.

ASR with medical vocabulary training captures the encounter audio and produces a raw transcript. Medical-domain ASR systems achieve substantially lower word error rates than general-purpose ASR — approximately 2–5% versus 8–15% — by incorporating training data that includes drug names, anatomical terms, and specialty-specific phrasing.
Speaker diarization separates the transcript into physician and patient speech turns. This is a prerequisite for correct attribution of symptoms, history, and clinical assessments to the appropriate speaker.
NLP-based clinical summarization processes the attributed transcript using NER, relation extraction, and LLM-based generation to identify and organize clinical entities — chief complaint, history of present illness, review of systems, physical exam findings, assessment, and plan — into the structured sections of a clinical note format.
SOAP or APSO note draft generation uses the extracted and organized clinical entities to produce a formatted draft note in the documentation style configured for the deployment site. Negation detection at this stage ensures that ruled-out conditions in the review of systems are not documented as positive findings.
EHR integration via HL7/FHIR APIs delivers the draft note into the appropriate EHR encounter for clinician review, editing, and attestation before it becomes part of the legal medical record.

A diagrammatic illustration showing a shared NLP processing layer feeding into three parallel workflow lanes: Ambient AI Scribing (audio to SOAP note), Computer-Assisted Coding (note to ICD-10 codes), and CDI/Documentation Integrity (document gaps to CDI alert panel). — The three clinical documentation domains served by NLP, each drawing on a shared set of sub-techniques but applying them at different points in the documentation lifecycle.

The peer-reviewed evidence base for ambient AI scribing has grown substantially since 2022 but remains limited in scope and consistency. A 2024 systematic review of 129 peer-reviewed studies found that AI tools improve clinical documentation through structuring data, annotating notes, and evaluating quality. Studies of AI speech recognition reported documentation time reductions ranging from 19% to 56%, though results were inconsistent — four studies reported increases in documentation time of 13.4–50%. The review concluded that while current AI tools offer targeted improvements, "moderately high error rates preclude the broad use of a comprehensive AI documentation assistant," and that a comprehensive, highly accurate end-to-end assistant is not yet validated in peer-reviewed literature.

A 2025 systematic review of 8 AI scribe intervention studies found positive trends in documentation efficiency. One peer-matched cohort study (Haberle et al.) showed documentation time per patient decreased from 5.3 minutes to 4.54 minutes for users of one ambient documentation system, and 24-hour documentation deficiency rates fell from 8.6% to 6.3%. However, the review found limited impact on reducing burnout — one included study reported no statistically significant change in burnout scores (p=0.081) despite improved documentation time perceptions. The review authors explicitly cautioned that evidence remains limited and heterogeneous, and called for broader real-world pragmatic evaluations.

Domain 2 — Computer-Assisted Coding

Computer-assisted coding (CAC) systems apply NLP to clinical documentation — primarily discharge summaries, operative reports, and outpatient encounter notes — to suggest ICD-10-CM, ICD-10-PCS, CPT, or HCPCS codes for human review and assignment. The pipeline moves from unstructured text to structured code suggestions through a sequence of NLP sub-tasks.

NER identifies diagnoses, medications, procedures, and anatomical locations as candidate entities within the clinical note text. This is the foundational step; errors here propagate through all subsequent stages.
Concept normalization maps the identified entity spans to standardized terminology — typically UMLS (Unified Medical Language System) concept unique identifiers or SNOMED CT codes — resolving the many surface forms in which a single clinical concept may appear across different notes and documentation styles.
Negation and assertion filtering removes entities that are negated, historical, hypothetical, or attributed to a family member from the active coding candidate list. This step is operationally critical: coding a ruled-out condition as present, or coding a family history as a patient diagnosis, constitutes a coding error with direct billing and compliance consequences.
Code suggestion generation maps the normalized, assertion-filtered entities to the appropriate code set — ICD-10-CM for diagnoses, ICD-10-PCS for inpatient procedures, CPT for outpatient procedures — and presents ranked code suggestions to a human coder for review.

A 2025 study published in Nature evaluated fine-tuning LLMs (GPT-4o mini and Llama variants) on the complete ICD-10 code set (74,260 code-description pairs). Fine-tuning increased exact code matching from under 1% to 97% in controlled base scenarios. However, when applied to full real-world clinical notes from the MIMIC-IV dataset (discharge summaries), exact ICD code matching reached only 69.20% at the top-1 prediction level, with category-level matching at 87.16%.

The study identified four primary failure modes on real-world clinical notes:

Failure modes identified in LLM-based ICD-10 coding on real-world MIMIC-IV clinical notes (Hou et al., 2025). Clinical context misinterpretation — which includes negation and assertion errors — accounts for 8% of errors despite being a lower-frequency category; its consequences for coding accuracy and compliance are disproportionate.
Failure Mode	Description	Share of Errors	Clinical Example
Information absence	The note lacks sufficient clinical detail to support the specific code	~42%	A diagnosis is mentioned without the specificity required to select between closely related ICD-10 codes
Diagnostic criteria insufficiency	The model applies a code without the clinical criteria required to justify it	~38%	Assigning a code for a specific severity level of a condition when the note does not document the criteria for that severity
Clinical context misinterpretation	The model misreads assertion, temporality, or negation context	~8%	Coding "history of cervical cancer s/p hysterectomy with no evidence of recurrence" as an active cervical cancer diagnosis
Coding rule violations	The model violates official ICD-10 coding guidelines or sequencing rules	~12%	Assigning a code that is designated as a manifestation code as a principal diagnosis

Domain 3 — Clinical Documentation Integrity and Improvement (CDI)

CDI programs use NLP to perform concurrent or prospective review of clinical documentation, identifying gaps in specificity, underdocumented secondary diagnoses, and missed complication or comorbidity (CC) and major complication or comorbidity (MCC) opportunities that affect diagnosis-related group (DRG) assignment, case mix index, and quality reporting metrics.

In traditional CDI workflows, a CDI specialist manually reviews inpatient encounters to identify documentation gaps and initiates physician queries to clarify or expand documentation. NLP-powered CDI tools partially automate the chart review process, scanning free-text notes for clinical concepts that suggest underdocumented diagnoses — for example, laboratory values and medication orders that are consistent with acute kidney injury but where no AKI diagnosis has been explicitly documented.

A 2025 observational study at a 726-bed U.S. hospital analyzed 43,597 patient encounters over 11 months. An NLP- and rule-based AI tool presented hospitalists with a draft assessment and plan and suggested documentation specificity improvements intended to reduce CDI queries. Overall CDI query rates decreased from 9.2% to 7.8% — a 14.8% relative reduction (p=0.02). Among clinicians classified as active full users, query rates fell from 10.1% to 8.1%, a 20.4% relative reduction (p=0.04). The tool maintained a human-in-the-loop design requiring clinicians to manually transfer AI-suggested content into the EMR.

CDI program scope is expanding. The CMS proposed phase-out of the Medicare Inpatient-Only (IPO) list — beginning with 285 primarily musculoskeletal procedures in FY2026 — shifts high-revenue surgical MS-DRGs toward outpatient reimbursement structures, compressing case mix index for affected service lines. CDI teams are extending NLP-assisted concurrent review into outpatient settings to support accurate comorbidity capture under the Medicare Two-Midnight Rule and to justify inpatient admission criteria for conditions such as morbid obesity, chronic diastolic heart failure, and CKD stage ≥3a.

Technology like NLP software, which can help interpret the patient encounter and identify conditions with query opportunities, is paramount to making a CDI team more efficient. Partially automating the chart review process allows for more reviews during a CDI specialist's day.

This practitioner framing from the CDI specialist community reflects the operational logic of NLP-assisted CDI: the technology increases the throughput of concurrent review, not the elimination of specialist judgment. Physician queries generated by AI-assisted CDI tools must remain compliant with AHIMA and ACDIS guidelines — they must be non-leading, must not reference reimbursement impact, and must not steer providers toward specific diagnoses.

Known Failure Modes and Accuracy Limitations

Clinical NLP systems fail in documented, patterned ways. Understanding these failure modes is necessary for evaluating vendor accuracy claims, designing deployment oversight, and interpreting study findings that report aggregate performance metrics without disaggregating error types.

LLM hallucination. Generative components of clinical NLP pipelines — particularly LLMs used for note summarization and code suggestion — can produce plausible-sounding but clinically incorrect output. Hallucinated diagnoses, fabricated medication dosages, or invented procedure details in a generated note represent a direct patient safety risk if the clinician reviewing the draft fails to catch the error.
Negation and assertion errors. Incorrectly scoping a negated or historical clinical concept as currently present is one of the most consequential error types in both coding and CDI applications. The Hou et al. study found that clinical context misinterpretation — which includes negation failures — produced errors such as coding a historical cancer as an active diagnosis.
Population and demographic bias. Clinical NLP training datasets are not demographically representative. Models trained predominantly on documentation from academic medical centers or specific geographic regions may underperform on documentation from safety-net hospitals, rural settings, or patient populations whose clinical presentations, comorbidity profiles, or documentation styles differ from the training distribution.
EHR portability failures. NLP models validated on one EHR platform's documentation structure frequently underperform when deployed on a different platform. Template structures, note section ordering, field naming conventions, and auto-populated text patterns vary enough across EHR systems to meaningfully affect model performance.
Clinician overreliance. When AI-generated note drafts or code suggestions are accepted without adequate review, errors in the AI output propagate into the legal medical record or the coded claim. The human-in-the-loop design requirement present across CDI and coding evidence reflects this risk — it is a mitigation strategy, not a feature preference.
Publication lag. LLM-based clinical NLP tools are deployed commercially well ahead of peer-reviewed evaluation. The Perkins et al. systematic review noted a 46% decrease in peer-reviewed AI CDI studies per month following the release of large general-purpose language models — suggesting that the research community has not kept pace with commercial deployment. Accuracy benchmarks from current peer-reviewed literature may understate the capabilities of the most recently deployed systems, but they also cannot yet confirm whether those improvements hold in real-world clinical environments.

Regulatory and Oversight Context

The regulatory classification of clinical NLP tools is not uniform. Some AI scribing and computer-assisted coding systems meet the FDA's Software as a Medical Device (SaMD) threshold and require 510(k) clearance or De Novo authorization before commercial deployment in the United States. Others are structured to qualify for the clinical decision support software exemption under the 21st Century Cures Act, which excludes from FDA device regulation software that displays, analyzes, or prints medical information for a clinician who can independently review the basis of the recommendation — provided the software is not intended to replace clinical judgment for serious or life-threatening conditions.

The distinction between regulated SaMD and exempt clinical decision support software is consequential for procurement. A tool that generates ICD-10 code suggestions for coder review may qualify for the CDS exemption; a tool that generates a clinical diagnosis or drives a treatment recommendation may not. Buyers should verify the regulatory pathway — or confirmed exemption basis — for any NLP-based clinical documentation tool before deployment.

Patient privacy considerations are also relevant to ambient AI scribing specifically. Continuous ambient recording of clinical encounters raises consent framework requirements that vary by state and institution. The Sasseville et al. systematic review explicitly identified patient privacy concerns about ambient recording as an ethical consideration requiring structured consent processes at the point of deployment.

The following terms appear frequently in vendor materials, research literature, and policy documents alongside clinical NLP. Brief definitions are provided here; readers requiring full entries should consult the ClinicalMind Glossary.

Key terms appearing in clinical NLP vendor materials, research literature, and policy documents. Full glossary entries for regulatory terms (SaMD, 510(k), De Novo) and AI/ML fundamentals (hallucination, model drift, AUROC) are available in the ClinicalMind Glossary.
Term	Brief Definition	Primary Context
Ambient Clinical Intelligence	A broader category encompassing AI systems that passively capture, process, and structure clinical information from the care environment — of which ambient AI scribes are the most common current application	AI scribing; clinical workflow
Computer-Assisted Coding (CAC)	Software systems that use NLP to suggest diagnostic and procedural codes from clinical documentation for review by a human coder	Revenue cycle; health information management
Diagnosis-Related Group (DRG)	A patient classification system used by CMS to determine inpatient reimbursement; DRG assignment is directly affected by the specificity and completeness of coded diagnoses, making CDI a financial and compliance function	CDI; inpatient reimbursement
CC / MCC	Complication or Comorbidity / Major Complication or Comorbidity — secondary diagnoses that, when coded, increase DRG weight and associated reimbursement; NLP-assisted CDI tools specifically target missed CC and MCC documentation opportunities	CDI; DRG optimization
Case Mix Index (CMI)	The average DRG weight across a hospital's inpatient discharges; a proxy for patient acuity and resource intensity; CDI program effectiveness is often measured by its impact on CMI	CDI; hospital finance
UMLS (Unified Medical Language System)	A compendium of biomedical vocabularies and their inter-relationships maintained by the National Library of Medicine; NLP concept normalization pipelines commonly map extracted entities to UMLS concept unique identifiers	Computer-assisted coding; NLP infrastructure
SNOMED CT	Systematized Nomenclature of Medicine Clinical Terms — a comprehensive clinical terminology used as a normalization target in NLP pipelines and as the basis for EHR problem list coding in many health systems	NLP concept normalization; EHR interoperability
ICD-10-CM / ICD-10-PCS	International Classification of Diseases, 10th Revision, Clinical Modification (diagnoses) and Procedure Coding System (inpatient procedures) — the primary code sets targeted by computer-assisted coding NLP systems in U.S. inpatient settings	Computer-assisted coding; CDI
CPT (Current Procedural Terminology)	The AMA-maintained procedure code set used for outpatient and professional services billing; a target code set for NLP-based coding in outpatient and ambulatory contexts	Computer-assisted coding; outpatient CDI
HL7 FHIR	Health Level 7 Fast Healthcare Interoperability Resources — the API standard through which AI scribing systems and other NLP tools exchange structured data with EHR platforms	AI scribing; EHR integration
SaMD (Software as a Medical Device)	FDA's regulatory category for software intended to diagnose, treat, cure, mitigate, or prevent a disease or condition; some clinical NLP tools fall within this category and require premarket authorization	Regulatory classification
ClinicalBERT / BioBERT	Transformer-based language models pre-trained on clinical or biomedical text (MIMIC-III, PubMed) that serve as the foundation for many clinical NLP fine-tuning tasks; predecessors and complements to general-purpose LLMs in clinical NLP pipelines	NLP infrastructure; AI scribing; coding
Hallucination	In the context of generative AI and LLMs, the production of plausible-sounding but factually incorrect output; a documented risk in clinical NLP generation tasks including note drafting and code suggestion	LLM risk; AI scribing; computer-assisted coding
HCPCS	Healthcare Common Procedure Coding System — a code set extending CPT to cover Medicare and Medicaid services, supplies, and equipment; relevant to NLP coding applications in payer and outpatient contexts	Computer-assisted coding

NLP in Clinical Documentation: A Reference Guide for AI Scribes, Clinical Coding, and CDI

Definition and Scope

Why Clinical NLP Differs from General NLP

Core NLP Sub-Technique Taxonomy

Domain 1 — Ambient AI Scribing

Domain 2 — Computer-Assisted Coding

Domain 3 — Clinical Documentation Integrity and Improvement (CDI)

Known Failure Modes and Accuracy Limitations

Regulatory and Oversight Context

Suggest Improvements

Comments

Definition and Scope

Why Clinical NLP Differs from General NLP

Core NLP Sub-Technique Taxonomy

Domain 1 — Ambient AI Scribing

Domain 2 — Computer-Assisted Coding

Domain 3 — Clinical Documentation Integrity and Improvement (CDI)

Known Failure Modes and Accuracy Limitations

Regulatory and Oversight Context

Related Terms and Cross-References

Suggest Improvements

Comments