AI and Healthcare: What Real Clinical Deployments Look Like

There is a persistent gap between how AI performs in a validation study and how it behaves when it goes live in a 600-bed community hospital at 2 a.m. on a Tuesday. The controlled study population is clean, the dataset is curated, the radiologists are primed. The real environment has shift changes, alert fatigue, a decade-old EHR that wasn't designed for AI outputs, and nurses who were told about the new tool in a 15-minute orientation.

That gap — between benchmark performance and operational reality — is what clinical deployment reports are meant to document. This piece synthesizes patterns across several major deployment categories: ambient documentation, radiology workflow AI, sepsis prediction, and prior authorization automation. For each, the picture is more complicated than either the vendor press release or the skeptic's dismissal suggests.

Ambient Documentation: The Deployment That Actually Moved Fast

Ambient AI scribes — tools that listen to a clinical encounter and generate a structured note draft — have had the fastest real-world uptake of any AI category in healthcare over the past two years. That speed is partly explained by the low regulatory bar (most ambient scribe products are not classified as medical devices and don't require FDA clearance) and partly by the genuine pain point they address: physician documentation burden.

Documented deployments at large health systems have reported meaningful reductions in after-hours documentation time — some peer-reviewed implementation studies cite 25–40% reductions in time spent on notes per encounter. Physician satisfaction scores in these studies tend to improve. That's the headline number.

The operational complications are less frequently foregrounded. Several health systems have reported that note quality degraded over time as physicians began accepting AI-generated drafts with minimal review — a phenomenon sometimes called "automation complacency" in the implementation literature. When the ambient tool misidentifies a medication dose or attributes a symptom to the wrong patient in a shared room, a physician who has normalized clicking "accept" may not catch it.

Staff adoption patterns vary significantly by specialty. Emergency medicine and primary care have shown higher adoption rates in documented deployments, likely because note structure in those settings is more formulaic and the AI output is easier to verify quickly. Subspecialties with complex, highly individualized documentation — oncology, neurology — have shown slower adoption and higher edit rates, which is arguably the safer pattern.

EHR Integration Is the Actual Bottleneck

Most ambient scribe products operate as a layer on top of the EHR rather than natively within it. The practical consequence: physicians are often toggling between a separate ambient AI interface and their EHR workflow, then copying or syncing generated notes. Several health systems have reported that this friction — not the AI quality — was the primary reason for low adoption in early rollouts. Products with tighter Epic or Oracle Health integrations have shown better sustained usage in the documented cases.

Radiology AI: Cleared Devices, Uneven Deployment

Radiology has the largest concentration of FDA-cleared AI devices of any clinical specialty. By early 2026, the FDA's AI/ML-enabled device list included well over 700 authorized devices, with radiology and cardiology accounting for the substantial majority. But FDA clearance and active clinical deployment are different things.

Documented deployments cluster around a few use cases: chest X-ray triage for incidental findings, intracranial hemorrhage detection for stroke workflow, and pulmonary embolism flagging. These are the areas where time-to-treatment matters most and where the AI output slots cleanly into an existing urgent-notification workflow. The AI flags the scan; the radiologist reads it; the workflow is accelerated.

Selected radiology AI deployment patterns based on peer-reviewed implementation studies and documented health system reports. Outcomes are site-specific and should not be generalized.
Use Case	Integration Method	Reported Operational Benefit	Documented Friction Points
Intracranial hemorrhage detection	PACS worklist prioritization	Faster time-to-read for critical scans; some sites report 15–20 min reductions in door-to-CT-read time	Alert fatigue when sensitivity tuned high; radiologist override rates not consistently reported
Chest X-ray incidental finding triage	Automated flag in radiology workflow	Increased incidental nodule follow-up rates in prospective implementation studies	False positive burden; downstream ordering variation across sites
Pulmonary embolism detection	EHR-linked urgent notification	Earlier treatment initiation documented in single-site studies	Limited multi-site validation; workflow varies by hospital size
Mammography AI assist	Integrated into reading workflow	Some sites report reduced double-read burden	Regulatory status varies by product; not all cleared tools have RWE published

The harder operational question — one that most deployment reports don't answer clearly — is what happens to radiologist behavior over time. Does AI-assisted prioritization change how radiologists allocate attention to non-flagged scans? A handful of implementation studies have begun tracking this, but the data is early and the findings are mixed.

The Reimbursement Problem

One underreported constraint on radiology AI deployment is reimbursement. Most AI-assisted radiology reads are billed at the same rate as standard reads — there is no established CPT pathway that reliably captures the AI component. Health systems that have invested in radiology AI infrastructure have done so primarily on the basis of workflow efficiency and liability risk reduction, not direct revenue. That calculus affects which tools get deployed and which get shelved after a pilot.

Sepsis Prediction: The Most Studied, Most Contested Deployment

Sepsis prediction algorithms have been deployed at more U.S. hospitals than almost any other clinical AI application, and they have also generated the most documented controversy. The core tension: high-sensitivity sepsis alerts reduce missed cases but generate enormous false positive burdens that erode clinical trust and trigger unnecessary interventions.

The Epic Sepsis Model (ESM) is the most widely deployed example. A prospective study published in JAMA Internal Medicine in 2021 — covering over 27,000 hospitalizations at a large academic medical center — found the ESM had an AUROC of 0.74 and a positive predictive value of 18% in the study population. Meaning: roughly 4 out of 5 alerts were false positives. The paper's authors documented that nurses responded to only 18% of alerts by escalating care within the recommended window.

Epic has updated the ESM since that study, and other vendors have deployed competing models. But the structural problem — that sepsis prediction in real populations involves a base rate that makes high PPV difficult to achieve — hasn't changed. Some health systems have responded by raising alert thresholds, accepting lower sensitivity in exchange for better specificity and reduced alert fatigue. Others have moved toward tiered alert systems that route high-confidence alerts differently from borderline ones.

What's notable about the sepsis prediction story is that it's one of the few areas where documented failure modes have driven visible policy responses. Several health systems have published their own post-deployment audits — something that remains rare in healthcare AI — and at least a few have publicly reduced or modified their alert configurations based on internal data.

Prior Authorization AI: Administrative Deployment With Clinical Consequences

AI-assisted prior authorization has been adopted rapidly by payer organizations and, in some cases, integrated into health system revenue cycle operations. The deployment context here is administrative — the AI reviews claims data and clinical documentation to generate authorization recommendations — but the downstream effects are clinical.

The documented controversy around payer-side prior authorization AI centers on denial rates. Several large payers have faced regulatory scrutiny and litigation over AI-assisted denial processes, with allegations that automated systems were denying claims at rates inconsistent with individual clinical review. CMS issued guidance in 2024 addressing AI use in Medicare Advantage prior authorization, requiring that coverage determinations involving AI must still be reviewed by qualified personnel.

On the provider side, health systems have deployed prior authorization AI primarily to reduce administrative burden on clinical staff — pre-populating authorization requests, predicting approval likelihood, and flagging cases likely to require peer-to-peer review. Documented outcomes in this category tend to be operational: reduced time-to-authorization, lower denial rates for outpatient procedures, reduced staff hours on appeals.

Cross-Cutting Patterns: What Deployment Data Consistently Shows

Across these categories, several patterns appear consistently enough to be worth stating plainly rather than embedding in individual case summaries.

Pilot-to-scale failure is common. Many AI tools that perform well in a structured pilot — often with selected users, dedicated support, and close monitoring — show degraded adoption and outcomes when scaled to full health system deployment without the same infrastructure.
Alert fatigue is the most frequently documented failure mode across clinical decision support AI. Tools tuned for high sensitivity generate alert volumes that clinical staff cannot sustainably act on, which eventually causes systematic override behavior that negates the tool's purpose.
Model drift is undermonitored in practice. Most health systems do not have formal processes for detecting when a deployed AI model's performance has degraded due to changes in patient population, EHR configuration, or clinical practice patterns. Drift is documented in the academic literature but rarely surfaces in health system operational reports.
Equity audits at deployment are rare. Most deployment reports do not include demographic breakdowns of AI performance across patient subgroups. The gap between what is technically possible (subgroup analysis) and what is operationally standard (none) remains wide.
Physician trust is built or destroyed in the first weeks. Implementation studies that track adoption longitudinally consistently find that early false positive experiences — especially high-confidence predictions that turn out to be wrong — have disproportionate effects on long-term clinician trust in the tool.

What Gets Left Out of Deployment Reports

The deployment literature has structural gaps that are worth naming. Negative results are underreported — health systems that quietly discontinue an AI tool rarely publish a post-mortem. Vendor contracts often include confidentiality provisions that prevent health systems from publishing performance data. And the most consequential deployment failures — cases where an AI tool contributed to a patient safety incident — are largely invisible in the public record, surfacing only occasionally through malpractice litigation or regulatory disclosure.

This means the published deployment literature is systematically skewed toward successful or at least neutral outcomes. The operational picture across healthcare AI is almost certainly messier than what peer-reviewed implementation studies and health system press releases reflect.

Source and Verification Notes

The deployment patterns described in this report draw on peer-reviewed implementation studies, publicly available health system communications cross-referenced with independent records, and documented conference proceedings from AMIA, HIMSS, and specialty society meetings. Specific deployment records with traceable sources are catalogued individually in this site's deployment report index.

Performance figures cited — including the ESM AUROC and PPV values — are sourced from published peer-reviewed literature and apply to the specific study populations described in those papers. They are not generalizable benchmarks. Readers evaluating a specific AI tool for their institution should consult the tool's FDA authorization record (if applicable), the supporting clinical evidence, and ideally post-market data from populations similar to their own.

AI and Healthcare: What Real Clinical Deployments Actually Look Like