AI and Healthcare: Core Concepts for Clinicians and Administrators

The phrase "AI in healthcare" covers an enormous range of things — from a radiology algorithm that flags pulmonary nodules on CT scans to a large language model summarizing discharge notes. Those are fundamentally different technologies, governed by different regulatory frameworks, with different evidence standards and different failure modes. Treating them as a single category is one of the most common sources of confusion for clinicians and administrators trying to evaluate whether a specific tool is ready for their setting.

This entry maps the conceptual terrain: what clinical AI actually is, how it gets classified and regulated in the US, how performance is measured and what those numbers mean in practice, and where the evidence base is solid versus where it remains genuinely thin.

What Counts as AI in a Clinical Context

Not every piece of software used in a hospital qualifies as "clinical AI" in any meaningful regulatory or evidentiary sense. The term is used loosely in vendor materials, but for evaluation purposes, the distinction that matters most is whether a tool makes or informs a clinical decision — and whether that function is governed by FDA oversight.

The FDA's operative framework is Software as a Medical Device (SaMD) — software intended to be used for one or more medical purposes that performs those purposes without being part of a hardware medical device. A deep learning model that detects diabetic retinopathy from fundus photographs is SaMD. A scheduling algorithm that optimizes OR block time is not. The boundary matters because SaMD triggers premarket review requirements; other software does not.

Within SaMD, the FDA distinguishes between tools that are locked (the algorithm does not change after deployment) and those designed to adapt or retrain over time. Adaptive algorithms raise distinct post-market surveillance questions that locked algorithms do not — a point the FDA's Predetermined Change Control Plan (PCCP) framework is specifically designed to address.

Regulatory Pathways: 510(k), De Novo, and PMA

The three primary FDA pathways for AI/ML-based medical devices differ substantially in what they require from a manufacturer and what they signal to a potential adopter.

FDA premarket review pathways for AI/ML-enabled medical devices
Pathway	Basis for Clearance	Risk Level	What It Signals
510(k)	Substantial equivalence to a predicate device	Low to moderate	The device performs similarly to something already cleared — does not require independent proof of clinical benefit
De Novo	Novel device with no predicate; FDA establishes new device type	Low to moderate	First-of-kind classification; FDA has reviewed the specific technology category
PMA (Premarket Approval)	Independent evidence of safety and effectiveness	High	Highest evidentiary bar; required for Class III devices with significant patient risk

The practical implication: the vast majority of FDA-cleared AI devices have gone through 510(k). That clearance confirms the device is substantially equivalent to a predicate — it does not mean the device has been shown to improve patient outcomes in a prospective trial. Clinicians and procurement staff who conflate 510(k) clearance with proven clinical efficacy are making a category error.

How Performance Is Measured — and What the Numbers Actually Mean

Performance metrics for clinical AI tools appear in FDA submissions, published studies, and vendor materials — often without adequate context. The same AUC figure can represent a robust finding or a nearly meaningless one depending on the study design, dataset, and population.

Sensitivity, Specificity, and the Tradeoff

Sensitivity measures how often the model correctly identifies a positive case (true positive rate). Specificity measures how often it correctly identifies a negative case (true negative rate). They move in opposite directions as you adjust the decision threshold — increasing sensitivity typically decreases specificity, and vice versa.

For a screening tool, high sensitivity is usually prioritized — you want to catch as many true cases as possible, accepting more false positives that downstream workup will resolve. For a confirmatory or triage tool, the calculus may differ. A tool with 95% sensitivity and 60% specificity will generate a large number of false alarms in a low-prevalence population, even if those numbers look impressive in isolation.

AUROC and Its Limits

The Area Under the Receiver Operating Characteristic curve (AUROC, or AUC) summarizes model discrimination across all possible thresholds. An AUC of 1.0 is perfect; 0.5 is no better than chance. AUC is widely reported because it is threshold-independent, but it has real limitations in clinical contexts.

AUC is insensitive to class imbalance. In a dataset where 2% of cases are positive, a model that always predicts negative achieves 98% accuracy but provides no clinical value. AUC can still look reasonable.
AUC is computed on the full test set. It does not tell you how the model performs specifically on subgroups — older patients, patients with comorbidities, or patients from demographics underrepresented in training data.
AUC from internal validation (tested on held-out data from the same institution) is almost always higher than AUC from external validation on a different institution's data. Studies reporting only internal validation results should be interpreted with caution.

Calibration: The Metric That Often Gets Skipped

Calibration measures whether a model's predicted probabilities match observed outcomes. A well-calibrated model that predicts a 30% risk of sepsis should be right about 30% of the time when it makes that prediction. A poorly calibrated model might discriminate well (high AUC) but systematically over- or underestimate risk, which matters when clinicians are using numeric outputs to make threshold decisions.

Most published AI studies report AUC. Calibration statistics are reported far less frequently — a gap the CONSORT-AI reporting standard is designed to close.

Algorithmic Bias and Health Equity

Algorithmic bias in clinical AI is not primarily a technical curiosity — it is a patient safety and equity issue. When a model trained predominantly on data from one population is deployed in a different one, its performance can degrade in ways that are not visible from aggregate metrics.

The mechanisms are well-documented. Training data that underrepresents certain demographic groups — by race, sex, age, or socioeconomic status — produces models that perform less reliably for those groups. Imaging AI trained largely on data from academic medical centers may perform differently in community hospital settings with different scanner hardware, imaging protocols, and patient demographics. Pulse oximetry-based training data carries known measurement errors for patients with darker skin tones, which propagates into any model trained on SpO2 values.

The FDA's AI/ML action plan and subsequent guidance documents have increasingly emphasized the need for manufacturers to characterize performance across demographic subgroups. But disclosure requirements remain inconsistent, and many cleared devices have limited published evidence on subgroup performance. This is an active gap in the current regulatory framework.

Model Drift and Post-Market Surveillance

A model validated in 2021 may perform differently in 2025 — not because the algorithm changed, but because the clinical environment did. Changes in patient demographics, disease prevalence, imaging equipment, documentation practices, or treatment protocols can all shift the distribution of inputs a model receives, degrading its performance without any modification to the model itself. This is model drift, and it is one of the least-discussed risks in clinical AI deployment.

Locked algorithms — the majority of FDA-cleared AI devices — cannot adapt to drift without going back through FDA review. The Predetermined Change Control Plan (PCCP) framework allows manufacturers to pre-specify what kinds of modifications they can make without a new submission, but PCCP adoption remains limited as of mid-2026. Most health systems deploying AI tools have no systematic process for monitoring whether a tool's real-world performance is degrading over time.

Data drift: the statistical properties of input data change (e.g., scanner upgrades alter image characteristics)
Concept drift: the relationship between inputs and outcomes changes (e.g., a new treatment changes the natural history of a disease the model was trained to predict)
Label drift: the definitions or coding practices used to label training data shift over time

Health systems that deploy AI tools without a monitoring plan are effectively running an uncontrolled experiment. Some vendors provide performance dashboards; many do not. This is an area where institutional governance — not just vendor contracts — determines whether drift is caught.

Types of AI Methods Used in Clinical Applications

Clinical AI tools use a range of underlying methods. The method matters for understanding what kinds of errors a tool makes and what evidence is needed to validate it.

Selected AI methods and their clinical deployment characteristics
Method	Common Clinical Applications	Characteristic Limitations
Convolutional neural networks (CNN)	Medical image analysis: radiology, pathology, dermatology	Sensitive to image quality and acquisition protocol; limited interpretability
Recurrent / transformer architectures	Time-series data (ECG, vitals), EHR sequence modeling	Requires large longitudinal datasets; can overfit to documentation patterns
Gradient boosting (XGBoost, LightGBM)	Tabular clinical data: risk scores, early warning systems	Performs well on structured data; less suited to unstructured text or images
Large language models (LLMs)	Clinical note summarization, documentation, Q&A	Hallucination risk; no FDA-authorized clinical LLM as of mid-2026; evidence base is early-stage
Federated learning	Multi-site model training without centralizing patient data	Adds complexity; does not eliminate bias if participating sites are not representative

Study Design and Why It Matters for Evaluating Evidence

The evidence base for clinical AI tools is uneven. A substantial portion of published studies are retrospective — the model is trained and tested on historical data from one or two institutions, with no prospective component and no external validation. These studies can demonstrate that a model is technically feasible; they cannot demonstrate that it improves outcomes when deployed prospectively in a real clinical workflow.

The hierarchy of evidence applies to AI just as it does to drugs and devices:

Prospective randomized controlled trials with clinical outcome endpoints — the highest bar; rare in AI
Prospective non-randomized studies with external validation — increasingly common in imaging AI
Retrospective studies with external validation on independent datasets — useful for initial assessment
Retrospective internal validation only — adequate for hypothesis generation; insufficient for deployment decisions
Regulatory submission data only (no published peer-reviewed evidence) — the situation for many cleared devices

External validation — testing a model on data from an institution that was not involved in training — is the single most important quality signal in an AI study. A model that performs well internally but has never been tested externally may be overfitted to the quirks of one institution's data, equipment, or documentation practices.

Interpretability and the Black Box Problem

Deep learning models — particularly convolutional neural networks used in imaging — are difficult to interpret. They do not produce an explanation for their output in the way a decision tree does. A model that flags an abnormality on a chest X-ray cannot, by default, tell you which features drove that flag.

Techniques like saliency maps and Grad-CAM can highlight which regions of an image most influenced the model's output, but these are post-hoc approximations, not ground-truth explanations. They can be clinically useful and also misleading — a saliency map that highlights the correct anatomical region does not guarantee the model is reasoning about that region for the right reasons.

For clinical adoption, interpretability matters less in some contexts than others. A tool used as a triage flag — "prioritize this scan for urgent read" — may not require the clinician to understand its internal reasoning. A tool that purports to explain a diagnosis to a patient requires a much higher bar.

Data Governance and Training Data Quality

The quality of a clinical AI model is bounded by the quality of its training data. This is not a technical abstraction — it has direct clinical consequences.

Training labels derived from billing codes rather than chart review can introduce systematic misclassification. Datasets drawn from a single academic health system reflect that system's patient demographics, documentation culture, and treatment practices. Models trained on data from before a major treatment change may not generalize to current practice.

Federated learning is one approach to training on multi-site data without centralizing it — each site trains locally and shares model parameters rather than raw data. This can improve generalizability and reduce privacy risk, but it does not automatically produce a representative dataset. If all participating sites are large academic centers, the resulting model still may not perform well in rural or community settings.

What This Means for Clinical Evaluation Decisions

Clinicians and administrators evaluating an AI tool for their institution need to ask a different set of questions than a vendor's sales deck is designed to answer. The relevant questions are not "what is the AUC?" in isolation, but:

Is this device FDA-cleared, and under which pathway? What was the predicate device?
What is the published evidence base — prospective or retrospective, internal or external validation?
Does the study population match our patient population in terms of demographics, disease prevalence, and care setting?
Were subgroup performance metrics reported? If not, why not?
What is the plan for monitoring performance after deployment? Who is responsible if performance degrades?
How does this tool integrate into the clinical workflow, and what happens when it is wrong?

The last question is underrated. A model with 90% sensitivity will be wrong 10% of the time. In a high-volume setting, that means a predictable number of missed cases or false alarms every day. The clinical impact depends entirely on whether the workflow is designed to catch those errors — and most are not.

Scope of This Entry

This entry covers foundational concepts. For device-specific regulatory records, see the FDA-Cleared AI Device Registry. For evidence appraisals of specific studies, see the Research Study Analyses group. For clinical application-specific summaries — including FDA clearance status, key studies, and known limitations by use case — see the Clinical Application Briefs. Regulatory guidance changes are tracked in the Regulatory & Policy Tracker.