AI and Health: Core Concepts for Clinicians and Administrators

"AI and health" is a phrase that covers an enormous range of things — from a deep learning model detecting pulmonary nodules on chest CT to a large language model drafting a discharge summary. These are not the same technology, they are not regulated the same way, and the evidence standards that apply to them differ substantially. Before evaluating any specific tool or study, it helps to have a clear map of the terrain.

This entry organizes the core concepts that recur across clinical AI evaluation: what AI is actually doing in clinical settings, how tasks are categorized, what regulatory frameworks apply, how performance is measured, and where the known failure modes live. Readers who want to go deeper on any individual concept will find dedicated entries linked throughout.

What AI Is Actually Doing in Clinical Settings

Most deployed clinical AI systems perform one of a small number of well-defined tasks. Understanding which task a system performs matters because it shapes what evidence is required, what regulatory pathway applies, and what failure modes to watch for.

Primary AI task types in clinical deployment, with representative examples and regulatory status as of Q2 2026.
Task Type	What It Does	Common Clinical Examples	Typical Regulatory Status
Detection	Flags presence or absence of a finding in an image or signal	Pulmonary nodule detection, diabetic retinopathy screening, ECG arrhythmia identification	Many FDA-cleared devices exist
Segmentation	Delineates anatomical structures or lesions within an image	Tumor boundary delineation in radiology, organ segmentation for surgical planning	Cleared tools exist; often bundled with detection
Risk stratification	Assigns a probability score for a future event or severity tier	Sepsis prediction, readmission risk, deterioration early warning	Regulatory status varies widely; many are clinical decision support
NLP extraction	Pulls structured data from unstructured clinical text	Extracting diagnoses from notes, coding assistance, prior authorization review	Often not classified as a medical device; ONC oversight may apply
Triage / prioritization	Reorders worklists or escalates cases by predicted urgency	Radiology worklist prioritization for suspected stroke, ICH triage	Several FDA-cleared tools; pathway depends on clinical claim
Ambient documentation	Transcribes and structures clinical encounters in real time	AI scribes in outpatient visits, automated SOAP note generation	Generally not FDA-regulated as a device; no cleared generative AI tools as of Q2 2026

The distinction between task types is not academic. A detection algorithm that flags a finding for radiologist review carries different liability, different evidence requirements, and different workflow implications than a risk score that influences whether a patient is admitted. Conflating them leads to misapplied evaluation criteria.

Software as a Medical Device (SaMD): The Regulatory Category That Matters

Not all health AI is regulated as a medical device. The FDA's classification framework distinguishes software that is intended to diagnose, treat, mitigate, cure, or prevent disease — Software as a Medical Device, or SaMD — from software that supports administrative functions or general wellness.

If a system's output is intended to inform a clinical decision about a specific patient — particularly one that could cause serious harm if the output is wrong — it is likely to be classified as SaMD and require FDA authorization before US commercial deployment. The FDA has cleared over a thousand AI/ML-enabled devices under this framework, the majority in radiology.

The Three Authorization Pathways

Most AI medical devices reach the US market through one of three FDA pathways. Understanding them helps clinicians and procurement teams read clearance records accurately.

FDA authorization pathways for AI-enabled medical devices. Most cleared AI tools use 510(k).
Pathway	What It Requires	Typical Use Case	Evidence Bar
510(k)	Substantial equivalence to a legally marketed predicate device	Majority of cleared AI imaging tools	Moderate — bench testing and often retrospective data; no RCT required
De Novo	Novel device with no predicate; establishes a new device classification	First-in-class AI tools where no predicate exists	Moderate to high — FDA sets the standard for the new class
PMA (Premarket Approval)	Valid scientific evidence that the device is safe and effective	High-risk devices; rare for AI tools currently	Highest — typically requires prospective clinical data

A fourth mechanism — the Predetermined Change Control Plan (PCCP) — allows manufacturers to pre-specify how a device's algorithm may be updated post-clearance without requiring a new submission for each change. This is increasingly relevant as AI tools are retrained on new data after deployment.

How Performance Is Measured — and Where Metrics Can Mislead

Clinical AI performance is typically reported using a small set of statistical measures. These metrics are not interchangeable, and each has failure modes that matter in clinical contexts.

Sensitivity, Specificity, and the Tradeoff

Sensitivity measures how often a model correctly identifies a true positive — a disease or finding that is actually present. Specificity measures how often it correctly identifies a true negative. High sensitivity means fewer missed cases; high specificity means fewer false alarms.

These two metrics trade off against each other at any given operating threshold. A screening tool optimized for high sensitivity will generate more false positives. A tool optimized for high specificity will miss more true cases. Which tradeoff is acceptable depends entirely on the clinical context — a cancer screening tool and an ICU deterioration alert have different tolerance for each type of error.

AUC / AUROC

The Area Under the Receiver Operating Characteristic Curve (AUROC, often shortened to AUC) summarizes a model's discriminative ability across all possible thresholds. An AUC of 1.0 is perfect discrimination; 0.5 is no better than chance. Most published clinical AI studies report AUC as the headline metric.

Positive and Negative Predictive Value

Positive Predictive Value (PPV) is the probability that a patient flagged by the model actually has the condition. Negative Predictive Value (NPV) is the probability that a patient not flagged is actually free of it. Unlike sensitivity and specificity, PPV and NPV depend on disease prevalence — a model with excellent sensitivity and specificity can have a very low PPV when applied to a low-prevalence population.

This is why performance metrics from high-prevalence training datasets often do not hold up when the same model is deployed in a general screening population. It is one of the most common sources of real-world performance degradation.

Algorithmic Bias and Health Equity

Algorithmic bias in clinical AI refers to systematic performance differences across patient subgroups — typically defined by race, ethnicity, sex, age, socioeconomic status, or geographic origin. These disparities usually originate in training data that underrepresents certain populations, but they can also be introduced by how outcomes are labeled, how features are selected, or how the model is validated.

The concern is not hypothetical. Documented examples include dermatology models that perform worse on darker skin tones, sepsis prediction tools that underperform in patients from lower-income zip codes, and chest X-ray classifiers with measurable accuracy gaps by sex. These gaps matter because deploying a biased tool at scale can systematically worsen care for already-disadvantaged groups.

Training data bias: The model learns from historical clinical data that reflects existing disparities in care access, documentation quality, and diagnostic patterns.
Label bias: Outcome labels (e.g., "disease present") may themselves reflect biased clinical decisions rather than ground truth.
Feature proxy bias: A model may use a feature that correlates with race or socioeconomic status as a proxy, encoding bias indirectly.
Deployment distribution shift: A model trained on one hospital's population may perform differently when deployed at an institution serving a different demographic mix.

Model Drift: What Happens After Deployment

A model that performs well at validation can degrade over time without any change to its code. This is model drift — the phenomenon where the statistical relationship between the model's inputs and the outcome it was trained to predict shifts in the real world.

Drift can be caused by changes in patient population (a hospital expands to serve a new geographic area), changes in clinical practice (a new imaging protocol changes how CT scans look), changes in documentation patterns (a new EHR affects how notes are written), or seasonal disease prevalence shifts. The COVID-19 pandemic demonstrated this dramatically: sepsis prediction models trained on pre-pandemic data showed significant performance degradation when applied to COVID patients whose clinical presentations differed from the training distribution.

Post-market surveillance — monitoring a deployed model's performance on an ongoing basis — is the standard mitigation. The FDA's PCCP framework is partly designed to accommodate the retraining that drift monitoring may require. In practice, many deployed tools lack systematic drift monitoring, which is a known gap in the field.

Generative AI in Clinical Settings: A Distinct Category

Large language models and multimodal generative AI systems occupy a different part of the clinical AI landscape from the discriminative models described above. They generate text, synthesize information, and produce outputs that can look authoritative regardless of whether they are accurate.

The defining risk is hallucination — the production of plausible-sounding but factually incorrect content. In clinical contexts, a hallucinated drug interaction, a fabricated lab value in a summarized note, or an incorrect diagnosis suggestion carries direct patient safety implications. As of Q2 2026, no generative AI system has received FDA authorization as a medical device for clinical decision-making tasks.

External Validation: The Evidence Quality Marker That Matters Most

A model that performs well on held-out data from the same institution where it was trained has demonstrated internal validity. External validation — testing on data from a different institution, health system, or country — is a substantially higher bar and is the most reliable indicator of whether a model will generalize.

The clinical AI literature has a well-documented pattern: models that achieve impressive AUC on internal test sets frequently show meaningful performance drops when externally validated. A 2022 systematic review in The BMJ found that the majority of published clinical prediction models had not been externally validated, and those that were validated typically showed reduced performance. This is not a minor methodological quibble — it is the difference between a model that works in one hospital and one that works in yours.

Internal validation only: The model was tested on data from the same institution or dataset used for training (even if split into train/test sets). Lowest generalizability evidence.
Partial external validation: Tested at one or two external sites, often with similar patient populations. Moderate generalizability evidence.
Prospective external validation: Tested prospectively at multiple independent sites with diverse populations. Strongest generalizability evidence for clinical deployment.

Federated Learning: Why It Matters for Multi-Site Training

One approach to improving model generalizability without pooling patient data across institutions is federated learning. In a federated setup, each participating site trains a local model on its own data, and only the model parameters — not patient records — are shared with a central aggregator. The aggregated model benefits from the diversity of multiple training populations without requiring data to leave any institution.

Federated learning is particularly relevant in healthcare because of the practical and regulatory barriers to centralizing patient data across health systems. It does not eliminate all privacy risks — membership inference attacks can sometimes extract information about training data from model parameters — but it substantially reduces the data-sharing surface compared to centralized training.

Model Interpretability: What It Is and What It Is Not

Interpretability refers to the degree to which a model's predictions can be explained in terms that clinicians can evaluate. Techniques like SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) generate feature importance scores that indicate which inputs most influenced a particular prediction.

These explanations are approximations, not ground truth. A SHAP value tells you which features the model weighted heavily for a given prediction — it does not tell you whether those features are causally related to the outcome, or whether the model's reasoning is clinically valid. A model can produce a correct prediction for the wrong reasons, and interpretability tools may not reveal that.

Reading the Evidence: What to Look For in a Clinical AI Study

When a vendor cites a study, or when a peer-reviewed paper crosses your desk, a short checklist helps separate substantive evidence from noise.

What is the study design? Retrospective analyses on historical data are the most common and the least generalizable. Prospective validation on independent data is more meaningful. RCTs measuring patient outcomes are rare but represent the strongest evidence.
Was the model externally validated? If not, performance figures may not translate to your setting.
What is the reference standard? How was the ground truth determined — radiologist consensus, pathology, long-term outcome? Weak reference standards inflate apparent model performance.
Are subgroup results reported? Overall AUC without demographic breakdowns leaves bias questions unanswered.
What is the operating threshold? A model's sensitivity and specificity at the threshold used in practice matters more than its AUC across all thresholds.
Are conflicts of interest disclosed? Vendor-funded studies show a consistent pattern of more favorable results than independent studies on the same tools.
Is the tool FDA-cleared for this specific use? Clearance status and intended use scope should be verified against the FDA database directly, not taken from vendor materials.

Where These Concepts Appear Across This Site

The concepts described here recur throughout the site's structured records. Clinical application briefs apply sensitivity, specificity, and equity considerations to specific tasks. FDA device records document clearance pathway and intended use scope. Evidence appraisals assess external validation status and report operating-point performance. Regulatory tracker entries cover how FDA guidance on SaMD, PCCP, and transparency requirements has evolved.

Each of those record types links back to the concepts defined here. If a term appears in a device record or appraisal without context, this entry is the intended reference point.