Medical Artificial Intelligence: A Structured Learning Path for Clinicians and Students

A module-based learning sequence guiding medical students, residents, and clinicians through the foundational concepts, regulatory frameworks, clinical evidence standards, and deployment realities of artificial intelligence in healthcare — in a logical reading order that builds working knowledge from the ground up.

Medical artificial intelligence is not a single technology. It is a collection of methods — statistical, computational, and probabilistic — applied to clinical problems ranging from reading chest X-rays to flagging sepsis risk in an ICU. Understanding it well enough to evaluate a specific tool, interrogate a study, or participate in a procurement decision requires building knowledge in a specific order.

This learning sequence does that. It moves from definitional foundations through regulatory mechanics, then into evidence evaluation, clinical deployment realities, and equity considerations. Each module identifies what you should be able to do after completing it — not just what you will have read.

Who This Sequence Is For

This path is built for three audiences who often encounter AI in healthcare without a structured way to make sense of it:

  • Medical students and residents who encounter AI-assisted tools in clinical rotations and want to understand what they are actually doing and how their outputs should be interpreted.
  • Practicing clinicians — in any specialty — who are being asked to adopt, evaluate, or simply use AI tools in their workflow and need a working vocabulary to engage critically.
  • Health administrators and IT staff responsible for procurement or deployment decisions who need to understand regulatory status, evidence quality, and implementation risk without needing a data science background.

The sequence assumes basic clinical knowledge — familiarity with how diagnosis works, what a clinical trial is, and how hospitals are organized. It does not assume any background in statistics, programming, or machine learning.

Sequence Overview

Eight-module learning path for medical artificial intelligence. Modules are designed to be read in order.
ModuleTopicPrimary Task After Completion
1What medical AI actually is — and what it is notDistinguish AI-enabled devices from clinical decision support software from generative AI tools
2How the FDA regulates AI in medicineIdentify whether a given tool is cleared, under what pathway, and for what intended use
3Reading the evidence: study design and performance metricsEvaluate a published AI study for design quality, metric interpretation, and generalizability
4Imaging AI: the most mature application domainUnderstand what FDA-cleared imaging AI tools do, their limitations, and how they integrate with radiologist workflow
5Clinical workflow AI: scribes, CDS, and EHR integrationAssess ambient documentation tools and embedded clinical decision support for evidence quality and regulatory status
6Algorithmic bias and health equity in AI systemsIdentify how training data gaps propagate into performance disparities and what questions to ask vendors
7Generative AI in clinical settings: a distinct risk profileDistinguish generative AI from predictive AI, understand hallucination risk, and apply appropriate skepticism to LLM-based clinical tools
8Putting it together: evaluating a real AI toolApply the full framework to assess a specific tool's regulatory status, evidence base, deployment context, and equity considerations

Module 1: What Medical AI Actually Is

The term "artificial intelligence" is used loosely in healthcare — sometimes to mean a deep learning model trained on millions of images, sometimes to mean a rules-based alert that fires when a lab value crosses a threshold. These are not the same thing, and conflating them leads to poor evaluation decisions.

In a regulatory and clinical context, medical AI typically refers to software that uses statistical models — trained on data — to generate outputs that inform clinical decisions. The FDA's framework for software as a medical device (SaMD) is the primary regulatory lens in the US. A tool that meets the SaMD definition is subject to premarket review; one that falls outside it (e.g., a general wellness app) is not.

Three Categories Worth Distinguishing

Three categories of AI-related software in clinical settings. Regulatory status differs substantially.
CategoryHow It WorksRegulatory StatusExample
Predictive AI / MLTrained statistical model generates a probability or classification from input dataMay require FDA clearance as SaMD depending on intended use and riskAI reading a chest CT for pulmonary nodules
Rule-based CDSExplicit if-then logic coded by clinicians or informaticistsGenerally exempt from FDA device regulation under CDS guidanceAlert when creatinine rises above 2.0 mg/dL
Generative AI (LLMs)Large language model generates text responses from promptsNo FDA-authorized generative AI medical devices as of Q2 2026LLM summarizing a patient's chart or answering a clinical question

The concept entry on Software as a Medical Device (SaMD) covers the regulatory definition in detail and explains how the FDA's risk-based classification framework determines which software requires premarket review.

Module 2: How the FDA Regulates AI in Medicine

FDA clearance is the single most important credential to verify before adopting an AI tool in a clinical setting. It does not guarantee effectiveness, but it does mean the agency reviewed the device's intended use, the evidence submitted by the manufacturer, and the risk profile.

Three pathways matter for AI devices:

  • 510(k) — The most common pathway. The manufacturer demonstrates "substantial equivalence" to a legally marketed predicate device. Most cleared AI devices use this pathway. Clearance does not require proof of clinical superiority over the predicate.
  • De Novo — Used when no predicate exists. The FDA establishes a new device classification. A De Novo decision creates a regulatory precedent that future 510(k) submissions can cite. More rigorous than 510(k) for novel device types.
  • PMA (Premarket Approval) — The highest bar. Required for Class III devices — those that support or sustain life, or present unreasonable risk of illness or injury. Very few AI devices have gone through PMA.

A critical concept introduced in recent FDA guidance is the Predetermined Change Control Plan (PCCP). An AI model that "learns" or updates after deployment would normally require a new submission for each change. A PCCP, if approved as part of the original clearance, allows the manufacturer to make specified modifications — within defined bounds — without re-submitting. This matters for evaluating whether a tool you deploy today is the same tool it will be in 18 months.

Module 3: Reading the Evidence

FDA clearance and clinical effectiveness are separate questions. A device can be cleared based on a retrospective study from a single institution with a homogeneous patient population — and then perform differently in your hospital's patient mix.

Study Design Hierarchy

Not all studies are equal. For AI in medicine, the study design determines how much weight you should give the reported performance:

Study design hierarchy for AI clinical evidence. Most cleared AI devices have retrospective evidence only.
DesignStrengthCommon Limitation in AI Studies
Prospective RCTHighest — controls for confounding, measures real clinical outcomesRare; expensive; few AI tools have undergone RCT evaluation
Prospective cohortStrong — real patients, real workflow, but no randomizationSelection bias; single-site performance may not generalize
Retrospective cohortModerate — faster, cheaper, but data collected for other purposesLabel quality issues; population shift; overfitting to local data patterns
Systematic review / meta-analysisDepends on included studies — can aggregate evidence or aggregate biasHeterogeneity in AI tasks and metrics makes pooling difficult
Regulatory submission onlyLowest — not peer-reviewed, methodology not fully disclosedNo independent replication; performance may reflect curated test sets

Performance Metrics: What They Mean and What They Hide

AUC (area under the ROC curve) is the most commonly reported metric in AI studies. An AUC of 0.90 sounds impressive but tells you almost nothing about how the model performs at any specific operating threshold — and operating threshold is what matters clinically. A model with AUC 0.90 can have sensitivity of 70% or 95% depending on where the threshold is set, with corresponding tradeoffs in false positives.

Always look for sensitivity and specificity reported at the operating threshold the tool actually uses in deployment. If the study only reports AUC, the evidence is incomplete for clinical adoption decisions.

Module 4: Imaging AI

Radiology and pathology account for the large majority of FDA-cleared AI devices. This is not because imaging AI is inherently more effective than AI in other domains — it is because imaging data is relatively standardized (DICOM format, established label conventions), which made it tractable for early deep learning applications and for FDA review.

Cleared imaging AI tools span a wide range of tasks: detecting pulmonary nodules on chest CT, flagging intracranial hemorrhage on non-contrast head CT, identifying diabetic retinopathy on fundus photography, classifying lesions on mammography. Each has a specific intended use — and most are designed to assist, not replace, the reading radiologist or pathologist.

Workflow Integration Is Not Automatic

A common deployment failure mode is purchasing a cleared imaging AI tool and discovering it does not integrate cleanly with the existing PACS or RIS. Integration requires DICOM conformance, worklist routing configuration, and often custom IT work. Some tools require a dedicated AI platform layer between the imaging system and the AI algorithm.

Radiologist acceptance is a separate variable. Studies consistently show that AI outputs influence radiologist decisions — sometimes in the direction of overcalling findings the AI flags, sometimes in the direction of anchoring on AI outputs and missing findings the AI missed. Neither effect is benign. Deployment without structured training on how to use AI outputs appropriately is a known risk.

Module 5: Clinical Workflow AI

Outside imaging, the fastest-growing category of deployed clinical AI is ambient documentation — tools that listen to physician-patient conversations and generate structured clinical notes. These tools are widely deployed across health systems as of 2025–2026, driven primarily by physician demand for documentation burden relief.

The evidence base for ambient AI scribes is almost entirely vendor-disclosed or based on single-institution implementation studies. Peer-reviewed RCT evidence for patient outcome improvements does not yet exist at scale. What does exist is evidence of documentation time reduction — typically 30–50% in vendor-reported studies — and physician satisfaction improvements.

Clinical decision support (CDS) embedded in EHRs occupies a different regulatory space. Under the 21st Century Cures Act's CDS provisions, software that presents information for a clinician to independently review and act on — where the clinician is not expected to rely primarily on the software — may be exempt from FDA device regulation. The boundary between exempt CDS and regulated SaMD is actively contested and has been the subject of multiple FDA guidance documents.

Module 6: Algorithmic Bias and Health Equity

Algorithmic bias in medical AI is not a theoretical concern. It is a documented pattern with traceable causes. When a model is trained on data from a specific patient population — demographically, geographically, or institutionally — its performance tends to degrade on populations underrepresented in that training set.

Several mechanisms drive this. Training data from academic medical centers over-represents certain demographics and under-represents others. Labels in training data may reflect historical clinical biases — for example, if a disease was historically underdiagnosed in a population, the training labels will encode that underdiagnosis. Image quality differences across imaging equipment can interact with race or socioeconomic status in ways that degrade performance.

What to Ask Before Adopting a Tool

  1. What was the demographic composition of the training dataset? Was it disclosed in the FDA submission or published literature?
  2. Was the model validated on a population similar to your patient population? If not, what is the expected performance gap?
  3. Does the manufacturer provide subgroup performance data by race, sex, age, or socioeconomic status?
  4. Has the tool been independently audited for disparate performance, or is the only evidence vendor-provided?
  5. Does your institution have a process for monitoring performance disparities post-deployment?

FDA guidance increasingly expects manufacturers to address algorithmic bias in submissions, but the requirements are not yet standardized. Subgroup performance reporting is inconsistent across cleared devices. In practice, the burden of identifying potential equity concerns often falls on the adopting institution.

Module 7: Generative AI — A Distinct Risk Profile

Large language models occupy a genuinely different risk category from predictive AI. A chest CT nodule detection model produces a structured output — a probability, a bounding box, a classification — that can be evaluated against a ground truth. An LLM generating a clinical summary produces free text that can be plausible, internally consistent, and factually wrong simultaneously.

Hallucination — the generation of confident, well-formed statements that are factually incorrect — is a structural property of current LLM architectures, not a bug that will be patched. In a clinical context, a hallucinated drug interaction, a fabricated lab value in a summary, or an incorrect medication dose in an AI-drafted note can reach the medical record.

Institutional policies governing LLM use in clinical settings vary widely. Some health systems have explicit policies prohibiting LLM-generated content from entering the medical record without physician review and attestation. Others have no formal policy. As a clinician, knowing your institution's policy — and the rationale behind it — is part of practicing responsibly in this environment.

Module 8: Evaluating a Real AI Tool

The final module is applied. Given a specific AI tool — one being considered for adoption, one already in use, or one encountered in a vendor demonstration — work through the following evaluation framework:

  1. Regulatory status. Is it FDA-cleared? Under what pathway? For what specific intended use? Does the intended use match the clinical task you are evaluating it for? Check the FDA database directly using the submission number.
  2. Evidence quality. What peer-reviewed studies exist? What was the study design? Was external validation performed? What were the sensitivity and specificity at the operating threshold, not just AUC?
  3. Population fit. Does the training/validation population resemble your patient population? Is subgroup performance data available? Are there documented performance disparities?
  4. Deployment context. How does it integrate with your EHR or PACS? What does the workflow look like in practice? What happens when it fails — is the failure mode silent or visible?
  5. Ongoing monitoring. Does the vendor provide post-market performance data? Is there a mechanism for your institution to detect model drift or performance degradation over time?

This framework does not guarantee a correct adoption decision — no framework does. But it ensures the decision is made with the right questions answered, rather than on the basis of a vendor demonstration or a single benchmark number.

How to Use This Site Alongside the Sequence

Each module in this sequence connects to specific content groups on Clinical AI Record. As you work through the modules, the site's structured records give you concrete examples to apply the concepts against:

Module-to-content-group mapping for applying this sequence on Clinical AI Record.
ModuleRelevant Site Content
1 — What medical AI isConcepts & Methods entries on SaMD, foundation models, and CDS definitions
2 — FDA regulationFDA-Cleared AI Device Registry — look up specific clearance records by specialty or pathway
3 — Reading evidenceResearch Study Analyses — structured appraisals with disclosed study design, metrics, and limitations
4 — Imaging AIMedical Imaging AI — evidence briefs and device records for radiology and pathology tools
5 — Workflow AIClinical Workflow AI — profiles of ambient scribes and EHR-embedded CDS tools
6 — Bias and equityClinical Application Briefs — equity considerations documented per application
7 — Generative AIGenerative AI in Medicine Watch — deployment evaluations and regulatory status tracking
8 — Tool evaluationAll content groups — use the cross-reference structure to trace a tool from device record to evidence to deployment report

Discussion

Experience reports, prerequisite questions, and observations from clinicians, students, and administrators who have taken this program are welcome.

Comments

Join the discussion with an anonymous comment.

Loading comments...