Canonical Definitions: Transfer Learning, Fine-Tuning, and Their Relationship
Transfer learning and fine-tuning are related but distinct concepts. Conflating them leads to imprecise clinical AI evaluation and governance. The distinction matters for regulatory classification, deployment planning, and safety assessment.
The FDA's Digital Health and Artificial Intelligence Glossary provides the regulatory-authoritative definition: transfer learning is "a strategic approach within ML wherein a model developed for a particular task is adapted for a second task. This approach leverages the knowledge and patterns acquired from a previously solved problem (source task) to boost the performance and learning efficiency of a model on a subsequent, often similar, problem (target task)." The FDA's own example: a model trained to identify tumors in lung X-ray images might leverage learned patterns to improve identification of abnormalities in liver ultrasound images.
Four terms form the core vocabulary of this concept family:
- Pretraining: The initial phase in which a model is trained on a large source dataset—such as ImageNet for vision models, PubMed text corpora for biomedical language models, or large medical imaging archives. The model learns general representations that encode patterns transferable to related tasks.
- Transfer learning: The overarching strategy of applying knowledge encoded in a pretrained model to a new target task. It encompasses all adaptation methods, from frozen feature extraction to full parameter updates.
- Fine-tuning: A specific transfer learning implementation in which a pretrained model's weights are updated through continued gradient-based training on task-specific labeled clinical data. The degree of updating varies by strategy—from a single output head to all model parameters.
- Domain adaptation: A subtype of transfer learning specifically addressing distribution shift between source and target data. Relevant when a model trained on general or external datasets is deployed in a specific institutional or patient population context. Covered in detail in the Taxonomy section below.
For a definition of foundation models—the large pretrained models that increasingly serve as the source of pretrained weights in clinical AI adaptation pipelines—see Foundation Models in Healthcare: Definition, Architecture, and Clinical Scope. That article addresses what foundation models are; this entry addresses how they and other pretrained models are adapted.
The Clinical AI Adaptation Pipeline: Pretraining → Adaptation → Deployment
Clinical AI models built on transfer learning move through a three-phase pipeline. Understanding this arc clarifies what is actually transferred, why it matters for healthcare applications, and where things can go wrong.

Phase 1 — Source pretraining: A model is trained on a large, often general-purpose dataset. For vision models, this has historically meant ImageNet (over 14 million labeled natural images). For biomedical language models, this means large text corpora such as PubMed abstracts and clinical notes. For multimodal clinical foundation models, this may include millions of paired imaging-report datasets. The model encodes learned representations—patterns, features, and relationships—into its weights.
Phase 2 — Target adaptation: The pretrained model is adapted to a specific clinical task using one of the strategies described in the Taxonomy section. What is transferred is not raw data but the weight representations—the model's learned internal structure—which encode generalizable patterns that can be redirected toward a new clinical objective. The adaptation step requires a labeled clinical dataset, but far smaller than would be needed to train from scratch.
Phase 3 — Deployment: The adapted model is integrated into a clinical workflow. Deployment introduces its own distribution shift challenges, as real-world patient populations, imaging equipment, and clinical documentation practices may differ from both the source and adaptation datasets. This is where model drift—the deployment-time manifestation of distribution shift—becomes a monitoring concern.
Taxonomy of Adaptation Strategies
Adaptation strategies exist on a spectrum from minimal parameter modification to full model retraining. The appropriate choice depends on the degree of domain shift between source and target, the size of the available labeled clinical dataset, computational resources, and the acceptable risk of catastrophic forgetting.

Feature Extraction (Frozen Backbone)
In feature extraction, all pretrained weights are frozen—held fixed—and only a new task-specific output head (typically a classification or regression layer) is trained on the clinical dataset. The pretrained model functions as a fixed feature extractor, transforming input data into representations that the new head learns to classify.
This approach eliminates catastrophic forgetting risk entirely, since pretrained weights are never modified. It is appropriate when the source and target domains are similar and labeled clinical data is very scarce. The trade-off is limited expressivity: because the backbone cannot adapt, performance is constrained by how well source-domain representations generalize to the clinical target. On the LUNA25 lung nodule detection dataset, linear probing (head-only training) achieved 59.5% AUC compared to 90.0% for full fine-tuning, illustrating the performance ceiling of this approach in high-domain-shift settings.
Partial Fine-Tuning and Gradual Unfreezing
Partial fine-tuning selectively updates a subset of layers while keeping others frozen. The underlying logic reflects the layered structure of deep neural networks: lower layers encode general, broadly transferable features (edges, textures, basic linguistic patterns), while upper layers encode task-specific, domain-dependent features. Updating only upper layers allows task-specific adaptation while preserving general representations.
Gradual unfreezing is a related technique in which layers are progressively unfrozen from the top down during training, often with discriminative learning rates (different learning rates assigned to different layer groups). This approach has demonstrated effectiveness across radiology, cardiology, and gastroenterology imaging classification tasks. It offers a practical middle ground between feature extraction and full fine-tuning.
Full Fine-Tuning
Full fine-tuning updates all model parameters on the clinical target dataset. This maximizes task-specific adaptation and is most appropriate when domain shift between source and target is large and sufficient labeled clinical data is available. It carries the highest risk of catastrophic forgetting and requires the most computational resources—and, for large models, may be practically infeasible without specialized hardware.
A study evaluating eight fine-tuning strategies across three CNN architectures (ResNet-50, DenseNet-121, VGG-19) and five medical imaging domains found that no single fine-tuning approach uniformly outperforms others across all datasets—the choice of architecture and fine-tuning method must be tailored to the specific clinical domain and dataset characteristics.
Domain Adaptation
Domain adaptation is a subtype of transfer learning specifically designed to address distribution shift between source and target data. It is not simply a fine-tuning strategy but a distinct objective: aligning model behavior across domains where the data-generating processes differ. In clinical AI, domain adaptation is relevant whenever a model trained on one institution's data, one imaging protocol, or one patient population is applied to a different clinical context.
Domain-adaptive pre-training (DAPT) is one implementation: continued unsupervised or self-supervised pre-training on large domain-specific unlabeled text—for example, millions of clinical notes or specialty-specific medical documents—before task-specific fine-tuning. This is particularly relevant for clinical NLP models that need to internalize specialty abbreviations, clinical terminology, and documentation conventions that differ from general biomedical text.
Parameter-Efficient Fine-Tuning (PEFT): LoRA, QLoRA, and Adapter Modules
Parameter-efficient fine-tuning methods update only a small, structured subset of model parameters rather than the full weight set. This makes large model adaptation feasible in low-data and resource-constrained clinical environments.
Low-Rank Adaptation (LoRA) freezes the original pretrained weights and injects trainable low-rank matrices alongside them. Formally, for a weight matrix W₀, the adapted weight becomes W = W₀ + AB, where A and B are low-rank matrices with rank r much smaller than the original matrix dimensions. Only A and B are updated during training; W₀ remains fixed. This enables fine-tuning with as little as 0.01–0.2% of total model parameters.
In medical imaging benchmarks, LoRA has demonstrated competitive performance relative to full fine-tuning in low-data regimes. For ViT-based classification tasks, LoRA achieved up to 6% absolute F1-score gains while tuning fewer than 0.2% of parameters. Applied to the Segment Anything Model (SAM) for medical image segmentation, LoRA fine-tuning of the image encoder produced a 13.93% absolute increase in average Dice Similarity Coefficient for multi-organ segmentation tasks, according to a 2024 review of foundation model adaptation strategies in medical imaging.
QLoRA extends LoRA with 4-bit quantization of the base model weights, dramatically reducing GPU memory requirements. Full fine-tuning of LLaMA 65B requires more than 780 GB of GPU memory; QLoRA reduces this to approximately 48 GB, making large language model fine-tuning accessible on hardware available in research and clinical informatics settings. QLoRA has been applied to clinical NLP tasks including echocardiography report generation (EchoGPT, fine-tuned on 95,506 echocardiography reports).
Adapter modules insert lightweight bottleneck layers within the frozen pretrained network. Only these inserted layers are trained. Like LoRA, adapters achieve parameter efficiency while preserving the pretrained backbone's representations.
Federated Learning as a Transfer Learning Variant
Federated learning enables multi-institutional collaborative fine-tuning without centralizing patient data. Each participating institution trains a local model on its own data; only model weight updates (gradients or parameters) are shared and aggregated centrally. The FDA defines federated learning as "a decentralized approach to training ML models... designed to preserve data privacy, as raw data remain at the local sites and are not centralized."
In the transfer learning context, federated fine-tuning allows a pretrained foundation model to be adapted across multiple health systems' data distributions simultaneously—improving generalizability while respecting HIPAA and GDPR constraints that prohibit direct data sharing. This is particularly relevant for rare disease applications where no single institution has sufficient labeled cases for effective fine-tuning.
| Strategy | Parameters Updated | Catastrophic Forgetting Risk | Labeled Data Requirement | Compute Requirement | Primary Clinical Use Case |
|---|---|---|---|---|---|
| Feature extraction (frozen backbone) | Output head only | None | Very low | Low | High domain similarity, very scarce labels |
| Partial fine-tuning / gradual unfreezing | Upper layers + head | Low to moderate | Low to moderate | Moderate | Moderate domain shift; radiology, cardiology, GI |
| Full fine-tuning | All parameters | High | Moderate to high | High | Large domain shift; sufficient labeled data available |
| Domain adaptation (DAPT) | All or partial parameters | Moderate | Low (unlabeled) + moderate (labeled) | Moderate to high | Cross-institutional, cross-protocol deployment |
| LoRA / QLoRA (PEFT) | < 0.2% of parameters | Low | Low | Low (QLoRA: ~48 GB for 65B model) | Large LLM fine-tuning; low-data clinical NLP |
| Adapter modules (PEFT) | Adapter layers only | Low | Low | Low | Modular task adaptation; NLP and imaging |
| Federated fine-tuning | Varies (local updates aggregated) | Varies | Distributed across institutions | Distributed | Multi-institutional adaptation; rare disease; privacy-constrained settings |
Healthcare-Specific Drivers: Why Transfer Learning Is Central to Clinical AI
Four structural constraints in healthcare make transfer learning not merely useful but often the only viable path to effective clinical AI model development.
- Labeled data scarcity: Expert annotation of clinical data—radiologist-labeled imaging studies, pathologist-annotated slides, physician-coded clinical notes—is inherently limited. Unlike natural image datasets that can be crowd-labeled, clinical annotation requires domain expertise, time, and often institutional review. Transfer learning allows effective model development from datasets that would be insufficient for training from scratch.
- Annotation cost: Stanford HAI has estimated that developing, deploying, and maintaining a classifier for a single clinical task under the conventional paradigm can exceed $200,000. Shared pretrained models and transfer learning reduce this cost structure by enabling multiple task-specific adaptations from a single pretrained base.
- Privacy and regulatory constraints on data sharing: HIPAA in the U.S. and GDPR in the EU restrict the centralization of patient data across institutions. This makes it difficult to aggregate the large labeled datasets that would otherwise be needed for training from scratch. Federated fine-tuning directly addresses this constraint by enabling collaborative adaptation without data movement.
- Class imbalance and rare disease underrepresentation: Many clinically important conditions—rare cancers, uncommon arrhythmias, atypical presentations of common diseases—are underrepresented in any single institution's dataset. Transfer learning from models pretrained on broader datasets provides a richer feature foundation from which to detect rare patterns, partially compensating for the scarcity of positive examples in the fine-tuning dataset.
Key Failure Modes: Catastrophic Forgetting, Negative Transfer, and Overfitting
Transfer learning and fine-tuning introduce failure modes that are distinct from those of models trained from scratch. Each has clinical safety implications that clinicians, health IT teams, and AI governance bodies need to understand.
Catastrophic Forgetting
Catastrophic forgetting occurs when fine-tuning on a narrow clinical dataset causes the model to overwrite general representations learned during pretraining. The model becomes highly optimized for the fine-tuning task but loses the broader representational capacity that made transfer learning valuable in the first place.
In clinical AI, this is most consequential when a model is sequentially fine-tuned on multiple tasks or when a model cleared for one clinical indication is subsequently adapted for a related but distinct indication. Full fine-tuning on small, homogeneous clinical datasets carries the highest forgetting risk. PEFT methods (LoRA, adapters) and partial fine-tuning strategies are specifically designed to mitigate this risk by preserving frozen pretrained weights.
Negative Transfer
Negative transfer occurs when the source domain knowledge encoded in a pretrained model is misaligned with the target clinical task—and that misalignment degrades rather than improves performance on the target. This happens when source and target domains have incompatible feature distributions, conflicting assumptions, or fundamentally different data-generating processes.
A documented illustration: a model trained on imaging data from urban academic medical centers—with standardized equipment, protocols, and patient demographics—may perform poorly when deployed at rural clinics with different imaging hardware, acquisition parameters, and patient populations. The transferred representations reflect the source distribution, not the deployment context. This is also a pathway through which pretrained model biases are inherited and potentially amplified in the fine-tuned clinical model. For a detailed taxonomy of bias inheritance and amplification mechanisms, see Algorithmic Bias in Healthcare AI: Definition, Taxonomy, and Mitigation Frameworks.
Negative transfer and the distribution shift problems that cause it are also the training-time antecedents of model drift—the deployment-time performance degradation that occurs when a deployed model encounters data distributions that differ from its training context. See Model Drift in Deployed Clinical AI: Definition, Types, Causes, Detection, and Monitoring for the deployment-time dimension of this problem.
Overfitting in Low-Data Clinical Regimes
When a large pretrained model is fine-tuned on a small clinical dataset—particularly with full parameter updates—it may overfit: memorizing the specific examples in the fine-tuning set rather than learning generalizable clinical patterns. An overfit model will report strong performance on the fine-tuning dataset but perform poorly on new patients, new institutions, or edge cases not represented in the adaptation data.
Overfitting risk is highest when the fine-tuning dataset is small, homogeneous, or not representative of the target deployment population. PEFT methods, regularization-based fine-tuning, and frozen backbone approaches all reduce overfitting risk relative to full fine-tuning in low-data settings.
Clinical Applications Across Modalities
Transfer learning and fine-tuning have been applied across two major modality tracks in clinical AI. Both tracks have substantial published evidence bases and distinct technical characteristics.
Medical Imaging: CNN and Vision Transformer Architectures
CNN-based transfer learning for medical imaging represents the older and more extensively validated paradigm in clinical AI. Standard architectures—ResNet, DenseNet, VGG, and more recently Vision Transformers (ViT)—are pretrained on large natural image datasets (principally ImageNet) and then fine-tuned on labeled clinical imaging data.
This approach has been applied across radiology (chest X-ray classification, CT lesion detection), pathology (whole-slide image analysis, tumor grading), ophthalmology (diabetic retinopathy grading from fundus photographs), dermoscopy (melanoma detection), and endoscopy (polyp detection, lesion characterization). The transfer learning rationale is consistent across these modalities: labeled clinical imaging datasets are orders of magnitude smaller than ImageNet, but the low-level visual features learned from natural images—edges, textures, shapes—generalize to medical imaging tasks.
A comprehensive evaluation of fine-tuning strategies across X-ray, MRI, histology, dermoscopy, and endoscopy datasets found that combining linear probing with subsequent full fine-tuning (LP-FT) produced notable improvements in over 50% of evaluated cases, and that adaptive learning rate methods (Auto-RGN) led to performance enhancements of up to 11% for specific modalities. The finding that no single strategy dominates across all imaging domains has practical implications: clinical AI developers must evaluate strategy selection empirically for each target task rather than applying a universal approach.
Without fine-tuning, foundation models face challenges in handling variation in real-world imaging data arising from differences in imaging modalities, patient demographics, and clinical acquisition context. A foundation model fine-tuned on actual CT scans with comorbidities becomes more capable of detecting tumors with confounding factors that a generically trained model may miss.
Clinical NLP and Large Language Models
Domain-specific language models fine-tuned for clinical tasks represent the second major application track. Models such as BioBERT (pretrained on PubMed abstracts and PMC full-text), GatorTron (pretrained on over 90 billion words of clinical text from the University of Florida Health system), and BioMedLM have been fine-tuned for tasks including named entity recognition in clinical notes, EHR mining, clinical coding, and information extraction.
More recently, large general-purpose LLMs have been fine-tuned for clinical applications using instruction tuning and RLHF (reinforcement learning from human feedback). Applications include radiology report generation, ambient clinical documentation (AI scribes), patient-facing chatbots, and clinical decision support. For applied context on fine-tuned LLMs in documentation workflows, see NLP in Clinical Documentation: A Reference Guide for AI Scribes, Clinical Coding, and CDI.
A third modality—structured EHR-based prediction models—uses transfer learning to adapt models trained on broad patient record patterns to specific prediction tasks such as sepsis risk scoring, readmission prediction, or deterioration alerts. These models typically operate on tabular or time-series EHR data rather than imaging or text, and fine-tuning strategies are adapted accordingly.
| Modality Track | Common Architectures | Pretrained On | Clinical Applications | Key Fine-Tuning Considerations |
|---|---|---|---|---|
| Medical imaging (radiology) | ResNet, DenseNet, VGG, ViT | ImageNet; large medical imaging corpora | Chest X-ray classification, CT lesion detection, lung nodule assessment | Domain shift from natural to medical images; no single strategy dominates across modalities |
| Medical imaging (pathology) | ResNet, DenseNet, ViT | ImageNet; pathology-specific corpora | Whole-slide image analysis, tumor grading, Ki-67 scoring | Very high-resolution inputs; multiple instance learning approaches |
| Medical imaging (ophthalmology, dermoscopy) | ResNet, VGG, ViT | ImageNet; specialty imaging datasets | Diabetic retinopathy grading, melanoma detection | Equipment and protocol variation across sites creates negative transfer risk |
| Clinical NLP (domain-specific) | BERT-based (BioBERT, GatorTron) | PubMed, PMC, clinical notes | Named entity recognition, EHR mining, clinical coding | Domain-adaptive pretraining before task fine-tuning improves terminology handling |
| Clinical LLMs (large models) | GPT-based, LLaMA, PaLM variants | General web + biomedical text | Ambient documentation, radiology report generation, patient interaction | PEFT (LoRA, QLoRA) for resource efficiency; hallucination risk persists post-fine-tuning |
| Structured EHR prediction | Transformer, LSTM, gradient boosting with transfer | Broad patient record datasets | Sepsis prediction, readmission risk, deterioration alerts | Tabular and time-series data require task-specific architecture choices |
Regulatory and Governance Dimensions
Transfer learning and fine-tuning have direct regulatory implications for AI-enabled medical devices cleared by the FDA. Two regulatory classifications and one governance mechanism are central to understanding how fine-tuned clinical AI is governed.
Locked vs. Adaptive (Continual Learning) Model Classification
The FDA distinguishes between two fundamental model types based on whether the model changes after deployment. A Locked Model "provides the same output each time the same input is applied to it and does not change with use." A Continual Machine Learning (Adaptive) Model has "a defined learning process to change its behavior" and "model changes are implemented such that for a given set of inputs, the output may be different before and after the changes are implemented."
Most fine-tuned clinical AI models deployed today are locked at the point of clearance: fine-tuning occurs before regulatory submission, and the cleared model is fixed. Any subsequent fine-tuning—for example, adapting a cleared model to a new patient population or imaging protocol—constitutes a modification that may require a new regulatory submission or, if pre-specified, governance under a Predetermined Change Control Plan.
Predetermined Change Control Plan (PCCP) and Planned Fine-Tuning
The Predetermined Change Control Plan (PCCP) is the FDA mechanism that allows manufacturers to specify in advance the types of algorithmic modifications—including fine-tuning updates—they intend to make post-clearance, and the performance monitoring and validation protocols that will govern those modifications. A PCCP-covered fine-tuning update does not require a new 510(k) or De Novo submission if it falls within the pre-specified scope and the manufacturer follows the agreed protocols.
For clinical AI developers planning iterative fine-tuning of cleared models—for example, periodic retraining on new institutional data to address model drift—the PCCP is the primary governance pathway. See Predetermined Change Control Plan (PCCP): The FDA Mechanism for Iterative AI/ML Medical Device Updates for the full regulatory framework. For the broader FDA AI/ML SaMD policy context, see FDA AI/ML SaMD Action Plan (2021): Five Commitments, Key Deliverables, and Implementation Status Through Q2 2026.
Model Shelf-Life and Governance Considerations
A practical governance consideration for health systems evaluating fine-tuned clinical AI: the competitive performance advantage of a fine-tuned model over the base pretrained model may erode within 4–6 months as newer, more capable base models become available. This affects the return on investment calculation for fine-tuning investments and underscores the importance of building institutional processes for periodic model evaluation and, where appropriate, re-fine-tuning under a PCCP framework rather than treating a fine-tuned model as a permanent solution.
- Fine-tuning updates to cleared AI devices that fall outside the original cleared intended use require a new regulatory submission, not just internal validation.
- PCCP pre-specification must occur before clearance—it cannot be added retroactively to an already-cleared device.
- Federated fine-tuning across institutions does not eliminate the regulatory requirement to validate the updated model before clinical deployment; it changes the data governance structure, not the validation obligation.
- Bias inherited from the pretrained model may be amplified rather than corrected during fine-tuning if the fine-tuning dataset is not representative of the target deployment population—a consideration that applies to both regulatory submissions and ongoing monitoring.
Related Terms and Cross-References
The following ClinicalMind entries extend the concepts defined in this article. Each occupies a distinct position in the knowledge chain; cross-referencing rather than re-reading this entry provides the most efficient path to the specific concept needed.
- Foundation Models in Healthcare: Definition, Architecture, and Clinical Scope — Upstream definition article. Covers what foundation models are, their pretraining architecture, and their clinical scope. This transfer learning article addresses the adaptation step that follows.
- Predetermined Change Control Plan (PCCP): The FDA Mechanism for Iterative AI/ML Medical Device Updates — The regulatory governance framework for planned post-clearance fine-tuning updates to cleared AI-enabled medical devices.
- Model Drift in Deployed Clinical AI: Definition, Types, Causes, Detection, and Monitoring — The deployment-time manifestation of distribution shift; the monitoring counterpart to domain adaptation addressed at training time.
- Algorithmic Bias in Healthcare AI: Definition, Taxonomy, and Mitigation Frameworks — Bias taxonomy and mitigation frameworks, including bias amplification through fine-tuning on non-representative clinical data.
- Hallucination in Clinical LLMs: Definition, Causes, Detection, and Deployment Implications — Defines the hallucination risk that persists in fine-tuned clinical LLMs and its implications for ambient documentation and clinical decision support applications.
- NLP in Clinical Documentation: A Reference Guide for AI Scribes, Clinical Coding, and CDI — Applied NLP context for fine-tuned LLM use cases in ambient documentation, EHR mining, and clinical coding.
- FDA AI/ML SaMD Action Plan (2021): Five Commitments, Key Deliverables, and Implementation Status Through Q2 2026 — The broader FDA regulatory framework for AI/ML software as a medical device, including the locked vs. adaptive model classification that applies to fine-tuned clinical AI.
Comments
Join the discussion with an anonymous comment.