Foundation Models in Healthcare: Med-PaLM, Architecture & Clinical Scope

What Is a Foundation Model? Definition and Conceptual Origin

A foundation model is a large AI system pre-trained on broad data at scale and designed to be adapted — with minimal additional effort — across many downstream tasks and domains. The defining characteristic is the separation between pre-training and deployment: a single base model acquires general capabilities from an enormous and diverse corpus, and that base is then adapted for specific tasks rather than being retrained from scratch for each one.

This framing was applied explicitly to large language models in medicine by the Med-PaLM research team in their landmark Nature (2023) paper, which described LLMs as "foundation models — large pre-trained AI systems that can be repurposed with minimal effort across numerous domains and diverse tasks." This framing distinguishes the foundation model paradigm from the dominant prior approach in clinical AI: narrow, single-task models trained specifically for one predefined clinical problem.

Narrow clinical AI models — such as a convolutional neural network trained exclusively to detect diabetic retinopathy in fundus photographs — are purpose-built and validated for a single, tightly scoped task. They do not generalize beyond that task without retraining. A foundation model inverts this relationship: the base model is general, and specialization is achieved through a targeted adaptation step applied on top of the pre-trained weights.

Schematic showing a central neural network node branching outward to six clinical domain icons including a chest X-ray, genomic helix, clinical text document, dermatology patch, pathology slide, and physician silhouette, with isolated single-task module icons shown separately in a corner. — Foundation models operate from a single base that branches to multiple clinical tasks and data modalities — contrasted here with isolated narrow AI modules, each trained for one specific task.

Architecture and Domain Adaptation: How Foundation Models Are Built for Medicine

The production of a medical foundation model follows a staged technical lifecycle. Understanding these stages is important for evaluating what a given model has and has not been exposed to, and where its capabilities and gaps originate.

Large-scale pre-training. The base model is trained on an extremely large, broad corpus — typically spanning web text, books, code, and scientific literature — using self-supervised objectives such as next-token prediction. This stage produces a model with general language understanding and reasoning capabilities, but no specific clinical alignment. The PaLM 540B model, which underlies the original Med-PaLM, exemplifies this stage.
Instruction tuning. The pre-trained base is further trained on a large collection of tasks formatted as instructions and desired outputs. This improves the model's ability to follow natural-language instructions and generalize to unseen task formats. Flan-PaLM — the instruction-tuned version of PaLM — is the product of this step and the immediate precursor to Med-PaLM.
Domain alignment via instruction prompt tuning. A parameter-efficient technique in which a small set of soft prompt tokens — not the full model weights — are optimized using a curated collection of medical question-and-answer exemplars. This is how Med-PaLM was produced from Flan-PaLM. Instruction prompt tuning differs from traditional fine-tuning: rather than updating all or most model parameters on a domain-specific dataset, it adjusts a small prefix of learned tokens that condition the model's behavior during inference. The underlying model weights remain largely unchanged.

The distinction between instruction prompt tuning and traditional fine-tuning matters for clinical evaluation. Traditional fine-tuning modifies the model substantially and may overfit to a narrow dataset; instruction prompt tuning is more parameter-efficient and preserves the base model's general capabilities while steering its outputs toward the target domain. Neither approach guarantees clinical accuracy or safety — those properties require empirical evaluation against clinical benchmarks and, ultimately, prospective validation.

For Med-PaLM 2, the adaptation strategy was extended: the base model was upgraded to PaLM 2, medical domain fine-tuning was applied more extensively, and ensemble refinement and chain-of-retrieval techniques were incorporated to improve accuracy on complex clinical questions. These additions produced the substantial benchmark improvement documented in the subsequent Nature Medicine publication.

The Med-PaLM Progression: From USMLE Threshold to Expert-Level Performance

The Med-PaLM research lineage spans three distinct model generations, each representing a measurable advance in benchmark performance and evaluation methodology. The progression is documented across peer-reviewed publications in Nature (2023) and Nature Medicine (2025).

Med-PaLM model progression: benchmark performance and key evaluation findings. Research-setting performance does not imply prospective clinical readiness.
Model	Base LLM	MedQA Accuracy	Key Evaluation Finding	Publication
Flan-PaLM (pre-alignment)	PaLM 540B + instruction tuning	67.6%	First AI system to surpass the USMLE-style passing threshold (>60%) on MedQA	Nature, 2023
Med-PaLM	Flan-PaLM + instruction prompt tuning	67.6%	Reduced likelihood-of-harm ratings from 29.7% to 5.9%; improved scientific consensus alignment to 92.6%	Nature, 2023
Med-PaLM 2	PaLM 2 + medical fine-tuning + ensemble refinement + chain of retrieval	86.5%	Preferred by physicians over physician answers on 8 of 9 clinical utility axes in pairwise evaluation of 1,066 consumer medical questions	Nature Medicine, 2025

The 19-percentage-point improvement from Med-PaLM to Med-PaLM 2 reflects multiple simultaneous changes: a more capable base model (PaLM 2), more extensive medical domain fine-tuning, ensemble refinement across multiple model outputs, and chain-of-retrieval augmentation that allows the model to draw on relevant passages when formulating answers. The pairwise physician preference study covered 1,066 consumer medical questions; physicians rated Med-PaLM 2 answers as preferred over answers from other physicians on eight of nine clinical utility axes, with statistical significance (p < 0.001) on all comparisons.

A separate bedside consultation pilot compared Med-PaLM 2 responses against those of both generalist physicians and specialist physicians. Specialists preferred Med-PaLM 2 over generalist physician answers approximately 65% of the time. However, specialist physician answers remained the overall preferred response in that pilot — Med-PaLM 2 did not surpass specialist-level performance in that setting. Both specialists and generalists rated Med-PaLM 2 responses as comparably safe to physician responses.

Evaluation Methodology: Why Benchmark Accuracy Is Insufficient

The Med-PaLM research team introduced MultiMedQA as a composite benchmark specifically designed to assess medical AI systems across a wider range of clinical question types than any single exam dataset can cover. MultiMedQA spans seven datasets representing distinct question sources and formats.

MedQA (USMLE-style multiple-choice questions from US medical licensing exams)
MedMCQA (multiple-choice questions from Indian medical entrance exams)
PubMedQA (biomedical research literature questions requiring yes/no/maybe reasoning)
MMLU clinical topics (college medicine, clinical knowledge, medical genetics, anatomy, professional medicine)
LiveQA (consumer medical questions submitted to the National Library of Medicine)
MedicationQA (medication-related consumer questions)
HealthSearchQA (medical questions commonly searched online — introduced by the Med-PaLM team)

Multiple-choice accuracy on these datasets provides a useful signal but is structurally limited: it measures whether a model selects the correct answer from a predefined option set, not whether it generates safe, accurate, and clinically useful responses in open-ended contexts. To address this, the Med-PaLM team developed a human evaluation framework in which both physicians and lay readers assessed model responses across multiple axes.

Human evaluation axes used in the Med-PaLM assessment framework, beyond multiple-choice benchmark accuracy.
Evaluation Axis	Assessed By	What It Measures
Factuality	Physicians	Whether the response accurately reflects established medical knowledge
Comprehension	Physicians	Whether the model correctly understood the question being asked
Reasoning	Physicians	Whether the logical steps leading to the answer are sound
Possible harm	Physicians	Whether the response could cause harm if followed by a patient or clinician
Bias	Physicians	Whether the response reflects demographic, cultural, or other systematic bias
Lay helpfulness	Lay readers	Whether the response is understandable and useful to a non-specialist

This multi-axis approach revealed improvements that single-number accuracy scores would obscure. Instruction prompt tuning (Med-PaLM) increased alignment with scientific consensus from 61.9% (untuned Flan-PaLM) to 92.6%, and reduced the proportion of responses rated as likely to cause harm from 29.7% to 5.9%. These are meaningful clinical safety signals — but they were measured on a curated research dataset, by physician raters, under controlled conditions. They do not establish how the model performs in live clinical environments on the full distribution of patient queries.

For clinicians and administrators evaluating any foundation model, the practical implication is this: when a vendor or publication cites a benchmark accuracy figure, that figure describes performance on a specific dataset under specific conditions. The multi-axis human evaluation framework used by the Med-PaLM team is a more informative — though still research-setting — approach. Neither replaces prospective real-world validation. For a broader discussion of AI evidence standards in healthcare, see what the clinical evidence actually shows for generative AI in 2026.

Multimodal Extension: Med-PaLM M and the Generalist Biomedical AI Approach

Med-PaLM M extends the foundation model concept from text-only clinical question answering to a single model capable of processing and generating outputs across multiple biomedical data types simultaneously. Built on the PaLM-E architecture, Med-PaLM M uses one set of model weights to handle 14 diverse biomedical tasks evaluated under the MultiMedBench benchmark framework.

MultiMedBench tasks span clinical language (medical question answering, clinical note summarization), imaging interpretation (chest X-ray report generation, mammography classification, dermatology image interpretation, radiology visual question answering, pathology slide classification), and genomics (genomic variant calling). The model reaches performance competitive with or exceeding prior state-of-the-art on all MultiMedBench tasks as reported in the Med-PaLM M preprint.

Three Architectural Approaches to Multimodal Medical AI

Building a model that handles both language and imaging requires architectural decisions with significant trade-offs. The Google Research team identified three distinct approaches in their multimodal medical AI overview:

Three-column schematic comparing multimodal medical AI architectures: tool use with API connections to specialist subsystems, model grafting with an adapter bridge layer, and a single unified generalist model with multiple input types feeding directly into it. — The three principal architectural approaches to multimodal medical AI: tool use (modular, high auditability), model grafting (adapter-based, reuses validated encoders), and generalist systems (single unified model, highest flexibility and compute cost).

Architectural trade-offs across the three approaches to multimodal medical AI. Med-PaLM M exemplifies the generalist system approach.
Architecture	How It Works	Key Advantages	Key Trade-offs
Tool use	LLM calls specialist subsystems (e.g., an imaging classifier) via API; results are returned as text	High auditability; each component can be validated independently; modular updates	Information loss at API interface; coordination complexity; dependent on subsystem quality
Model grafting	A specialist encoder (e.g., a radiology image encoder) maps its output through an adapter layer into the LLM's embedding space	Reuses existing validated specialist models; modest additional compute cost	Communication between components is not human-readable; adapter introduces a new failure point
Generalist system	A single unified model natively processes all modalities using one set of weights (e.g., Med-PaLM M on PaLM-E)	Maximum flexibility; no interface bottlenecks; tasks can share learned representations	Highest computational cost; debuggability trade-offs; harder to isolate errors by modality

In a retrospective study of 246 chest X-ray cases, clinicians expressed a pairwise preference for Med-PaLM M-generated reports over radiologist-generated reports in up to 40.5% of comparisons. This finding is frequently cited as a headline result, but its framing requires precision.

Commercial Deployment: MedLM on Google Cloud — Intended Uses and Hard Prohibitions

MedLM is the commercially deployed product powered by Med-PaLM 2, available through Google Cloud Vertex AI. As of the most recent available documentation, it is generally available in the United States, Brazil, and Singapore to a limited group of customers — it is not broadly open to all Google Cloud users.

The MedLM documentation defines both the permitted use cases and the explicit prohibitions. Administrators and procurement teams evaluating MedLM should treat these as hard operational boundaries, not advisory guidance.

Explicitly Permitted Uses

Long-form question answering (e.g., detailed responses to clinical knowledge questions)
Multiple-choice question answering (e.g., exam-style or structured clinical queries)
Summarization of existing documentation (e.g., after-visit summaries, history and physical notes, discharge summaries)

All permitted uses require that a qualified human review, edit, and approve outputs before any use. MedLM is explicitly positioned as a drafting and synthesis tool, not an autonomous decision-maker.

Explicitly Prohibited Uses

Clinical diagnosis of any kind
Direct patient use (patients may not interact with MedLM outputs without clinician review)
Deployment as Software as a Medical Device (SaMD)
Any use that requires clearance or approval from a medical device regulatory agency
Providing medical advice to patients

An important architectural context: Med-PaLM and Med-PaLM 2 were built on Google's PaLM model family. As of mid-2026, Google has transitioned its primary emphasis toward Gemini-based successors, and MedLM product documentation has begun migrating toward the Gemini Enterprise Agent Platform. The Med-PaLM lineage remains the primary published research case study for the foundation model paradigm in medicine, and its architectural concepts remain foundational — but the active commercial product ecosystem has evolved beyond the PaLM-based models specifically.

Unresolved Clinical Limitations and Readiness Gaps

The gap between research-setting benchmark performance and clinical deployment readiness is not a marketing caveat — it is a substantive technical and regulatory reality. The following limitations apply to current foundation models in healthcare, including the Med-PaLM family, and are documented in the primary research publications and deployment documentation.

Hallucination of plausible medical misinformation. Foundation models generate fluent, confident-sounding text that may contain factual errors, fabricated citations, or clinical recommendations inconsistent with current evidence. In a medical context, a plausible but incorrect statement about drug interactions, dosing, or contraindications carries direct patient safety implications. Human review before any clinical use is not optional — it is the minimum operational safeguard.
Demographic bias amplification. Foundation models trained on existing biomedical literature and clinical data inherit the demographic skews of those sources. MedLM documentation acknowledges that performance may differ across demographic groups. This concern is particularly acute for Med-PaLM M's dermatology modality, given documented underrepresentation of darker skin tones in dermatology AI training datasets. For a detailed examination of this issue, see the site's analysis of AI dermatology tools and skin tone bias.
Inability to reflect time-varying medical consensus. A foundation model reflects the state of medical knowledge at the time its training data was assembled. It does not update as clinical guidelines change, new drug safety data emerges, or treatment protocols are revised. This is noted explicitly in both the original Nature (2023) Med-PaLM paper and the MedLM product documentation. The model cannot warn a user that its knowledge on a given topic may be outdated.
Absence of prospective real-world clinical validation. All published Med-PaLM performance data — including the MedQA benchmarks, physician preference rankings, and Med-PaLM M chest X-ray study — is retrospective or conducted in controlled research settings. No prospective randomized clinical trial has established the safety or efficacy of any Med-PaLM model in a live clinical environment. The evidence standards applicable to clinical AI tools require prospective validation; the Med-PaLM family has not yet met that bar. For a broader discussion of evidence standards, see what the clinical evidence shows for AI in healthcare.
No SaMD regulatory clearance. Neither Med-PaLM, Med-PaLM 2, Med-PaLM M, nor MedLM holds FDA clearance as a Software as a Medical Device. MedLM explicitly prohibits use in any context that would require such clearance. Clinical deployment for diagnostic or treatment decision support would require regulatory authorization that does not currently exist for these products.

Further work on validation and alignment to human values is necessary as the technology finds broader uptake in real-world applications. — Med-PaLM 2 research team, Nature Medicine (2025)

For clinicians and administrators evaluating whether foundation model-based tools fit specific clinical workflows, the practical question is not whether benchmark performance is impressive — it is whether the specific intended use falls within documented permitted boundaries, whether human review is operationally guaranteed, and whether the institution has mechanisms to detect and correct model errors before they affect patient care. For a workflow-level perspective on how AI capabilities map onto specific clinical settings, see the site's clinical applications overview across key medical domains.

The following terms appear frequently in literature and documentation related to foundation models in healthcare. Each is distinct from "foundation model" in a way that matters for clinical and regulatory evaluation.

Large language model (LLM). A model class defined by scale (parameter count, training data volume) and text-based input/output. All current medical foundation models based on language are LLMs, but not all LLMs are foundation models in the sense of being designed for broad cross-task reuse. The terms are related but not interchangeable.
Instruction tuning. A training procedure that exposes a pre-trained model to a large collection of instruction-formatted tasks to improve its ability to follow natural-language directions. Flan-PaLM is the instruction-tuned version of PaLM; instruction tuning precedes domain alignment in the Med-PaLM production chain.
Fine-tuning. A training procedure that updates model weights on a domain-specific dataset. Traditional fine-tuning modifies a larger proportion of model parameters than instruction prompt tuning and typically requires more domain-specific training data. Med-PaLM 2 incorporated medical domain fine-tuning as one of several adaptation steps.
Software as a Medical Device (SaMD). An FDA regulatory classification for software intended to perform a medical function without being part of a hardware medical device. MedLM explicitly prohibits use as SaMD. Any clinical AI tool that informs, supports, or makes diagnostic or treatment decisions may require SaMD classification and FDA clearance.
MultiMedQA. The composite benchmark introduced by the Med-PaLM team spanning seven medical question-answering datasets. Distinct from MultiMedBench, which is the multimodal task suite used to evaluate Med-PaLM M across imaging, genomics, and language tasks.
Hallucination. In the context of LLMs, the generation of factually incorrect, fabricated, or internally inconsistent content presented with apparent confidence. In clinical settings, hallucination is a patient safety concern, not merely an accuracy problem. It is not fully preventable by prompt engineering or retrieval augmentation in current systems.
Multimodal AI. AI systems capable of processing and generating content across multiple data types — text, images, structured data, audio — within a single model or coordinated system. Med-PaLM M is the primary multimodal foundation model reference case in biomedical AI research as of this writing.