Why Skin Tone Bias in AI Dermatology Matters Clinically

Dermatology AI tools are increasingly positioned as a solution to a genuine access problem: the global shortage of dermatologists means that many patients — particularly in lower-resource settings — will encounter algorithmic triage before they encounter a specialist. In that context, a tool that performs well for some patients and poorly for others does not simply underperform. It actively redistributes diagnostic quality in ways that compound existing disparities.

The clinical stakes are concrete. CDC data covering approximately 2011 to 2015 shows that five-year melanoma survival in the United States was approximately 66% for Black patients compared to approximately 90% for non-Hispanic White patients. That gap is not primarily explained by biology. It reflects later-stage diagnosis, differential access to dermatologic care, and documented underrecognition of melanoma presentations on darker skin — problems that an AI tool trained predominantly on lighter-skin images will replicate and potentially amplify at scale.

It is worth noting that non-Hispanic White populations have statistically higher melanoma incidence overall. But incidence and outcome are separate questions. Darker-skinned patients who develop melanoma face a meaningfully worse prognosis, and the mechanism most amenable to intervention is earlier, more accurate diagnosis. An AI dermatology tool that fails on darker skin tones at the detection stage is therefore failing at the precise point where clinical intervention has the most impact.

This article synthesizes the peer-reviewed evidence quantifying the AI diagnostic performance gap across skin tones, examines the structural causes rooted in training dataset composition, and evaluates the remediation strategies that have published support. For readers who want broader context on how AI performs in clinical diagnosis generally, the site's analysis of AI in medical diagnosis provides the wider evidentiary landscape.

Five circular dermoscopic examination zones arranged across a Fitzpatrick skin tone spectrum from light to dark, with AI scanning arcs that are bright and sharp over lighter zones and progressively dimmer over darker zones, and performance bars decreasing from left to right.
A conceptual representation of the AI diagnostic performance gradient across the Fitzpatrick skin tone spectrum. The dimming of the scanning arc over darker skin zones reflects the documented AUROC gap between Fitzpatrick I–III and IV–VI populations.

Quantifying the Performance Gap: What the Evidence Shows

Two primary sources anchor the quantitative evidence base. A 2025 PRISMA-compliant systematic review and meta-analysis by Tjiu and Lu, published in Medicina and registered on PROSPERO (CRD420251184280), synthesized data from 18 studies covering more than 70,000 test images published between 2020 and 2025. The overall pooled AUROC across those studies was 0.88 (95% CI 0.87–0.90), assessed with GRADE certainty rated as moderate.

The skin-tone-stratified subgroup analysis — a critical distinction — was possible for only 6 of the 18 included studies, because the remaining 12 did not report outcomes by Fitzpatrick skin type. Among those 6 studies, the pooled AUROC for Fitzpatrick types I–III was 0.89 compared to 0.82 for Fitzpatrick types IV–VI, a difference of −0.07 that reached statistical significance (p<0.01). Readers should note that this gap figure comes from a subset of 6 studies with stratified data, not from the full 18-study pool, and that moderate heterogeneity was present across those studies.

The second anchor is the 2022 Diverse Dermatology Images (DDI) benchmark study published in Science Advances by Daneshjou and colleagues. The DDI dataset consists of 656 biopsy-confirmed images from Stanford Clinics spanning Fitzpatrick types I through VI — at the time, one of the few publicly available dermatology benchmarks with verified skin-tone metadata. The performance disparities it revealed were substantial.

DDI benchmark performance by Fitzpatrick skin type. Sensitivity figures are for malignancy detection; AUC figures are from the HAM10000-trained model evaluated on DDI. Source: Daneshjou et al., Science Advances 2022.
ModelSensitivity FST I–IISensitivity FST V–VIAUC FST I–IIAUC FST V–VI
DeepDerm0.690.23
ModelDerm0.410.12
HAM100000.720.57

The DeepDerm sensitivity figure for FST V–VI (0.23) is particularly striking: a tool detecting fewer than one in four malignant lesions on dark skin would fail the basic clinical threshold for a screening tool. The Fisher's exact p-value for the DeepDerm sensitivity difference was 5.65×10⁻⁶, and for ModelDerm 0.0025, indicating these gaps are not statistical artifacts.

A 2025 conference abstract from the Journal of Investigative Dermatology (Kadam et al., Stony Brook Medicine/UVA, JID abstract LB1196) corroborates the DDI AUC values in a scoping review context, reporting FST V–VI versus FST I–II scores of 0.55 versus 0.64 for DeepDerm and 0.50 versus 0.61 for ModelDerm. This source should be read as corroborating rather than independently establishing these figures — it is a conference abstract, not a full peer-reviewed publication.

The 2025 meta-analysis also provides setting-stratified data that reveals how skin-tone bias compounds with deployment context. Specialist-setting AUROC was 0.90; community care AUROC was 0.85; smartphone-based deployment AUROC was 0.81. Patients with darker skin tones who are also more likely to access care through community or mobile settings face the intersection of both performance penalties simultaneously.

Root Causes: How Training Datasets Became Skin-Tone Biased

The performance gap documented above is not a consequence of inherent limitations in AI technology applied to dark skin. It is a consequence of specific, traceable decisions — and non-decisions — in how dermatology training datasets were assembled, labeled, and reported.

Abstract diagram showing a horizontal stacked bar with a wide cool blue-gray segment representing approximately 80% of training data from lighter skin tones and a narrow warm sienna segment representing approximately 20% from darker skin tones, with thumbnail placeholders and neural network clusters mirroring the imbalance.
A schematic representation of training dataset imbalance. The skewed composition of major dermatology AI datasets — with lighter-skin images dominating — directly produces the performance disparities documented in the DDI benchmark and the 2025 meta-analysis.

Dataset Composition: ISIC and HAM10000

The International Skin Imaging Collaboration (ISIC) dataset — with more than 50,000 images and used in the majority of dermatology AI training pipelines — contains no skin type metadata in its standard form. When researchers have applied Individual Typology Angle (ITA)-based inference to estimate skin tone distribution, the results show a heavily skewed distribution toward lighter skin tones. The ISIC collection was assembled predominantly from dermatology centers in Europe and Australia, populations that are disproportionately lighter-skinned relative to the global patient population that AI tools will eventually serve.

HAM10000, another widely used training dataset, similarly lacks skin type metadata and is dominated by Fitzpatrick I–III images. The absence of metadata is not a minor administrative gap — it means that researchers cannot audit these datasets for skin-tone representation without applying indirect inference methods, and it means that training pipelines built on these datasets proceed without any mechanism for detecting or correcting the imbalance.

The Reporting Gap

The dataset composition problem is compounded by a near-total absence of demographic reporting in published dermatology AI research. A scoping review by Guo and colleagues (cited in Kleinberg et al., Journal of Biomedical Research 2022) found that in a review of 136 dermatology AI studies, only 8.82% (12 of 136) disclosed the race or ethnicity of source image participants, and only 4.41% (6 of 136) disclosed skin type information. Among all 136 studies, only 2 explicitly included Hispanic individuals, only 1 explicitly included Black patients, and only 1 explicitly included American Indian or Alaska Native patients.

A separate review cited in Badrie (RCSIsmj, 2025) — drawing on Wen and colleagues' 2022 analysis in Lancet Digital Health — found that only approximately 10% of dermatology AI studies reported skin tone data at all. These two figures (8.82% and 10%) come from different review populations and methodologies, but they converge on the same conclusion: the overwhelming majority of published dermatology AI research provides no basis for assessing skin-tone equity.

Label Noise and the Propagation of Human Bias

A less-discussed root cause is the quality of the human-generated labels used to train AI models. The DDI study measured dermatologist visual consensus sensitivity — the accuracy of the human labels that AI models learn from — and found it varied by skin tone. Dermatologist sensitivity for malignancy was 0.72 for FST I–II images versus 0.59 for FST V–VI images (p = 8.8×10⁻⁶). This means that the training signal AI models receive for darker-skin images is measurably noisier and less accurate than the signal they receive for lighter-skin images.

The Stanford STAR-ED (Skin Tone Analysis for Representation in EDucational materials) framework, developed by Roxana Daneshjou and colleagues at Stanford HAI, provides upstream context for this labeler bias. Using machine learning to audit dermatology textbooks, lecture slides, and journal articles, STAR-ED found that only 1 in 10 images in these materials falls in the black-brown range on the Fitzpatrick scale. Physicians who are not adequately trained on darker-skin presentations produce less accurate labels for darker-skin images — and those less accurate labels propagate directly into AI training pipelines.

Clinical Consequences: Delayed Diagnosis and Compounding Disparities

The performance figures above translate directly into clinical harm when these tools are deployed. A tool with DeepDerm-level sensitivity on FST V–VI skin (0.23) operating as a triage or screening tool would miss approximately three in four malignant lesions in darker-skinned patients. In a setting where the tool is positioned as reducing unnecessary specialist referrals, that failure mode would delay diagnosis for the patients who already face the worst melanoma outcomes.

A systematic review by Montoya and colleagues (arXiv 2024, an unreviewed preprint) examined 18 major melanoma detection datasets and found that only 3 include skin tone identification, all using the Fitzpatrick scale, and only 7 include ethnicity metadata. The same review highlighted the pivotal trial for a recently FDA-cleared AI melanoma detection device (DermaSensor, DERM-SUCCESS trial), which enrolled 1,005 patients of whom 97.1% were reported as white and only 1.8% were FST VI. The trial's demographic composition means the regulatory clearance provides no validated performance data for the patient population most at risk of being harmed by a biased tool.

The equity concern is not that AI tools perform imperfectly — all diagnostic tools have performance limits. The concern is that the performance limits are systematically worse for the patient populations that already face the most significant access and outcome disparities, and that current deployment frameworks provide no mechanism for detecting or disclosing this.

Remediation Strategies: What the Evidence Supports

Three categories of remediation have been tested with published results. Their evidence quality varies significantly, and the most important finding — that one common approach does not work — is as informative as the findings that do.

Dataset Diversification and Fine-Tuning

The DDI benchmark study did not only document the performance gap — it also tested whether fine-tuning on a diverse, skin-tone-balanced dataset could close it. When DeepDerm was fine-tuned on DDI data (656 biopsy-confirmed images spanning FST I–VI), AUROC performance converged substantially: AUROC for FST I–II moved to 0.73–0.77, and AUROC for FST V–VI moved to 0.76–0.78. The gap that had been clinically significant in the baseline model was reduced to within the margin of measurement uncertainty. Notably, fine-tuned models outperformed dermatologist labelers on FST V–VI images (p = 9.33×10⁻⁵), suggesting that a model trained on better-balanced data can exceed the performance of the human experts whose labels it was originally trained on.

This finding has a direct implication: the barrier to improvement is not algorithmic complexity. It is data. A relatively small, carefully curated, biopsy-confirmed dataset spanning the full Fitzpatrick range produced measurable convergence in model performance. The DDI dataset is publicly available and represents a model for how diverse benchmark curation should be approached.

Synthetic Data Augmentation

A 2025 preprint from Munia and Imran at the University of Kentucky introduces DermDiff, a latent diffusion model conditioned on text prompts specifying skin tone (mapped to FST I–II, III–IV, and V–VI) and disease type (benign or malignant). Trained on the Fitzpatrick17k dataset, DermDiff generated 60,000 balanced synthetic dermoscopic images across skin tone and disease class combinations. When downstream classifiers were trained on combined real and synthetic data, they outperformed real-data-only models in F1-score and AUC for darker skin tones on the DDI test set.

Algorithmic Fairness Approaches: A Critical Null Finding

The DDI study tested three robust training methods — GroupDRO, CORAL, and CDANN — applied to DeepDerm with the goal of reducing skin-tone performance disparities through algorithmic means rather than data changes. None of them closed the gap. The study authors concluded that this result "further suggests that the performance limitations lie with the lack of diverse training data" rather than with the training algorithm itself.

This null finding matters for resource allocation. Institutions and researchers investing in algorithmic fairness techniques as a primary remediation strategy for skin-tone bias in dermatology AI should be aware that the published evidence does not support this approach as sufficient. The evidence redirects remediation effort toward dataset diversity as the primary intervention.

Comparison of tested remediation strategies for skin-tone bias in dermatology AI, with evidence basis and key caveats.
Remediation ApproachEvidence BasisOutcomeCaveats
Fine-tuning on diverse dataset (DDI)Peer-reviewed (Daneshjou et al., Science Advances 2022)AUROC gap reduced to within 0.03–0.05; fine-tuned model outperforms dermatologist labelers on FST V–VIDDI dataset is small (656 images); results may not generalize to all model architectures
Synthetic augmentation (DermDiff)Preprint (Munia & Imran, arXiv 2025)Improved F1 and AUC for darker skin tones on DDI test set with real+synthetic trainingUnreviewed preprint; single-institution evaluation; limited test sets
Algorithmic fairness methods (GroupDRO, CORAL, CDANN)Peer-reviewed (Daneshjou et al., Science Advances 2022)Did NOT close the skin-tone performance gapTested on DeepDerm only; may differ for other architectures, but data diversity is the supported primary solution

Reporting and Regulatory Gaps

The performance disparities documented above persist in part because the field lacks the reporting infrastructure to detect them routinely. When fewer than 10% of published dermatology AI studies report skin tone data, and fewer still report patient ethnicity, the true scope of bias across the literature is unknown. Studies that do not stratify outcomes by skin type cannot identify a gap even if one exists.

The regulatory pathway does not currently require correction. There is no mandatory Fitzpatrick-stratified subgroup reporting requirement for FDA SaMD clearance of AI-enabled dermatology devices. A tool can be cleared based on aggregate performance metrics that mask substantial disparities in subgroup performance — as the DermaSensor pivotal trial enrollment data illustrates. For readers who need the full regulatory context, the site's overview of current regulatory frameworks governing AI in healthcare covers the current state of FDA and international SaMD guidance in detail.

Existing reporting frameworks could close this gap if applied. The CONSORT-AI extension to the CONSORT reporting standard for randomized trials, and the SPIRIT-AI extension for trial protocols, both include requirements for reporting demographic composition and subgroup outcomes. CONSORT-AI compliance would, at minimum, make skin-tone underrepresentation visible in published research. The STAR-ED framework provides a practical tool for auditing skin-tone representation in educational and training materials, and is open-source and scalable to PDFs, slides, and journal articles.

  • Approximately 10% of dermatology AI studies report skin tone data (Wen et al., Lancet Digital Health 2022, cited in Badrie).
  • Only 8.82% of 136 dermatology AI studies disclosed race or ethnicity of source image participants (Guo et al., cited in Kleinberg et al.).
  • No current FDA SaMD clearance pathway mandates Fitzpatrick-stratified subgroup performance reporting for dermatology AI devices.
  • CONSORT-AI and SPIRIT-AI provide existing reporting standards that would surface demographic gaps if consistently applied.
  • STAR-ED (Stanford HAI) provides an automated open-source framework for auditing skin-tone representation in educational and training materials.

What Clinicians, Researchers, and Institutions Should Require

The evidence base reviewed here is sufficient to support concrete requirements — not aspirational guidelines — for anyone evaluating, deploying, or developing AI dermatology tools. For readers new to the underlying metrics, the site's overview of core AI and health concepts covers AUROC, sensitivity, specificity, and model fine-tuning in clinical context.

For Clinicians Evaluating or Deploying AI Dermatology Tools

  • Require skin-tone-stratified performance data (AUROC and sensitivity by Fitzpatrick subgroup) as a condition of evaluation. Aggregate performance metrics are insufficient.
  • Ask vendors specifically about training dataset composition: what proportion of training images were Fitzpatrick IV–VI? Was skin type metadata collected prospectively or inferred post-hoc?
  • Treat the absence of stratified performance data as a disqualifying gap, not a minor omission — particularly in settings serving populations with a significant proportion of darker-skinned patients.
  • Be aware that smartphone-based deployment (AUROC 0.81 in the meta-analysis) compounds skin-tone bias. Tools used in community or mobile settings carry a double performance penalty for darker-skinned patients.

For Researchers Developing or Validating Dermatology AI

  • Follow CONSORT-AI reporting standards and treat skin-tone-stratified subgroup outcomes as a minimum reporting requirement, not an optional supplementary analysis.
  • Use or contribute to diverse benchmark datasets with prospectively collected Fitzpatrick metadata and biopsy confirmation — the DDI dataset is the current reference standard for this.
  • Do not rely on ITA-based skin tone inference as a substitute for prospectively collected Fitzpatrick labels; the method has documented reliability problems when compared against human-annotated ground truth.
  • Do not invest primarily in algorithmic fairness training methods (GroupDRO, CORAL, CDANN) as the solution to skin-tone bias — the published evidence shows these approaches do not close the gap. Dataset diversification is the supported intervention.
  • Disclose funding sources and conflicts of interest in full; the 8.82% race/ethnicity disclosure rate in published dermatology AI literature reflects a reporting culture that needs to change.

For Institutions and Procurement Teams

  • Require dataset transparency from AI dermatology vendors as part of procurement due diligence: training dataset size, FST distribution, metadata collection method, and independent validation status.
  • Audit institutional dermatology educational materials using the STAR-ED framework to assess whether physician training is contributing to labeler bias in any in-house AI development or annotation pipelines.
  • Evaluate whether the patient population served by the institution is adequately represented in the training data of any tool under consideration — particularly for institutions with significant Black, Hispanic, or South Asian patient populations.

For a synthesis of AI diagnostic evidence across specialties beyond dermatology, the site's research digest on AI in the medical field provides the broader evidentiary context within which dermatology-specific findings should be interpreted. Readers seeking a companion overview of AI health evidence across clinical domains may also find value in the site's overview of what the clinical evidence on AI in health actually shows.