AI Skin Lesion Detection for Melanoma: Evidence & Limits

The Clinical Problem: Melanoma Burden and the Access Gap

Melanoma is among the most survivable cancers when caught early and among the most lethal when it is not. The five-year survival rate for localized melanoma is approximately 99%, but that figure declines sharply once the disease reaches distant organs. This survival gradient makes early detection the single most consequential clinical intervention available — and it makes the current access gap a patient safety problem, not merely a scheduling inconvenience.

Average dermatology wait times in the United States run approximately 35 days. In rural and underserved regions, specialist access is substantially worse, with many patients routed through primary care for initial lesion assessment. Primary care physicians are trained to identify suspicious lesions, but their sensitivity for early melanoma is meaningfully lower than that of dermatologists — a gap that has direct consequences for stage at diagnosis.

Localized melanoma: ~99% five-year survival rate.
Metastatic melanoma: survival drops sharply with distant spread.
Average U.S. dermatology wait time: approximately 35 days.
Rural specialist shortages mean primary care physicians handle a disproportionate share of initial lesion evaluation.
Primary care physician sensitivity for melanoma is substantially lower than expert dermatologist sensitivity in comparative studies.

This is the clinical context that motivates AI-assisted detection: not a technology solution looking for a problem, but an access and sensitivity gap with a measurable mortality consequence. Whether AI tools close that gap — and under what conditions — is what the evidence section addresses.

AI Approaches in Skin Lesion Detection: Architectures and Image Modalities

Most FDA-authorized and research-stage AI tools for skin lesion detection use convolutional neural networks (CNNs) or, increasingly, Vision Transformers (ViTs) and hybrid CNN-Transformer architectures. CNNs remain the dominant approach in published clinical studies, particularly those using dermoscopy images. ViTs and hybrid models show promise in benchmark evaluations but have a thinner prospective clinical evidence base as of mid-2026.

The four primary image modalities used in AI dermatology research and clinical tools differ substantially in their evidence base and clinical applicability:

AI image modalities for skin lesion detection ranked by clinical evidence base. AUROC figures from the PMC Medicina 2025 equity meta-analysis (Tjiu et al.).
Modality	Evidence Base	Clinical Context	Key Limitation
Dermoscopy	Strongest — majority of prospective and meta-analytic studies	Specialist and trained primary care settings	Requires dermoscope; operator-dependent image quality
Clinical photography	Moderate — used in some prospective trials and real-world deployments	Primary care, teledermatology	Lower resolution; lighting variability degrades AI performance
Elastic scattering spectroscopy (ESS)	Limited but growing — underpins DermaSensor, the only FDA De Novo-cleared primary care device	Point-of-care primary care settings	Requires proprietary handheld device; not image-based
Smartphone capture	Weakest prospective evidence; AUROC ~0.81 in setting-stratified analyses	Consumer and community settings	Image quality highly variable; not equivalent to regulated medical devices

DermaSensor, the only AI-enabled dermatology device currently cleared by the FDA for non-specialist use, uses elastic scattering spectroscopy rather than image capture. This distinction matters: its performance data is not directly comparable to dermoscopy-based AI tools, and it should not be conflated with consumer smartphone applications.

Split clinical illustration showing a dermoscopy AI heatmap on a medical monitor on the left and a physician using a handheld spectroscopy probe on a patient's forearm on the right. — AI-assisted skin lesion evaluation in two modalities: dermoscopy with AI risk overlay (left) and point-of-care spectroscopy in a primary care setting (right). Both represent adjunctive tools supporting clinician decision-making.

Evidence Quality: What the Literature Actually Shows

Before citing any accuracy figure, the study design producing it must be understood. The dominant evidence base for AI skin lesion detection is retrospective, uses internal validation on curated benchmark datasets, and is not representative of real clinical populations. This context is not a footnote — it is the essential frame for interpreting every performance number in this field.

AI vs. Clinicians: The Core Comparative Findings

The npj Digital Medicine 2024 meta-analysis (53 studies) provides the largest comparative dataset currently available. Its headline findings break down along three clinician comparison groups, each with distinct clinical implications:

AI vs. clinician diagnostic performance from the npj Digital Medicine 2024 meta-analysis (Salinas et al., 53 studies, predominantly retrospective). Sensitivity and specificity figures are pooled estimates.
Comparison	AI Sensitivity	AI Specificity	Clinician Sensitivity	Clinician Specificity	Clinical Implication
AI vs. all clinicians	87.0%	77.1%	79.8%	73.6%	AI modestly outperforms the pooled clinician group across both metrics
AI vs. expert dermatologists	86.3%	78.4%	84.2%	74.4%	Performance is comparable; AI does not substantially exceed expert dermatologists
AI vs. generalist physicians	92.5%	—	64.6%	—	Largest gap — AI substantially outperforms generalists on sensitivity; highest-value deployment context

The generalist comparison is the most clinically actionable finding: a 27.9 percentage point sensitivity gap between AI (92.5%) and generalist physicians (64.6%) points directly to primary care triage as the highest-value deployment context. This is where AI assistance has the greatest potential to change clinical outcomes.

Prospective Evidence: The JAMA Dermatology 2024 Meta-Analysis

The strongest prospective signal comes from a 2024 systematic review and meta-analysis published in JAMA Dermatology (Laiouar-Pedari et al.), which applied strict inclusion criteria — prospective design only — and identified 11 qualifying studies involving more than 2,500 patients and more than 50 dermatologists. This is a substantially smaller evidence base than the retrospective literature, reflecting how few prospective trials have been conducted.

Key prospective findings from this meta-analysis:

Dermatologist pooled sensitivity: 78.6%; specificity: 75.2%.
AI system pooled sensitivity: 80.9%; specificity: 75.6% — comparable to dermatologists prospectively.
AI-assisted dermatologists (one study): sensitivity 91.9%, specificity 83.7% — the strongest result in the prospective literature, and the only configuration that meaningfully outperforms either AI or clinicians alone.
Authors note frequent risk of bias from patient preselection and binary classification frameworks, and call for broader validation in unselected real-world populations before AI is considered ready for routine clinical use.

Overall Performance Envelope: The PMC Medicina 2025 Equity Meta-Analysis

A 2025 systematic review and meta-analysis published in PMC Medicina (Tjiu et al., 18 studies, >70,000 test images) provides the broadest current performance envelope: pooled AUROC 0.88 (95% CI 0.87–0.90), pooled sensitivity 0.91, pooled specificity 0.64. These figures include stratified analyses by clinical setting and skin tone that are essential for deployment planning — covered in the limitations section below.

Regulatory Status: From PMA-Only Devices to the First Primary Care Authorization

Three AI-enabled dermatology devices have received FDA marketing authorization, and their regulatory history illustrates both the maturation of the field and the persistence of its core clinical problem — low specificity.

FDA-authorized AI dermatology devices as of June 2026. Source: FDA AI-Enabled Medical Devices list (DEN230008, P150046, P090012).
Device	Pathway	Authorization Date	Indicated Users	Key Limitation at Authorization
MelaFind	PMA (P090012)	November 1, 2011	Dermatologists only	Low specificity; high false-positive rate in specialist settings
NeviSense	PMA (P150046)	June 28, 2017	Dermatologists only	Specificity concerns persisted; specialist-only scope maintained
DermaSensor	De Novo (DEN230008)	January 17, 2024	Non-specialist physicians (primary care)	Sp 20.7% in pivotal trial; postmarket requirements imposed for underrepresented skin tones

DermaSensor De Novo Authorization: What It Establishes

The FDA's January 17, 2024 De Novo authorization of DermaSensor (DEN230008) is the first FDA clearance for an AI-enabled dermatology device indicated for use by non-specialist physicians. Its scope covers detection of melanoma, basal cell carcinoma (BCC), and squamous cell carcinoma (SCC) in patients aged 40 and older.

The De Novo pathway carries regulatory significance beyond this single device: it establishes product classification code QZS, which enables future manufacturers of substantially equivalent devices to seek clearance via the 510(k) pathway rather than De Novo. This creates a regulatory precedent that could accelerate market entry for subsequent primary care AI dermatology tools.

Real-World Deployment Evidence: DermaSensor in Primary Care

Three distinct studies provide real-world performance data for DermaSensor. They differ in design, patient population, and endpoint, and must not be conflated. The specificity figures in particular vary significantly across studies — reflecting different clinical populations and study conditions, not measurement error.

DERM-SUCCESS Pivotal Trial

The DERM-SUCCESS trial was the pivotal study supporting the DermaSensor De Novo application. It enrolled 1,005 patients with 1,579 lesions across 22 primary care centers. Against biopsy-confirmed outcomes:

DermaSensor sensitivity: 95.5%
Primary care physician (PCP) sensitivity: 83.0%
Negative predictive value (NPV): 96.6%
Specificity: 20.7% — meaning approximately four of every five benign lesions flagged by the device were referred unnecessarily

Clinical Utility Study (108 PCPs, >10,000 Lesions)

A subsequent clinical utility study evaluated DermaSensor use by 108 primary care physicians across more than 10,000 lesions. This study assessed management decisions rather than biopsy-confirmed diagnoses, and produced a different specificity profile:

Management sensitivity with device: 91.4% vs. 82.0% without — a meaningful improvement.
Diagnostic sensitivity with device: 81.7% vs. 71.1% without.
Specificity decreased from 44.2% to 32.4% with device use — a significant referral cascade risk in high-volume primary care settings.

Rural Primary Care Real-World Study (3 PCPs, 155 Patients)

A smaller real-world study published in JABFM evaluated 3 primary care physicians in a rural setting, assessing 178 lesions from 155 patients. The study cohort was 92.2% self-identified White. Against biopsy or dermatologist consensus:

DermaSensor diagnostic sensitivity: 90% vs. standard-of-care PCP sensitivity: 40%
NPV: 98.6% — high reliability for 'Monitor' (negative) results
Device specificity: 60.7% vs. clinician standard-of-care specificity: 84.8%
Specificity by skin tone: 53.2% for Fitzpatrick I–III; 69.1% for IV–VI (consistent across groups in this small sample)

DERM System in NHS Clinical Pathways (Separate Device)

A prospective real-world deployment study of DERM across two NHS hospitals (July 2021–October 2022) assessed 8,571 lesions with confirmed outcomes across all Fitzpatrick skin types I–VI. Results by device version:

DERM system performance in live NHS clinical pathways (Frontiers in Medicine, 2023). Specificity improved substantially between versions while sensitivity was maintained. All Fitzpatrick types I–VI represented.
DERM Version	Melanoma Sensitivity	Benign Specificity	Discharge Eligibility
DERM-vA	95.0–100.0%	40.7–49.4%	15–31% of cases
DERM-vB	95.0–100.0%	70.1–73.4%	15–31% of cases

The DERM data is notable for demonstrating that a regulated AI device can maintain pre-market sensitivity targets in live clinical service without the performance decay commonly observed when AI moves from curated research datasets to real-world deployment — and that specificity can improve substantially through iterative model development.

Known Limitations: Specificity, Skin-Tone Bias, and Setting-Dependent Performance

The following limitations are documented in peer-reviewed literature and regulatory filings. They are not theoretical concerns — each has quantified evidence. Clinicians and health IT professionals evaluating adoption should treat these as active constraints, not background caveats.

1. Low Specificity Across Cleared Devices

Specificity has been the persistent clinical weakness of every FDA-authorized AI dermatology device. In the DermaSensor DERM-SUCCESS pivotal trial, specificity was 20.7% — meaning approximately 79% of benign lesions evaluated by the device generated a 'refer' recommendation. In the clinical utility study (108 PCPs, >10,000 lesions), specificity fell from 44.2% to 32.4% with device use.

In a primary care setting handling high lesion volumes, this specificity profile creates a substantial downstream referral burden on dermatology services that are already constrained. The high NPV (96.6% in the pivotal trial) makes the device useful for ruling out malignancy, but the low specificity limits its utility for ruling it in.

2. Skin-Tone Bias: Quantified and Persistent

The PMC Medicina 2025 equity meta-analysis (Tjiu et al., 18 studies, >70,000 images) documents a clinically significant and statistically significant fairness gap: pooled AUROC for Fitzpatrick IV–VI skin types was 0.82, compared to 0.89 for Fitzpatrick I–III. The ΔAUROC of −0.07 (p<0.01) is not a marginal rounding difference — it represents a systematic performance disadvantage for patients with darker skin tones.

Two arc gauges comparing AUROC for Fitzpatrick I–III at 0.89 and Fitzpatrick IV–VI at 0.82, with a delta label showing −0.07 gap. — AI skin lesion detection AUROC by Fitzpatrick skin tone group. Pooled AUROC 0.89 for types I–III vs. 0.82 for types IV–VI (ΔAUROC −0.07, p<0.01). Source: Tjiu et al., PMC Medicina 2025 (18 studies, >70,000 images).

The DermaSensor pivotal trial enrolled 97.1% White participants — a dataset composition that directly limits the generalizability of its performance data to patients with darker skin tones. The FDA's imposition of post-market performance testing requirements for underrepresented populations acknowledges this gap explicitly, but does not resolve it at the time of authorization.

3. Setting-Dependent Performance Gradient

AI diagnostic accuracy for skin lesion detection is not a fixed property of a model — it is a function of the clinical setting in which it is used. The PMC Medicina 2025 meta-analysis documents a clear gradient:

Specialist settings (dermoscopy, controlled conditions): AUROC 0.90
Community care settings: AUROC 0.85
Smartphone or consumer environments: AUROC 0.81

Three arc gauge panels showing AI AUROC declining from 0.90 in specialist settings to 0.85 in community care to 0.81 in smartphone environments, connected by a downward-sloping arrow. — AI skin lesion detection AUROC by clinical setting. Performance degrades as image quality and clinical conditions move away from specialist environments. Source: Tjiu et al., PMC Medicina 2025.

This gradient has a direct implication for benchmark interpretation: AUROC figures from dermoscopy studies on curated datasets (e.g., ISIC/HAM10000) systematically overstate the performance clinicians will observe in community or smartphone-based deployment.

4. Dataset Homogeneity

The majority of AI skin lesion models have been trained on a small number of benchmark datasets — primarily ISIC and HAM10000 — which are predominantly composed of images from patients with lighter skin tones and from specialist dermatology settings. These datasets have been repeatedly reused across published studies, creating a risk of overfitting to benchmark conditions that does not transfer to diverse real-world populations.

Leading models including those trained on HAM10000 show significant performance drops on the Diverse Dermatology Images (DDI) dataset, particularly for Fitzpatrick skin types V–VI. Rare presentations — amelanotic melanoma, acral lentiginous melanoma — are systematically underrepresented in training data, which limits AI reliability for exactly the presentations most likely to be missed clinically.

5. Metadata Absence

Current AI skin lesion models — including cleared devices — operate primarily on image or spectroscopy data. They do not integrate patient history, age, personal or family history of melanoma, lesion evolution over time, or palpation findings. A dermatologist evaluating a suspicious lesion uses all of this information; the AI does not. This structural gap means AI performance figures in published studies reflect image-only classification, not the full clinical evaluation that determines management decisions.

6. Prospective Evidence Remains Limited

The JAMA Dermatology 2024 prospective meta-analysis (Laiouar-Pedari et al.) identified only 11 prospective studies meeting inclusion criteria — from a field that has published hundreds of AI dermatology papers. The authors explicitly note that frequent risk of bias from patient preselection and binary classification frameworks limits generalizability, and that broader validation in unselected real-world populations is needed before AI can be considered ready for routine clinical use.

Deployment Stage: Adjunctive Triage Support, Not Autonomous Diagnosis

The evidence base as of mid-2026 supports one deployment framing for AI-enabled skin lesion detection: adjunctive decision support for non-specialist clinicians, particularly in primary care triage settings where the sensitivity gap between AI and generalists is largest. This is not a conservative interpretation — it is the conclusion of the prospective meta-analysis literature, the equity meta-analysis, and the FDA's own authorization language.

The strongest prospective result — AI-assisted dermatologists achieving sensitivity 91.9% and specificity 83.7% — demonstrates that human-AI collaboration consistently outperforms either the clinician or the AI system operating independently. This is the appropriate performance target for deployment planning, not AI-alone accuracy from retrospective benchmark studies.

For health IT professionals and procurement teams, the deployment decision matrix should account for three variables the aggregate literature does not resolve: the local patient population's skin-tone distribution relative to the training data, the clinical setting's image quality and operator training, and the downstream capacity of dermatology services to absorb the referral volume generated by a low-specificity device. None of these are resolvable from published trial data alone — they require site-specific assessment.

AI-Enabled Skin Lesion Detection for Melanoma: Clinical Evidence, Regulatory Status, and Deployment Limitations