Clinical trials of AI interventions have a reporting problem. A study can describe an AI system so vaguely — "a deep learning model trained on institutional data" — that no reader can determine whether the results would replicate in a different hospital, on a different patient population, or even on the same system after a software update. CONSORT-AI was developed specifically to close that gap.
Published in Nature Medicine and The BMJ in September 2020, CONSORT-AI is an extension of the Consolidated Standards of Reporting Trials (CONSORT) 2010 statement. It adds 14 items specifically addressing the unique reporting challenges that arise when the intervention being evaluated is an AI system rather than a drug, device, or procedure.
Why Standard CONSORT Falls Short for AI Trials
CONSORT 2010 was designed for interventions that can be described with reasonable completeness in a few sentences — a drug dose, a surgical technique, a behavioral protocol. AI systems don't fit that mold.
An AI diagnostic tool is defined not just by its architecture but by its training data, preprocessing pipeline, decision threshold, software version, and the hardware environment it runs on. Any of these can shift performance materially. A model trained on chest radiographs from one scanner manufacturer may degrade noticeably when applied to images from another. A threshold calibrated on a predominantly male dataset may systematically underperform on female patients.
Standard CONSORT has no line items for any of this. Reviewers and readers evaluating AI trials under CONSORT 2010 alone were left to guess at details that are essential for assessing generalizability, reproducibility, and safety.
The 14 CONSORT-AI Extensions: What Each Requires
CONSORT-AI maps its additions onto the existing CONSORT structure — title, abstract, introduction, methods, results, discussion — rather than creating a separate document. The 14 new items fall across five areas.
AI System Description
The standard requires that trials identify the AI system with enough specificity that an independent party could locate or reproduce it. This includes the version number or release date, the developer or source, and — where applicable — a reference to a publicly accessible model or code repository. For FDA-cleared devices, this typically means citing the 510(k) or De Novo submission number.
This matters because AI systems are updated continuously. A trial reporting results for "version 2.1" of a system may be entirely inapplicable to version 3.0 deployed six months later. Without version specificity, the published evidence cannot be reliably linked to the deployed product.
Training Data Disclosure
CONSORT-AI requires disclosure of the dataset used to train the AI — including its source, size, demographic composition where known, and any preprocessing steps applied. This is one of the most frequently omitted items in practice. Many published AI trials describe the training data only as "a large retrospective dataset" or "proprietary institutional data," which provides essentially no basis for assessing whether the model's training distribution matches the trial population.
Decision Threshold and Output Handling
Most AI diagnostic systems produce a continuous probability score that is converted to a binary output (positive/negative, flag/no flag) at a chosen decision threshold. CONSORT-AI requires that trials report this threshold explicitly, explain how it was chosen, and describe what happens to borderline or indeterminate outputs.
This matters because threshold selection directly trades off sensitivity against specificity. A trial reporting sensitivity of 94% is reporting sensitivity at a particular threshold — move the threshold and that number changes. Without knowing the threshold, readers cannot compare results across trials or assess whether the operating point chosen for the trial matches what would be used in clinical practice.
Human-AI Interaction and Workflow Integration
CONSORT-AI requires a description of how the AI output was integrated into clinical decision-making. Was the AI output shown to clinicians before or after their initial assessment? Could clinicians override the AI? Were there protocol rules governing how disagreements between the AI and the clinician were resolved?
This is not a minor procedural detail. The same AI system producing the same outputs can have dramatically different effects on clinical decisions depending on whether it is presented as a second opinion, a first-pass filter, or a mandatory checkpoint. A trial that doesn't describe this interaction structure cannot be generalized to a deployment context where the workflow is different.
Failure Analysis and Indeterminate Outputs
CONSORT-AI requires that trials report how the AI system handled cases it could not process — images of insufficient quality, data outside the model's expected input range, or outputs below a confidence threshold. What happened to those cases in the trial? Were they excluded from analysis, routed to human review, or counted as errors?
Excluding low-confidence cases from reported metrics is a common source of performance inflation. A model that declines to classify 15% of cases and reports 97% accuracy on the remaining 85% is not performing at 97% accuracy on the full population.
CONSORT-AI vs. Related Reporting Standards
CONSORT-AI is one of several AI-specific reporting guidelines that have emerged since 2019. Understanding where it sits relative to the others matters for both study authors and readers interpreting published work.
| Standard | Scope | Study Type | AI-Specific Focus |
|---|---|---|---|
| CONSORT-AI | Randomized controlled trials of AI interventions | RCT only | AI system description, training data, threshold, human-AI interaction, failure handling |
| TRIPOD+AI | Prediction model development and validation | Observational / validation studies | Model development transparency, validation methodology, calibration |
| STARD-AI | Diagnostic accuracy studies | Diagnostic test evaluation | Reference standard selection, reader variability, AI vs. human comparison |
| SPIRIT-AI | Clinical trial protocols | Trial registration and protocol documents | Pre-specified AI system details before trial begins |
| MI-CLAIM | Machine learning in clinical/imaging studies | Observational and clinical ML studies | Reproducibility, code/data availability, dataset characteristics |
SPIRIT-AI is worth highlighting separately because it is the prospective complement to CONSORT-AI. Where CONSORT-AI governs how completed trials are reported, SPIRIT-AI governs what must be specified in a trial protocol before the trial begins — including pre-specifying the AI system version, the decision threshold, and the handling of indeterminate outputs. Trials that register under SPIRIT-AI and report under CONSORT-AI provide the strongest basis for evaluating whether the published results reflect what was actually pre-planned.
Adoption in Practice: Where the Gaps Are
Since its 2020 publication, CONSORT-AI has been endorsed by a number of journals, including The Lancet, JAMA, and Nature Medicine. But endorsement and compliance are different things. Audits of published AI trials in the years following the standard's release have consistently found incomplete adherence, with training data disclosure and failure analysis being the most commonly omitted items.
Several factors drive this gap. Peer reviewers are not always equipped to assess AI-specific reporting completeness. Authors from clinical backgrounds may not have access to training data details if the AI system was developed by a commercial vendor. And journals that endorse CONSORT-AI don't always enforce it at the submission stage — endorsement often means "we encourage compliance" rather than "we require it."
How CONSORT-AI Connects to FDA Regulatory Submissions
CONSORT-AI is a voluntary reporting standard — it carries no regulatory force. The FDA does not require CONSORT-AI compliance as a condition of 510(k) or De Novo clearance. However, several of the items CONSORT-AI requires overlap substantially with what FDA expects in a premarket submission for AI/ML-based Software as a Medical Device (SaMD).
FDA's guidance on AI/ML-based SaMD — including the 2021 action plan and subsequent draft guidances — emphasizes training data transparency, performance across demographic subgroups, and clear documentation of the model's intended operating conditions. These parallel CONSORT-AI's requirements closely enough that a trial conducted under CONSORT-AI standards will generally produce the kind of evidence that supports a regulatory submission, even if the mapping is not exact.
The connection matters for procurement and clinical adoption decisions too. An AI tool supported by a trial that meets CONSORT-AI standards is substantially more evaluable than one supported by a trial that doesn't — because the CONSORT-AI report allows a reader to assess whether the trial population resembles their own patient population, whether the workflow described matches their intended deployment, and whether the reported performance metrics were measured at a clinically appropriate operating point.
Limitations of CONSORT-AI Itself
CONSORT-AI improves on standard CONSORT for AI trials, but it has its own gaps worth acknowledging.
- Model drift is not addressed. CONSORT-AI describes the AI system at the time of the trial. It does not require any specification of how performance should be monitored after deployment, or how software updates that change the model's behavior will be handled. This is a significant gap for continuously learning systems.
- Proprietary systems create a structural problem. When the AI system is a commercial product, trial authors often cannot disclose training data, model architecture, or software internals because these are trade secrets. CONSORT-AI does not resolve this tension — it requires disclosure that manufacturers may refuse to provide.
- Subgroup reporting requirements are limited. CONSORT-AI requires that training data demographic composition be described, but it does not mandate that performance metrics be reported stratified by demographic subgroup. A trial can technically comply with CONSORT-AI while reporting only aggregate accuracy figures that obscure differential performance across race, sex, or age groups.
- Generative AI and LLM-based interventions are a poor fit. CONSORT-AI was developed with supervised learning systems in mind. The framework's assumptions — a defined input, a defined output, a fixed decision threshold — map poorly onto large language models used for clinical summarization, documentation, or decision support, where outputs are variable and thresholds don't apply in the same way.
Using CONSORT-AI When Evaluating Evidence on This Site
Across the site's research study analyses and clinical application briefs, CONSORT-AI compliance is treated as one dimension of evidence quality — not a binary pass/fail, but a factor that affects how much weight a trial's performance claims can carry.
A trial that fully specifies its AI system version, training data, decision threshold, and human-AI workflow allows a reader to ask: does this evidence apply to my context? A trial that omits these items produces performance numbers that cannot be meaningfully transferred. That distinction — between transferable and non-transferable evidence — is one of the core analytical questions this site applies when summarizing the evidence base for specific AI applications.