Study Overview
| Field | Detail |
|---|---|
| Authors | Mackin S, Major VJ, Chunara R, Newton-Dame R |
| Journal | npj Digital Medicine |
| Citation | npj Digit. Med. 8, 335 (2025) |
| DOI | 10.1038/s41746-025-01732-w |
| Publication Date | June 5, 2025 |
| Study Design | Retrospective, real-world EMR — two live binary classification models |
| Institution | NYC Health + Hospitals (H+H), in collaboration with NYU Grossman School of Medicine and NYU Center for Health Data Science |
| Population — Asthma Model | n = 53,099 patients |
| Population — Readmission Model | n = 139,132 patients |
| Funding | Schmidt Futures |
| IRB Status | Exempt (BRANY) |
| Open-Source Repository | GitHub: shainamackin/nyc-hh_algorithmic-bias-mitigation |
Clinical and Equity Context: Why Safety-Net Systems Face Disproportionate Bias Risk
NYC Health + Hospitals is the largest public health system in the United States, serving more than one million New Yorkers annually. Approximately 90% of its patients are people of color, and roughly 70% are covered by Medicaid or have no insurance. This demographic profile makes H+H a critical test case for algorithmic bias: the patients most likely to be harmed by a biased clinical AI tool are precisely the patients this system serves.
Algorithmic bias in clinical AI does not require malicious intent to cause harm. When a model is trained predominantly on data from populations that differ demographically from the patients it will later encounter, it can systematically produce higher false-negative rates for underrepresented groups — meaning those patients are more likely to be missed by automated outreach or risk-stratification. At a safety-net institution, that missed identification translates directly into missed care for patients who already face structural barriers.
The practical challenge for health systems like H+H is that most commercially deployed AI tools are black boxes: the institution uses the model's output but has no access to the training data, the model architecture, or the internal weighting of variables. This is the norm, not the exception, for health system consumers of vendor AI. Bias mitigation strategies that require retraining the model or modifying its internals are therefore unavailable to most institutions.
Study Design: Model Selection, Patient Populations, and Fairness Metric
The research team began with eight EMR-based binary classification models in active use at H+H. From these, five had at least one year of paired prediction and outcome data available — the minimum needed to conduct a meaningful bias evaluation. After screening on model discrimination (AUROC and precision-recall AUC), two models were retained for full bias analysis and mitigation testing.
| Model | Clinical Task | AUROC | Primary Patient Population | Key Sensitive Classes |
|---|---|---|---|---|
| Asthma Model | Predict acute care utilization (ED visit or hospitalization) for asthma patients | 0.756 | n = 53,099 | Race/ethnicity, sex, preferred language, insurance status |
| Readmission Model | Predict unplanned 30-day readmission | 0.719 | n = 139,132 | Race/ethnicity, sex, preferred language, insurance status |
Both models were used to drive automated patient outreach programs — identifying patients who should receive proactive contact from care teams. In this context, a false negative (a patient who should be flagged but is not) is the critical error: that patient receives no outreach and may miss an intervention. This clinical use case shaped the choice of fairness metric.
The team used Equal Opportunity Difference (EOD) as the primary fairness metric. EOD measures the difference in false-negative rates between a demographic subgroup and a designated referent group. A positive EOD indicates that the subgroup has a higher false-negative rate — meaning the model is more likely to miss patients in that group relative to the referent.
The team pre-defined three criteria for a mitigation approach to qualify as a success:
- Absolute EOD below 5 percentage points (pp) for all subgroups within a sensitive class
- Accuracy loss less than 10% relative to the baseline model
- Alert rate change less than 20% relative to the baseline (to preserve operational feasibility for outreach teams)
Asthma Model Results: Racial and Ethnic Bias at Baseline and After Threshold Adjustment
At baseline, the asthma model showed racial and ethnic bias exceeding the 5pp EOD threshold across all race/ethnicity subgroups evaluated. False-negative rates varied substantially across groups, ranging from 0.51 for Black/African American patients to 0.828 for White patients, with the referent group anchoring the comparison. The pattern of disparity meant that some groups were substantially more likely to be missed by the model's outreach trigger than others.

Custom threshold adjustment — implemented in R by the H+H team — brought all race/ethnicity absolute EODs below the 5pp threshold. Critically, this was achieved while keeping accuracy loss under 10% and alert rate change under 20%, satisfying all three pre-defined success criteria.
Aequitas, the Python-based open-source bias mitigation library, achieved a non-absolute EOD below 5pp but did not meet the absolute EOD criterion for all subgroups — a meaningful distinction when the goal is to ensure no subgroup is left with a substantially higher false-negative rate than the referent. Reject option classification (ROC), implemented via AIF360 and Sony nnbala, failed on the alert rate criterion: it reduced the alert rate from 0.124 to 0.081, a decrease of more than 30%, which would have made the outreach program operationally unsustainable.
Readmission Model Results: Insurance-Based Bias and Mitigation Outcomes
The 30-day readmission model showed insurance-based bias across all insurance subgroups at baseline. False-negative rates ranged from 0.382 for Medicare patients to 0.797 for Self-Pay patients — a gap of more than 40 percentage points between the best- and worst-served groups. Insurance status is a sensitive class that frequently correlates with race, income, and access to primary care, making this pattern particularly consequential at a safety-net institution.
| Insurance Subgroup | Baseline FNR | EOD vs. Referent |
|---|---|---|
| Medicare | 0.382 | Lowest (referent or near-referent) |
| Medicaid | Not specified individually | >5pp at baseline |
| Commercial | Not specified individually | >5pp at baseline |
| Self-Pay | 0.797 | Highest disparity |
Custom threshold adjustment again succeeded where the alternatives did not. It brought all insurance-class absolute EODs below 5pp while keeping accuracy and alert rate within the pre-defined bounds. ROC failed the alert rate criterion for the readmission model as well, decreasing the alert rate from 0.255 to 0.174 — a reduction exceeding 30% that would have materially impaired the outreach program's reach.
Mitigation Method Comparison: Custom Thresholding, Aequitas, and Reject Option Classification
The H+H team tested three distinct post-processing mitigation approaches across both models. The comparison reveals not only which methods worked but also why existing open-source tooling presents practical barriers for resource-limited implementers.

| Method | Implementation | Asthma Model Outcome | Readmission Model Outcome | Alert Rate Impact | Practical Barrier |
|---|---|---|---|---|---|
| Custom threshold adjustment | In-house R code | Qualified success — all absolute EODs <5pp | Qualified success — all absolute EODs <5pp | Within 20% threshold | Requires in-house R coding capacity; no major library dependency |
| Aequitas (balanced group thresholding) | Python library | Partial success — non-absolute EOD <5pp but not absolute | Not specified as fully successful | Within threshold | Outdated documentation; limited adaptability to institution-specific criteria |
| Reject option classification (ROC) | AIF360 + Sony nnbala | Failed — alert rate criterion not met | Failed — alert rate criterion not met | Dropped >30% in both models | Poor documentation; alert rate reduction operationally unsustainable |
The team found that existing open-source libraries — Aequitas and AIF360 — had documentation that was difficult to adapt to their specific institutional criteria and sensitive class definitions. Custom R code, while requiring more upfront development effort, was ultimately more transparent and efficient for this use case. The released code repository is intended to reduce that development burden for other institutions.
Key Limitations
- Single-system generalizability: The study was conducted at one urban safety-net institution. Results may not transfer directly to academic medical centers, rural health systems, or institutions with substantially different patient demographic compositions.
- EOD-only optimization: Mitigation was optimized specifically for Equal Opportunity Difference. Using a different fairness metric — such as demographic parity or predictive parity — would likely produce different thresholds and potentially different success or failure outcomes for each method.
- No intersectional bias analysis: The study evaluated sensitive classes (race/ethnicity, sex, language, insurance) separately. Intersectional combinations — such as Black women on Medicaid — were not analyzed. Threshold adjustment for one class may not address compounding disparities across multiple dimensions simultaneously.
- Missing data not imputed: Patients with missing values for sensitive class variables were excluded from subgroup analyses. This may undercount bias in groups where data completeness is itself unequal — a common issue in safety-net EHR data.
- No real-world outcome measurement: The study demonstrates that threshold adjustment reduces EOD disparity in model predictions. It does not yet measure whether the adjusted thresholds translate into equitable clinical outcomes for patients — that requires prospective implementation research.
- Model-specific thresholds: Adjusted thresholds are calibrated to the specific models, patient populations, and time periods studied. Thresholds will require recalibration if models are updated, if patient population composition shifts, or if the models are deployed at a different institution.
Clinical Relevance: Implementation Pathway and Open-Source Playbook
The H+H study's most immediately actionable contribution is not a theoretical framework — it is a transferable process. The team released both an open-source R code repository and a Supplementary Playbook specifically designed for replication by other low-resource health systems. The repository (GitHub: shainamackin/nyc-hh_algorithmic-bias-mitigation) provides the custom threshold adjustment code used in the study, along with documentation structured for institutions that may not have dedicated data science teams.
For health system administrators and AI implementation teams, the study outlines a replicable sequence:
- Inventory deployed AI models and identify those with at least one year of paired prediction and outcome data.
- Screen models for minimum discrimination performance (AUROC/PR-AUC) before investing in bias evaluation — low-performing models should be addressed at the model level, not just the threshold level.
- Define sensitive classes relevant to your patient population and institutional context (race/ethnicity, insurance status, language, sex, and others as appropriate).
- Select a fairness metric aligned with the clinical use case — EOD is appropriate when false negatives are the primary harm; other use cases may warrant different metrics.
- Pre-specify success thresholds before running mitigation — the H+H team defined absolute EOD <5pp, accuracy loss <10%, and alert rate change <20% in advance, preventing post-hoc rationalization of results.
- Apply custom threshold adjustment using the released R code as a starting point, then validate that all three criteria are met before deploying adjusted thresholds in production.
- Plan for ongoing monitoring — adjusted thresholds require recalibration as models, patient populations, or clinical workflows change.
This study's approach aligns with the broader policy direction established by the STANDING Together 2025 consensus recommendations (published in NEJM AI), a Delphi-based international consensus developed with input from more than 350 representatives across 58 countries. STANDING Together addresses both documentation of health datasets and the identification and mitigation of algorithmic biases — framing bias evaluation not as an optional quality improvement activity but as a standard expectation for responsible AI deployment.
The broader literature context reinforces why this matters. A 2025 scoping review of clinical AI fairness found that post-processing methods remain the least-studied category of bias mitigation, and that real-world implementation studies — as opposed to controlled model-development experiments — are rare. The H+H study is notable precisely because it was conducted on live models serving real patients, with operational constraints (alert rate limits, accuracy requirements) that reflect the actual conditions health systems face.
Discussion
Discussion from clinicians, researchers, policy professionals, and community advocates is welcome. For formal corrections or evidence submissions, use the contact page.
Comments
Join the discussion with an anonymous comment.