Mitigating Algorithmic Bias in Safety-Net Clinical AI: NYC H+H Study

Study Overview

Citation and study metadata for Mackin et al. (2025), npj Digital Medicine.
Field	Detail
Authors	Mackin S, Major VJ, Chunara R, Newton-Dame R
Journal	npj Digital Medicine
Citation	npj Digit. Med. 8, 335 (2025)
DOI	10.1038/s41746-025-01732-w
Publication Date	June 5, 2025
Study Design	Retrospective, real-world EMR — two live binary classification models
Institution	NYC Health + Hospitals (H+H), in collaboration with NYU Grossman School of Medicine and NYU Center for Health Data Science
Population — Asthma Model	n = 53,099 patients
Population — Readmission Model	n = 139,132 patients
Funding	Schmidt Futures
IRB Status	Exempt (BRANY)
Open-Source Repository	GitHub: shainamackin/nyc-hh_algorithmic-bias-mitigation

Clinical and Equity Context: Why Safety-Net Systems Face Disproportionate Bias Risk

NYC Health + Hospitals is the largest public health system in the United States, serving more than one million New Yorkers annually. Approximately 90% of its patients are people of color, and roughly 70% are covered by Medicaid or have no insurance. This demographic profile makes H+H a critical test case for algorithmic bias: the patients most likely to be harmed by a biased clinical AI tool are precisely the patients this system serves.

Algorithmic bias in clinical AI does not require malicious intent to cause harm. When a model is trained predominantly on data from populations that differ demographically from the patients it will later encounter, it can systematically produce higher false-negative rates for underrepresented groups — meaning those patients are more likely to be missed by automated outreach or risk-stratification. At a safety-net institution, that missed identification translates directly into missed care for patients who already face structural barriers.

The practical challenge for health systems like H+H is that most commercially deployed AI tools are black boxes: the institution uses the model's output but has no access to the training data, the model architecture, or the internal weighting of variables. This is the norm, not the exception, for health system consumers of vendor AI. Bias mitigation strategies that require retraining the model or modifying its internals are therefore unavailable to most institutions.

Study Design: Model Selection, Patient Populations, and Fairness Metric

The research team began with eight EMR-based binary classification models in active use at H+H. From these, five had at least one year of paired prediction and outcome data available — the minimum needed to conduct a meaningful bias evaluation. After screening on model discrimination (AUROC and precision-recall AUC), two models were retained for full bias analysis and mitigation testing.

The two models retained for bias analysis after AUROC/PR-AUC screening. Both were used for automated patient outreach, making false negatives the primary harm of concern.
Model	Clinical Task	AUROC	Primary Patient Population	Key Sensitive Classes
Asthma Model	Predict acute care utilization (ED visit or hospitalization) for asthma patients	0.756	n = 53,099	Race/ethnicity, sex, preferred language, insurance status
Readmission Model	Predict unplanned 30-day readmission	0.719	n = 139,132	Race/ethnicity, sex, preferred language, insurance status

Both models were used to drive automated patient outreach programs — identifying patients who should receive proactive contact from care teams. In this context, a false negative (a patient who should be flagged but is not) is the critical error: that patient receives no outreach and may miss an intervention. This clinical use case shaped the choice of fairness metric.

The team used Equal Opportunity Difference (EOD) as the primary fairness metric. EOD measures the difference in false-negative rates between a demographic subgroup and a designated referent group. A positive EOD indicates that the subgroup has a higher false-negative rate — meaning the model is more likely to miss patients in that group relative to the referent.

The team pre-defined three criteria for a mitigation approach to qualify as a success:

Absolute EOD below 5 percentage points (pp) for all subgroups within a sensitive class
Accuracy loss less than 10% relative to the baseline model
Alert rate change less than 20% relative to the baseline (to preserve operational feasibility for outreach teams)

Asthma Model Results: Racial and Ethnic Bias at Baseline and After Threshold Adjustment

At baseline, the asthma model showed racial and ethnic bias exceeding the 5pp EOD threshold across all race/ethnicity subgroups evaluated. False-negative rates varied substantially across groups, ranging from 0.51 for Black/African American patients to 0.828 for White patients, with the referent group anchoring the comparison. The pattern of disparity meant that some groups were substantially more likely to be missed by the model's outreach trigger than others.

Split-panel illustration showing demographic subgroup performance bars diverging on the left (baseline bias) and equalizing on the right (after threshold adjustment), with diverse patient silhouettes and an urban hospital backdrop. — Conceptual representation of the threshold adjustment effect: subgroup false-negative rates diverge at baseline and re-equalize after custom threshold correction. Source: Medica Intelligence editorial illustration.

Custom threshold adjustment — implemented in R by the H+H team — brought all race/ethnicity absolute EODs below the 5pp threshold. Critically, this was achieved while keeping accuracy loss under 10% and alert rate change under 20%, satisfying all three pre-defined success criteria.

Aequitas, the Python-based open-source bias mitigation library, achieved a non-absolute EOD below 5pp but did not meet the absolute EOD criterion for all subgroups — a meaningful distinction when the goal is to ensure no subgroup is left with a substantially higher false-negative rate than the referent. Reject option classification (ROC), implemented via AIF360 and Sony nnbala, failed on the alert rate criterion: it reduced the alert rate from 0.124 to 0.081, a decrease of more than 30%, which would have made the outreach program operationally unsustainable.

Readmission Model Results: Insurance-Based Bias and Mitigation Outcomes

The 30-day readmission model showed insurance-based bias across all insurance subgroups at baseline. False-negative rates ranged from 0.382 for Medicare patients to 0.797 for Self-Pay patients — a gap of more than 40 percentage points between the best- and worst-served groups. Insurance status is a sensitive class that frequently correlates with race, income, and access to primary care, making this pattern particularly consequential at a safety-net institution.

Baseline false-negative rates by insurance subgroup in the readmission model. All subgroups exceeded the 5pp EOD threshold before mitigation. Specific intermediate values were not individually reported in the accessible study summary.
Insurance Subgroup	Baseline FNR	EOD vs. Referent
Medicare	0.382	Lowest (referent or near-referent)
Medicaid	Not specified individually	>5pp at baseline
Commercial	Not specified individually	>5pp at baseline
Self-Pay	0.797	Highest disparity

Custom threshold adjustment again succeeded where the alternatives did not. It brought all insurance-class absolute EODs below 5pp while keeping accuracy and alert rate within the pre-defined bounds. ROC failed the alert rate criterion for the readmission model as well, decreasing the alert rate from 0.255 to 0.174 — a reduction exceeding 30% that would have materially impaired the outreach program's reach.

Mitigation Method Comparison: Custom Thresholding, Aequitas, and Reject Option Classification

The H+H team tested three distinct post-processing mitigation approaches across both models. The comparison reveals not only which methods worked but also why existing open-source tooling presents practical barriers for resource-limited implementers.

Three-column comparison diagram showing mitigation method outcomes: custom thresholding with equalized bars (success), Aequitas with partially equalized bars (partial success), and reject option classification with a sharp alert volume drop marked as failure. — Schematic comparison of the three post-processing mitigation approaches evaluated by NYC H+H. Only custom threshold adjustment met all three pre-defined success criteria for both models.

Head-to-head comparison of three post-processing bias mitigation methods tested on two live EMR models at NYC H+H. 'Qualified success' means all three pre-defined criteria (absolute EOD <5pp, accuracy loss <10%, alert rate change <20%) were met.
Method	Implementation	Asthma Model Outcome	Readmission Model Outcome	Alert Rate Impact	Practical Barrier
Custom threshold adjustment	In-house R code	Qualified success — all absolute EODs <5pp	Qualified success — all absolute EODs <5pp	Within 20% threshold	Requires in-house R coding capacity; no major library dependency
Aequitas (balanced group thresholding)	Python library	Partial success — non-absolute EOD <5pp but not absolute	Not specified as fully successful	Within threshold	Outdated documentation; limited adaptability to institution-specific criteria
Reject option classification (ROC)	AIF360 + Sony nnbala	Failed — alert rate criterion not met	Failed — alert rate criterion not met	Dropped >30% in both models	Poor documentation; alert rate reduction operationally unsustainable

The team found that existing open-source libraries — Aequitas and AIF360 — had documentation that was difficult to adapt to their specific institutional criteria and sensitive class definitions. Custom R code, while requiring more upfront development effort, was ultimately more transparent and efficient for this use case. The released code repository is intended to reduce that development burden for other institutions.

Key Limitations

Single-system generalizability: The study was conducted at one urban safety-net institution. Results may not transfer directly to academic medical centers, rural health systems, or institutions with substantially different patient demographic compositions.
EOD-only optimization: Mitigation was optimized specifically for Equal Opportunity Difference. Using a different fairness metric — such as demographic parity or predictive parity — would likely produce different thresholds and potentially different success or failure outcomes for each method.
No intersectional bias analysis: The study evaluated sensitive classes (race/ethnicity, sex, language, insurance) separately. Intersectional combinations — such as Black women on Medicaid — were not analyzed. Threshold adjustment for one class may not address compounding disparities across multiple dimensions simultaneously.
Missing data not imputed: Patients with missing values for sensitive class variables were excluded from subgroup analyses. This may undercount bias in groups where data completeness is itself unequal — a common issue in safety-net EHR data.
No real-world outcome measurement: The study demonstrates that threshold adjustment reduces EOD disparity in model predictions. It does not yet measure whether the adjusted thresholds translate into equitable clinical outcomes for patients — that requires prospective implementation research.
Model-specific thresholds: Adjusted thresholds are calibrated to the specific models, patient populations, and time periods studied. Thresholds will require recalibration if models are updated, if patient population composition shifts, or if the models are deployed at a different institution.

Clinical Relevance: Implementation Pathway and Open-Source Playbook

The H+H study's most immediately actionable contribution is not a theoretical framework — it is a transferable process. The team released both an open-source R code repository and a Supplementary Playbook specifically designed for replication by other low-resource health systems. The repository (GitHub: shainamackin/nyc-hh_algorithmic-bias-mitigation) provides the custom threshold adjustment code used in the study, along with documentation structured for institutions that may not have dedicated data science teams.

For health system administrators and AI implementation teams, the study outlines a replicable sequence:

Inventory deployed AI models and identify those with at least one year of paired prediction and outcome data.
Screen models for minimum discrimination performance (AUROC/PR-AUC) before investing in bias evaluation — low-performing models should be addressed at the model level, not just the threshold level.
Define sensitive classes relevant to your patient population and institutional context (race/ethnicity, insurance status, language, sex, and others as appropriate).
Select a fairness metric aligned with the clinical use case — EOD is appropriate when false negatives are the primary harm; other use cases may warrant different metrics.
Pre-specify success thresholds before running mitigation — the H+H team defined absolute EOD <5pp, accuracy loss <10%, and alert rate change <20% in advance, preventing post-hoc rationalization of results.
Apply custom threshold adjustment using the released R code as a starting point, then validate that all three criteria are met before deploying adjusted thresholds in production.
Plan for ongoing monitoring — adjusted thresholds require recalibration as models, patient populations, or clinical workflows change.

This study's approach aligns with the broader policy direction established by the STANDING Together 2025 consensus recommendations (published in NEJM AI), a Delphi-based international consensus developed with input from more than 350 representatives across 58 countries. STANDING Together addresses both documentation of health datasets and the identification and mitigation of algorithmic biases — framing bias evaluation not as an optional quality improvement activity but as a standard expectation for responsible AI deployment.

The broader literature context reinforces why this matters. A 2025 scoping review of clinical AI fairness found that post-processing methods remain the least-studied category of bias mitigation, and that real-world implementation studies — as opposed to controlled model-development experiments — are rare. The H+H study is notable precisely because it was conducted on live models serving real patients, with operational constraints (alert rate limits, accuracy requirements) that reflect the actual conditions health systems face.

Mitigating Algorithmic Bias in Safety-Net Clinical AI: Lessons from NYC Health + Hospitals

Study Overview

Clinical and Equity Context: Why Safety-Net Systems Face Disproportionate Bias Risk

Study Design: Model Selection, Patient Populations, and Fairness Metric

Asthma Model Results: Racial and Ethnic Bias at Baseline and After Threshold Adjustment

Readmission Model Results: Insurance-Based Bias and Mitigation Outcomes

Mitigation Method Comparison: Custom Thresholding, Aequitas, and Reject Option Classification

Key Limitations

Clinical Relevance: Implementation Pathway and Open-Source Playbook

Discussion

Comments