Article Text

Evaluating equity in performance of an electronic health record-based 6-month mortality risk model to trigger palliative care consultation: a retrospective model validation analysis
  1. Stephanie Teeple1,2,
  2. Corey Chivers3,
  3. Kristin A Linn1,
  4. Scott D Halpern2,4,
  5. Nwamaka Eneanya2,4,
  6. Michael Draugelis5,
  7. Katherine Courtright2,4
  1. 1 Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
  2. 2 Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
  3. 3 Proscia, Inc, Philadelphia, Pennsylvania, USA
  4. 4 Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
  5. 5 Hackensack Meridian Health, Edison, New Jersey, USA
  1. Correspondence to Stephanie Teeple, Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA; stephanie.teeple{at}


Objective Evaluate predictive performance of an electronic health record (EHR)-based, inpatient 6-month mortality risk model developed to trigger palliative care consultation among patient groups stratified by age, race, ethnicity, insurance and socioeconomic status (SES), which may vary due to social forces (eg, racism) that shape health, healthcare and health data.

Design Retrospective evaluation of prediction model.

Setting Three urban hospitals within a single health system.

Participants All patients ≥18 years admitted between 1 January and 31 December 2017, excluding observation, obstetric, rehabilitation and hospice (n=58 464 encounters, 41 327 patients).

Main outcome measures General performance metrics (c-statistic, integrated calibration index (ICI), Brier Score) and additional measures relevant to health equity (accuracy, false positive rate (FPR), false negative rate (FNR)).

Results For black versus non-Hispanic white patients, the model’s accuracy was higher (0.051, 95% CI 0.044 to 0.059), FPR lower (−0.060, 95% CI −0.067 to −0.052) and FNR higher (0.049, 95% CI 0.023 to 0.078). A similar pattern was observed among patients who were Hispanic, younger, with Medicaid/missing insurance, or living in low SES zip codes. No consistent differences emerged in c-statistic, ICI or Brier Score. Younger age had the second-largest effect size in the mortality prediction model, and there were large standardised group differences in age (eg, 0.32 for non-Hispanic white versus black patients), suggesting age may contribute to systematic differences in the predicted probabilities between groups.

Conclusions An EHR-based mortality risk model was less likely to identify some marginalised patients as potentially benefiting from palliative care, with younger age pinpointed as a possible mechanism. Evaluating predictive performance is a critical preliminary step in addressing algorithmic inequities in healthcare, which must also include evaluating clinical impact, and governance and regulatory structures for oversight, monitoring and accountability.

  • evaluation methodology
  • decision support, computerized
  • information technology

Data availability statement

No data are available.

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


  • Clinical prediction models may vary in their predictive performance across sociodemographic groups due to social forces (eg, racism) that shape health, healthcare and health data.

  • Inequities in predictive performance are rarely examined empirically, and no consensus guidelines exist about the best way to do so.


  • We identified disparities in the predictive performance of an electronic health record-based 6-month mortality risk model across patient sociodemographic groups, where it underpredicted mortality risk for some marginalised patients. Thus, it was less likely to identify marginalised patients as likely to benefit from palliative care services; actual impact to care delivery and patient outcomes has yet to be evaluated.

  • These disparities occurred despite the fact that no ‘sensitive’ social predictors were included in the model (beyond age and binary sex) and no consistent pattern appeared focusing on general performance metrics alone.


  • Algorithmic inequity in healthcare is complex, and structures (for evaluation, governance and regulation) are urgently needed in both research and practice to protect patient safety.


Interest in machine learning (ML) and/or artificial intelligence (AI) for clinical decision support has exploded in recent years. The number of biomedical journal articles mentioning ML/AI increased by 1984% over the past decade,1 the market value of AI in healthcare is projected to reach $31.3 billion by 2025,2 and Food and Drug Administration approvals of ML/AI-based technologies have steadily increased.3 However, there is increasing recognition that models are likely to have unequal performance across patient subgroups.4–7 Yet, the rapid uptake of ML/AI tools in healthcare has outpaced the necessary assessment of the potential for such algorithms to entrench or exacerbate health inequities.8–11

The potential for ‘algorithmic bias’ in clinical prediction models, whether ML-based or regression-based, emerges in part via the use of electronic health record (EHR) data. Racism and other social forces not only cause differential disease distribution among oppressed groups,12–16 but also fundamentally shape the delivery of healthcare in the USA.17 Racism, for example, is subsequently encoded in the EHR in myriad ways, including data missingness due to barriers to care, differential ordering of tests or treatment, implicit or explicit bias in documentation, and organisation-level and policy-level factors.17–23

Health systems are increasingly using EHR-based prediction models to identify patients most likely to benefit from specific interventions. A recurrent example of this is the use of prognostic models to predict risk of death or other undesirable outcomes in an effort to improve targeted delivery of supportive or palliative care interventions for serious illness,24–31 long a pressing national priority.31–34 Yet, no studies to date have rigorously evaluated the myriad published EHR-based prognostic models for potential differential predictive performance among patient subgroups, particularly for structurally marginalised patients with reduced access to high-quality serious illness care at baseline.35 Such evaluations are needed to help ensure these models do not exacerbate inequities in access to high-quality serious illness care when incorporated into daily practice. Thus, to demonstrate an approach for comparative evaluation of predictive performance across marginalised sociodemographic patient groups, we use an existing EHR-based prognostic model that was developed to improve inpatient palliative care delivery.30 For this study, we understand differences in predictive performance across social categories as not due to innate differences between people, but rather due to social context which impacts health and healthcare (eg, living in a racist society as a root cause of illness, rather than an individual’s racial identity).8 10 14


Prediction model

This study evaluates a previously published mortality risk model, Palliative Connect, developed at the University of Pennsylvania Health System (UPHS). The model was designed to predict probability of death within 6 months on the second day of an acute care hospital admission, and was used to promote inpatient palliative care consultation for patients with a risk score above a selected threshold.30 The model is a logistic regression model that was fit using backward stepwise selection in a split-sample approach (85% of the total sample was used for a training set and 15% for a test set); see online supplemental appendix table 1 for a full list of predictors and their coefficients. Predictors included comorbidities from the previous decade, lab values from the index admission and admission type (eg, elective or emergent). Two patient demographic variables, age and binary sex, were also included.30 The outcome for the prediction model was death within 6 months, defined by <180 days between hospital admission and death dates.

Supplemental material

Data sources

The data sources used for this study include the EHR from three urban hospitals within UPHS, the Social Security Death Master File (SSDMF)36 and the American Community Survey (ACS), all from 2017.37 We collected the mortality risk model predictors, patients’ zip code, race, ethnicity, insurance type and death date from the EHR. The ACS is an annual survey administered by the Census Bureau to a random sampling of all US households. We merged the ACS with EHR data to generate zip code level estimates of household income and educational attainment. Finally, we merged the EHR data with the SSDMF using social security number and date of birth to determine vital status and death date. Among those who died, EHR death date was preferred if there was a missing or conflicting date in the SSDMF.

Study population

The original Palliative Connect training cohort was constructed via an 85/15 training/test split stratified by patient (eg, if a patient is selected for the training set, all their encounters are included in the training set). Inclusion criteria were all admissions in the 2016 calendar year for patients ≥18 years, excluding observation, obstetric, rehabilitation and hospice admissions (n=55 500 encounters corresponding to 40 000 unique patients). The test cohort for this evaluation project included all admissions from 2017 who met the same aforementioned inclusion criteria (n=58 464 encounters corresponding to 41 327 unique patients).

Patient variables

We identified patient subgroups of interest based on existing health disparities evidence, our hypotheses stemming from a social constructivist framework (eg, individual-level measures of socioeconomic status (SES) are related to health via larger mechanisms like privatisation of healthcare),8 10 14 38 and which had sufficient sample size to support our analyses (eg, ≥10 occurrences in both Palliative Connect outcome categories—10 patients who died within 6 months of an index hospital encounter and 10 patients who survived).

EHR variables

The binary sex variable contained two categories (male, female). Sex in EHR data refers to a person’s biological and physiological characteristics, is assigned at birth, and is distinct from gender identity and sexual orientation.39 This EHR data source did not include a category for intersex or people of other sexes. Patient age was defined at the time of admission categorised into quartiles to evaluate model performance among older versus younger patients. Insurance status was categorised as Medicaid, Medicare, managed care and private. Medicare is a federal insurance programme in the USA for people 65 years and older; Medicaid is a US federal-state assistance programme for low-income people; private insurance is sold by health insurance companies, as are managed care plans. Missing insurance data were considered a proxy for being uninsured.40 The variable for patient race contained eight categories (American Indian/Alaskan Native, Asian, Black or African-American, Native Hawaiian/Pacific Islander, white, mixed, other, unknown). Discrepancies between EHR racial categorisations and self-reported racial identity data are well-documented, with related limitations from the use of a small number of a priori categories determined by the Office of Management and Budget.41–43 Thus, we understand the patient race variable best reflects how a patient is racialised by healthcare institutions, and therefore a patient’s experience of racism, both structural and interpersonal, in healthcare delivery (patients with race coded as ‘Black or African-American’ are assumed to be racialised as black).8 14 15 44 Similarly, given the enormous heterogeneity of people labelled as ‘Hispanic’ and the significant limitation of a single ethnic category,45 we use the ethnicity variable (‘Hispanic’ vs ‘non-Hispanic’) as distinct from race and a proxy for position within society, (eg, systematic exclusion from jobs with adequate sick leave policies) rather than sociocultural characteristics (eg, referring to a specific diet or language).44–47

ACS variables

The two SES measures were zip code level median household income and zip code level educational attainment, defined as the proportion of residents >25 years of age who completed a bachelor’s degree or higher. Both SES variables were categorised into quartiles to enable comparisons of higher to lower levels.


The primary outcome for this study were six performance metrics used to evaluate Palliative Connect predictions: c-statistic (or area under the curve), integrated calibration index (ICI), Brier Score, accuracy, false positive rate (FPR) and false negative rate (FNR).

Statistical analyses

We compared the model’s predictive performance across selected strata of age, sex, race, ethnicity, insurance status, zip code level household income, and zip code level educational attainment, using Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines and additional performance metrics for validation. The reference group for each variable was selected based on existing evidence and our hypotheses regarding the most structurally advantaged groups in US healthcare generally and palliative care services specifically. The reference group for each stratifying variables were, respectively: age—oldest quartile; race—non-Hispanic white patients; ethnicity—non-Hispanic white patients; insurance—patients with Medicare; household income—patients residing in zip codes in the highest quartile of household income; educational attainment—patients residing in zip codes in the highest quartile of educational attainment.48

We used a nonparametric pairwise bootstrapping approach. Given sufficient sample size, minimal sampling bias, and the quantity being estimated is not an extreme value (ie, the maximum), nonparametric bootstrapping is a flexible way to compare two statistics without making distributional assumptions.49 Estimating predictive performance for patients in all racial categories is important. However, there was little variation in mortality in the test data set for patients coded as American Indian/Alaskan Native, Native Hawaiian/Pacific Islander or mixed race (<10 patients in each of these categories died during the study period), resulting in performance estimates that were undefined and/or implausibly optimistic. Thus, these subgroups were excluded from analysis. Furthermore, studying and theorising inequities in predictive performance and/or healthcare delivery for patient populations coded as ‘other’ or ‘unknown’ race is critically important, but outside the scope of this study.

First, six identical copies of the test data, corresponding to the six stratifying variables, were partitioned into strata defined by the subgroups of interest. We used the previously published coefficients of the Palliative Connect model to generate predictions of 6-month mortality risk for each encounter in this test data.30 Then, each subgroup data set was resampled with replacement 500 times at the patient level. For each iteration, we calculated six predictive performance metrics: Brier Score, c-statistic, ICI, accuracy, FPR and FNR.

Performance metrics

The TRIPOD guidelines state evaluations should report discrimination and calibration, with ‘overall’ performance measures common but optional.50 We used the c-statistic, or area under the receiver operating characteristic (ROC) curve, as a measure of discrimination. For calibration, we used the ICI, where a lower number indicates better model calibration.51 Finally, we use the Brier Score as an overall measure of how close probabilistic predictions are to the actual outcome.

We examined additional performance metrics salient to health equity and clinical decision-making: accuracy (% correctly classified), FPR (false positive (FP)/FP+true negative (TN)), and FNR (false negative (FN)/FN+true positive (TP)). While these are not proper scoring rules (it is possible to obtain a perfect score with a model that makes errors),52 they provide meaningful comparisons between and within prediction models for purposes of examining equity.53 54 For example, these metrics facilitate a cost-asymmetrical analysis at a chosen, clinically relevant risk threshold.55–58

For this analysis, we used the same risk threshold of ≥30% mortality as was done in the small clinical pilot study.30 In a sensitivity analysis, we used a higher threshold (≥50%) since the pilot study results suggested that ≥30% may be overly sensitive relative to patients’ actual palliative care needs and/or practical limitations of the palliative care team. All performance metrics were calculated at the encounter level. For classification metrics (accuracy, FPR, FNR), we further summarised at the patient level over multiple encounters (if applicable) to align model performance assessment with the clinical use-case.30 Specifically, some patients have multiple encounters within the span of 6 months before their death. In practice, patients only need to be flagged for consultation once, after which the palliative care team will follow as appropriate.30 Thus, if a patient had at least one encounter with a corresponding model prediction above the selected threshold in the 6 months before death, they were considered a TP (and FN if they had zero encounters with a predicted risk above threshold). If the patient survived the entire study period and had no encounters with a predicted mortality risk above the threshold, they were considered a TN (FP if they had at least one encounter above the threshold). If a patient appeared in the data for longer than 6 months (and then died), they contributed two classifications (either TN/FP for the first time period and TP/FN for the 6 months directly prior to their death). The percentile method was used to generate 95% CIs for each metric.59 Results were considered statistically significant if the CIs of the bootstrapped difference (subgroup—reference group) did not cross zero. All analyses were conducted in R V.3.6.1. The analytical workflow, with the race variable as an exemplar, can be found in the appendix (online supplemental appendix figure 1).

We conducted a secondary analysis to identify potential mechanisms of predictive performance differences. We performed bivariate analyses between the original model’s predictor coefficients and the standardised mean difference in each predictor variable between the reference group and subgroup of interest. Standardised mean difference is defined as the difference between the two group means divided by the SD of the variable, and can be a positive or negative value. If a predictor in the model has both (1) A large positive effect size and a large positive standardised mean difference or (2) A large negative effect size and a large negative standardised mean difference, then that predictor likely contributes to systematic differences in the predicted probabilities between the two groups.

Patient and public involvement

Patients or the public were not directly involved in the design, conduct, reporting or dissemination of this research.


The test data included 58 464 encounters among 41 327 patients (table 1). In the test data, the median patient age was 60.8 years (IQR 47.8–71.2) and 20 511 (49.6%) were male. The majority of patients in the test data were categorised as white (22 962, 55.6%) or black (14 428, 34.9%); the majority were insured through Medicare (18 360, 44.4%) or a managed care plan (11 077, 26.8%). The median zip code level household income was $58 784 (IQR $33 177 to $80 363) and the median proportion of adults ≥25 years of age who completed high school as the highest level of educational attainment was 31.9% (IQR 22.8%–37.8%). Of the patients in the test data 8.9% died within the study period. The test data was overall comparable to the training data, but was older (median age 60.8 years vs 58.5 years), had more male patients (49.6% vs 45.7%) and had more Medicare patients (44.4% vs 39.2%). See table 1 for additional study cohort characteristics.

Table 1

Characteristics of the palliative connect training cohort versus study test cohort at the patient level

TRIPOD performance metrics

For the test cohort overall, the c-statistic was 0.816 (95% CI 0.811 to 0.821), the Brier Score 0.087 (95% CI 0.085 to 0.089) and the ICI 0.014 (95% CI 0.012 to 0.015) (online supplemental appendix table 3); see online supplemental appendix table 2 for point estimates of the TRIPOD performance metrics by subgroup. The c-statistic was significantly higher (bootstrapped difference 0.075, 95% CI 0.052 to 0.093) and Brier Score (−0.109, 95% CI −0.114 to –0.104) and ICI (−0.035, 95% CI −0.042 to –0.028) were significantly lower in the youngest versus the oldest patients (table 2). This pattern of better Brier Score, discrimination and calibration was consistent for the second and third younger quartiles compared with the oldest, and for non-Medicare (except for those missing insurance information) versus Medicare (figure 1). For black versus non-Hispanic white patients and female versus male patients, the Brier Score and discrimination were significantly better; calibration results were non-significant. In contrast, for Hispanic versus non-Hispanic white patients and for Asian versus non-Hispanic white patients, discrimination and Brier Score did not significantly differ, and calibration was significantly worse. For patients in the lowest quartile of household income, all three measures were significantly lower versus the patients in the highest quartile of household income; for patients in the second-lowest quartile, only the ICI was significantly higher. For the lowest quartile of educational attainment, the Brier Score and ICI were significantly lower; for the second quartile the Brier Score was significantly higher and c-statistic lower, and for the third quartile, the c-statistic was again significantly lower.

Table 2

Differences in model predictive performance by metric, patient subgroup minus corresponding reference group

Figure 1

Model predictive performance for each subgroup, TRIPOD-recommended metrics. Age quartiles comprised: youngest (18.1–47.8 years), second quartile (47.8–60.8 years), third quartile (60.8–71.2 years) and oldest (71.2 to ≥90 years). Zip code level median household income quartiles comprised: lowest quartile ($11 269 to $33 117), second quartile ($33 117–$58 784), third quartile ($58 784–$80 363) and highest quartile ($80 363–$225 598). Zip code level educational attainment (proportion of residents ≥25 years old who completed at least a bachelor’s degree, inclusive of all higher levels) quartiles comprised: lowest quartile (0%–21.9%), second quartile (21.9%–28.6%), third quartile (28.6%–48.7%) and highest quartile (48.8%–100%). ICI, integrated calibration index; TRIPOD, Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis.

Health equity performance metrics

For the test cohort overall, the accuracy was 0.839 (95% CI 0.835 to 0.844), the FPR 0.128 (95% CI 0.123 to 0.131) and the FNR 0.419 (95% CI 0.406 to 0.435) (online supplemental appendix table 3); see online supplemental appendix table 2 for point estimates of the health equity performance metrics by subgroup. For the following patient subgroups relative to their reference, the accuracy of the prediction model was significantly higher, FPR significantly lower and FNR significantly higher: younger, black, Hispanic, Medicaid or missing insurance information, lower median household income, lower educational attainment (table 2). This same pattern was seen in patients with private insurance and for female patients, except for FNR which was not significantly different (figure 2). For Asian compared with non-Hispanic white patients, model accuracy and FPR did not differ, and the FNR was significantly lower. This general trend remained the same when the threshold was raised to ≥50% mortality risk (online supplemental appendix table 4).

Figure 2

Model predictive performance for each subgroup, health equity-relevant metrics. Age quartiles comprised: youngest (18.1–47.8 years), second quartile (47.8–60.8 years), third quartile (60.8–71.2 years) and oldest (71.2 to ≥90 years). Zip code level median household income quartiles comprised: lowest quartile ($11 269 to $33 117), second quartile ($33 117 to $58 784), third quartile ($58 784 to $80 363) and highest quartile ($80 363 to $225 598). Zip code level educational attainment (proportion of residents ≥25 years old who completed at least a bachelor’sdegree, inclusive of all higher levels) quartiles comprised: lowest quartile (0%–21.9%), second quartile (21.9%–28.6%), third quartile (28.6%–48.7%) and highest quartile (48.8%–100%). FPR, false positive rate; FNR, false negative rate.

Potential drivers of difference

In our analysis of model predictors as potential drivers of the differences detected in model performance, we found that age had the second-largest (positive) effect size and large standardised mean differences across most subgroups. Urgent admission type had a larger effect size, but negligible standardised mean difference (figure 3). Uncomplicated hypertension and female sex had moderate, negative effect sizes and moderate standardised mean differences for select subgroups, including Hispanic and black patients, and patients in the lowest educational attainment and income quartiles (online supplemental appendix figures 2–6).

Figure 3

Original mortality risk model predictor coefficients versus the standardised mean difference in predictors, non-Hispanic white versus black and Asian patients. All 34 predictors included in the original EHR-based mortality risk model are represented in this plot. Variable coefficient estimates are represented on the x-axis; standardised mean difference in predictors (difference between the two group means divided by the SD of the variable) is represented on the y-axis. The standardised mean differences were all calculated via reference group minus selected subgroup (eg, non-Hispanic white patient mean of a selected predictor − black patient mean of a selected predictor). The predictor contributes to predictive performance disparities if (1) The effect size is large and positive and the standardised mean difference is large and positive) or (2) The effect size is large and negative and the standardised mean difference is large and negative. EHR, electronic health record.


In this retrospective model validation analysis, we identified a number of differences in the predictive performance of an EHR-based 6-month mortality risk model in terms of TRIPOD-designated metrics, but these differences did not consistently advantage or disadvantage marginalised groups. For some marginalised groups, all three metrics were markedly improved; for others there were statistically significant differences but of negligible magnitude (eg, c-statistic of 0.812 vs 0.823) or these metrics were worse. For equity-relevant metrics, a more consistent pattern emerged: among patients categorised as black, ‘Hispanic’, younger patients and patients with Medicaid or missing insurance or living in low SES zip codes, the model had greater accuracy, a lower FPR and a higher FNR. This resulted in more conservative (that is, lower probability) predictions for these patients. For example, the difference observed in the FNR by income suggests that the model underpredicted risk for 49.9% of patients from the lowest-income zip codes that died in the subsequent 6 months compared with 31.2% of patients from the highest-income zip codes. If the model were applied deterministically in clinical care (eg, without clinicians’ deviating from its recommendations), 69.8% of highest-income patients who died during the study period would have been connected to palliative care in the last 6 months of life versus only 51.1% of lowest-income patients. Both of these quartiles had a similar mortality rate (8.5% vs 8.2%, respectively). These differences appear to be driven, at least partially, by younger age distributions among marginalised subgroups.

Strengths and weaknesses of the study

Despite recognition that performance of clinical prediction models is likely to vary across marginalised patient subgroups,4–7 a comprehensive evaluation of a serious illness EHR-based model using TRIPOD and other recommended performance metrics60–62 has not previously been reported to our knowledge. Limitations of the present study include examining patient subgroups using EHR data and public ecological databases, which are variably collected as self-report, surrogate-report or ascribed, and require assumptions that such data serves as sufficient proxies for more complex social relations. Furthermore, we were not able to estimate performance for several subgroups (patients coded as American Indian/Alaska Native, Native Hawaiian/Pacific Islander or mixed race) because very few or no deaths occurred in our sample. Estimating performance for these groups is critically important and future work could leverage larger cohorts or techniques such as oversampling or Bayesian estimation to do so. We also examined patient characteristics separately, but individuals hold multiple intersecting social identities potentially impacted by a given model’s predictive performance.63 Furthermore, attributing patients’ SES as those identifiable at the zip code level is error-prone, given evidence that considerable individual variation exists within zip codes.64 Finally, this study does not lay out a holistic evaluation framework for the equity impacts of EHR-based clinical decision support tools. These results only attend to performance across patient subgroups cross-sectionally and ‘in silico’, and does not elucidate the mortality model’s actual impact (if any) on clinical processes or patient outcomes. This approach is intended for empirically exploring potential impacts to marginalised patients prior to model implementation; significance findings should always be contextualised with group effect size and sample size (eg, not discounting impacts in small samples with large effect sizes and marginal significance).65 66 Finally, patient populations, healthcare practices and data inputs may differ over time and across institutions, thus limiting generalisability.67

Comparison to other studies

These findings have important implications for the growing use of EHR-based prediction models for clinical decision support in general, and for palliative care delivery specifically. There is a critical shortage of palliative care services in the USA.68–70 Prognostic triggers for palliative care are one way in which many hospitals are responding to this challenge.24 25 27 28 30 However, little guidance exists on how to quantitatively evaluate disparities in these and other clinical prediction tools, either in current guidelines or in forthcoming ones.71–73 An important next step is to develop and implement rigorous procedures for evaluating equity of prediction performance throughout the model development process (eg, of which the approach presented here could be one part). By better understanding mechanisms by which a predictive model can exacerbate inequities in healthcare (eg, palliative care), there is an opportunity to reduce potential harms from deploying the model in clinical practice and its associated new workflows.74 Still, it is critical to acknowledge the limitations of optimising prediction models to address the broader questions of algorithmic inequity and healthcare inequity.9 11 75–78 Resource scarcity, in addition to evidence suggesting that existing inequities in palliative care are driven in part by hospital-level variation in the availability of resources, highlights the critical need for structural interventions beyond clinical decision support tools to advance palliative care inequity. This includes policies to improve coverage and payment for these services and to expand, diversify and improve equity education for the palliative care workforce.35 79 80

In this study, we found that differences in predictive performance persisted across patient subgroups despite the model containing no ostensibly ‘sensitive’ predictors (eg, race, insurance status). This anticlassification approach to algorithmic fairness, whereby sensitive predictors are removed, often fails because the variation captured by these variables is still encoded in the remaining predictors. Efforts to remove ‘race correction’ are a critically important first step, but this highlights that additional model specification changes may be needed to target predictive performance equity.81 82 Furthermore, with prognostic models, ‘self-fulfilling prophecies’ are a concern, that is when clinical models trigger interventions that impact the outcome they seek to predict and/or are based on data containing existing disparities such as EHR data.83 The former concern is unlikely to occur with clinical implementation of the model evaluated in this paper given consistent prior evidence that suggests palliative and supportive care interventions do not hasten death nor affect mortality rates.83–86 Moreover, the clinical use-case for this model is to trigger a palliative care consultation, which the treating clinician, patient or their family may decline and, unlike hospice care, can be provided concurrently with curative or restorative interventions. In contrast, given the well-documented disparities in provision of palliative care among marginalised patients and their families,35 48 implementation of the model evaluated here could reproduce biased clinical decision making by other means (eg, by reinforcing clinicians’ explicit or implicit beliefs).

While age may seem to be an innocuous predictor to include in a mortality prediction model, it is important to be wary of several potential equity-related problems. First, inequities in life expectancy between black and white people in the USA have persisted for decades due to racism,15 resulting in different population-based age distributions, and contributing to the inequities seen in model performance in this study. Independently, deployment of a model which systematically underpredicts probability of death among young individuals, even if supported by sufficient system design,74 could entrench misperceptions that palliative care is only appropriate for older individuals. To the extent that age is then correlated with other characteristics of relevance to health equity, such as ethnicity, race, sex, insurance or SES, differences in age could drive underprediction of mortality for these marginalised groups, making it less likely they are identified by the model as likely to benefit from palliative care.

In the USA, patients of advanced age with chronic serious illness comprise the majority of palliative care need; age will likely remain an important predictor in mortality models. Future work on model specification and preliminary validation could explore whether such differences in these models’ predictive performance can be mitigated by incorporating interaction terms between certain subgroups and age, further examine FNRs/sensitivity (aligned with equity concerns regarding marginalised patients’ experience of delayed/denied care), or incorporate additional metrics that compare model predictions to current clinical decision making (eg, number-needed-to-treat or number-needed-to-harm, net benefit). There is an arguable theoretical basis for including proxy measures of marginalisation (eg, structural racism (red-lining), interpersonal racism (discrimination at point of care), internalised racism (self-report on attitudes and mental health))15 into predictions of individuals’ mortality risk, as these forces certainly affect patients’ health and well-being. This is different from using an individual’s race as a predictor for a physiological function like estimated glomerular filtration rate, which is then used to define a ‘normal’ range of values and to determine patients’ eligibility for kidney-related treatments,87 implicitly premised on a false ideology of black people’s biological inferiority. However, merging such social data with the EHR is practically and ethically fraught.88 Still, for any algorithm where social predictors are used, there is the risk of reifying extant beliefs about innate, biological differences among scientists and clinicians who build, circulate, interpret and use such models, and subsequently the broader public.8 9 89

Conclusion and future work

An EHR-based 6-month inpatient mortality risk model developed for triggered palliative care delivery had similar discrimination and calibration, yet differential accuracy, FPR and FNR among marginalised patient groups. This resulted in underprediction of risk of mortality for marginalised patients, which could result in fewer being identified for palliative care services when deployed in clinical practice. However, rigorous, equity-oriented quantitative evaluations of predictive performance are just one part of a multifaceted approach required to address broader questions of algorithmic inequity in healthcare. To most effectively protect patient safety, future work must move beyond bias mitigation efforts with individual EHR-based clinical decision support tools towards developing and implementing governance and regulatory structures that pertain to equity. Although US federal regulation has been slow to emerge,90 91 myriad frameworks regarding clinical algorithms and equity have been proposed that are appropriate for healthcare systems.78 91–94 These frameworks vary, but core recommendations include transparency in documentation, stakeholder engagement and accountability to those most impacted (including patients), prospective and ongoing evaluation and monitoring, and highlight that the decision to implement any clinical decision support tool should not be a foregone conclusion.

Data availability statement

No data are available.

Ethics statements

Patient consent for publication


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Contributors ST, CC, KL, SH, NE, MD and KC contributed to the conception and design of the study, acquisition of data, interpretation of results, and manuscript drafting and substantive revision. ST and CC had full access to all the data in the study and conducted statistical analyses. All authors approved the final draft. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted. ST acts as guarantor.

  • Funding This study was funded by the US National Library of Medicine (Grant number: F31LM013403).

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Linked Articles