INTRODUCTION

Pay for performance (P4P) refers to the use of financial incentives to stimulate improvements in healthcare efficiency and quality. P4P belongs to a collection of financing schemes known as alternative payment models (APMs), which are designed to replace fee-for-service (FFS) payment systems. Whereas FFS payment rewards volume of services, APMs are designed to incentivize better outcomes and value. This is typically achieved by ensuring that providers and systems are financially vested in patient health status and efficient care delivery. In addition to P4P, prominent models include bundled payments and medical homes. Although P4P had previously been implemented by private payers on a small scale, there has been an increase in large-scale ambulatory and hospital P4P programs over the last decade both in the United States and internationally.

The Veterans Health Administration (VHA) instituted its performance pay program in 2004 after passage of the VA Health Care Personnel Enhancement Act.1 The amount of performance pay awarded to each provider is determined by the degree to which they achieve a set of performance goals, which may include measures of care processes (e.g., ordering periodic hemoglobin A1c tests in diabetic patients), health outcomes, or fulfillment of work responsibilities (e.g., timely completion of training activities). There is also a managerial performance pay program for administrators. The VHA performance pay program allows medical centers and regional networks autonomy in determining the choice of measures comprising the performance goals for different types of providers. In 2011, approximately 80 % of VA providers received performance pay, at an average of $8,049 per provider.2

In recent years, there have been an increasing number of studies examining the effects of these and other large-scale P4P programs. As experience with and evidence in examining these programs have increased, questions have arisen regarding the effectiveness of such programs and concerns voiced about the potential for negative unintended consequences.3,4 However, financial incentive programs are complex interventions and vary widely in the their implementation, including characteristics of the measures chosen, such as the number of measures incentivized, the types of measure (e.g., structural, cost/efficiency, clinical processes, patient/intermediate outcomes, patient experience), and features related to the incentive structure such as who the incentive targets (e.g., providers, groups, managers, administration), the amount, whether incentives are in the form of rewards (e.g., fee differentials, bonuses) or penalties (e.g., withholding payment, repayments to payers), and incentive frequency. Added to the complexity are differences in the contexts in which they are implemented, such as the type of setting (e.g., ambulatory settings, hospitals, nursing home), the organizational culture within the setting, and other factors including patient population. The positive and negative effects associated with any given P4P program likely depend in part on the combination of all of these factors.

This paper, which is part of a larger report commissioned by the VHA, reports the results of a systematic review and key informant (KI) interviews focused on how implementation features influence the effectiveness of P4P programs.

METHODS

Data Sources and Strategy

A recent report on value-based purchasing published by the RAND Corporation included an examination of P4P programs.5 We modified their search strategy and conducted an updated search of the PubMed, PsycINFO, and CINAHL databases from the end of their search date through April 2014. We searched the grey literature, targeting websites of both organizations known to conduct systematic reviews and those known to have experience or data related to P4P programs. In addition, we performed searches of PubMed and Google, targeting the names of larger P4P programs, and also searched for studies examining programs not included in the RAND report (e.g., the UK Quality and Outcomes Framework [QOF]) from database inception through April 2014 (Appendix 1, available online). We obtained additional articles from systematic reviews, reference lists of pertinent studies, reviews, and editorials, and by consulting experts.

Study Selection

We included English-language trials and observational studies examining direct pay-for-performance programs targeting healthcare providers at the individual, group, managerial, or institutional level. We excluded studies examining patient-targeted financial incentives, as well as payment models other than direct pay-for-performance, such as managed care, capitation, bundled payments, and accountable care organizations. Only studies examining systems and patient populations similar to that of the VHA were included, thus excluding studies conducted in countries with healthcare systems that differ widely from U.S. or VHA settings, studies that were not conducted in hospital or ambulatory settings, and studies with child patient populations (Appendix 2, available online). Two investigators independently assessed each study for inclusion based on the criteria (Appendix 3, available online). We used a “best evidence” approach to guide study design criteria, according to the question under consideration and the literature available.6

Data Extraction and Quality Assessment

We abstracted data from each included study on study design, sample size, country, relevance to the VHA, program description, incentive structure, incentive target (e.g., provider, management, administration), comparator, outcome measures, and results. Given the wide variation in study designs and large number of observational studies, we used the Newcastle-Ottawa Quality Assessment Scale to appraise study quality.7 Both study data and data related to risk of bias were abstracted by one investigator, and were reviewed for accuracy by at least one additional investigator.

Discussions with Key Informants

We engaged experienced P4P researchers as key informants to gain insight into issues related to implementation and unintended consequences. Key informants were identified as those having expertise on pay-for-performance programs in healthcare through a review of relevant literature and through consultation with our stakeholders and Technical Expert Panel. We conducted hour-long semi-structured interviews with KIs to understand their perceptions of implementation factors that were important in both positively and negatively influencing P4P programs. Five investigators conducted independent inductive open-coding of interview notes. One investigator with qualitative research experience (KK) reviewed the investigators’ codes and identified common themes.

Data Synthesis

We qualitatively synthesized the results of included studies according to an implementation framework based on the Consolidated Framework for Implementation Research (CFIR),8 and modified for the topic in collaboration with our panel of technical experts (Fig. 1). The framework applies to P4P in healthcare generally, and describes the relationship between the features of P4P programs, external factors, implementation factors, and provider cognitive/affective and behavioral responses on processes of care and patient outcomes. This paper focuses on the relationships between implementation factors, which include implementation processes, features related to the inner and outer settings, and provider characteristics; program design features; provider cognitive/affective responses; provider behavioral responses; and the effect on processes of care and patient outcomes. Table 1 describes each category included in the framework. Due to the large number of observational trials and heterogeneity among the studies, meta-analysis was not performed.

Figure 1
figure 1

Conceptual framework.

Table 1. Description of Implementation Framework Categories

RESULTS

We reviewed 1363 studies, with 509 examined at the full-text level. Forty studies met inclusion criteria, with an additional study identified by a peer reviewer, for a total of 41 (Fig. 2; see Table 2 for study characteristics; study details provided in Appendices 4 and 5, available online). Of 45 individuals invited, 14 participated in KI interviews (Appendix 6, available online).

Figure 2
figure 2

Literature flow.

Table 2. Study Characteristics

Program Design Features (13 Studies)

We identified one prospective cohort,9 two retrospective cohort,10,11 and one pre-post study,12 six cross-sectional surveys,1317 one economic analysis,18 and two simulation studies.19,20 Related to measure development, studies found that an emphasis on clinical quality and patient experience criteria was related to increased coordination of care, improved office staff interaction, and provider confidence in providing high-quality care.11,14 Conversely, an emphasis on productivity and efficiency measures was associated with poorer provider and office staff communication.11 In addition, one study that surveyed administrators and managers about the overall effectiveness of a P4P program found that factors predictive of the perceived effectiveness of the program included both the communication of goal alignment and the alignment of individual goals to institutional goals, while another found that providers believed that the P4P program increased a clinician’s focus on issues related to quality of care.12,15 Finally, one study examined different statistical methods of constructing composite measures, and found latent variable methods to be more reliable than raw sum scores.19

Related to incentive structures, one study examined the extent to which incentive size related to the decision to participate in P4P programs, and found that no clear amount determined decisions of whether to participate, but rather that there was a positive relationship between participation and the potential for reward.10 Similarly, another study found that after controlling for covariates, perceived financial salience was significantly related to a high degree of performance.13 Another study found that the underlying payment structure influenced performance, and that higher incentives may be necessary when the degree of cost sharing is lower.9 Finally, a study examining the relationship between P4P and patient experience in California over a 3-year period found that, compared with larger incentives (>10%), smaller incentives were associated with greater improvement in provider communication and office staff interaction measures.11 These findings were contrary to the hypotheses of the authors, who concluded that their findings may have been influenced by the tendency of practices with smaller incentives to incentivize clinical quality and patient experience measures (vs. productivity measures), which were also associated with improvements in office staff interaction.

Findings from Key Informant Interviews

Key informants stressed that P4P programs should include a combination of measures addressing processes of care and patient outcome, and that while measures should cover a broad range, having too many measures increased the likelihood of negative unintended consequences. KIs also agreed that measures should reflect organizational priorities, and should be realistically attainable, evidence-based, clear, simple, and linked to clinically significant rather than data-driven outcomes, with systems in place for evaluation and modification as needed. In addition, they suggested that improvements should be incentivized, that incentives should be large enough to provide motivation, but not so large as to encourage gaming, that penalties may be more effective than rewards, and that team-based incentives may be effective for increasing buy-in and professionalism among both clinical and non-clinical staff. Similarly, the timing of payments should be frequent enough to reinforce the link between measure achievement and the reward; however, this must be balanced with payment size, as the reward must be sufficient to reinforce behavior.

Implementation Processes (8 Studies)

We identified seven cohort studies, one prospective21 and six retrospective,2227 and one simulation study.28 Three included studies25,26,28 examined threshold changes in the QOF, and found that quality continued to increase after increases in maximum thresholds, with lower-performing providers improving significantly more than those who were performing at a high level under the previous threshold.25,26 In addition, we identified three studies examining clinical process, and patient outcome incentives were removed from a measure. One study, of the QOF, found that the level of performance achieved prior to the incentive withdrawal was generally maintained, with some difference by indicator and disease condition.27 Two studies examined changes in incentives within the VHA. Benzer et al. (2013) evaluated the effect of incentive removal and found that all improvements were sustained for up to 3 years.22 Similarly, Hysong and others (2011) evaluated changes in measure status, that is, the effect on performance when measures shift from being passive monitored (i.e., no incentive) to actively monitored (i.e., incentivized), and vice versa.23 Findings indicate that regardless of whether a measure was incentivized, all remained stable or improved over time. Quality did not deteriorate for any of the measures in which incentives were removed, and of the six measures that changed from passive to active monitoring, only two improved significantly after the change (HbA1c and colorectal cancer screening).

Findings from Key Informant Interviews

Similar to the findings reported in the literature, key informants believed that measures should be evaluated regularly (e.g., yearly) to enable continued increases in quality. Once achievement rates are high, those measures should be evaluated, with the possibility of increasing thresholds, if relevant, or replacing them with others representing areas in need of quality improvement.

KIs stressed that implementation processes should be transparent and should provide resources to encourage and enable provider buy-in through information that allows them to link the measure to clinical quality and provides guidance on how to achieve success. To achieve buy-in, KIs urged the engagement of stakeholders of all levels, recommended a “bottom-up” approach to program development, and strongly supported clear performance feedback to providers at regular intervals, accompanied by suggestions for and examples of how to achieve high levels of performance.

Outer Setting (6 Studies)

We identified five retrospective cohort studies2933 and one cross-sectional survey17 related to the outer setting. Studies provided no clear evidence related to factors associated with region, population density, or patient population. One short-term study of the QOF reported better performance associated with a larger proportion of older patients.33 Findings related to performance in urban compared with rural settings were inconsistent, with two studies reporting better performance by providers in rural settings,29,32 and one finding no difference.31

Findings from Key Informant Interviews

Key informants discussed the importance of taking patient populations into account when designing P4P programs, stressing the importance of flexibility in larger multi-site programs to allow for targets that are realistic and that meet the needs of local patient populations.

Inner Setting (18 Studies)

We identified 15 retrospective cohort studies30,3245 and three cross sectional surveys15,46,47 related to the inner setting. Studies of the QOF found that larger practices in the UK performed better in the short term,3335 particularly when examining total QOF points;37 however, results varied when examining subgroups by condition or location and by indicator.36,44,45 In addition, two studies found that group practice and training practice status was associated with higher quality of care,33,34 while two others found no significant effect of training practice status after controlling for covariates.35,44 Studies in the United States and other countries indicate that factors related to higher quality or greater quality improvement include culture change interventions introduced along with P4P,46 and clinical support tools.42 Results were mixed regarding quality improvement visits/groups and training.15,47 Contrary to findings related to the QOF, however, differences in quality associated with P4P within independent versus group practices,48 type of hospital (e.g., training, public, private, etc.),30 and patient panel size/volume are less clear, with studies reporting conflicting results.30,43

Findings from Key Informant Interviews

KIs stressed that P4P is just one piece of an overall quality improvement program, with other important factors such as a strong infrastructure and ongoing infrastructure support (particularly with regard to information technology and electronic medical records), organizational culture around P4P and associated measures, alignment/allocation of resources with P4P measures, and public reporting. Public reporting was described by many of our KIs as a strong motivator, particularly for hospital administrators, but also for individual providers operating within systems in which quality achievement scores are shared publically.

Provider Characteristics (5 Studies)

We identified three retrospective cohort studies29,34,43 and two cross-sectional surveys.13,49 Studies examining the influence of provider characteristics found no strong evidence that provider characteristics (e.g., gender, age) related to performance in P4P programs.13,29,34,43,45

Table 3. Evidence and Policy Implications by Implementation Framework Category

DISCUSSION

We identified 41 studies examining factors related to the implementation of P4P programs. Studies targeted implementation features associated with the effect on process-of-care and short-term patient outcomes, as well as on provider cognitive, affective, and behavioral responses. Implementation features included those related to program design, such as factors related to the incentivized measures; implementation processes, such as updating or retiring measures; the inner and outer settings; and provider characteristics. The studies we examined differed widely by health system and patient population, and evaluated a range of P4P programs that varied substantially in both measures prioritized and incentive structure. Despite numerous examples of P4P programs, the heterogeneity inherent in across health systems and organizations and the challenges related to the evaluation of complex interventions such as P4P preclude us from drawing firm conclusions that can be broadly applied.

While the literature does not provide strong evidence to definitively guide the implementation of P4P programs, there are several themes from KI interviews that were consistent with evidence from the published literature (Table 3). First, programs that emphasize measures targeting process-of-care or clinical outcomes that are transparently evidence-based and viewed as clinically important may inspire more positive change than programs using measures targeted to efficiency or productivity, or that do not explicitly engage providers from the outset. Findings from both the literature examining physician perceptions and KI interviews support the use of evidence-based measures that are congruent with provider expectations for clinical quality, and there was strong agreement among KIs that provider buy-in is crucial.

Second, the incentive structure needs to carefully consider several factors, including incentive size, frequency, and target. In general, the QOF, with its larger incentives, has been more successful than programs in the U.S. Key informants attribute this to incentives that are large enough to motivate behavior, but also caution that larger incentives may not be cost effective and may result in gaming. KIs also stressed the importance of the attribution of the incentive to provider behavior, and that incentivized measures must be congruent with institutional priorities, must address the needs of the institution at the local level, and must be designed to best serve the local patient population.

Third, P4P programs should have the capacity to change over time in response to ongoing measurement of data and provider input. Key informants strongly agreed that P4P programs should be flexible and should be evaluated on an ongoing and regular basis. They pointed to the QOF, which is evaluated annually, and which since its inception has undergone numerous adjustments, including changes to the measures incentivized and the thresholds associated with payments.

Finally, and related, P4P programs should target areas of poor performance and consider de-emphasizing areas that have achieved high performance. Findings from studies of both the QOF and the VHA and our KI interviews support the notion that improvements associated with measures achieving high performance can be sustained after the measure has been de-incentivized. Consistent evaluation of the performance of and adjustments to incentivized measures will allow institutions to shift focus and attention to areas in greatest need of improvement.

Limitations

Our review has a number of limitations. Due to the recent report on pay-for-performance programs published by the RAND Corporation and commissioned by CMS, which focused largely on programs in the United States, and our inclusion of studies examining the UK Quality and Outcomes Framework, our review and subsequent conclusions are weighted heavily towards programs targeting ambulatory care. In addition, given the heterogeneity among P4P programs, and our goal of better understanding the important factors related to implementation, we included studies that utilized less rigorous methodology, some of which had small samples. The breadth of topics and outcomes related to implementation characteristics made it difficult to restrict our criteria by study design. Given these factors, along with the inclusion of studies examining primarily observational data, we did not formally assess strength of evidence. To better inform an understanding of implementation factors important to the success of P4P programs, we interviewed 14 key informants. As our goal was not to conduct primary research, our key informants were experienced P4P researchers in the United States and the United Kingdom. While their knowledge and experience provided us with insight into implementation processes and unintended consequences, and although they were particularly well positioned to speak to future research needs, we recognize that conversations with other stakeholders, including policymakers, program officials, hospital administrators and managers, providers and other clinical and non-clinical staff, and patients, are necessary to more fully understand the issues related to P4P.

Future Research

Despite numerous P4P programs in the United States, the United Kingdom, and elsewhere, there is a need for higher-quality evidence to better understand whether these programs are effective in improving the quality of healthcare and the implementation factors that contribute to their success. Studies examining P4P have been largely observational and primarily retrospective, or have lacked good matched comparison groups, and research examining implementation characteristics has often been conducted with small samples. One of the fundamental challenges in evaluating complex multi-component interventions such as P4P is disentangling the individual effect of each intervention. In the case of P4P, the challenge is even greater, as contextual and implementation factors must also be strongly considered, with programs differing widely in their measures and incentive structures, as well as the overarching health systems and organizations to which they are applied, and the patient populations for which they are designed to serve. There is an urgent need to examine the implementation factors that may mediate or moderate program effectiveness, including the influence of public reporting, the number and focus of measures, incentive size, structure, and target. Finally, KIs stressed the belief that the VHA as a system is in a unique position from which to conduct much needed rigorous and methodologically strong P4P research, not only to examine P4P’s effectiveness on processes of care and patient outcomes directly, but also to better understand and clarify the implementation characteristics important in achieving higher quality of care and in mitigating unintended consequences.