Article Text

Approach to systematically examine the usefulness of quality measures in practice: Minnesota’s nursing home quality indicators and scoring approach
  1. Dongjuan Xu1,
  2. Teresa Lewis2,
  3. Marissa Rurka3,
  4. Greg Arling1
  1. 1 School of Nursing, Purdue University, West Lafayette, Indiana, USA
  2. 2 Nursing Facility Rates and Policy Division, Minnesota Department of Human Services, Saint Paul, Minnesota, USA
  3. 3 Department of Sociology, Purdue University, West Lafayette, Indiana, USA
  1. Correspondence to Dr Dongjuan Xu, School of Nursing, Purdue University, West Lafayette, IN 47907, USA; xu976{at}purdue.edu

Abstract

Background Healthcare quality measurement systems, which use aggregated patient-level quality measures to assess organisational performance, have been introduced widely. Yet, their usefulness in practice has received scant attention. Using Minnesota nursing home quality indicators (QIs) as a case example, we demonstrate an approach for systematically evaluating QIs in practice based on: (a) parsimony and relevance, (b) usability in discriminating between facilities, (c) actionability and (d) construct validity.

Methods We analysed 19 risk-adjusted, facility-level QIs over the 2012–2019 period. Parsimony and relevance of QIs were evaluated using scatter plots, Pearson correlations, literature review and expert opinions. Discrimination between facilities was assessed by examining facility QI distributions and the impact of the distributions on scoring. Actionability of QIs was assessed through QI trends over time. Construct validity was assessed through exploratory factor analysis of domain structure for grouping the QIs.

Results Correlation analysis and qualitative assessment led to redefining one QI, adding one improvement-focused QI, and combining two highly correlated QIs to improve parsimony and clinical relevance. Ten of the QIs displayed normal distributions which discriminated well between the best and worst performers. The other nine QIs displayed poor discrimination; they had skewed distributions with ceiling or floor effects. We recommended scoring approaches tailored to these distributions. One QI displaying substantial improvement over time was recommended for retirement (physical restraint use). Based on factor analysis, we grouped the 18 final QIs into four domains: incontinence (4 QIs), physical functioning (4 QIs), psychosocial care (4 QIs) and care for specific conditions (6 QIs).

Conclusion We demonstrated a systematic approach for evaluating QIs in practice by arriving at parsimonious and relevant QIs, tailored scoring to different QI distributions and a meaningful domain structure. This approach could be applied in evaluating quality measures in other health or long-term care settings.

  • Nursing homes
  • Quality measurement
  • Evaluation methodology

Data availability statement

Data may be obtained from a third party and are not publicly available.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

WHAT IS ALREADY KNOWN ON THIS TOPIC

  • Healthcare quality measurement systems, which use aggregated patient-level quality measures to assess organisational performance, have been introduced widely into practice. Yet, scant attention has been given to the evaluation of their continued usefulness in practice.

WHAT THIS STUDY ADDS

  • This study, evaluating a nursing home quality measurement system, demonstrates a general approach based on well-known criteria, including parsimony and relevance, usability, actionability and validity. For example, this study shows how to group quality indicators into meaningful clinical domains and to tailor quality scores so that they better discriminate between good and poor performing organisations.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

  • Findings from the study are being applied to improving the nursing home quality measurement system that was the focus of the evaluation. In addition, the general approach could be emulated by researchers with other quality measurement and scoring systems.

Introduction

Healthcare quality measurement systems, which use aggregated patient-level quality measures to assess organisational performance, have been introduced widely. Yet, their usefulness in practice has received scant attention. There is abundant literature on the development of quality measures, criteria for evaluation and related frameworks. Expert panels have been the primary means for evaluating and then endorsing healthcare quality measures through organisations such as the National Quality Forum (NQF)1 in the USA and the Organization for Economic Cooperation and Development’s (OECD) Health Care Quality Indicators Project.2 The NQF applied five criteria for evaluation, including importance, scientific acceptability, feasibility, usability, and related and competing measures.1 The OECD applied six criteria—relevance, reliability, validity, actionability, international feasibility and international comparability—to revise its framework and quality measures.2 However, empirical approaches based on these criteria for systematically and rigorously evaluating quality measures’ usefulness in practice are lacking.

We propose a general approach for evaluating a quality measurement system’s usefulness in practice with a rigorous empirical analysis of a case example. Minnesota’s nursing home (NH) quality indicators (QIs), an established quality measurement system, serves as an excellent example. First, in many respects, it is prototypical of other quality measurement systems in the USA, including Nursing Home Compare, Hospital Compare, Doctors and Clinicians Compare, and other settings,3 as well as internationally for acute and long-term care systems.2 4–7 These quality measurement systems typically rely on measures at the person level (eg, resident or patient), and then aggregated to the organisational level (eg, NH or hospital), as incidence or prevalence rates oftentimes after risk adjustment. To facilitate comparison between organisations and across measures, rates are frequently converted into standardised scores, organisations are ranked, and then they are assigned to categories from worst to best quality. Second, the NH quality measurement system has been extensively developed and widely applied for multiple purposes, such as for quality improvement, public reporting and reimbursement. These measures have had a long history of refinement, beginning with Zimmerman et al 8 and evolving through the NQF (2004)9 process to arrive at the Nursing Home Compare quality measures.10 11 Most previous research on NH quality measures has focused on application and measurement issues for individual measures at the personal level, including their reliability and validity,12–14 stability and sensitivity,15 16 risk adjustment,17 18 setting meaningful thresholds19 and comparisons between statistical models to improve quality measure assessment.20 However, little is known about their practical application at the organisational level and across systems of care.

In our evaluation of Minnesota’s NH QIs, we address issues crucial to the application of quality measures in practice. First, we examine their parsimony and relevance. Large numbers of indicators, particularly if they are redundant or extraneous, can burden providers and complicate quality improvement. Second, we assess usability of QIs in discriminating between facilities. Failure to discriminate between facilities in their care quality undermines the basis for peer comparison in prioritising quality improvement efforts and it threatens the fairness of value-based payment incentives. Third, we consider actionability by evaluating facility performance over time. Care quality for a QI may have improved to the point where it can be retired. If system-wide organisational performance is unacceptably poor and it is remaining the same or declining, then policymakers and providers may need to devote more improvement effort or try different strategies. Finally, we examine construct validity by determining if QIs can be organised into domains of care that reflect inter-related care processes and outcomes. By assessing performance and then intervening at the domain level, facilities may be more efficient in the use of resources and better able to affect multiple QIs through a coordinated effort.

Methods

Data source and setting

This is a secondary analysis of QIs in Minnesota calculated from the Minimum Data Set (MDS) V.3.0 assessment data on NH residents from 2012 to 2019.21 The MDS is a comprehensive assessment tool covering health, functioning and care processes. The MDS is performed by facility nursing staff on all NH residents at admission, every 90 days thereafter, after a significant change in status and on discharge. Facilities must upload completed MDS assessments into a national database maintained by the Centers for Medicare and Medicaid Services (CMS). The CMS sends copies of assessments to Minnesota and other states. This relatively seamless process results in an up-to-date, state-level MDS database for generating QIs every calendar quarter. All Minnesota NHs were included in the analysis with the exception of a small number of facilities (an average of 11 over the study period) with fewer than 20 residents. The number of NHs in each quarter ranged from 366 to 382.

Measures: Minnesota’s NH QIs and the scoring approach

Minnesota developed and implemented the QIs in 2006, several years prior to the introduction of Medicare’s NH quality measures.22 The QIs are currently applied in Minnesota at the micro (resident care), meso (facility) and macro (policy and system) levels.23 Facility scores on individual QIs are one component of a comprehensive set of performance measures reported publicly on Minnesota’s NH Report Card.24 The QIs are also a key part of the state’s value-based reimbursement initiatives,25 26 as well as individual quality improvement efforts.27 28 Although the Minnesota QIs and federal quality measures are defined in a similar manner, the Minnesota QIs have more indicators and they are more extensively risk adjusted.29

Each facility receives an individual QI rate, ranging from 0 to 1, representing the proportion of residents eligible for a QI who either pass or fail. Facilities are assigned points for each QI according to where they stand on a QI distribution.29 This approach assumes that the facility distribution on each QI will be normal with sufficient variance to discriminate in facility performance. The percentile ranks of the facilities are used to set thresholds. Facilities with the best 20% of QI rates receive full points for the QI; facilities in the worst 10% receive no points; and facilities in between receive partial points determined through linear interpolation between the QI rates for the worst 10% and best 20%. The original 19 QIs cover 10 clinical domains with multiple domains containing a single indicator (online supplemental table 1). Each domain receives 10 points regardless of the number of indicators, which are summed to arrive at a total score for a facility with a maximum of 100 points.

Supplemental material

Evaluation steps and analysis

Parsimony and relevance

We began with a qualitative assessment of the QIs that was followed by an empirical analysis. We conducted a scoping literature review of NH QIs and their applications.30 We also elicited expert opinions through four 1-hour meetings with 27 nurses and quality improvement experts from the NH industry, as well as state agency staff that managed the QI system.31 We conducted quantitative analysis through Pearson correlations and scatter plots to identify highly correlated QIs. If a Pearson correlation coefficient between two QIs was near 0.70 or above, and the dots in scatter plots closely clustered on or near a straight line, we defined it as highly correlated and considered combining QIs. Based on the correlation analysis, literature review and expert opinions, decisions were made on combining, dropping, adding or changing the QI definitions.

Usability of QIs to discriminate between facilities

We examined facility QI distributions and the impact of the distribution on the scoring approach, that is, assignment of points. Too little variance or a skewed distribution could distort the QI scoring with facilities very close in their QI rates receiving substantially different points. We examined facility QI distributions for skewness, minimal variation (less than 0.001), and ceiling or floor effects.

Actionability: trends in QI rates

To examine trends in facility performance, we conducted a descriptive analysis (line graphs) comparing trends in the mean QI rate of the best performing 20% of facilities, the median QI rate and the mean QI rate of the worst performing 10% of facilities. If a QI showed substantial improvement resulting in a very low prevalence (eg, median QI rate <1.0%), it would be a candidate for retirement. If performance declined over time or the majority of facilities performed poorly on a QI (eg, median QI rate >50%), then it likely suffers from measurement error or a system-wide failure to address that quality dimension. System-wide failure would be pervasive poor performance on the QI across the NH system.

Construct validity of QIs

We conducted exploratory factor analyses (EFA) to identify the domains underlying the QI rates using principal component factor methods with orthogonal varimax rotation. We used scree plot and eigenvalues greater than 1 to determine the number of factors or domains. When determining to which domain a QI should be allocated, the following two criteria were weighted: the relative sizes of the factor loadings and the perceived validity of each domain. The QI was assigned to the domain with the highest factor loading. In instances in which the difference between the largest and the second largest factor loading was less than 0.2,32 the QI was placed to the domain with a smaller factor loading if the QI fit better conceptually with the other QIs in that domain. Cronbach’s alpha was calculated to assess domain consistency and reliability. To handle the independence assumption for EFA, we conducted sensitivity analyses in each calendar quarter to examine whether the same factors or domains were retained.

Following our evaluation of the above criteria, we discussed and proposed a revised set of QIs, new scoring methods tailored to the facility QI distributions and a new domain structure. All statistical analyses were performed using Stata V.16.1 (StataCorp, College Station, Texas, USA).

Results

Parsimony and relevance

Our correlation analysis and qualitative assessment of the individual QIs for parsimony and relevance pointed to the need for combining, dropping, adding and changing QI definitions. The correlation analysis suggested redundancy in the incontinence QIs. The ‘incidence of worsening or serious bladder incontinence’ and ‘incidence of worsening or serious bowel incontinence’ were highly correlated (r=0.66). Therefore, we combined the two QIs into one: ‘incidence of worsening or serious bladder or bowel incontinence’. We also expanded the scope of the fall-related QI. Previous studies have suggested that the overall number of falls, as opposed to only falls with injury, is an important indicator of adverse events with a close connection to quality of care and quality of life.33 34 Even minor injuries due to falls can have devastating outcomes for older adults including later injury, severe injury and mortality.35 36 In the guidance for prevention of falls, a fall risk assessment is recommended for all older adults with a fall history.37 Given this body of research and expert opinions from our qualitative assessment, we replaced ‘prevalence of falls with major injury’ with a new QI ‘prevalence of any fall’. Moreover, the current QIs, with the exception of the improved walking QI, focus on avoiding poor care practices or outcomes. These negatively framed QIs convey a message of avoiding harm, essentially penalising facilities for poor care. Positively framed QIs are intended to reward facilities for better care, with better care processes and outcomes. Continence is an area where improvement can have a positive impact on quality of life. To motivate better performance in this area, we added the ‘incidence of improved or maintained bladder or bowel continence’ QI. The resulting 19 QIs are listed in table 1.

Table 1

The characteristics of current and new risk-adjusted facility QI rates over the 2012–2019 period

Usability in discriminating between facilities: facility QI distributions and impact on QI scoring

Table 1 presents characteristics of the 19 facility QI rates over the 2012–2019 period. Of these, ‘incidence of walking as well or better than on previous assessment’ and ‘incidence of improved or maintained bladder or bowel continence’ are the positively framed QIs, with an average rate of 69.1% and 57.3%, respectively. Among the 17 negatively framed QIs, on average, facilities performed worst on the two QIs about incontinence without a toileting plan (bowel incontinence: 84.8%; bladder incontinence: 74.9%) and did best on the physical restraint QI with an average rate of 0.6%. Six QIs display minimal variation in rates (variance <0.001) including physical restraints, pressure ulcers in high-risk residents, indwelling catheters, infections, urinary tract infections and unexplained weight loss.

When we examined facility QI distributions, we identified four groups (online supplemental figures 1-4): QIs with an approximately normal distribution (10 QIs); QIs with a skewed distribution, floor effect and too little variance (5 QIs); QIs with a skewed distribution, floor effect and relatively large variance (2 QIs); and QIs with a skewed distribution, ceiling effect and a systematic problem of pervasive, poor facility performance (2 QIs).

For the 10 QIs with an approximately normal distribution, the current scoring approach discriminated well between facilities. The QI ‘prevalence of moderate to severe pain’, shown in figure 1, serves as an example of QIs with an approximately normal distribution. The best performing 20% of facilities (receiving full points) performed much better than facilities in the middle or below in the distribution, and the worst performing 10% of facilities (receiving no points) performed worse than facilities in the middle or above. The other nine normally distributed QIs are shown in online supplemental figure 1, and the facility QI rates according to points assigned are shown in online supplemental table 2.

Figure 1

Distributions of the pain QI and the infection QI, respectively. QI, quality indicator.

In contrast, for the nine QIs displaying a skewed distribution, the current scoring method distorts or exaggerates the relationship between points assigned and the distribution in facility QI rates. Five QIs (physical restraints, indwelling catheters, infections, urinary tract infections and pressure ulcers) had a floor effect with minimal variance. For example, the QI ‘prevalence of infections’, shown in figure 1, is right skewed with most facilities having infection rates close to zero. Nonetheless, these tightly grouped facilities with low infection rates would receive highly varying points under the current scoring system. The other four QIs displaying similar distributions and scoring patterns are shown in online supplemental figure 2 and online supplemental table 2.

Two QIs (antipsychotic medications and depressive symptoms) exhibited a skewed distribution and floor effect, but they also had relatively large overall variance. The antipsychotic QI, shown in figure 2, had more than half of the facilities concentrated at the lower end of the distribution; they were able to achieve a QI rate less than 7%. In contrast, among the worst performing 10% of facilities, the use of antipsychotics QI ranged from 16% to 70%. The current scoring approach exaggerates the differences in facilities at the tails of the distribution. The facility distributions and scoring patterns for two QIs are shown in online supplemental figure 3 and online supplemental table 2.

Figure 2

Distributions of the antipsychotics QI and the bladder incontinence without a toileting plan QI, respectively. QI, quality indicator.

Finally, two QIs, bladder or bowel incontinence without a toileting plan, exhibited a troubling pattern of ceiling effects where a large percentage of facilities had residents without toileting plans. Figure 2 shows the facility distribution for the QI measuring bladder incontinence without a toileting plan. The current scoring programme assigns a relatively high number of points to even the worst performing facilities, including facilities with a QI rate above 90% of residents without a toileting plan. A similar pattern occurs for the QI for bowel incontinence without a toileting plan (online supplemental figure 4 and online supplemental table 2).

Actionability of QIs: trends in QI rates

The majority of facilities showed relatively steady or modestly improving quarterly QI rates from 2012 to 2019. However, some QIs stood out as having substantially improving or declining performance. The trend in the ‘prevalence of physical restraints’ QI displayed a steady decline from 6% in 2012 to just over 1% in 2019 among the worst performing 10% of facilities (online supplemental figure 5). By the fourth quarter of 2019, nearly all facilities (94.7%) had completely eliminated restraint use. As a result, this QI is considered for retirement. Another two QIs, antipsychotics and depressive symptoms, also showed substantial improvement, especially for the worst performing 10% of facilities (online supplemental figures 6-7). The average inappropriate use of antipsychotics QI declined from 28.8% in quarter 1, 2012 to 20.3% in quarter 4, 2019; depressive symptoms QI declined from 20.2% in quarter 1, 2012 to 12.2% in quarter 4, 2019. In contrast, the two toileting plan QIs displayed a trend of worsening performance. Not only did most facilities perform poorly on the two QIs during the entire 2012–2019 period, but they also showed a disturbing upward trend. For example, the bladder incontinence without a toileting plan QI had an increase in the median from 62.4% to 88.4% (online supplemental figure 8).

Construct validity: domain structure

Table 2 shows the EFA results for the 18 facility-level QI rates pooled over 2012–2019. The initial analysis yielded six factors with eigenvalues >1, which explained 51.3% of the variance in the QIs. Upon examining the scree plot, the results support four underlying domains, with a clear ‘elbow’ seen at four factors. The eigenvalues for these four factors were 2.32, 2.00, 1.48 and 1.22. So, we recommend reducing the QI domains to four: incontinence (four QIs), physical functioning (four QIs), psychosocial care (four QIs) and care for specific conditions (six QIs). The signs and scores of factor loading were consistent and as expected in each domain, although the two QIs ‘incidence of worsening or serious range of motion limitation’ and ‘prevalence of infections’ had loading scores less than 0.3. We placed the range of motion limitation QI in the physical functioning domain because of a better fit conceptually. Moreover, we put the ‘prevalence of infections’ QI on the same domain as the ‘prevalence of urinary tract infections’ QI, because both are related to infections. The ‘prevalence of any fall’ QI loaded together with the antipsychotics QI, depressive symptoms QI and behavioural problems QI. This is not unexpected, since based on the 2019 updated Beers Criteria,38 prescription medications such as antipsychotics and antidepressants have the potential to increase older adults’ risk of falls and fractures. The Cronbach’s alpha values for the four domains are 0.67, 0.54, 0.37 and 0.37, respectively. Researchers have suggested 0.4 is acceptable for the reliability of MDS items.21 39 Although the alpha values for the last two domains are relatively low, they are close to 0.4. In a sensitivity analysis of QI rates in each quarter over the 2012–2019 period, the EFA results showed that the majority of quarters had the same patterns, suggesting the domain structure was robust.

Table 2

Results of exploratory factor analysis for 18 risk-adjusted facility QI rates over the 2012–2019 period

Discussion

Revised QI set, domain structure and tailored scoring approach

After combing the evaluation results of our major criteria, we recommended a more parsimonious and relevant set of QIs, a simplified domain structure and a scoring system tailored to different QI distributions, as shown in online supplemental table 3. In arriving at the final 18 QIs, we combined two highly related QIs into a single measure (bowel and bladder incontinence), expanded the scope of a QI to increase its sensitivity (prevalence of any fall) and introduced an improvement-focused QI (improvement in bowel or bladder continence) to achieve greater balance with problem-focused QIs. In addition, we recommended retiring the restraint QI because of sustained improvement and a very low prevalence rate.

The two toileting plan QIs were included in our final set but they were not scored. They displayed a troubling pattern of very poor performance across the majority of facilities. Nonetheless, we felt the need to retain the QIs because of strong clinical evidence that a well-designed and implemented toileting plan such as habit training, timed toileting and prompted toileting can effectively help residents manage incontinence.40–42 However, an effective toileting plan may be difficult to implement because it is resource-intensive and requires considerable skill and management support.43 The QIs might also be confounded by inadequate documentation of the MDS item for toileting.12 44 The toileting plan QIs require extensive study to determine the root cause of poor facility performance.

The four domains we used to organise the NH QIs are supported clinically and empirically which offers evidence for their validity. The simplified domain structure also contributed to more balanced scoring with four to six QIs in each domain. We recommended an equal allocation of points across the four domains, and points spread uniformly among the QIs within a domain. An alternative weighting scheme, which could be pursued with stakeholder input, would be to assign different weights to domains or QIs reflecting their relative clinical importance, intervention potential or other criteria.

We recommended alternative methods for QI point allocation that were tailored to different QI distributions. We applied a scoring approach to the 10 QIs with a normal distribution by assigning full points to the top performers at one tail of the distribution, no points for the worst performers at the other tail of the distribution and the remaining points assigned proportionately to facilities in between. The other QIs required different approaches. We recommended that the four QIs with skewed distribution, floor effect and restricted variance should be assigned points with an all-or-nothing approach.45 To receive full points, a facility must have no resident triggering on a QI. This strategy recognises that while every facility may not achieve the zero-problem target, all facilities should be striving to achieve it. Using this approach, facilities with a zero QI rate receive full points, the worst performing 10% of facilities receive no points and facilities in between receive points proportionally.

Because of their unique distributions, we recommended a more stringent floor for the antipsychotic medications and depressive symptoms QIs. With the long rightward tail of the facility distribution, setting a lower threshold at the 10th percentile would be permissive of poor quality care. Instead, we recommend the median QI rate as the lower threshold. By setting the lower threshold at the median, we set expectations that all facilities should be able to meet the quality standards of the top 50%.

Applications of evaluation approach

From our case example of the Minnesota NH QIs, we demonstrated an empirical approach for evaluating a quality measurement system’s usefulness in practice. We applied well-known evaluation criteria of parsimony, relevance, usability in discrimination, actionability and validity. We have tried to present sufficient detail for other researchers to replicate the approach, and we feel that it is robust to other quality performance systems.

Parsimony and relevance are challenges facing any healthcare quality measurement system because of time and other resource constraints on healthcare providers. We carried out a literature review, elicited expert opinions, and performed correlation analysis to arrive at an optimum number of measures representing areas of care and outcomes that could best inform quality improvement as well as inform consumers and policymakers. The QIs we selected through our qualitative assessment were most applicable to an NH setting and they depended on the findings of our literature search and the experts we consulted. Nonetheless, the approach of triangulating between research literature, expert opinion, and empirical analysis could be applied to measurement development and evaluation in any healthcare setting.

Probably the most controversial issue in a quality measurement system, particularly if it is tied to financial incentives, is the allocation of points for organisation performance. Although some quality measure rates may be normally distributed with enough variation to clearly discriminate between organisations, inevitably some quality measures will be skewed and have other features that make discrimination difficult. Rates of low frequency sentinel events, for example, tend to be highly skewed in hospitals and other healthcare settings as well as in NHs. Scores assigned purely by percentile rank can be misleading if facilities are concentrated at either end of the quality measure distribution. We examined distributional patterns with a variety of common issues such as skewed distribution with floor or ceiling effects, and too little variance. Our recommended scoring approach is tailored to distributions with these non-normal properties. Our approach is applicable to Medicare’s Nursing Home Compare Five-Star performance-based NH ratings which are assigned according to percentile rank without regard to underlying facility distributions.46 Similar distributional and scoring issues arise with Medicare’s Hospital Compare and Doctors and Clinicians Compare47; in ratings of patient experiences, such as the Hospital Consumer Assessment of Healthcare Providers and Systems survey used widely for hospitals in the USA and other countries48; and in drawing international comparisons of healthcare system quality.49

Another challenge to quality measurement systems is responding to changes over time in performance or standards used to measure performance.50 We found a variety of time trends in the average facility QI rates: some improving, a few declining, but most remaining about the same. A quality measure might be retired if it steadily improved to the point where few facilities evidenced a quality problem. At the other extreme, pervasive and ongoing poor facility performance on a quality measure requires careful examination of its properties and of the care delivery system. Moreover, underlying the average trends can be considerable movement of individual facility rates as some facilities improve and others decline. For the scoring system to be effective in motivating quality improvement, it should avoid making the QI thresholds a moving target. For example, if thresholds are re-based each year as the percentile distribution changes, then individual increases in performance may go unrewarded because thresholds have risen as other facilities improved their performance. Overall improvement in care quality, a major goal of the system, can be motivated by establishing benchmarks or fixed quality measure thresholds based on the rates for best performers or goals set by stakeholders. At least in theory, this approach allows all facilities to achieve higher points as performance improves. On the other hand, thresholds should not be stagnant; better overall performance over time should be recognised by higher standards.

Finally, we addressed the key measurement property of construct validity by evaluating the underlying domain structure of the quality measures. A clinically and empirically supported domain structure can make the quality measures more meaningful and actionable, and it can lead to more effective targeting for quality improvement.51 For example, strategies for quality improvement aimed at one quality measure in a domain can have spillover effects on other quality measures in the same domain. If the scoring approach reflects the domain structure, the scores may be more intuitive clinically and more easily interpreted by consumers or other stakeholders.

Limitations

The generalisability of our approach needs further exploration because it is based on a case study of NH QIs in a single state in the USA. However, in theory, a similar approach could be used elsewhere using the same general concepts and steps we have taken here. Also, the focus of our analysis was on selected methodological issues affecting the implementation of healthcare quality measurement in practice. Other issues affecting a measurement system’s fitness for use, such as managerial fitness, interpretability, organisational context, and political or regulatory environment, were outside the scope of the study.2 23 Nonetheless, any scoring system that is based on percentile rank, such as widely used star ratings, is likely to be subject to problems similar to those addressed in our analysis.

Conclusion

Using a systematic empirical approach, we comprehensively evaluated 19 long-stay QIs used in the Minnesota NH Report Card and made recommendations to reform the current scoring system and adopt a new domain structure. More informative and rigorous QIs in public reporting will enable prospective residents and their families to make informed decisions when selecting a facility, and allow policymakers to better assess, benchmark and monitor NH clinical care quality. Moreover, our evaluation approach could be emulated by researchers with other quality measurement and scoring systems.

Data availability statement

Data may be obtained from a third party and are not publicly available.

Ethics statements

Patient consent for publication

Ethics approval

The study was approved by the Purdue University Institutional Review Board (IRB-2020-1207).

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors All authors have read and approved the submission of this manuscript. DX and GA contributed to the study concept and design, acquisition, analysis and interpretation of data, and preparation of the manuscript. TL and MR contributed to data interpretation and manuscript preparation. DX was responsible for statistical analysis and the overall content as the guarantor.

  • Funding This study was funded by evaluation contract with the Minnesota Department of Human Services.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Linked Articles