Objective Administrative data systems are used to identify hospital-based patient safety events; few studies evaluate their accuracy. We assessed the accuracy of a new set of patient safety indicators (PSIs; designed to identify in hospital complications).
Study design Prospectively defined analysis of registry data (1 April 2010–29 February 2016) in a Canadian hospital network. Assignment of complications was by two methods independently. The National Surgical Quality Improvement Programme (NSQIP) database was the clinical reference standard (primary outcome=any in-hospital NSQIP complication); PSI clusters were assigned using International Classification of Disease (ICD-10) codes in the discharge abstract. Our primary analysis assessed the accuracy of any PSI condition compared with any complication in the NSQIP; secondary analysis evaluated accuracy of complication-specific PSIs.
Patients All inpatient surgical cases captured in NSQIP data.
Analysis We assessed the accuracy of PSIs (with NSQIP as reference standard) using positive and negative predictive values (PPV/NPV), as well as positive and negative likelihood ratios (±LR).
Results We identified 12 898 linked episodes of care. Complications were identified by PSIs and NSQIP in 2415 (18.7%) and 2885 (22.4%) episodes, respectively. The presence of any PSI code had a PPV of 0.55 (95% CI 0.53 to 0.57) and NPV of 0.93 (95% CI 0.92 to 0.93); +LR 6.41 (95% CI 6.01 to 6.84) and −LR 0.40 (95% CI 0.37 to 0.42). Subgroup analyses (by surgery type and urgency) showed similar performance. Complication-specific PSIs had high NPVs (95% CI 0.92 to 0.99), but low to moderate PPVs (0.13–0.61).
Conclusion Validation of the ICD-10 PSI system suggests applicability as a first screening step, integrated with data from other sources, to produce an adverse event detection pathway that informs learning healthcare systems. However, accuracy was insufficient to directly identify or rule out individual-level complications.
- Adverse events, epidemiology and detection
- Chart review methodologies
- Healthcare quality improvement
- Incident reporting
Statistics from Altmetric.com
- Adverse events, epidemiology and detection
- Chart review methodologies
- Healthcare quality improvement
- Incident reporting
Surgical adverse events are common,1 contribute to poor short-term and long-term patient outcomes2 and increase healthcare costs.3 After major non-cardiac surgery, 10%–20% of patients experience a serious complication,4 and up to 40% of surgical patients experience at least one patient safety event during their hospitalisation.5 Accordingly, surgical complications must be monitored to determine improvement opportunities and monitor progress. Monitoring of surgical complications can be achieved using clinical surveillance. A leading example of this method is the NSQIP, which is widely considered the gold standard for surgical outcome ascertainment. However, programmes such as NSQIP do require provision of additional financial and human resources to prospectively review patient outcomes and medical records. Therefore, monitoring of surgical patients in these programmes is typically limited to random subsamples or to targeted procedural groupings.
As an alternative, most health systems routinely abstract health administrative data from all hospitalisations to generate hospital discharge abstract data or episode statistics. These routinely collected data contain coded information describing the interventions and diagnoses occurring during a hospital encounter. As such, health administrative data can provide a consistent source of data for all surgical inpatients and could be used to identify surgical complications. However, because health administrative data are not abstracted prospectively or by clinical experts, the accuracy of these data for identifying complications cannot be presumed. To address this important limitation, several sets of administrative data codes have been assembled in attempts to provide accurate surveillance and ascertainment of adverse patient outcomes in hospital.6–8 Many of these systems, however, rely on International Classification of Diseases Ninth Edition (ICD-9) codes, which may lack the ability to differentiate a pre-existing diagnosis from one arising in-hospital; other limitations include accuracy validation focused only on positive predictive values (PPVs) (ie, accuracy to rule-in a complication), as opposed to also considering negative predictive values (NPVs) (ie, ruling out the presence of a complication).
Southern et al recently developed a new set of administrative data code clusters indicative of complications potentially related to quality of care and arising after admission using Canadian discharge abstracts to identify ‘patient safety indicators’ (PSIs).9 Key features of these new PSIs were the combination of data-driven code identification with an expert-driven consensus processes, as well taking advantage of diagnosis timing indicators that confined the code to diagnoses occurring after hospital admission (ie, adverse events arising during a hospital stay and potentially related to safety and quality of care). These indicators can be used (and are at our centre) to monitor rates of healthcare-associated complications during hospitalisation. While this method is logical and feasible, its accuracy has not been assessed beyond an ecological design that only compared population-level crude incidences between different data sets (as opposed to an individual-level analysis using a clinical reference standard). Therefore, we performed a validation study using a gold standard clinical reference (the National Surgical Quality Improvement Programme (NSQIP)) to determine the accuracy of the PSIs developed by Southern et al at the individual level. This study will inform the use of PSIs for monitoring surgical safety.
Study design and setting
We performed a prospectively defined analysis of registry data from a single academic health sciences network in Canada. A study protocol was created, finalised and reviewed by the research ethics board. This protocol was used to direct the study analyst prior to any data manipulation; however, the protocol was not published or registered. Clinical data were from NSQIP patient files, while administrative data were from the Discharge Abstract Database (DAD) of the Canadian Institute for Health Information. All patients had surgery at a 900-bed tertiary care academic health sciences network serving a population of approximately 1.2 million people. The hospital network consists of three geographically distinct campuses, including two inpatient hospitals and a free-standing ambulatory surgery centre. The network is the sole regional provider for trauma care, neurosurgery, thoracic surgery and vascular surgery and is the regional cancer treatment centre. Our investigation is reported using the Standards for the Reporting of Diagnostic Accuracy Studies (STARD initiative).10
Our unit of analysis was an episode of inpatient surgical care (meaning that a patient could be enrolled more than once if they were admitted and had surgery more than once during the study period). All consecutive patients having surgery at our hospital were eligible to be randomly included in the NSQIP database (our hospital enrolled a random sample of one out of eight surgical cases according to standard NSQIP protocols during the study period), with exclusions from consideration occurring only for people <18 years or (for people having relevant procedures) if their procedure would have been the fourth inguinal herniorrhaphy, breast lumpectomy, laparoscopic cholecystectomy or transurethral resection of prostate or bladder in an 8-day inclusion period (ie, up to three of each could be included in a given period). Patients had surgery between 1 April 2010 (the start of our hospital’s enrolment in the NSQIP programme) and February 2016 (the latest date at which all data sets were complete at the time of analysis) were identified. Using anonymised unique patient identifiers, we directly linked each patient’s NSQIP record to the corresponding DAD record. We included all adult patients who had an NSQIP record for inpatient surgery. To link the NSQIP data (which is based on a specific surgery date) to the DAD (which represents an individual hospitalisation), we identified the DAD record that overlapped with the date of surgery.
Admission, patient and surgical characteristics were extracted for each episode of care. At the admission level, we identified whether the admission was elective or urgent and the total length of hospital stay from the DAD. At the patient level (from the DAD), we measured age, biological sex, all Elixhauser comorbidities present on admission using ICD 10th Edition (ICD-10) codes,11 each patient’s Elixhauser Comorbidity Index Score12 and from the NSQIP whether individuals had systemic inflammatory response syndrome or sepsis before surgery, whether they were ventilator dependent, their functional status, receipt of a preoperative transfusion and their American Society of Anesthesiologists (ASA) score. The primary surgical service for each episode was also identified.
Definition of references and comparators
Data comprising the NSQIP record were generated through standard review procedures established by the NSQIP programme. Specifically, a trained and certified surgical clinical reviewer reviewed medical records and contacted patients to populate the NSQIP record. This process was subject to regular review and was supported by decision-support mechanisms through the NSQIP programme. The reviewers at the study hospital have consistently met established NSQIP data quality standards, including <5% disagreement rates. While NSQIP data are considered the gold standard for surgical perioperative data, it must also be recognised that NSQIP participation does not include systematic laboratory and diagnostic screening. Therefore, laboratory values to aid in diagnosis of subclinical presentations of myocardial infarction or renal failure (eg, troponin and creatinine), for example, are not available for all enrolled patients and may result in some degree of outcome misclassification.
From the NSQIP record, we identified all complications for each patient during each episode of care. Because the NSQIP covers a period of 30 days after surgery, regardless of a patient being in or out of hospital, whereas the DAD only contains data generated during a hospitalisation, we used the date of each NSQIP complication to determine if the outcome occurred in-hospital or after discharge. Outcomes occurring on the day of discharge were considered to be in-hospital events. Only in-hospital complications were used to define an NSQIP complication as present for the purposes of this study. Using PSI criteria9 (online supplementary appendix 1) applied to the DAD, we identified all PSIs that occurred during the index hospitalisation. For both the NSQIP complications and PSIs, each variable was coded in a dichotomous fashion (present or absent). NSQIP complications were considered the reference standard against which PSI accuracy was tested. Outcomes defined by the NSQIP and PSI were generated independently (ie, NSQIP results were not known to administrative data extractors or vice versa).
Mapping NSQIP outcomes to PSI clusters
Following a structured review of NSQIP publications,13 we found that 70% of reports specified ‘any complication’ as an outcome of interest (as opposed to specifying a single type of complication as the study outcome). Therefore, our primary objective was to determine the diagnostic accuracy of the presence of any PSI in correctly identifying a patient who suffered any NSQIP complication while in hospital. For each episode of care, a PSI was classified as a true positive (any PSI present, any NSQIP complication present), false positive (any PSI present, all NSQIP complications absent), true negative (all PSIs absents, all NSQIP complications absent), or false negative (all PSIs absent, any NSQIP complication present).
Our secondary objective was to determine the diagnostic accuracy of specific PSIs in identifying related NSQIP complications. To accomplish this, we mapped PSI clusters to a corresponding NSQIP complication. This was done by consensus by all authors. First, two authors (DM and GH) mapped relevant NSQIP complications to each PSI based on clinical content overlap, as well as review of all ICD-10 codes pertinent to each PSI, and clinical descriptions of NSQIP outcomes. Only 7/17 PSI domains specifically overlapped with a corresponding NSQIP complication. In some cases, one NSQIP complication was assigned to more than one PSI due to possible clinical content or code-related overlap. Next, the initial list was circulated to co-authors for comment and suggestion of changes to the PSI-NSQIP map. These changes were then incorporated, and a final mapping was finalised through consensus by all investigators (online supplementary appendix 2). Each specific PSI was classified as a true positive (PSI present, NSQIP complication(s) present), false positive (PSI present, NSQIP complication(s) absent), true negative (PSI absent, NSQIP complication(s) absent) or false negative (PSI absent, NSQIP complication(s) present) for each episode of care.
Descriptive statistics were calculated for our full study population, as well as for patients who were identified to have an NSQIP complication (NSQIP+) and for those with a PSI (PSI+). Absolute standardised differences (ASD) were calculated to assess possible differences in episode characteristics between NSQIP+ vs PSI+ patients; ASDs of >0.10 are generally considered to represent a substantial difference. Characteristics between PSI/NSQIP concordant and discordant individuals were also calculated.
During peer review, we specified PPVs and NPVs as our main measures of accuracy to support interpretability and comparison with related studies. In this study, PPVs represent the proportion of patients with a PSI diagnosis who actually had a complication in the NSQIP; NPVs represent the proportion of patients without a PSI diagnosis who also did not have a complication in NSQIP. Confidence intervals (CI) were calculated using the binomial distribution. Positive and negative likelihood ratios (+LR, −LR) were initially specified as our primary measures of accuracy. LR-based cut-offs have been suggested to guide the assessment of diagnostic accuracy. Test results with a +LR of 10 or greater and a –LR of 0.1 or less are considered to be very useful; those with a +LR between 5 and 10 and a –LR between 0.1 and 0.2 are considered moderately useful and those with a +LR between 2 and 5 and a –LR between 0.2 and 0.5 are considered somewhat useful. Tests with a +LR less than 2 or a –LR greater than 0.5 are considered to be essentially useless.14 The 95% CI for our LRs were calculated according to the method of Simel et al.15 We also calculated sensitivity and specificity with 95% CI using the binomial distribution.
We performed several prespecified subgroup analyses. First, we performed separate analyses in elective and emergency surgery patients. Next, we performed separate analyses in orthopaedic and general surgery patients (our two highest volume surgical services). Finally, because the NSQIP programme collects data only up to 30 days after surgery, whereas PSI diagnoses could occur at any point during a prolonged hospitalisation, we tested the diagnostic accuracy of PSIs in patients with a length of stay >30 days to determine whether prolonged length of hospital stay contributed to changes in predictive accuracy of the PSI system. All analyses were performed in SAS V.9.4 for Windows.
We identified 12 898 episodes of inpatient surgery during our study period who were enrolled in NSQIP data collection and who could be linked to the DAD (figure 1). Characteristics of the cohort, as well as NSQIP+ and PSI+ patients are provided in table 1; characteristics of NSQIP/PSI concordant/discordant individuals are provided in the online supplementary appendix 3. The average age of patients was 60 years, and the majority were females (55%). Most patients had surgery on an elective basis. The mean Elixhauser comorbidity score was 2.5, and most patients were categorised as an ASA score of 3 or higher. Patients who experienced a complication, either in NSQIP or based on PSI definitions, were more likely to have had emergency surgery, were older, had higher comorbidity burdens, had higher ASA scores and were more likely to have general and less likely to have gynecologic surgery than the average patient in the study cohort. Almost all demographic, comorbid and surgical factors were similar between PSI+ and NSQIP+ patients.
Accuracy of any PSIs in identifying any NSQIP complication
The NSQIP database identified 2885 (22.4%) patients who experienced a postoperative complication; at least one PSI was identified in 2415 (18.7%) patients. The full 2×2 table delineating true and false positives and true and false negatives for the primary (ie, any complication) comparison is provided in table 2, while for the specific matched PSI cluster to NSQIP complications and subgroups, the 2×2 tables are provided in the online supplementary appendix 4.
The PPV for any PSI code was 0.55 (95% CI 0.53 to 0.57) and the NPV was 0.93 (95% CI 0.92 to 0.93). The presence of any PSI was associated with the presence of any complication in the NSQIP with a +LR of 6.41 (95% CI 6.01 to 6.84). The absence of any PSI was associated with the absence of any NSQIP complication with a −LR of 0.40 (95% CI 0.37 to 0.42). Measures of accuracy provided in table 3. The accuracy of PSIs in subgroups were similar to overall results (table 3). In elective cases, PPV was lower and NPV higher than in the full cohort, while in emergency cases, PPV was higher and NPV lower. Results in orthopaedics were similar to the overall population, while accuracy in general surgery was slightly higher, with both PPV and NPV increasing compared with the full population. When patients stayed in hospital beyond 30 days (the longest follow-up period covered by the NSQIP), PPV increased substantially; however, NPV decreased by a greater margin.
Accuracy of specific PSI clusters identifying specific NSQIP complications
The most common complication per PSI methods was infection (7.1%), followed by surgical complications (5.6%), gastrointestinal (4.6%), haemorrhagic (3.9%) and cardiac (2.1%) (ICD-10 codes present in at least 1% of positive cases are provided in the online supplementary appendix 5). The frequency and type of complications identified by the NSQIP methods was similar. Infectious complications (composed of septic shock, sepsis, pneumonia and surgical site infections) occurred in 7.7% of patients. Blood transfusions, a marker for haemorrhage, were required in 9.8% of cases. Cardiac complications (consisting of postoperative myocardial infarction and cardiac arrest) occurred in 0.7% of patients. No gastrointestinal complications were identified by NSQIP method (online supplementary appendix 6). For the individual PSI categories, PPVs were lower (0.13 for fluid-related complications to 0.61 for infectious events) and NPVs were substantially higher (0.92 for haemorrhagic events to 0.99 for venous-thromboembolic, fluid and cardiac-related events). Positive LRs and specificities were high (95% CI 7.7 to 131), while –LR and sensitivities were noticeably worse (95% CI 0.39 to 0.90) (table 3).
PSIs based on clusters of ICD-10 diagnostic codes designated by timing flags to have arisen during a hospital stay can be used to identify surgical patients who have experienced a prospectively identified postoperative complication during their hospitalisation with high NPV (0.93) and moderate PPV (0.55). However, despite addressing the limitations of previously derived and tested sets of PSI codes based on ICD-9 frameworks, which typically lack a diagnosis timing indicator, the addition of diagnosis timing indicators do not appear to increase accuracy to an extent where PSIs based on administrative data could completely replace prospective clinical review to identify complications and guide quality improvement.
Currently, most safety indicator systems based on administrative data codes are used for reporting health system and institution-level performance, or as flags to guide more detailed incident review. Most systems, such as the Agency for Healthcare Research and Quality (AHRQ) PSI system, are based on ICD-9 codes, which may lack diagnosis timing indicators and suffer previously documented issues with misclassification bias.16 Specifically, safety indicator systems most often identify complications with high specificity and negative predictive values, but with low sensitivity and positive predictive values.7 16 Most studies validating codes in administrative data have also focused only on positive accuracy measures (ie, the reference standard includes only people with the target condition), therefore negative predictive values are not available and positive predictive values may be inaccurately inflated due to falsely high disease prevalence in the reference population.17
The ICD-10-based PSIs derived by Southern et al 9 combined with a gold standard validation study design that includes people with and without postoperative complications, had the potential to address these limitations. First, the ICD-10 system includes a diagnosis timing indicator to clarify whether a diagnosis documented in a hospital record was present before, or arose during, hospitalisation. This could help to decrease rates of misclassification that impact ICD-9 PSIs. Additionally, a variety of designs exist for validating administrative data codes.17 In Southern et al’s initial derivation study of PSIs,9 an ecological design (which compares incidence rates between different data sources) was used to support the initial accuracy and face validity of the PSI clusters. However, an ecological design can only provide a crude measure of accuracy because the analysis is not performed at the individual level. In contrast, the current study used a gold standard approach (considered the strongest design for code validation17) to measure the association of PSI clusters with clinically applied criteria and definitions. Furthermore, unlike most gold standard code validation studies that contain only people with the true diagnosis,17 we employed a clinical reference standard that contained individuals with and without true complications. This approach both decreased bias in our sample and allowed us to define the accuracy of the code in predicting the presence or absence of a complication.17
Our approach to validating the presence of any PSI versus any NSQIP complication reflects typical uses of complication systems in the literature. This stands in contrast to validation studies of other sets of administrative PSI codes which have focused on specific diagnoses only. Selected individual complications of the AHRQ system have been validated against the NSQIP system in a manner similar to our study. These data demonstrate that, for specific complications, true case identification (ie, PPVs and +LRs) by the AHRQ system16 are higher than the Southern et al’s method. However, the sensitivity and NPVs of the Southern et al’s method appear superior. By extrapolating from available data comparing the AHRQ system to the NSQIP, the sensitivity of the Southern PSIs are higher than the for the AHRQ system, and in particular their sensitivity for identifying any complication (28% for the AHRQ vs 64% for the Southern PSIs).7
Through validation of this new set of PSIs, these systematically designed clusters of ‘post-admission’ diagnostic codes, this study provides an important contribution that supports more efficient monitoring patient safety outcomes in surgical patients. In keeping with Southern et al’s derivation study, the PSI codes appear to have validity at the ecological level (ie, the prevalence of PSI codes and NSQIP complications was similar) in surgical patients. This stands in contrast to the AHRQ system that has been found to identify less than one in three NSQIP complications in surgical patients.7 At the individual level, with a PPV of 0.55 and NPV of 0.93, it appears that, despite additional expert review and diagnosis timing indicators in the ICD-10 PSI system, the degree of misclassification present appears to preclude such administrative data capture systems from being considered as reliable sources of data to compare individual level quality outcomes or to drive pay for performance programmes. These findings are consistent with recent reviews specific to identification of hospital-acquired infections in administrative data.18 Furthermore, when considering the individual clusters of diagnoses within Southern et al’s PSI system, a lack of direct overlap with complications typically measured in prospective data systems, such as the NSQIP, could be a barrier to use for initiatives aimed at specific medical and surgical complications. Therefore, the primary utility of both the ICD-9 and ICD-10 systems in isolation will likely continue to be as screening tools as opposed to definitive diagnostic measures.
Moving forward, although both the new PSIs developed by Southern et al and the AHRQ PSIs have clear utility in helping to identify cases to review representing true occurrences of any complications or of specific complications, both systems require refinement to improve their ability to completely rule in or out the occurrence of complications on their own. Ultimately, however, the fundamental questions now are less about which administrative data indicator system is best (eg, AHRQ PSIs vs Southern PSIs), but rather that we must begin to understand how to position such indicator systems into an optimal health information screening pathway that will help healthcare organisations to effectively and efficiently detect adverse events so that a learning healthcare system paradigm can be achieved. To support such integration, future efforts should consider how data sources beyond administrative data codes (such as clinical data from linked electronic health records as these data sources become more widely available) or different approaches to using administrative data codes to define risk (such as multivariable probability models19). Either way, well-conducted validation studies that consider both positive and negative predictive accuracy will be foundational to advancement of efficient, accurate and widespread patient safety monitoring.
We evaluated the accuracy of openly available code to identify PSIs against the NSQIP database in a data set external from the PSIs initial derivation cohort. Furthermore, while the use of a PSI system based on arising after admission timing flags ensures that diagnoses arising in elective surgery patients were truly postoperative events, we were unable to ascertain whether the documented PSI events occurred before or after surgery for emergency surgery patients, which could explain the slightly higher false positive rate in emergency cases. As a study performed in surgical patients at a single health sciences network, we are unable to ascertain whether predictive accuracy would generalise to other hospitals, as external influences such as reimbursement incentives can impact coding accuracy20 21 or other types of hospitalised patients, such as internal medicine, maternal or mental health patients. Further validation in these populations will be required. Additionally, the clusters within the ICD-10 PSI framework do not clearly differentiate surgical (eg, surgical site infection) versus medical (eg, pneumonia) complications. And lastly, we evaluated the PSIs developed by Southern et al, in the context of an administrative data system that has diagnosis timing flags permitting distinction of diagnoses arising during a hospital stay from those present on admission. Currently, this means that such PSI analyses are only feasible in three countries—Canada, Australia and the USA—that have such timing flags in their national hospital discharge data systems. Future upgrades to international data systems will make diagnosis timing information routinely available in most countries.
In an external validation study using a clinical reference standard, the PSIs developed by Southern et al have moderate PPVs and high NPVs relative to prospectively identified surgical complications collected using NSQIP methodologies. Future work is now needed to refine the approach to integrating and improving these sorts of PSI indicator systems, based on health administrative data, into health information screening pathways that incorporate PSI screening algorithms with other clinical data to create a learning healthcare system paradigm.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Not commissioned; externally peer reviewed.
Data availability statement No additional data are available.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.