Reliability of diagnoses coding with ICD-10

doi:10.1016/j.ijmedinf.2006.11.005

International Journal of Medical Informatics

Volume 77, Issue 1, January 2008, Pages 50-57

https://doi.org/10.1016/j.ijmedinf.2006.11.005 Get rights and content

Abstract

Objective

Reliability of diagnoses coding is essential for the use of routine data in a national health care system. The present investigation compares reliability of diagnoses coding with ICD-10 between three groups of coding subjects.

Method

One hundred and eighteen students coded 15 diagnoses lists, 27 medical managers from hospitals 34 discharge letters, and 13 coding specialists 12 discharge letters. Agreement in principal diagnosis was assessed using Cohen's Kappa and the fraction of coincidences over the number of pairs, agreement for the full set of diagnoses with a previously developed measure p_om.

Results

Kappa values were fair (managers) or moderate (coders) for terminal codes with 0.27 and 0.42 (agreement 29.2% versus 46.8%), substantial for the chapter level with 0.71 and 0.72 (agreement 78.3% versus 80.8%). p_om was lower for the full set of diagnoses than for principal diagnoses, for example in case of managers with 0.21 versus 0.29 for terminal codes. Best results were achieved by students coding diagnoses lists. In summary, the results are remarkably lower than in earlier publications.

Conclusion

The refinement of the ICD-10 accompanied by innumerous coding rules has established a complex environment that leads to significant uncertainties even for experts. Use of coded data for quality management, health care financing, and health care policy requires a remarkable simplification of ICD-10 to receive a valid image of health care reality.

Introduction

The use of classified and coded medical entities for reimbursement, quality management, and health care policy has increased enormously in the last 30 years. The usefulness of these data relies basically on an identical coding of the same entity independent of the coding person and/or the time of coding. Thirty years ago the Institute of Medicine (IOM) analyzed the reliability of diagnoses coding from hospital discharge abstracts with the 8th Revision of the International Classification of Diseases (ICD) [1]. An independent re-coding of the principal diagnoses confirmed 65.2% of the original codes. Since then, various studies have raised issues such as whether hospitals use systematically wrong codes to increase reimbursement [2] or whether administrative data include the necessary elements for quality management [3]. Many studies have been published concerning the validity of coded data [4], [5]. But it is still not clear whether diagnoses coding with ICD is more than a matter of chance.

Some established problems raise concerns about the present reliability of diagnoses coding with ICD:

•
The ICD includes ambiguities and inconsistencies [6].
•
Coding of abstracts and medical reports is influenced by different conclusions about existing diagnoses [7].
•
Refinement of ICD for reimbursement and a high number of rules constitute a complex coding system, which is quite difficult to understand, even for coding experts.

Coding of medical entities with classifications is a hot topic in Germany. The codes are used for reimbursement and system design of the German Diagnosis Related Groups (G-DRGs), introduced on a mandatory basis to hospitals in 2004. Obligatory public quality reports from hospitals include performance statistics comprising codes. These reports were published first in 2005 for 2004. A system for risk compensation is in progress. Health insurance companies will establish morbidity scores derived from coded data.

We conducted an investigation on the reliability of diagnoses coding from discharge letters with the German modification of the ICD-10 for health care financing (ICD-10-GM) [8]. The ICD-10-GM is a successor of a pooling of an earlier German adaptation of WHO's 10th revision with the ICD-10 Australian Modifications (ICD-10-AM) Version 1. Due to the adoption of the Australian Refined DRGs (AR-DRGs) in 2003 compatibility with the ICD-10-AM was required. ICD-10-GM is revised each year according to requirements from the G-DRGs. For coding of procedures a national classification – abbreviated as OPS – is used based on WHO's International Classification of Procedures (ICPM), also adapted to the Australian DRGs. The ICD-10-GM 2004 included 12,983 terminal codes.

We aimed at calculating the reliability of diagnoses coding. Reliability measures the agreement of different persons coding the same case (inter-rater reliability) or the agreement of one person at different times coding the same case (intra-rater reliability). Reliability is different from validity. Validity measures the agreement with a gold standard. On the one hand it is possible to have high reliability but weak validity, if all raters agree in their wrong decisions. On the other hand, low reliability can be explained two-fold. It can be the consequence of insufficient education and training, and of inadequate standardization of the coders and the coding scenario. But it can also indicate weaknesses in the classification used for coding mentioned above. In the latter case, low reliability indicates poor quality of a coding system and should lead to a major revision!

The investigation was split into three studies: medical students coding diagnoses lists from discharge letters, physicians working in medical management in hospitals coding from discharge letters, and specialists in medical documentation also coding from discharge letters. Results from the first study with medical students were published previously [9]. Objectives of our study were to learn about the ICD-10, to find arguments for the discussion who should code and to get information on the quality of data coded in routine care.

Section snippets

Materials and methods

Discharge letters were used as basis for coding. The letters originate from a department of internal medicine of a medium sized municipal hospital and had been written by one physician in the early 1990s. They cover a full range of medical problems with special emphasis on nephrology. Personal data had been deleted including any datestamps concerning seldom events, rare diseases, or pathognomonic information. The length of the letters ranged from 1 to 4 pages (cf. Fig. 1). Participants were

Results

Table 2 gives an overview of the study groups. One hundred and eighteen student forms from 15 discharge letters include 516 codes with a mean of 4.4 codes per form. The most frequent code was I10 “essential (primary) hypertension” (38 forms). One hundred and eighteen different codes were used. One hundred and thirty-five manager forms include 751 codes with a mean of 5.6 codes per form. The most frequent code was E66.0 “Obesity due to excess calories” (23 forms). Three hundred and twelve

Discussion

The coding of diagnoses with ICD-10-GM is of great importance for hospitals in Germany today. Their revenue depends mainly on the coding of diagnoses and procedures that build the definition for DRGs. Appropriateness of care is systematically monitored by a timely communication with health insurance companies using the same codes. In questionable cases an assessment of the correct coding, the appropriateness of admissions and the appropriateness of medical decisions is carried out analyzing the

Conclusions

We argue that the stated fair reliability is caused by the extensive refinement of the ICD-10 in Germany, accompanied by the introduction of complex and numerous coding rules. It is obvious to all coding experts that it is impossible to obtain reliable data on such a base. It is surprising, that re-coding studies as presented by Dixon et al. [15] did not recognize the role of the classification itself, even if they conclude a “low level of agreement between coders over main diagnosis and

Acknowledgements

We are very grateful to all the voluntary participants in our study, who received no additional fee for coding.

References (17)

G. Surján
Questions on validity of international classification of diseases-coded diagnoses
Int. J. Med. Inform.
(1999)
Institute of Medicine, Reliability of hospital discharge abstracts, Report of a study, National Academy of Sciences,...
D.C. Hsia et al.
Accuracy of diagnostic coding for medicare patients under the prospective-payment system
N. Engl. J. Med.
(1988)
L.I. Iezzoni
Assessing quality using administrative data
Ann. Intern. Med.
(1997)
P.F. Brennan et al.
Assessing data quality: from concordance, though correctness and completeness, to valid manipulatable representations
J. Am. Med. Inform. Assoc.
(2000)
W.R. Hogan et al.
Accuracy of data in computer-based patient records
J. Am. Med. Inform. Assoc.
(1997)
C.P. Friedmann et al.
Exploring the boundaries of plausibility: empirical study of a key problem in the design of computer-based clinical simulations
Deutsches Institut für Medizinische Dokumentation und Information (Hrsg.) ICD-10-GM Systematisches Verzeichnis. Version...

There are more references available in the full text version of this article.

Cited by (94)

Creating a computer assisted ICD coding system: Performance metric choice and use of the ICD hierarchy
2024, Journal of Biomedical Informatics
Machine learning methods hold the promise of leveraging available data and generating higher-quality data while alleviating the data collection burden on healthcare professionals. International Classification of Diseases (ICD) diagnoses data, collected globally for billing and epidemiological purposes, represents a valuable source of structured information. However, ICD coding is a challenging task. While numerous previous studies reported promising results in automatic ICD classification, they often describe input data specific model architectures, that are heterogeneously evaluated with different performance metrics and ICD code subsets.
This study aims to explore the evaluation and construction of more effective Computer Assisted Coding (CAC) systems using generic approaches, focusing on the use of ICD hierarchy, medication data and a feed forward neural network architecture.
We conduct comprehensive experiments using the MIMIC-III clinical database, mapped to the OMOP data model. Our evaluations encompass various performance metrics, alongside investigations into multitask, hierarchical, and imbalanced learning for neural networks.
We introduce a novel metric,
, tailored to the ICD coding task, which offers interpretable insights for healthcare informatics practitioners, aiding them in assessing the quality of assisted coding systems. Our findings highlight that selectively cherry-picking ICD codes diminish retrieval performance without performance improvement over the selected subset. We show that optimizing for metrics such as NDCG and AUPRC outperforms traditional F1-based metrics in ranking performance. We observe that Neural Network training on different ICD levels simultaneously offers minor benefits for ranking and significant runtime gains. However, our models do not derive benefits from hierarchical or class imbalance correction techniques for ICD code retrieval.
This study offers valuable insights for researchers and healthcare practitioners interested in developing and evaluating CAC systems. Using a straightforward sequential neural network model, we confirm that medical prescriptions are a rich data source for CAC systems, providing competitive retrieval capabilities for a fraction of the computational load compared to text-based models. Our study underscores the importance of metric selection and challenges existing practices related to ICD code sub-setting for model training and evaluation.
Impact of continuity of care on older adults diagnosed with mental and behavioural disorders at risk of death due to intentional self-harm: a retrospective Korean cohort study
2024, Public Health
The aim of this study was to evaluate the impact of continuity of care on older adults diagnosed with mental and behavioural disorders who are at risk of death due to intentional self-harm.
This was a retrospective cohort study.
Data from the Korean National Health Insurance Service-Elderly Cohort Database (2002–2013) were used. A total of 53,980 patients who had visited the outpatient clinic three or more times within the year following the initial diagnosis of mental and behavioural disorders were included. A generalised estimating equation model was generated to examine the impact of continuity of care (CoC) on the risk of death due to intentional self-harm among older adults with mental illnesses.
The risk of death due to intentional self-harm was significantly higher in those with poor CoC for mental and behavioural disorders than in those with good CoC. The risk ratio, adjusting for all covariates, was larger for the Usual Provider of Care index (adjusted risk ratio [aRR]: 1.63, 95% confidence interval [CI]: 1.25–2.12) than for the CoC index (aRR: 1.50, 95% CI: 1.18–1.90), indicating a stronger association with the concentration of contact with the most frequently visited provider.
Poor CoC among Korean older adults diagnosed with mental and behavioural disorders was identified as a significant risk factor for death due to intentional self-harm. The results of this study highlight the need for interventions that can prevent suicidal behaviour in older adults, such as institutionalising the usual providers of mental health care for older adults.
Inconsistency and Ambiguity Within the International Classification of Disease 10 Procedure Coding System for Hip Fractures
2023, Journal of Arthroplasty
The International Statistical Classification of Diseases (ICD), 10th Revision Procedure Coding System (PCS) was created to increase the granularity of procedural coding. These codes are entered by hospital coders from information derived from the medical record. Concern exists that this increase in complexity could lead to inaccurate data.
Medical records and ICD-10-PCS codes were reviewed for operatively treated geriatric hip fractures from January 2016 through February 2019 at a tertiary referral medical center. Definitions for each of the 7-unit figures from the 2022 American Medical Association’s ICD-10-PCS official codebook were compared to the medical, operative, and implant records.
There were 56% (135 of 241) of PCS codes that had ambiguous, partially incorrect, or frankly incorrect figures within the code. One or more inaccurate figures were noted in 72% (72 of 100) of fractures treated with arthroplasty compared to 44.7% (63 of 141) treated with fixation (P < .01). There was at least 1 frankly incorrect figure contained in 9.5% (23 of 241) of codes. Approach was coded ambiguously for 24.8% (29 of 117) of pertrochanteric fractures. Device/implant codes were partially incorrect in 34.9% (84 of 241) of all hip fracture PCS codes. Hemi and total hip arthroplasties were partially incorrect in 78.4% (58 of 74) and 30.8% (8/26) of device/implant codes, respectively. Significantly more femoral neck (69.4%, 86 of 124) than pertrochanteric fractures (41.9%, 49 of 117) had 1 or more incorrect or partially correct figures (P < .01).
Despite the increased granularity of ICD-10-PCS codes, the application of this system is inconsistent and often incorrect when applied to hip fracture treatments. The definitions in the PCS system are difficult to be utilized by coders and do not reflect the operation performed.
ICD2Vec: Mathematical representation of diseases
2023, Journal of Biomedical Informatics
The International Classification of Diseases (ICD) codes represent the global standard for reporting disease conditions. The current ICD codes connote direct human-defined relationships among diseases in a hierarchical tree structure. Representing the ICD codes as mathematical vectors helps to capture nonlinear relationships in medical ontologies across diseases.
We propose a universally applicable framework called “ICD2Vec” designed to provide mathematical representations of diseases by encoding corresponding information. First, we present the arithmetical and semantic relationships between diseases by mapping composite vectors for symptoms or diseases to the most similar ICD codes. Second, we investigated the validity of ICD2Vec by comparing the biological relationships and cosine similarities among the vectorized ICD codes. Third, we propose a new risk score called IRIS, derived from ICD2Vec, and demonstrate its clinical utility with large cohorts from the UK and South Korea.
Semantic compositionality was qualitatively confirmed between descriptions of symptoms and ICD2Vec. For example, the diseases most similar to COVID-19 were found to be the common cold (ICD-10: J00), unspecified viral hemorrhagic fever (ICD-10: A99), and smallpox (ICD-10: B03). We show the significant associations between the cosine similarities derived from ICD2Vec and the biological relationships using disease-to-disease pairs. Furthermore, we observed significant adjusted hazard ratios (HR) and area under the receiver operating characteristics (AUROC) between IRIS and risks for eight diseases. For instance, the higher IRIS for coronary artery disease (CAD) can be the higher probability for the incidence of CAD (HR: 2.15 [95% CI 2.02–2.28] and AUROC: 0.587 [95% CI 0.583–0.591]). We identified individuals at substantially increased risk of CAD using IRIS and 10-year atherosclerotic cardiovascular disease risk (adjusted HR: 4.26 [95% CI 3.59–5.05]).
ICD2Vec, a proposed universal framework for converting qualitatively measured ICD codes into quantitative vectors containing semantic relationships between diseases, exhibited a significant correlation with actual biological significance. In addition, the IRIS was a significant predictor of major diseases in a prospective study using two large-scale datasets. Based on this clinical validity and utility evidence, we suggest that publicly available ICD2Vec can be used in diverse research and clinical practices and has important clinical implications.
Validation of Diagnostic Coding for Diabetes Mellitus in Hospitalized Patients
2022, Endocrine Practice
Some studies have shown that there is an undercoding of diabetes mellitus among hospitalized patients, which can have adverse clinical and financial implications for health systems. We aimed to validate the discharge diagnostic coding of diabetes mellitus in hospitalized patients using clinical and laboratory-based diagnostic indicators as the reference.
This was a retrospective cohort study of 83 690 discharges of 48 615 unique adult patients who were hospitalized in an academic medical center over 4.5 years and had at least 4 blood glucose measurements during admission. A missing diabetes code (MDC) was defined using 2 criteria. MDC₁ was defined as the presence of any of the following: blood glucose ≥200 (x2), A1C ≥6.5%, home antihyperglycemic medication, or preadmission code for diabetes, whereas MDC₂ was defined as preadmission diabetes or at least 2 other criteria in MDC₁. Multivariable logistic regression was used to identify factors associated with MDC compared to the present diabetes code.
MDC₁ and MDC₂ were present in 12 186 (14.6%) and 3542 (4.7%) discharges, respectively. Factors associated with both MDC₁ and MDC₂ were medium-dose steroid use [adjusted odds ratio (aOR) 2.11, 2.01], high-dose steroid use (aOR 4.70, 2.50), intermediate medical care service (aOR 1.65, 1.55), infection (aOR 1.21, 1.34), and hepatic disease (aOR 1.93, 1.92).
In this retrospective study, MDC ranged from 5% to 15% and was associated with various clinical factors. Further prospective studies are needed to validate these findings, explore the mechanisms behind these associations, and understand the clinical and financial implications.
Reliability of trauma coding with ICD-10
2022, Chinese Journal of Traumatology - English Edition
Citation Excerpt :
Coding reliability refers to obtaining the same results upon repeating the coding activity.11 In other words, the coding reliability is an agreement between different people in coding a diagnosis (external reliability), or an individual coding the same diagnosis at different time (internal reliability).12 Adopting a consistent encoding process that results in the reliable coded data is crucial to using these data because users will trust the data when they are convinced that the data encoding process is reliable.13
The reliability of trauma coding is essential in establishing the reliable trauma data and adopting efficient control and monitoring policies. The present study aimed to determine the reliability of trauma coding in educational hospitals affiliated to Shahid Beheshti University of Medical Sciences, Iran.
In this descriptive cross-sectional study, 591 coded medical records with a trauma diagnosis in 2018 were selected and recoded by two coders. The reliability of trauma coding was calculated using Cohen's kappa. The data were recorded in a checklist, in which the validity of the content had been confirmed by experts.
The reliability of the coding related to the nature of trauma in research units was 0.75–0.77, indicating moderate reliability. Also, the reliability of the coding of external causes of trauma was 0.57–0.58, suggesting poor reliability.
The reliability of trauma coding both in terms of the nature of trauma and the external causes of trauma does not have a good status in the research units. This can be due to the complex coding of trauma, poor documentation of the cases, and not studying the entire case. Therefore, holding training courses for coders, offering training on the accurate documentation to other service providers, and periodically auditing the medical coding are recommended.

View all citing articles on Scopus

View full text

Reliability of diagnoses coding with ICD-10

Abstract

Objective

Method

Results

Conclusion

Introduction

Section snippets

Materials and methods

Results

Discussion

Conclusions

Acknowledgements

Int. J. Med. Inform.

Accuracy of diagnostic coding for medicare patients under the prospective-payment system

N. Engl. J. Med.

Assessing quality using administrative data

Ann. Intern. Med.

Assessing data quality: from concordance, though correctness and completeness, to valid manipulatable representations

J. Am. Med. Inform. Assoc.

Accuracy of data in computer-based patient records

J. Am. Med. Inform. Assoc.

Exploring the boundaries of plausibility: empirical study of a key problem in the design of computer-based clinical simulations