Background Misinterpretation of radiological examinations is an important contributing factor to diagnostic errors. Consultant radiologists in Norwegian hospitals frequently request second reads by colleagues in real time. Our objective was to estimate the frequency of clinically important changes to radiology reports produced by these prospectively obtained double readings.
Methods We retrospectively compared the preliminary and final reports from 1071 consecutive double-read abdominal CT examinations of surgical patients at five public hospitals in Norway. Experienced gastrointestinal surgeons rated the clinical importance of changes from the preliminary to final report. The severity of the radiological findings in clinically important changes was classified as increased, unchanged or decreased.
Results Changes were classified as clinically important in 146 of 1071 reports (14%). Changes to 3 reports (0.3%) were critical (demanding immediate action), 35 (3%) were major (implying a change in treatment) and 108 (10%) were intermediate (requiring further investigations). The severity of the radiological findings was increased in 118 (81%) of the clinically important changes. Important changes were made less frequently when abdominal radiologists were first readers, more frequently when they were second readers, and more frequently to urgent examinations.
Conclusion A 14% rate of clinically important changes made during double reading may justify quality assurance of radiological interpretation. Using expert second readers and a targeted selection of urgent cases and radiologists reading outside their specialty may increase the yield of discrepant cases.
- Diagnostic errors
- Healthcare quality improvement
- Audit and feedback
- Continuing education
- continuing professional development
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
- Diagnostic errors
- Healthcare quality improvement
- Audit and feedback
- Continuing education
- continuing professional development
Surgeons often rely on radiology as a source of diagnostic information in the work-up and follow-up of their patients. Because the radiologists who interpret the examinations are human beings, they are not exempt from discrepancies or even error. The reports: ‘To err is human’ and ‘An Organization with a Memory’ increased awareness of medical errors and the importance of learning from them.1 ,2 An autopsy study of patients dying in hospital showed that radiological misinterpretation caused 8% and contributed to another 33% of diagnostic errors in patients with relevant imaging.3 In a recent report, the Institute of Medicine finds that the occurrence of diagnostic errors has been largely unappreciated in efforts to improve the quality and safety of healthcare.4
Double reading is a practice in which two readers interpret an imaging examination that reduces errors and increases sensitivity.5 Although the concept is simple, double reading can be conducted in several ways. There are large variations in the reported effect of double reading in different settings, and the cost effectiveness is not well established.6–8 Applied prospectively, it may be used for quality assurance of radiology reports, and it is routine in the education of residents.9 ,10 Some mammography screening programmes conduct independent double reading, in which the readers are blinded to the interpretation of their colleague.11
In the USA, it is a requirement for department credentialing by the Joint Commission on Accreditation of Healthcare Organizations that all staff participate in continuous peer review of 5% of randomly selected cases.12 In order to meet this standard, and to minimise its impact on workflow, peer review programmes such as RADPEER use retrospective double reading (review) of previous examinations when they are compared with the current ones being interpreted.13 The reviewing radiologist selects the examinations, and the goals are quality improvement through shared learning from discrepancies and benchmarking of performance, rather than quality assurance of the individual report.
Similarly, in the UK, The Royal College of Radiology recommends that all radiology departments aim to implement ‘peer feedback’ with a systematic review of 5% of reports by December 2018, and that this effort should be coupled with regular ‘Learning from Discrepancies meetings’.14 ,15
In Norway, the approach to double reading in clinical radiology is somewhat different. When reading an examination, a consultant radiologist may choose to finalise the report directly or to request a second reading.9 The decision is based on the consultant's judgement of whether this quality assurance is warranted or not. The request may be explicit by directly contacting a specific colleague, or implicit by choosing not to sign the report, in which case the examination is routed to a queue for second reading. Fellow consultants at the same hospital carry out the second readings, and most consultants contribute as second readers, usually within their own field of expertise. Second readers have access to the preliminary report and updated information in the electronic patient record. The preliminary report, which is available in the electronic patient record, is substituted by the final report when the second reading is completed.
Consultant radiologists in Norwegian hospitals submit 39% of CT examinations for a second reading in this manner.16 For all examination techniques together, the practice consumes 20%–25% of consultant working hours.16 The main goal is quality assurance of the report before it is finalised. Less than 10% of departments record discrepancy rates or engage in benchmarking of radiologist performance.16
The objective of this study was to estimate the proportion of radiology reports that were changed during prospective double reading of current abdominal CT examinations of surgical patients and to assess the potential clinical impact of these changes. We also aimed to explore whether characteristics of examinations or radiologists were associated with a higher proportion of clinically important changes.
In this retrospective multicentre study, preliminary and final radiology reports from 1071 consecutive double-read abdominal CT examinations were collected and compared for changes (figure 1). Experienced gastrointestinal surgeons rated the clinical importance of the changes made to radiology reports following double reading. In order for the clinical raters to act within their area of expertise, all patients were inpatients or outpatients from the department of surgery and were aged 18 years or older. We only included examinations of the entire abdominal cavity (excluding isolated examinations of the liver). Repeated examinations on the same patient were not included.
Data were collected from the Radiology Information System and Electronic Patient Records at five public hospitals with a combined catchment population of 1.2 million. The number of reports collected from each hospital was in relative proportion to the number of consultant full-time equivalents in the radiology department. All included examinations were conducted between 1 September 2011 and 27 March 2013, and had been double read by two consultant radiologists as routine quality assurance. The first reader selected which examinations to submit for this quality assurance according to their own judgement, as there are no established selection criteria. Accordingly, the reasons for submitting and the number of examinations submitted vary among radiologists. Approval for the study and waiver of informed consent was obtained from the Regional Ethics Committee and the Data Protection Officer.
Patient and examination data
We collected data on patient gender and age, inpatient/outpatient status, urgency of examination (routine or urgent, defined as requested within 24 h), referral information, the identities of the first reader and second reader and the time of examination, time of preliminary and final reports (during working hours: 7:00 to 16:00, or out of working hours).
The pairs of preliminary and final reports were compared using ‘Diff Doc Professional’ (Softinterface, Los Angeles, California, USA), document comparison software, which labelled deletions, additions and changes in the reports by colour coding.
All radiology reports with changes in content beyond simple corrections of misspelling and layout were submitted for clinical rating (figure 1). Two gastrointestinal surgeons independently rated the clinical importance of changes in content to the reports on an ordinal five-point scale. We designed the scale with the intention to be dichotomised in the statistical analysis (figure 2).
Report changes given discrepant ratings of two or lower by both raters were classified as ‘clinically not important’ and not resolved further. All discrepancies rated three or higher by at least one rater were resolved by obtaining a clinical rating from a third surgeon, and clinical importance was classified according to the median of the three ratings.
The three raters were specialists in gastrointestinal surgery, all with >10 years of surgical experience. They made their rating based on the radiology report with colour-coded changes, the referral and the patients’ age and gender. To reduce bias, the source hospital of the reports were not disclosed to the raters and reports from the five hospitals were presented in a mixed sequence.
In addition to the rating, the surgeons made written comments about the assumed consequences of the changes they rated clinically important. With the aid of these comments we classified clinically important changes according to the clinical issues concerned. We also distinguished between increased, unchanged and decreased severity of the radiological findings resulting from clinically important changes. Changes considered an increase in severity were additional pathological findings or diagnostic suggestions leading to more comprehensive investigations or treatment. Changes considered a decrease in severity were removal or downgrading of initially reported pathological findings. Some changes could not be classified as either and were labelled unchanged severity.
We wished to explore the impact of reasons for referral on the frequency of clinically important changes. The first author reviewed referrals, and classified reasons for referral into four groups: acute presentations, non-acute presentations, follow-up and investigations after surgery or invasive procedures.
We classified the involved consultant radiologists based on experience as a consultant and subspecialty into four groups: inexperienced (<3 years as a consultant), general radiologist (≥3 years, not working within a limited field of expertise), abdominal radiologist (≥3 years, working predominantly with abdominal imaging) and other subspecialist (≥3 years, working within any other limited field of expertise).
The inter-rater agreement for the five-point scale was assessed using raw agreement and weighted κ.17 We used a weight of 1−[(i−j)/(k−1)]2, where ‘i’ and ‘j’ index the rows and columns of the ratings by the two raters, and ‘k’ is the maximum number of possible ratings. Differences in ratings between the two initial raters were tested with a related samples Wilcoxon signed-rank test. Agreement and Cohen's κ were calculated for the dichotomised ratings.
Exploratory analysis of associations between clinical importance of changes and characteristics of patients, examinations and readers was performed with univariate logistic regression. Variables whose univariate test had a p value of <0.25 were entered as candidate variables in a multivariate logistic regression model. Subsequently, a stepwise removal of the candidate variable with highest p value was performed until only statistically significant variables remained.
Associations between reasons for referral and clinically important changes were explored by univariate logistic regression. The classification of reasons for referral is not a readily available parameter in a quality assurance setting, and we expected considerable overlap with more robust patient parameters such as urgency, admission status and examination time. Therefore we decided not to enter reasons for referral into the multivariate model.
We constructed two random effects logistic regression models to assess a possible association between readings of separate examinations by the same radiologist. The models tested whether there was clustering of clinically important changes in reports that were made or reviewed by individual radiologists. The significant variables from the multivariate analysis were included as fixed effects coefficients, and the random effects coefficients in the two models were the identity of the first reader and second reader, respectively.
Statistical analysis was done using IBM SPSS Statistics (V.22; IBM Corp, Somers, New York, USA) and Stata (V.12.1; StataCorp, College Station, Texas, USA). All p values are two-sided. A p value of <0.05 indicates statistical significance.
A total of 7838 abdominal CT examinations were conducted at the five hospitals in the time span from which we collected the reports. About 4102 of these were referred from the departments of surgery, from which 1970 (48%) were read by residents. We included pairs of reports from the 1071 examinations (26%), which were read by two consultant radiologists consecutively. Descriptive statistics regarding examinations, patients, hospitals and radiologists are shown in tables 1 and 2. The median delay between the preliminary and final reports was 19 h and 56 min. Details of report turnaround times are shown in online supplementary appendix 1.
Supplementary appendix 1
Changes to reports
There were no changes made to 435 reports (41%). There were simple orthographical corrections or changes in layout for 237 reports (22%). In 399 reports (37%), the content had been changed, and these were submitted for clinical rating. A flow chart depicting this is shown in figure 1.
On the five-point scale, the two raters were in agreement on 245 ratings (61%), and the weighted κ score for the inter-rater agreement was 0.60 (95% CI 0.53 to 0.66). Rater 2 gave lower ratings than rater 1 for 91 reports and gave higher ratings for 63 reports (p=0.049). On the dichotomised scale, there was agreement on 297 ratings (74%), and the κ score was 0.50 (95% CI 0.42 to 0.58).
The 154 discrepant ratings were resolved as follows: 10 reports with a mean rating of 1.5 were considered unequivocally ‘not clinically important’ and were not resolved further. A total of 144 reports with discrepant ratings were submitted for a third rating. In the final classification, changes to 146 reports (14%, 95% CI 11.6% to 15.8%) from 1071 double-read examinations were clinically important. Changes to 108 reports (10%, 95% CI 8.3% to 12.0%) were intermediate, 35 (3%, 95% CI 2.3% to 4.5%) were major and 3 (0.3%, 95% CI 0.06% to 0.8%) were critical.
The clinical issues concerned in changes classified as clinically important are presented in table 3. Among the 146 clinically important changes, the severity of the radiological findings was increased in 118 (81%), decreased in 11 (8%), and unchanged in 17 (12%). All three critical changes implied an increase in severity. In one of the critical changes, the preliminary reported normal postoperative findings were changed to suspected anastomotic leakage.
Among changes classified as major, 30 (86%) implied an increase in severity, and in 5 (14%) the severity was unchanged. In one of the major changes, the preliminary reported possible (but unlikely) large bowel obstruction was changed to large bowel obstruction caused by a constricting tumour of the sigmoid colon with suspected metastases.
Among the changes classified as intermediate, 85 (79%) implied an increase in severity, 12 (11%) implied unchanged severity and 11 (10%) implied a decrease in severity. In one of the intermediate changes, the preliminary reported normal imaging findings were changed to a suspected cystadenoma in the head of the pancreas. More examples of report changes with description of clinical presentation and corresponding classification of clinical importance and change in severity are shown in online supplementary appendix 2.
Supplementary appendix 2
The distribution of reasons for referral (n=1069) was acute presentations 349 (33%), non-acute presentations 211 (20%), follow-up 204 (19%) and investigations after surgery or invasive procedures 305 (29%). There was an association (p <0.01) between reasons for referral and clinically important change, with changes made less frequently to reports in a follow-up setting (OR: 0.4, p<0.001) than in the setting of acute presentations.
Factors associated with clinical importance
Associations between clinical importance of changes and characteristics of patients, examinations and readers are shown in table 4. The multivariate analysis showed that more clinically important changes were made to urgent referrals. Subspecialties of both first and second readers were associated with the rate of clinically important changes. Important changes were made less frequently when abdominal radiologists were first readers and more frequently when they were second readers.
Examination and first reading out of working hours and inpatient status were associated with higher rates of clinically important changes in the univariate model, but not in the multivariate model. The random effects logistic regression model did not show a significant clustering effect neither with regards to the identity of the first reader (p=0.3) nor with the second reader (p=0.1).
We found that prospective double reading of radiologist-selected examinations produced clinically important changes to 14% of radiology reports. Although our data stem from a different approach both to double reading and rating of discrepancies, the results are not significantly different from a previously reported 11.8% pooled total discrepancy rate for CT of the abdomen and pelvis, suggesting that some quality assurance of radiological interpretation may be justified.18
Changes to 10% of reports were rated intermediate, necessitating added controls or a change in investigations or prognosis. Although the results of these investigations are not known, they are not inconsequential neither with regards to the patients nor to resource consumption. Changes to 4% of reports were rated major or critical, implying changes in conservative or invasive treatment.
We rated discrepancies based on the potential clinical consequences of discrepancies, and used experienced gastrointestinal surgeons as raters. This is logical as surgeons have superior clinical knowledge, are the typical recipients of these reports and are accustomed to making clinical decisions partly founded on their content. Traditionally, radiologists have rated discrepancies of interpretation according to the magnitude of the error in question.13 Such rating is subjective and may be perceived as punitive.19 Previously reported inter-rater agreement is slight to fair with a κ of 0.17–0.2.17 ,20 ,21 The clinical rating system in the present study was more reliable, achieving a moderate to substantial inter-rater agreement, with a κ of 0.5–0.6.17 In a quality assurance perspective there might be mutual benefits from bringing clinicians into the feedback loop. It may increase awareness among clinicians of the limitations of radiology, and among radiologists of the discrepancies that matter most to clinicians and patients.19
Our data result from routine quality assurance as it is practiced, and the results should be representative of everyday clinical practice in these departments. The first reader selected the cases for double reading, but we do not know their reasons or thresholds for doing so. One might expect that complex cases be selected more frequently, which might increase the rate of interpretation discrepancies. However, this is not necessarily the case. Autopsy studies have shown that in almost half of autopsies requested by clinicians they were ‘fairly certain’ of the main diagnosis, and that the degree of clinical confidence was an inadequate predictor of diagnostic errors.22–25
Less-experienced consultants submitted more cases for double reading, and more experienced radiologists tended to conduct the second reading, indicating that the task was not randomly assigned. The higher rate of clinically important changes made by abdominal radiologists as second readers may therefore partly be due to intentional routing of complex cases to these readers as well as their competence in detection, interpretation and reporting. Similarly the lower rate of clinically important changes made to abdominal radiologists as first readers may result from higher performance or a tendency by the second readers to put more trust in their judgement and less scrutiny in their work.
The non-random selection of cases and readers renders our data unsuitable for benchmarking of performance, and the outcomes may not pertain to all abdominal CTs performed. However, retrospective peer review systems, which are frequently used for this purpose, are also vulnerable to selection bias due to radiologists’ intentional avoidance of cases taking more time to review and conscious selection of less-time-intensive cases.26 A similar reluctance has been reported in physicians failing to participate in adverse events reporting due to risk of liability exposure or professional embarrassment, burdensome reporting methods, time required for reporting, perceptions of the clinical import of adverse events and lack of sense of ownership in the process.27
The median delay between the preliminary and final reports was approximately 20 h. Meantime it is possible that the discrepancy be discovered based on clinical factors, or that the opportunity to intervene be missed. However, for most findings the information will still be relevant, and patient treatment may still be corrected. This opportunity to prevent patient harm directly may facilitate a more wholehearted participation by radiologists, and may also reduce concerns over medico-legal issues.
Clinically important changes were made more often to the reports from urgent investigations. This may be attributed to a higher frequency of new findings in these examinations or to a less favourable working environment of the on-call radiologist. Regardless it is worth considering urgent examinations especially for quality assurance.
This study was limited to the preliminary and final radiology reports, and did not consider any supplementary communication between radiologists and clinicians. Since there is a delay between the first and second reading, second readers may have gained information on patient development through clinical conferences or subsequent investigations, and some report changes may not result from the second reading only.
Another limitation of our study is that the actual impact of the report changes is unknown. It is questionable whether patient records can be relied on to establish this retrospectively. Records may be incomplete regarding decisions and their justifications, and courses of action may change before they are recorded. In the absence of a gold standard we cannot confirm that the second reading was the correct one. There are studies in which discrepancies between preliminary interpretations of residents and final interpretations of staff radiologists have been compared with those of consensus reference panels. The panels confirmed the second reading in 64%–85%, and were more likely to confirm a second reading pointing out false-positive than false-negative and false indeterminate preliminary reports.28–30 Accordingly, in some cases report changes may have resulted in increased costs or even harm without benefit to the patient. This underlines the importance of establishing a feedback system involving the first and second readers and of course the clinicians.
We conclude that a 14% rate of clinically important changes made during double reading suggest that some quality assurance of radiological interpretation is justified. Using expert second readers, and targeting urgent cases and radiologists reading outside their specialty may increase the yield of discrepant cases. Establishing additional objective selection criteria would require further studies.
This study was made possible by research grants from the Norwegian Medical Association's Fund for Patient Safety and Quality Improvement and from the Norwegian Society of Radiology. We would like to thank Øystein Andreas Ødegaard for his contribution as third clinical rater.
Contributors PML: Substantial contributions towards concept, design, data retrieval, data processing/analysis, statistical analysis, interpretation, drafting manuscript, revising and approval. JGA, MVS and ALT: Substantial contributions towards design, data retrieval, interpretation, revising manuscript and approval. RA and TH: Substantial contributions towards design (rating scale), data processing, data analysis, revising manuscript and approval. PH and GS: Substantial contributions towards concept, design, interpretation, revising manuscript and approval. FAD: Substantial contributions towards design, data processing/analysis, statistical analysis, interpretation, revising manuscript and approval. PG: Substantial contributions towards concept, design, data processing/analysis, statistical analysis, interpretation, drafting manuscript, revising and approval. Øystein Andreas Ødegaard: contribution as the third clinical rater (see Acknowledgement).
Funding The Norwegian Society of Radiology and The Norwegian Medical Association (Fund for Patient Safety and Quality Improvement) (KSF1: 12/1318).
Competing interests PML reports research grants from the Norwegian Medical Association and the Norwegian Society of Radiology for the conduct of this study.
Ethics approval South-East Regional Ethics Committee, Norway (ref: 2012/1986) and the Data Protecion Officer at each hospital.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Provided relevant permissions are obtained; an anonymised version of the dataset will be made available on request should anyone wish to inspect our work or conduct further studies.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.