Article Text

Download PDFPDF

Continuous quality improvement in statistical code: avoiding errors and improving transparency
  1. Thomas S Valley1,2,3,
  2. Neil Kamdar2,
  3. Wyndy L Wiitala4,
  4. Andrew M Ryan2,5,
  5. Sarah M Seelye4,
  6. Akbar K Waljee2,4,6,
  7. Brahmajee K Nallamothu2,4,7
  1. 1 Division of Pulmonary and Critical Care Medicine, University of Michigan, Ann Arbor, Michigan, USA
  2. 2 Institute for Healthcare Policy and Innovation, University of Michigan, Ann Arbor, Michigan, USA
  3. 3 Center for Bioethics and Social Sciences in Medicine, University of Michigan, Ann Arbor, Michigan, USA
  4. 4 Center for Clinical Management Research, VA Ann Arbor Healthcare System, Ann Arbor, Michigan, USA
  5. 5 School of Public Health, Department of Health Management and Policy, University of Michigan, Ann Arbor, Michigan, USA
  6. 6 Division of Gastroenterology and Hepatology, University of Michigan, Ann Arbor, Michigan, USA
  7. 7 Division of Cardiovascular Medicine, University of Michigan, Ann Arbor, Michigan, USA
  1. Correspondence to Dr. Thomas S Valley, Division of Pulmonary and Critical Care Medicine, University of Michigan, Ann Arbor, Michigan 48105, USA; valleyt{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Clear communication of statistical approaches can ensure healthcare research is well understood, reduce major errors and promote the advancement of science. Yet in contrast to the increasing complexity of data and analyses, published methods sections are at times insufficient for describing necessary details. Therefore, ensuring the quality, transparency and reproducibility of statistical approaches in healthcare research is essential.1 2

Such concerns are not just theoretical and have direct implications for research in the quality and safety field. For example, the Hospital Readmissions Reduction Program was instituted in 2012 by the US Centers for Medicare & Medicaid Services (CMS) and imposed financial penalties on hospitals with high readmission rates. Subsequent studies sought to determine the extent to which this programme was successful in reducing readmissions without promoting unintended consequences, such as increased mortality. Clearly defining the success or failure of this programme is essential, but in 2018 two prominent articles using the same CMS data set presented opposing results.3 4 These conflicts are undoubtedly due to differences in analytical choices, but specific differences are challenging to reconcile given statistical code was unavailable to readers. In another recent example, a major article was retracted due to the discovery of a statistical coding error that reversed the categorisation of treatment and control groups.5 This clinical trial examined a support programme for hospitalised patients with chronic obstructive pulmonary disease, originally reporting a lower risk of hospitalisation and emergency department visits, but in actuality demonstrating the support programme was associated with harm. Both cases demonstrate how better practices with statistical coding sharing at the time of publication may improve the quality of research.

While the utility of statistical code sharing may seem self-evident, it occurs more infrequently than one would expect.2 We believe a principal barrier to statistical code sharing is likely an unfamiliarity with the systematic approach necessary for its success. Using a strategy similar to the Plan-Do-Study-Act (PDSA) cycle commonly used in healthcare for continuous quality improvement,6 7 we describe our recent efforts to establish routine statistical code sharing practices within our research teams as well as key lessons that we have learnt.

We focus on three integrated cycles of code sharing: (1) Code development, (2) Code review, and (3) Code release (figure 1). Code development refers to the iterative process of data preparation, code refinement and data analysis critical to any research process. Code review refers to a systematic approach to ensure that statistical code and its generated outputs are clear, valid and align with the study design. Code release refers to the practical tools we and others have used for publicly distributing statistical code. We believe that integration of these three steps within the analytical process is most likely to promote effective statistical code sharing and increase the rigour and quality of statistical code.

Code development

Code development consists of three key phases: data preparation, code refinement and data analysis (figure 1A). While managing data, the analyst must make important decisions regarding cohort selection, linkage to other data sets, aggregation of data and the creation of variables. Throughout this process, the analyst iteratively conducts their own PDSA cycles, at times independently and with guidance from others on the research team. Similarly, the process of analysing data typically requires refinements to ensure data are appropriately structured and analyses are accurately specified. Important decisions regarding both data preparation and analysis must be made throughout code development. Through the process of systematic code sharing, we learnt that simply knowing that code review would occur developed an environment in which the analyst recognised the need to clearly annotate these decisions in the statistical code to ensure understanding for others who may access the statistical code.

Figure 1

An overview of statistical code sharing.

Code review

Code review is a strategy that we have used to ensure developed statistical code is readily shareable. Code review has three core goals. First, it ensures statistical code can be easily followed by others. Second, it assesses for coding errors at a high level. Third, it assures that the code clearly articulates the study design to enhance reproducibility by others.

In our group, an independent analyst formally reviews the statistical code to examine and describe what the code is doing after the primary analyst has concluded the analysis (figure 1B). We performed code review only with what was annotated within the code, blinded from the study design or additional information about the study itself (eg, in the form of a manuscript).

After code review is completed, the principal and senior investigators, primary analyst, and independent analyst meet to discuss the independent analyst’s understanding of the code and its intent. It is critical that this step is performed in a constructive fashion. We found code review in this manner was often iterative in promoting changes and ongoing discussion. We viewed these sessions as providing ‘incident reports’ of common errors to improve group practices systematically, much like the Study stage of PDSA cycles are used in healthcare.6 In this framing, the benefits are meant to improve the immediate research and shift the culture of the research process with ideally long-term advantages in the implementation of future work.

We also defined several boundaries for code review. First, we felt that the independent analyst should not be a scientific judge of the appropriateness of the research question and that code review was not intended to change the underlying study design or analyses. Second, code review was not intended to criticise the efficiency of code, primarily because of time constraints (although suggestions were welcome). The foremost principle of code review was to ensure that the statistical code could be easily followed and that it aligned with the intended methodology.

Studies in medicine often have limitations on data sharing due to patient privacy concerns that make full reproducibility impossible without direct access to source data. To mimic real-life situations and to demonstrate the value of code review, independent of the primary data, we conducted our code review without access to the underlying data. We believe code review promotes scientific reproducibility, primarily by enhancing transparency and ensuring that adequate annotation and clarity is present within statistical code. Subsequently, reproducibility becomes simplified for any external investigator with access to source data.

As an example, we published an article in The BMJ that performed an instrumental variable analysis to evaluate the effect of intensive care unit admission for patients with ST-elevation myocardial infarction on 30-day mortality.8 Code review was conducted after publication, as it was a relatively new practice that was still under development for our group. The primary goal of the code review process was to ensure clarity and intent of statistical code.

The independent analyst was unaffiliated with the study. Prior to reading the study or reviewing its written methods, the independent analyst conducted a line-by-line examination of the code to determine whether the intent of each line could be deciphered. If not, a notation was made to indicate that the code was unclear. We found that there were several areas where our statistical code was unclear, which would have limited transparency of openly shared code. For example, within this study, US Medicare data were linked to the American Hospital Association annual survey to obtain hospitals’ postal ZIP codes. However, the file names were insufficiently labelled within the statistical code file, preventing an independent reader from identifying the data sources or the code’s intent. A key lesson learnt was the need to ensure annotation is sufficient to be understood by an independent reader, during the code development process, leading to an establishment of best practices in code annotation.

We shared our code using GitHub ( and now also include examples of our initial code, independent analyst notes and final code.

Code release

Some publications within BMJ Quality & Safety have released code in supplementary appendices;9 10 however beyond supplementary material, most journals do not have other mechanisms for the public release of statistical code, even for investigators who are interested in the process. This barrier results in many investigators requesting that interested readers contact them if interested in their code.11 12 While seemingly useful, this process poses two major challenges for code release as demonstrated by our empirical work.12 First, over time investigators may shift positions and even engaged investigators may lose access to code as analysts and team members move on to different work. Second, versions of statistical software change frequently, leading to hurdles when attempting to replicate studies even a few years after a publication. We have found that creating a system to deposit code along with specific documentation around particulars of the statistical code (eg, versions of software and key modules employed) at the time of publication is critical to maintaining a record of what was performed (figure 1C).

This process requires several systematic steps. First, one must choose a platform for publicly releasing statistical code. Fortunately, there are a growing number of platforms available. Chosen platforms should ideally adhere to the guiding principles of ‘FAIR’ for software stewardship: Findability, Accessibility, Interoperability and Reusability.13 Some prominent examples include journals’ online supplemental material, Open Science Framework (OSF),14 Dataverse,15 Dryad,16 GitHub17 or Code Ocean (table 1).18 Online supplemental material, OSF, Dataverse, Dryad or GitHub provide straightforward presentation of statistical code, dependent on access to data and analytical software. At a more advanced level, Code Ocean allows investigators to release code and data while also permitting code to be run via cloud-based computing on open-source software, MATLAB or Stata.18

Table 1

Selected platforms for statistical code release

It is important that any released code include the ‘readme’ documents necessary to open software and understand each file. Investigators also must consider the software license by which they plan to release code. These software licenses are important to establish a formal set of permissions from the code’s creator that range from ‘proprietary’ to ‘free and open source’. While commonly thought to be used to restrict access, these permissions can also be used to facilitate access to code.19

As an example, we constructed a large data set of acute hospitalisations in the US Veterans Affairs (VA) healthcare system—the VA Patient Database 2014–2017 (VAPD)—and published this process in BioMed Central Medical Research Methodology.20 The VAPD consists of daily clinical data from across the nationwide VA healthcare system that were drawn from the VA’s electronic medical records system, including data on laboratory tests, vital signs, antibiotics and microbiology cultures. Data extraction and data cleaning for the VAPD resulted in 57 unique SQL, SAS V.9.4, and Stata programs.

Our database was created in part to help guide other investigators in standardising clinical data across multihospital systems like the VA. The methods to link data from numerous data sources was a complex task that would be applicable for diverse healthcare questions. To assist other investigators in replicating our processes, we uploaded our programmes to a GitHub site ( and included two ‘readme’ files that detailed the steps for processing the code.

Within months of publication, we received feedback from a team of investigators at another institution who were in the process of replicating our database. The investigators discovered a supplemental documentation file that was referenced in the programme that we had failed to upload to GitHub. The investigators also sought clarification on the data extraction process that we described in our code and the ‘readme’ file. Through communication with these external investigators, we were able to identify and upload the missing documentation file and clarify our code and database construction process, ensuring these steps were not missed in future code development.


Statistical code sharing is growing in popularity as a topic of discussion, but practical strategies to share code within healthcare research have yet to be outlined. We continue to aspire towards early and regular code sharing practices, and we hope that our experiences might serve to start a conversation within the healthcare research community.

For the investigators and analysts, this systematic process provided opportunities for professional growth and improved long-term coding practices through iterative quality improvement. Anecdotally, we have found that this may also increase job satisfaction and retention, particularly among analysts who often work independently and in isolation, receiving limited technical feedback on their code. Statistical code sharing improved the clarity and readability of developed code, which, in turn, may enhance re-creation, re-purposing and growth of healthcare research. Ultimately, the underlying science of a study can be best gauged when its methods are clearly understood and transparent.

At the same time, we recognise a number of barriers to regular code sharing. First, code review is resource intensive. Our independent analyst spent 3 hours for the initial review and 2 hours to meet with the investigators and primary analyst. This time was unfunded by the study and largely done as a means of quality improvement within our team. Thus, we propose that health systems and funders have an important role in ensuring that code sharing is fully supported within healthcare research. Sustainability hinges on operational and grant budgets that support costs related to code sharing practices. Certain funders, like the US Agency for Healthcare Research and Quality, have recognised and pledged to support this need.21

Second, code review requires an independent analyst with appropriate technical expertise. This expertise requires independent analysts who serve as reviewers to be adaptable to different coding languages and variability in coding practices, understand diverse statistical methods and communicate their review in a constructive manner aimed at fostering growth. Third, code sharing may be impractical in scenarios where code might be proprietary or commercialised requiring protection of intellectual property. Fourth, getting started with code release requires initial start-up costs (whether financial or time-related) depending on the chosen platform.

A final challenge that we have not discussed is the release of data, which is essential to verify the accuracy of statistical code and its outputs but may or may not accompany statistical code sharing. Data release in healthcare has been quite limited and controversial, often due to the need to maintain patient privacy and because many studies use patient data collected for non-research purposes in secondary analyses. Potential solutions are needed to share data without breaching confidentiality.


We have presented one proposed approach to statistical code sharing that our group has strived towards, although other models may work as well or better. We have revealed some of the challenges we encountered and hope these suggestions may help others move towards iteratively developing their own practices for code sharing in their work. We feel strongly that, within healthcare research, we must come together to establish code sharing as a standard and believe this will only occur through careful discussion and the guidance of investigative leaders, scientific journals, healthcare systems and funding agencies.



  • Twitter @tsvalley, @Andy_Ryan_dydx, @bnallamo

  • Disclaimer This manuscript does not necessarily represent the view of the U.S. Government or the Department of Veterans Affairs.

  • Competing interests All authors have completed the ICMJE uniform disclosure form at TSV declares support from the NIH (K23 HL140165).

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; internally peer reviewed.

  • Data availability statement There are no data in this work.