Article Text

The Surgical Safety Checklist and Teamwork Coaching Tools: a study of inter-rater reliability
  1. Lyen C Huang1,2,
  2. Dante Conley3,4,
  3. Stu Lipsitz5,
  4. Christopher C Wright6,
  5. Thomas W Diller6,
  6. Lizabeth Edmondson1,
  7. William R Berry1,
  8. Sara J Singer7,8,9
  1. 1Ariadne Labs: a joint center for health system innovation at the Brigham and Women's Hospital and Harvard School of Public Health, Boston, Massachusetts, USA
  2. 2Department of Surgery, Stanford University School of Medicine, Stanford, California, USA
  3. 3Department of Health Policy and Management, Harvard School of Public Health, Boston, Massachusetts, USA
  4. 4Tanana Valley Clinic, Fairbanks, Alaska, USA
  5. 5Center for Surgery and Public Health, Brigham and Women's Hospital, Boston, Massachusetts, USA
  6. 6Greenville Health System, Greenville, South Carolina, USA
  7. 7Department of Health Policy and Management, Harvard School of Public Health, Boston, Massachusetts, USA
  8. 8Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA
  9. 9Mongan Institute for Health Policy, Massachusetts General Hospital, Boston, Massachusetts, USA
  1. Correspondence to Dr Sara J Singer, Department of Health Policy and Management, Harvard School of Public Health, 677 Huntington Avenue, Kresge Building 3, Room 317, Boston, MA 02115, USA; ssinger{at}


Objective To assess the inter-rater reliability (IRR) of two novel observation tools for measuring surgical safety checklist performance and teamwork.

Summary background Data surgical safety checklists can promote adherence to standards of care and improve teamwork in the operating room. Their use has been associated with reductions in mortality and other postoperative complications. However, checklist effectiveness depends on how well they are performed.

Methods Authors from the Safe Surgery 2015 initiative developed a pair of novel observation tools through literature review, expert consultation and end-user testing. In one South Carolina hospital participating in the initiative, two observers jointly attended 50 surgical cases and independently rated surgical teams using both tools. We used descriptive statistics to measure checklist performance and teamwork at the hospital. We assessed IRR by measuring percent agreement, Cohen's κ, and weighted κ scores.

Results The overall percent agreement and κ between the two observers was 93% and 0.74 (95% CI 0.66 to 0.79), respectively, for the Checklist Coaching Tool and 86% and 0.84 (95% CI 0.77 to 0.90) for the Surgical Teamwork Tool. Percent agreement for individual sections of both tools was 79% or higher. Additionally, κ scores for six of eight sections on the Checklist Coaching Tool and for two of five domains on the Surgical Teamwork Tool achieved the desired 0.7 threshold. However, teamwork scores were high and variation was limited. There were no significant changes in the percent agreement or κ scores between the first 10 and last 10 cases observed.

Conclusions Both tools demonstrated substantial IRR and required limited training to use. These instruments may be used to observe checklist performance and teamwork in the operating room. However, further refinement and calibration of observer expectations, particularly in rating teamwork, could improve the utility of the tools.

  • Checklists
  • Patient safety
  • Surgery
  • Teamwork

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Surgical safety checklists promote adherence to standards of care in surgery. When used to enhance communication, they can improve teamwork and prevent communication failures in the operating room.1–4 Their use has been associated with reductions in mortality and other postoperative complications.5–7

However, the effectiveness of surgical safety checklists depends on how well surgical team members perform them.8 For example, surgical team members may skip items on the checklist or treat it as a box-ticking exercise.6 ,8–11 Research suggests that poorly implemented checklists can have adverse effects on team function.12 Previous experience and research suggests that these challenges can be overcome with methodical implementation programmes.13

The Safe Surgery 2015 initiative has spent the last 3 years collaborating with a diverse group of 67 hospitals in South Carolina with the goal of implementing surgical safety checklists in a way that promotes adherence and teamwork. Individual and team-based coaching on checklist performance has been endorsed by experts (including those who developed the WHO Surgical Safety Checklist) as an essential practice in achieving this goal.14 The Safe Surgery 2015 initiative encourages participating hospitals to select and train clinical staff from their ORs to serve as coaches. To provide a framework for coaching surgical teams, Safe Surgery 2015 investigators developed a pair of observational tools that could be distributed widely and used by the coaches with limited training to assess and guide discussion about checklist performance and surgical teamwork—to assess not only whether the surgical checklist is being used, but also how it is being used. These tools were also designed to measure the impact of the Safe Surgery 2015 initiative. At the time, existing tools for measuring surgical checklist performance failed to address all three stopping points recommended by the WHO Safe Surgery Checklist (preanaesthesia processes of care, preincision briefing, and postoperative debriefing), important communication features, surgical team member buy-in for using checklists, and expected impacts of checklist use (like reducing the number of times the circulating nurse needs to leave the room to find instruments or equipment). Similarly, tools for observing surgical teamwork were not tailored to aspects of teamwork that checklists might affect.

The goal of the present study was to pilot test the two tools using observers like those expected to use them in a ‘real world’ setting (ie, not extensively trained and not conducted in a simulation lab) and to determine their inter-rater reliability (IRR) under these circumstances. Measuring the IRR is a critical step in assessing whether we have developed tools that are sufficiently clear and easy to use that the results will be consistent regardless of observer biases or training.


Development of the coaching tools

We developed a pair of tools to measure checklist performance and teamwork in the operating room (see online Appendix for sample South Carolina Checklist Template as well as the Checklist and Teamwork Coaching Tools). We created two tools rather than a single unified tool to allow hospitals to focus on specific areas of improvement and to avoid common method bias in future intertool analysis.

The Surgical Safety Checklist Coaching Tool measures the key behaviours and processes contained on the surgical safety checklist template developed in collaboration with participating South Carolina hospitals. The majority of the items on the tool document the extent to which surgical teams performed the key teamwork and communication elements of the checklist. It was designed this way to reinforce the concept of the checklist as a process for improving teamwork and communication rather than a series of tasks to be checked off. We also included items that assess whether surgical team members follow checklist best practices (eg, reading all checklist items aloud, without reliance on memory) and exhibit appropriate behaviours (eg, ‘Buy-In’, see table 1) while performing the checklist. In order to measure the effect of checklist performance for the observed cases, we also incorporated items measuring operating room efficiency, the avoidance of errors and adherence to existing surgical standards of care (eg, antibiotic re-dosing for operations >2 h in duration). For certain processes, which are not applicable in every case (eg, antibiotic prophylaxis, compression boots), we allowed observers to indicate if they were not applicable to the case.

Table 1

Selected sections and domains from the Checklist Performance and Surgical Teamwork Tools

The companion Surgical Teamwork Tool measures teamwork among surgical team members in the operating room. In developing the tool, we started with the conceptual model already in use by the Safe Surgery 2015 monitoring programme. Members of the Safe Surgery 2015 team derived the conceptual model from previous models of teamwork,15 understanding of intended checklist effects, and experience observing checklist use. Specifically, we defined five measurable domains of teamwork considered particularly applicable to the operating room: clinical leadership, team communication, assertiveness, coordination and respect. We focus on aspects of teamwork, such as coordination, rather than hallmarks of high reliability,16 including situational awareness, in order to focus on the central construct, and because coordination applies specifically to teams while situational awareness can apply to individuals. We separately consider aspects of teamwork others have consolidated, such as clinical leadership and assertiveness to distinguish behaviours of those with authority from those without it. We also include elements like respect, that while often absent in surgical teamwork tools, are prominent in the teamwork literature more generally15 and often problematic in surgical teams.17 Definitions and examples of the behaviours measured in each of the five domains are shown in table 1. Safe Surgery 2015 team members (WB, DC, LE, SS) generated potential items for the Surgical Teamwork Tools, with reference to previous teamwork observation tools and climate assessments, including the Teamwork in Multidisciplinary Care Teams tool,18 the Oxford NOTECHS System,19 the High Reliability Surgical Teamwork tool by Thomas et al for Kaiser Permanente (unpublished), the Behaviour Marker Risk Index,20 the case-based version of the Safety Attitudes Questionnaire (‘ORBAT’),21 and the Observational Teamwork Assessment for Surgery (‘OTAS’) tool.22

The tools were further refined in consultation with experts in teamwork and medical simulation from the Center for Medical Simulation. We developed 19 items, which best measured behaviours within the five teamwork domains. All but one item on the tool describes what is considered an optimal teamwork behaviour, for example, ‘Discussions took place in a calm, learning-oriented fashion.’ These items use a 5-point frequency scale, where 1 indicates the behaviour never occurred, 2 says the behaviour occurred about 25% of the time, 3 corresponds with a behaviour that occurred about half the time, 4 with a behaviour occurring about 75% of the time, and 5 indicates that a behaviour always occurred. By estimating the proportion of instances in which the optimal behaviour occurred in the case, a rater assesses potentially varying quality of teamwork among surgical teams. A ‘N/A’ option was provided for four items that referenced behaviours unlikely to occur in every case. The last item on the tool asked for an overall rating of surgical teamwork during the procedure on a scale of 1–5, with 1 indicating poor surgical teamwork and 5 indicating excellent surgical teamwork.

Common to both tools is a section capturing case demographics (patient age and gender, surgeon's specialty, and procedure performed) and observer information (age, gender, role and tenure). This information enables users to match observations from the two tools for the same case in order to examine associations between checklist performance and teamwork. We also included more detailed case characteristic information, such as case duration measured as time of incision to surgical end time, whether the case was urgent/emergent or delayed, and patient disposition in order to study the relationship between these characteristics, checklist performance and teamwork.

Design and case sample

We conducted a prospective observational study. Two nurse observers from the study hospital (both of whom had experience in quality management and observational data collection) used the two coaching tools to rate checklist performance and teamwork in the operating room. Their training on use of the instruments resembled the training that we believe could be reasonably offered to personnel in any healthcare facility. First, the two observers reviewed the tools and accompanying written instructions. Then, each observer completed a supplementary web-based training course on the use of the Surgical Teamwork Tool ( For each of the five teamwork domains included in the teamwork instrument, the web-based training provides a definition, lists the set of related items, and shows two short video vignettes. The video vignettes depict scenarios carefully designed to demonstrate positive and negative forms of each behaviour that the observers will use the teamwork instrument to judge. A short quiz follows each vignette to assess the user's ability to distinguish positive and negative teamwork behaviours. The training provides automated feedback on user's responses and allows the opportunity for review. The web-based training is self-paced and generally takes about 15 min to complete.

After this preliminary training was complete, the two observers together trialled the coaching tools in a single case, without involvement of investigators. We followed this first observation with a conference call allowing the observers to debrief with the primary investigators (LH, SS). During this call, we discussed discrepancies in observer ratings and provided an opportunity for the observers to ask questions about the instruments. Observers requested clarification regarding subjective measures, for example, what constituted ‘significant’ disruption and ‘repeatedly’ leaving the OR. The remaining cases were then completed without additional discussion with the research team. Additionally, the observers did not discuss their ratings with each other during the study period.

The onsite project coordinator randomly selected 50 surgical cases for observation. Selection criteria included an expected case duration between 30 min and 2 h. We expected the minimum duration to provide adequate time for observers to evaluate teamwork during the procedure, and the maximum duration to allow the observers to see both the briefing and debriefing portions of the same case. Additionally, selected cases were elective. This criterion was designed to minimise disruption and distraction associated with the presence of observers that might affect patient care. We also excluded cases involving study investigators to reduce potential for observer bias. Study investigators sent a letter to all surgical personnel prior to the start of the study offering the opportunity to decline to participate. None did so. The observers attended cases over a 3-month period from November 2012 to January 2013.

We obtained ethical approval for the study from institutional review boards at the Greenville Health System and the Harvard School of Public Health.

Data collection and statistical analysis

The project coordinator at the study hospital sent electronic copies of completed paper-based observations forms to Safe Surgery 2015 team members every 2–3 weeks during the data collection period. Investigators entered these data for analysis, checking accuracy of data entry by reviewing data that seemed inconsistent or unusual and double checking 10% of all data entered.

We performed all analyses using SAS V.9.3 (SAS Institute, Cary, North Carolina, USA). To begin our analysis, we calculated descriptive statistics for both tools. To assess IRR, we calculated the percent absolute agreement and Cohen's κ score for each section of the Checklist Coaching Tool and each domain of the Surgical Teamwork Tool. κ Is considered more robust than simple percent agreement because it accounts for agreement occurring by chance. For questions using Likert scales, we used a weighted κ score, which assigns partial credit for near, but not exact, agreement. For Likert scale questions also including an N/A option, the N/A was assigned a value of 6 in order to maintain the same continuum as the scale. For the Checklist Coaching Tool, we calculated an overall average κ coefficient as an average of the κ coefficients for the individual sections weighted by the number of items in each section.23 We generated an overall κ coefficient for the Surgical Teamwork Tool by calculating an average of the κ coefficients across all 19 items. The 95% CIs for the κ coefficients were calculated using a jackknife technique.24 We considered κ coefficients to be statistically significant if the 95% CI excluded 0, and the p value was less than or equal to 0.05. In sections where there was a very high prevalence of the same answer (ie, an item was always rated a 5 by both observers), we did not estimate κ coefficients because the probability of selecting the answer by chance was so high that the κ coefficient is considered an inappropriate measure of reliability.25 These sections were omitted from the overall κ calculations. We interpreted the κ coefficients using Landis and Koch's scale: <0.20 was considered ‘slight agreement’; 0.20–0.40 ‘fair’, 0.40–0.60 ‘moderate’, 0.60–0.80 ‘substantial’, and >0.80 ‘almost perfect’.26 For this study, we set our threshold for considering the tools sufficiently reliable for widespread use at 0.70, which is the midpoint for the ‘substantial’ category.27

To evaluate the possibility of an experience effect, that is, that ratings would change with experience in using the tools, we compared percent agreement and κ scores of the first 10 cases (excluding the first case that was observed prior to the debriefing call with investigators) and the last 10 cases.


Case characteristics

Both observers attended all 50 cases to which they were jointly assigned. Information about case characteristics is presented in table 2. The median age of the patients in the cases observed was 38.5 (IQR 14.0–56.0), and the median case duration was 43.5 min (IQR 34.0–72.0). A significant non-clinical disruption as judged by the observers only occurred in one case. However, there were significant delays of greater than 30 min in 5 (10%) of the cases. In 6 (12%) of the cases, the patients were admitted to the hospital postoperatively (versus discharged home). The most commonly observed specialty was general surgery (36% of all cases), followed by gynaecology (18%).

Table 2

Observed case characteristics (n=50)

Checklist performance and teamwork in the operating room

The observation tools provide a snapshot about checklist performance and surgical teamwork at Greenville Memorial Hospital at the time cases were observed (table 3). Compliance with the Joint Commission's Surgical Care Improvement Project (SCIP) measures was very high according to both observers, with 40 teams providing antibiotics within 1 h of incision in the 41 cases that called for prophylactic administration. Additionally, in the 36 cases where compression boots were not contraindicated, all the teams provided them. For appropriate placement of warmers in cases with an expected duration of more than 1 h, both observers identified high rates of compliance but differed in their assessment of case duration. One observer reported that teams placed warmers in 38 of 38 cases where warmers were required. The other observer reported that teams complied in 41 of 42 applicable cases.

Table 3

Checklist performance in the 50 observed cases

Compliance with performing the briefing and debriefing portions of the checklist was less consistent. According to both observers, teams introduced themselves by name and role, or had done so earlier in the day in all 50 cases. However, the first observer noted introductions in 44 of the 50 cases and the second observer in 43 of the cases. Surgeons discussed the operative plan less than half the time (in only 22 of the 50 cases according to the first observer, and in 20 cases according to the second observer). The surgeon stated the expected duration of the procedure in 31 or 33 of the 50 cases according to the first and second observers, respectively. By contrast, nurses discussed sterility, equipment and other concerns more frequently (in 45 of 50 cases according to the first observer and 41 cases according to the second observer). There was more disagreement regarding how often the anaesthesia provider discussed the anaesthesia plan. The first observer reported a briefing by an anaesthesia provider in 43 of 50 cases while the second observer reported it in just 32 cases. Most teams did not perform the checklist as intended, by reading every item aloud without reliance on memory. The two observers reported that the checklist was performed properly in only 22 or 23 of the cases, respectively. For the debriefings, teams discussed specimen labelling in 25 cases according to both observers. However, the first observer identified 34 cases with specimens while the second noted 33. Teams discussed equipment and other problems in 46 of 50 or 44 of 49 cases according to the two observers, respectively. Finally, observers reported that 40 of 49 teams (41 of 50 according to the second observer) discussed key concerns for patient recovery and postoperative management.

Buy-in to the checklist process among the surgical team members was uniformly rated as high, with mean buy-in scores of 4.78–4.88 among the different professional roles. A notable proportion of cases experienced equipment issues. The two observers reported that in 16 or 17 cases, nurses had to leave the OR repeatedly to find instruments or equipment. Equipment was available and functioning throughout the case in only 23 or 24 of the 50 cases according to the first and second observers, respectively. Both observers also noted that antibiotic re-dosing was not discussed in the one case where the expected case duration of longer than 2 h warranted such discussion.

With regard to teamwork in the operating room, the observers uniformly rated cases highly (table 4). The mean overall teamwork rating was 4.74 (SD 0.49) according to the first observer, and 4.98 (SD 0.14) by the second observer. Scores ranged from 3 to 5 on the 5-point scale. The reverse-scored item ‘Team members referred to each other by role instead of name’ (Q13) was rated the highest by both observers (mean 4.98, SD 0.14 by observer 1; mean 5.00, SD 0.00 by observer 2), indicating team members almost never referred to others by role. The teamwork item ‘Verbal communication among team member was easy to understand’ (Q5) was rated the lowest by the observers (mean 4.54, SD 0.58 by observer 1; mean 4.62, SD 0.57 by observer 2). When items were aggregated by teamwork domain, assertiveness was, on average, the highest rated domain (mean 4.85, SD 0.16 by observer 1; mean 4.97, SD 0.01 by observer 2), while communication was rated the lowest (mean 4.84, SD 0.20 by observer 1; mean 4.83, SD 0.15 by observer 2).

Table 4

Surgical teamwork in the 50 observed cases

Inter-rater reliability

For the Checklist Coaching Tool, the overall percent agreement was 93%, and the overall κ coefficient was 0.74 (95% CI 0.66 to 0.82) (table 5). Percent agreement within sections ranged from 83% for SCIP measures and surgical team member buy-in to 100% for the surgical best practices section. κ Coefficients ranged from 0.44 for buy-in to 0.94 for adherence to the Joint Commission Timeout.

Table 5

Surgical Safety Checklist Coaching Tool inter-rater reliability and percent agreement (n=50 cases)

Within the SCIP measures, percent agreement for antibiotics being given within 1 h (Q1) was 58% when only unprompted administration was considered to be proper performance, but increased to 92% when the responses for ‘Yes w/o prompting’ and ‘Yes, prompted by the checklist’ were combined. There was no change in the percent agreements for compression boot use (Q2) or warmer use (Q3) when the ‘Yes’ responses were combined. The overall percent agreement increased to 95% though, when the ‘Yes’ responses were combined.

The overall percent agreement for the Surgical Teamwork Tool was 86% and the κ score was 0.84 (95% CI 0.77 to 0.90). The assertiveness domain had the lowest κ score at 0.63 (95% CI 0.45 to 0.82) and percent agreement at 79%. The respect domain had the highest κ score at 0.92 (95% CI 0.84 to 1.00) and percent agreement at 92%. The κ score for the communication domain was not statistically significant (κ 0.66, 95% CI −0.76 to 0.99).

Training effect

Percent agreement and κ scores for the first 10 cases (excluding the initial case done prior to phone training) versus the last 10 cases are shown in tables 5 and 6. For the Checklist Coaching Tool, κ coefficients improved in the sections measuring case characteristics, Joint Commission timeout, debriefing, buy-in, and in the first additional data section. κ Coefficients decreased or remained unchanged in the SCIP, briefing, active participation and surgical best practices sections. When comparing the first 10 cases with the last 10 cases for the Checklist Coaching Tool, the change in the overall κ coefficient was non-significant, 0.60 (95% CI 0.27 to 0.93) for the first 10 cases and 0.83 (95% CI 0.54 to 0.99) in the last 10 (p=0.319). Percent agreement changed from 92% to 94%. For the Surgical Teamwork Tool, comparison of the first 10 cases with the last 10 cases found a non-significant change in κ: 0.65 (95% CI 0.37 to 0.93) for the first 10 cases and 0.89 (95% CI 0.77 to 0.99) for the last 10 cases (p=0.13). Percent agreement changed from 76% to 94%.

Table 6

Surgical Teamwork Tool inter-rater reliability and percent agreement (n=50 cases)


This paper describes the development, pilot testing and IRR of a pair of novel tools for measuring surgical safety checklist performance and teamwork in the operating room, respectively. To our knowledge, this is the first test of paired checklist performance and teamwork observation tools conducted without highly trained observers. Unlike prior studies that have required substantial training,28–31 we found that two observers with limited training and no previous operating room experience were able to effectively use the tools almost immediately. Observers achieved IRR scores considered substantial by standard statistical criteria26 ,27 in the first 10 cases, and maintained this level of reliability throughout the study period. We also did not find any significant changes in reliability between the first 10 and last 10 cases.

Consistent with prior studies,2 ,11 ,32 the observers identified numerous opportunities for improvement in checklist performance, with certain checklist items being performed in fewer than half the observed cases. For example, more surgeons relied on memory than read the checklist aloud. Surgeons also often failed to discuss the operative plan, expected duration, or expected blood loss. Whether these omissions were conscious or unconscious is unclear. However, the tendency to rely on memory suggests that educating surgeons to read from a printed checklist could lead to improved checklist performance.

By contrast with studies of teamwork using observational tools based on behaviourally anchored rating scales,19 ,20 ,22 the observers in this study tended to score elements of teamwork as exceptionally high, with no cases rated below 3 on a 5-point scale. While these results do not reflect the full range of response options on the tool, they do not indicate that poor behaviour never occurred, but rather that on average optimal behaviours were performed at least half the times they were observed. They are also sufficiently varied to reveal opportunities for improvement.

Our findings suggest that it is possible to develop coaching and measurement tools to support large-scale implementation efforts like the Safe Surgery 2015 initiative. Previous studies of teamwork assessment tools have generally relied on highly trained and experienced observers.19 ,33 However, the scale of the Safe Surgery 2015 initiative meant that extensive training was not feasible, so the tools had to be usable with limited training. The training, which included detailed instructions, video vignettes of optimal and not optimal behaviour, an assessment for establishing accurate and aligned observations, and collective debriefing following an initial trial, could be applied by hospitals in any environment. In the authors’ opinion, the debriefing to resolve questions was important but did not require expert facilitation. Rather, the important feature of the debriefing was to create an opportunity for the observers to discuss between themselves how they would apply the tools and to come to an agreement about how to proceed. Though local observers may not rate teamwork and checklist performance as would highly trained observers,34 the more practical approach applied in this study demonstrated the ability of observers to discern variance that would provide opportunity for coaching and improvement and enjoys the distinct advantage of potentially broad dissemination.

The lack of significant change in IRR between the first 10 cases and the last 10 cases suggests, in contrast with studies with other surgical teamwork observation tools,35 that there is a minimal learning curve for using our instruments. The ability to use the tool almost immediately is important, as most participating hospitals do not have the resources to train observers extensively. Also, many hospitals want to observe cases periodically to ensure consistent checklist performance. An easy-to-use tool with high IRR increases the likelihood that these periodic observations will be comparable and useful. The limited number of observers and their experience in quality improvement and clinical observation likely contributed to their ability to conduct observations reliably. Their familiarity with norms for interaction due to long tenure at the hospital also probably played a role. However, these conditions could be replicated by most hospitals.

Despite the strong overall performance of the observation tools, our findings highlight opportunities for improvement. First, the κ coefficients for several specific subsections (ie, SCIP measures and checklist buy-in on the Checklist Coaching Tool; communication, coordination and assertiveness on the Surgical Teamwork Tool) fell below our preferred threshold of 0.70. Additionally, we could not calculate a meaningful κ coefficient for two sections of the Checklist Coaching Tool due to the very high prevalence of a single answer. One possible reason for the low κ in the SCIP measures and checklist buy-in sections was that both had scales with 4 and 5 points, respectively. The observers in particular appeared to have difficulty agreeing on whether the antibiotics being given on time was due to prompting by the checklist or not. This is evidenced by the improvement in κ and percent agreement when the two ‘Yes’ responses were combined. However, the two observers had little difficulty agreeing that the checklist did not play a role in compression boot or warmer use in the cases. The agreement (or lack thereof) on these tasks reflects the fact that antibiotic prophylaxis is a cognitively more difficult task for surgical team members compared with the other two items, and may benefit from better team communication as prompted by the checklist.

Second, consistently high teamwork scores suggest opportunities for improving the tool. One potential explanation for the high scores is that the observers missed, or were reluctant to report, non-optimal behaviour among surgical teams in their own hospital. Another potential explanation is that the observers’ presence influenced the behaviours of surgical teams. Potential approaches for expanding the range of teamwork scores reported include additional training for the observers and the use of calibration (eg, observers jointly observe a case and agree to score the observed behaviours at the midpoint of the 5-point Likert scale). This could also help improve the IRR of these sections.35 Additionally, those instructing the observers should emphasise the non-punitive, learning-oriented nature of the exercise, and the hospital’s desire to achieve the best possible teamwork to make observers feel safe to assign less-than-perfect ratings. The combined effect could be to encourage observers to set a high bar and not to be afraid to identify ways in which surgical teams have not yet achieved it. Users may also need to consider revising the instructions, question prompts, or increasing the number of answer choices to increase response variance.

Additional limitations of this study are worthy of note. First, results reflect the experience of a single institution and the experience of other institutions may differ. However, both tools are being used broadly by hospitals participating in the Safe Surgery 2015 initiative. Other hospitals are reporting similar scores, suggesting that the experience of the hospital in this study was in no way extraordinary. Second, the two observers in this study had prior experience in conducting quality management projects, including the use of audit tools. While other institutions may not have such experienced observers, most hospitals do have people who have experience auditing for compliance and certification programmes. Such auditors would likely be similar to the observers in this study in that they did not have prior experience in the operating room or in measuring surgical teamwork. Third, our observers only attended elective cases less than 2 h in duration, and none of the cases involved a significant risk of blood loss. The criteria for observing cases less than 2 h was chosen to ensure that the observers could reasonably watch the entire case and thus get an accurate reflection of teamwork. However, it is possible that observing longer cases, or cases with a higher stress level due to higher risk of blood loss, might result in discrepancies in observer response and lead to lower IRR.


Limitations notwithstanding, the observation tools used as part of the Safe Surgery 2015 initiative appear to be reliable. Moreover, these tools provide an example of how observational tools can be integrated into large-scale implementation efforts. The tools and training materials are now publicly available and in use by other hospitals, and we look forward to learning from their experiences. Future research from the Safe Surgery 2015 initiative will explore how checklist performance and teamwork in ORs varies across diverse hospitals, and how teamwork in the OR relates to performance of safety interventions like the checklist.


The authors wish to thank Martin Zammert, Jeff Cooper, Dan Raemer, Ashley Kay Childers and Mathew Kiang for their help in developing and refining the observation tools along with the web-based training modules; April Howell and Michael Rose for pilot testing the tools at McLeod Regional Medical Center; Danielle McElveen for organising the study logistics at Greenville Health System; Karen Hudson and Lynda Bingham for conducting the observations; and Lorri Gibbons, Rick Foster, Thornton Kirby and the staff of the South Carolina Hospital Association along with the South Carolina State Leadership Team for their critical role in fostering the collaboration between the Safe Surgery 2015 initiative and the hospitals of South Carolina.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Funding This study was supported by a grant from the Agency for Healthcare Research & Quality (R18:HS019631-01). The Safe Surgery 2015 initiative is supported by a grant from the Branta Foundation.

  • Competing interests None.

  • Ethics approval IRB at Harvard School of Public Health.

  • Provenance and peer review Not commissioned; externally peer reviewed.