BMJ Qual Saf 21:78-82 doi:10.1136/bmjqs-2011-000296
  • Original research

Determination of the psychometric properties of a behavioural marking system for obstetrical team training using high-fidelity simulation

  1. Ken Milne7
  1. 1Department of Anesthesia, Women's College Hospital and Sunnybrook Health Sciences Centre, University of Toronto, Toronto, Ontario, Canada
  2. 2School of Nursing, York University, Toronto, Ontario, Canada
  3. 3Department of Obstetrics and Gynecology, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, Ontario, Canada
  4. 4Department of Anesthesia, Sunnybrook Health Sciences Centre, University of Toronto, Toronto, Ontario, Canada
  5. 5Centre for Health Education Scholarship, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
  6. 6Department of Anesthesia, University of Toronto, Toronto, Ontario, Canada
  7. 7Salus Global Corporation, London, Ontario, Canada
  1. Correspondence to Dr Pamela J Morgan, Department of Anesthesia, Women's College Hospital, 76 Grenville Street, Toronto, ON M5S 1B2, Canada; pam.morgan{at}
  • Accepted 6 September 2011
  • Published Online First 12 October 2011


Background To determine the effectiveness of high-fidelity simulation for team training, a valid and reliable tool is required. This study investigated the internal consistency, inter-rater reliability and test–retest reliability of two newly developed tools to assess obstetrical team performance.

Methods After research ethics board approval, multidisciplinary obstetrical teams participated in three sessions separated by 5–9 months and managed four high-fidelity simulation scenarios. Two tools, an 18-item Assessment of Obstetric Team Performance (AOTP) and a six-item Global Assessment of Obstetric Team Performance (GAOTP) were used.5 Eight reviewers rated the DVDs of all teams' performances.

Results Two AOTP items were consistently incomplete and omitted from the analyses. Cronbach's α for the 16-item AOTP was 0.96, and 0.91 for the six-item GAOTP. The eight-rater α for the GAOTP was 0.81 (single-rater intra-class correlation coefficient, 0.34) indicating acceptable inter-rater reliability. The ‘four-scenario’ α for the 12 teams was 0.79 for session 1, 0.88 for session 2, and 0.86 for session 3, suggesting that performance is not being strongly affected by the context specificity of the cases. Pearson's correlation of team performance scores for the four scenarios were 0.59, 0.35, 0.40 and 0.33, and for the total score across scenarios it was 0.47, indicating moderate test–retest reliability.

Conclusions The results from this study indicate that the GAOTP would be a sufficient assessment tool for obstetrical team performance using simulation provided that it is used to assess teams with at least eight raters to ensure a sufficiently stable score. This could allow the quantitative evaluation of an educational intervention.


Communication problems have been identified by the Confidential Enquiry into Maternal and Child Health and the Joint Commission in the USA as root causes of maternal and neonatal morbidity and mortality.1 2 In the UK, maternity services account for a significant proportion of the cost of claims identified to the NHS Litigation Authority and approximately 60–70% of sums paid out result from litigation arising from maternity services.3 Annual ‘skill drills’ have been recommended by the Royal College of Midwives and the Royal College of Obstetricians and Gynaecologists in the UK and are one of the requirements in the new Maternity Clinical Negligence Scheme for Trusts.3

The Institute of Medicine report entitled Crossing the Quality Chasm states that ‘health care organisations should establish interdisciplinary team training programs for clinicians to incorporate the proven team training strategies used in the aviation industry’.4 High-fidelity simulation is one way to bring multidisciplinary teams together to practice, reflect and repeat the case management of critical obstetrical events. However, to determine whether such an educational intervention has any effect on the quality of team performance, a valid, reliable tool to evaluate performance is required. At the initiation of this study, there were no available reliable tools to assess obstetric team performance. The purpose of this study was to establish the psychometric properties of two behavioural tools developed in a previous study (The Assessment of Obstetric Team Performance (AOTP) and the Global Assessment of Obstetric Team Performance (GAOTP)).5


A fully equipped simulation facility with a patient mannequin, SimMan (Laerdal Medical Canada, Ltd, Toronto, Ontario, Canada) who can speak, be catheterised and reacts in an appropriate physiological and pharmacological manner to interventions was the venue for this study. Our research team developed an obstetrical abdomen that fits over SimMan that can be prepped and draped, through which a caesarean section can be performed, a baby or babies delivered and in which massive obstetrical haemorrhage can be simulated. A fetal heart rate simulator provides visual and audible feedback and was programmed to reflect the findings of the respective scenario. Four simulation scenarios, previously used to develop our behavioural tools (AOTP, GAOTP), were included:

  1. need for caesarean section under general anaesthesia, difficult airway, cannot intubate/cannot ventilate, hypoxaemia, pulseless electrical activity;

  2. severe pre-eclampsia, epidural in situ, non-reassuring fetal heart rate trace, urgent caesarean section, pulmonary oedema;

  3. 34-week twin gestation, prolapsed cord, emergency caesarean section, amniotic fluid embolism, asystole;

  4. prolonged fetal bradycardia, emergent caesarean section, occult abruptio placenta, massive blood loss.

Two performance tools were developed from first principles in a previous study by our research group5 (see online appendix). The AOTP has six themes and 18 subthemes and the GAOTP only the six themes, as follows: communication with patient/partner; task/case management; teamwork; situational awareness; communication with team members; and environment of the room. Each item is rated on a five-point rating scale (1=poor performance, 5=excellent performance). Each theme and subtheme has descriptors for poor and excellent performance.

After research ethics board approval, multidisciplinary obstetrical team members from six hospitals were invited to participate. Each team consisted of an obstetrician, an anaesthetist, three registered nurses and in some cases a family doctor. The composition of each hospital's team (ie, the presence or absence of a family physician) reflected the usual practice in that organisation. Each team attended three sessions separated by a period of 5–9 months. After informed consent and an orientation to the centre, teams managed the four simulation scenarios. The first responder received the pertinent patient information. He/she then entered the operating room and could interview the patient before the 30-min scenario began. Each member of the team could personally interact with the patient upon entering the room and a complete chart with full documentation was also available. Teams were informed that they could call for help and an appropriate person would respond (ie, neonatology, respiratory therapist, porter). All four scenarios were recorded for later evaluation using the AOTP and the GAOTP. No formal debriefing was performed.

At the second session, 5–9 months later, participants managed the same four scenarios in a different order, all of which were recorded. The teams were unaware that the same scenarios would be presented. Once completed, six of the 12 teams received a standardised anaesthesia crisis resource management debriefing using the DVD of the team performance to discuss human factor issues that occurred.

At the third session, 5–6 months after the second, each team again managed the same four scenarios presented again in a different order. All performances were recorded. Upon completion, teams were offered debriefing if they had not previously received it.

Eight DVD reviewers (three nurses, one midwife, two anaesthetists, two obstetricians) who attended an 8 h workshop to familiarise themselves with the tools, independently reviewed the DVDs of 136 performances (48 from session 1, 48 from session 2 and 40 from session 3). Reviewers were blinded to ordering (session 1, 2 or 3) and in order to minimise the threat of standardised learning effect among reviewers, scenario order was randomised for each reviewer.

Statistical analyses

To assess the internal consistency of items, the 1088 completed forms were analysed using Cronbach's α. To establish the inter-rater reliability, the eight-rater Cronbach's α coefficient was calculated across the 136 performances using the total AOTP and GAOTP scores generated by each rater. To establish the ‘inter-scenario’ reliability, a four-scenario Cronbach's α was calculated for the 34 team sessions (12 from session 1, 12 from session 2 and 10 from session 3) using the average AOTP/GAOTP team score for each scenario across the eight raters. Finally, to establish the test–retest reliability of the measures, a Pearson product moment correlation was calculated between the 12 overall team scores (across four scenarios and eight raters) of session 1 and session 2.


Twelve multidisciplinary obstetrical teams from six hospitals (two teams from four hospitals, three from one hospital and one from one hospital with the same team members for each session) were recruited and 10 teams completed all four scenarios in all three sessions. Because of a serious illness in one team member, and the relocation of a second team's member, the remaining two teams completed all four scenarios in sessions 1 and 2 only, for a total overall of 136 performances for review. Two items on the AOTP were not consistently completed by the reviewers because they were not relevant to every scenario and were therefore deleted from the analyses. This left a 16-item AOTP and a six-item GAOTP for analyses. There was no statistical effect of team performance between those receiving or not receiving debriefing after the second session and therefore these results are not included in the analyses, p=0.60.

Across the 1088 completed evaluations (136 performances × 8 evaluators), the internal consistency (Cronbach's α) for the 16-item AOTP was 0.96, and for the six-item GAOTP was 0.91. Correlation between the two scales was 0.97 and when the two scales were treated as a single 22-item rating scale, Cronbach's α was 0.97, suggesting that they are collectively measuring a single dimension (overall performance). Thus, the remaining analyses present only the data from the GAOTP. The eight-rater α for the GAOTP was 0.81 (single-rater intra-class correlation coefficient, 0.34) indicating acceptable inter-rater reliability when eight raters are used. After averaging team scores across raters for each scenario, the ‘four-station’ α for the 12 teams was 0.79 for session 1, 0.88 for session 2 and 0.86 for session 3, suggesting that performance is not being strongly affected by the ‘situation specificity’ of the scenarios. That is, unlike many performance-based examinations (such as the objective structured clinical examination), a team's performance on one scenario in a session was strongly predictive of the team's performance on other scenarios such that the average of four stations was a stable measure of a team's performance on a given day. Pearson's correlation of team performance scores from session 1 to session 2 for the four scenarios were 0.59, 0.35, 0.40 and 0.33, and for the total score across scenarios it was 0.47, indicating moderate test–retest reliability.


A recent systematic review of the literature examining multidisciplinary team training in a simulation setting for acute obstetric emergencies reported the results of eight studies.6 One retrospective cohort study demonstrated an improvement in perinatal outcome as measured by 5 min Apgar scores and the incidence of hypoxic-ischaemic encephalopathy.7 Seven studies reported an improvement in knowledge, practical skills, communication and team performance in the management of critical obstetric events. Four of the studies from the UK presented data from the Simulation and Fire-drill Evaluation study commissioned by the Department of Health of England and Wales.8–11 Participants included only obstetricians and midwives. None of the remaining studies involved anaesthetists and none of the studies evaluated team behavioural performance using a validated scale for obstetric team performance. The results of this systematic review reinforce the need for valid, reliable tools to measure team performance to assess the effect of simulation-based training.6

Rosen et al present a set of ‘best practices’ in the development of a team performance measurement tool in simulation-based training.12 Within the category of Best Practice #3: ‘Capture Competencies’, the authors warn against adopting ‘generic’ measurement tools. Although a tool may have been found to be valid and reliable in one setting, it is not necessarily transferable to another setting, underscoring the need to target the measurement tool to the specific competencies being trained. Similar to the Anaesthetists' Non-Technical Skills tool,13 the Non-Technical Skills for Surgeons,14 and the Ottawa Crisis Resource Management Global Rating Scale (Ottawa GRS),15 our tools, the AOTP and GAOTP, were developed from first principles to assess non-technical skills of a specific professional group and clinical context. The development of the AOTP and GAOTP involved 13 reviewers who reviewed the DVDs of 12 multidisciplinary obstetrical teams managing four obstetrical scenarios.5 The reviewers generated a list of behavioural aspects judged to negatively or positively affect the teams' performances. Data were collated and analysed using qualitative methodology and common themes and subthemes identified, and anchored descriptors for ‘excellent’ and ‘poor’ team performance developed. The themes and subthemes were compiled to create a prototype of the AOTP tool, incorporating a five-point Likert scale (1=poor performance and 5= excellent performance). The GAOTP was also created using only the themes.5

The AOTP and GAOTP differ from the assessment scales reported in the literature. The Anaesthetists' Non-Technical Skills and Non-Technical Skills for Surgeons address the behaviours of one professional group only, while the AOTP and GAOTP are used to assess multidisciplinary obstetrical team performance in the context of high-fidelity simulation. Tools such as the Mayo teamwork scale16 are used by teams to conduct self-assessments, rather than for objective assessments by external reviewers. Tools such as the Observational Teamwork Assessment for Surgery17 are completed by two real-time raters, with one focusing on intra-operative tasks and equipment and the other on team behaviours. Further, another major difference is that the Observational Teamwork Assessment for Surgery assesses teamwork separately in surgical, anaesthesia and nursing subteams rather than the performance of the team as a whole.

When we began this study there were no valid, reliable tools to assess multidisciplinary obstetrical non-technical skills performance. Since that time, a tool entitled the Clinical Teamwork Scale (CTS) has been developed by the State Obstetric and Paediatric Research Collaborative OB Safety Initiative (STORC).18 The CTS contains 15 items in five teamwork domains. Two of the domains, communication and situational awareness, are the same as two of the themes in the AOTP/GAOTP. The STORC CTS uses a 0–10 rating scale, and similar to our tool, the scale values are anchored by qualitative descriptors. The CTS was validated using a single 5–6 min scenario (shoulder dystocia) and the recordings were done using four scripted actors acting out the same scenario three times reflecting a poor, an average and a good/perfect teamwork performance. The specifics of what roles (registered nurse, obstetrician, anaesthesiologist) the actors were assuming were not described. Three raters, an obstetrician, a perinatologist and a nurse–midwife reviewed the performances. The results demonstrated that the raters tended to agree fairly consistently about the team scores in the three recordings of ‘poor’, ‘average’ and ‘good/perfect’ performances. Inter-rater reliability of the three raters was 0.98. A comparison of the teamwork behaviours of the CTS and the AOTP/GAOTP are found in table 1.

Table 1

Comparison of the clinical teamwork scale (CTS)18 and the assessment of obstetric team performance (AOTP)/global assessment of obstetric team performance (GAOTP)5

Our study differs from the STORC study in several ways. First, our scenarios were composed of unscripted real-life multidisciplinary obstetric teams managing 4–30 min scenarios involving critical events rather than standardised poor, average or good performances. The viewing time also differed in that our raters viewed 1020 h of DVDs and evaluated 136 scenarios whereas the STORC study reviewers evaluated 15 min of recordings of three teams managing one scenario. In addition, our raters included anaesthetists, obstetricians, labour and delivery nurses and a midwife potentially adding more variance in rater responses. These methodological distinctions may account for the fact that we required eight raters to achieve acceptable inter-rater reliability. The differences in the two studies makes it difficult to compare the psychometric properties of the two instruments directly, and future studies using both scales together will likely be necessary to make appropriate claims regarding the relative properties of the CTS and the GAOTP as measures of team functioning in naturalist contexts.

Since previous studies have found that team members tend to rate themselves and their colleagues favourably,16 19 one of the goals of this research was to develop a team assessment tool for use by independent reviewers. Previous work by our research team indicated the need for eight to nine raters to achieve an acceptable reliability.20 To help ensure reliability, we decided to use eight reviewers in the current study. To further ensure reliability, the reviewers attended an 8 h workshop to learn about the themes and subthemes of team performance and practice using the tool in five DVDs that were not part of the current study. Ultimately, our decision to use eight raters is supported by the eight-rater α for the GAOTP of 0.81, and is therefore the number of raters that should be used in future studies. Although the need for eight raters may not seem practical, the assessments obtained using this tool with eight raters will yield reliable results. Use of the GAOTP alone will significantly shorten the time to score the performances.

This study has demonstrated that the previously developed GAOTP is a reliable tool to assess the non-technical skills of multidisciplinary obstetric team performance using simulation provided that eight raters are used to ensure a sufficiently stable score. The GAOTP can now be used to determine whether simulation-based education and teamwork training improves obstetric team performance in behavioural skills reflected by this tool.


  • Funding This work was supported by a research grant from the Canadian Patient Safety Institute (CPSI), Edmonton, Alberta, Canada. Research equipment support was received from Hedstrom Canada, Cambridge, Ontario, Canada.

  • Competing interests Drs Morgan, Tregunno, Pittini, Regehr, Kurrek and Ms DeSousa have no conflict of interest related to this study. Dr Tarshis is a shareholder in CAE Inc. and Dr Ken Milne is a salaried employee in the position of President and CEO of Salus Global Corporation.

  • Ethics approval Ethics approval was provided by Sunnybrook Health Sciences Centre.

  • Provenance and peer review Not commissioned; externally peer reviewed.