Article Text

Download PDFPDF

Observer-based tools for non-technical skills assessment in simulated and real clinical environments in healthcare: a systematic review
  1. Helen Higham1,
  2. Paul R Greig1,
  3. John Rutherford2,
  4. Laura Vincent1,
  5. Duncan Young1,
  6. Charles Vincent3
  1. 1 Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, UK
  2. 2 Department of Anaesthetics, Dumfries and Galloway Royal Infirmary, Dumfries, UK
  3. 3 Department of Experimental Psychology, University of Oxford, Oxford, UK
  1. Correspondence to Dr Helen Higham, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford OX3 9DU, UK; helen.higham{at}


Background Over the past three decades multiple tools have been developed for the assessment of non-technical skills (NTS) in healthcare. This study was designed primarily to analyse how they have been designed and tested but also to consider guidance on how to select them.

Objectives To analyse the context of use, method of development, evidence of validity (including reliability) and usability of tools for the observer-based assessment of NTS in healthcare.

Design Systematic review.

Data sources Search of electronic resources, including PubMed, Embase, CINAHL, ERIC, PsycNet, Scopus, Google Scholar and Web of Science. Additional records identified through searching grey literature (OpenGrey, ProQuest, AHRQ, King’s Fund, Health Foundation).

Study selection Studies of observer-based tools for NTS assessment in healthcare professionals (or undergraduates) were included if they: were available in English; published between January 1990 and March 2018; assessed two or more NTS; were designed for simulated or real clinical settings and had provided evidence of validity plus or minus usability. 11,101 articles were identified. After limits were applied, 576 were retrieved for evaluation and 118 articles included in this review.

Results One hundred and eighteen studies describing 76 tools for assessment of NTS in healthcare met the eligibility criteria. There was substantial variation in the method of design of the tools and the extent of validity, and usability testing. There was considerable overlap in the skills assessed, and the contexts of use of the tools.

Conclusion This study suggests a need for rationalisation and standardisation of the way we assess NTS in healthcare and greater consistency in how tools are developed and deployed.

  • team training
  • performance measures
  • medical education

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Evidence that errors in non-technical skills (NTS) are common in adverse incidents in healthcare has been accruing over the past two decades.1–5 NTS have been defined as ‘the cognitive, social, and personal resource skills that complement technical skills, and contribute to safe and efficient task performance.’6 They include such attributes as communication, teamwork, situation awareness, decision-making, task allocation and stress and fatigue management. It is worth highlighting that concern exists around the use of the term NTS7 to describe such important aspects of professional clinical practice; however, while there is currently no universally agreed substitute8 the term NTS will be used for this study.

Interest in evaluating and enhancing NTS in multiprofessional teams of healthcare workers has been increasing in line with concerns highlighted in studies of error in healthcare and a number of tools are now available for measuring them with many of the early examples adapted from the civil aviation field.9–12 Concerns about the measurement properties of these tools (including their validity and reliability) have been raised by educational and research communities.13–17 Assessment of healthcare professionals, particularly in high stakes settings such as examinations or interviews, requires rigorous attention to the quality of the tool being used to make that assessment if it is to be objective and fair. Furthermore, the choice of an appropriate tool for NTS assessment may be hampered by the large number available for different settings in healthcare.

This systematic review of the NTS assessment tools in healthcare seeks to provide a clearer understanding of the range, purpose, evidence of validity and usability of published tools.


The objectives were:

  • To provide an overview of observer-based assessment tools for performance of NTS in healthcare professionals or students in simulated or clinical environments.

  • To describe the methods used in developing the tools.

  • To explore the evidence provided for the validity and usability (including training required) of the tools.


This systematic review was registered with PROSPERO (Ref No: CRD42017055445). Peer-reviewed studies were identified by search of the electronic bibliographic databases Medline, Embase, CINAHL, PsycINFO, Scopus and ERIC. A search of the grey literature was made via Google Scholar, ProQuest and OpenGrey. A manual search of the reference list of identified relevant articles was also conducted. No further searches were conducted after March 2018.

All reviewed articles were assessed using criteria defined by Hawker et al for mixed qualitative and quantitative research studies18 ( The inclusion and exclusion criteria are included below and the assessment questionnaire (as per Hawker) and a detailed search strategy are included as online supplementary appendix 1.

Supplemental material

Inclusion criteria

Papers were eligible for inclusion where:

  • They were published in the English language, or translation was available.

  • The population studied comprised healthy adults working in healthcare settings.

  • The publication date was between January 1990 and March 2018.

  • They described a tool designed to assess NTS and included more than one of the following domains: communication, teamwork, situation awareness, decision-making and task allocation/management.

  • They described a tool designed for use by direct observation or review of audiovisual files in a simulated or real clinical setting.

  • Peer-reviewed papers were preferred but if a tool had been developed and only published as, for example, a thesis, this was highlighted.

Exclusion criteria

Papers were excluded where:

  • Ethical approval of the study or informed consent from participants was not described.

  • No data describing evidence of the tool’s validity or reliability were available.

  • The tool was designed for self-assessment only.

  • The tool did not analyse performance under more than one of the key non-technical domains of: communication, situation awareness (sometimes described as vigilance), decision-making or task allocation/management.

  • They described a tool used for the study of technical skills only.

Synthesis of results

Papers with potential for inclusion in the review on the initial search were first screened for relevance, by review of the title and by abstract review (see figure 1 for the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) review process). Papers with a relevant title and abstract were retained for full review. Papers without any assessment of validity or reliability for the NTS tool being used were discarded. Where papers were not retained for review, their reason for non-inclusion was recorded.

Figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) diagram for non-technical skills (NTS) assessment tools.

The first stage of the screening process was conducted for all papers in pairs (HH and PRG; HH and JR or PRG and JR)—where any disagreement was encountered a decision was made by the reviewer who was not a member of the original pair. Full-text articles were acquired for all abstracts put forward for further analysis. These were divided between the three reviewers for initial assessment and any ambiguities arising regarding inclusion were discussed and agreed together. The final in-depth analysis was then undertaken by HH and PRG with JR acting as final arbiter. All first authors were contacted by email, on two separate occasions, to seek additional unpublished information.

Most of the tools had already been given a name (eg, Team Emergency Assessment Measure—TEAM19) and, if not, we devised a name based on an approximation of the purpose of the tool (eg, anaesthetic trainee NTS20). A list of acronyms for all the tools in this review can be found in the online supplementary appendix 2.

Supplemental material

The NTS assessed by the tools were usually described in categories, for example, communication, teamwork, leadership, and so on, which were underpinned by behavioural markers (eg, TEAM, Observational Teamwork Assessment for Surgery (OTAS), Oxford NOnTECHnical Skills (Oxford NOTECHS), Non-Technical Skills for Surgeons (NOTSS) and Ottawa CRM Global Rating Scale19 21–24) but some described an inventory of behaviours relevant to the context or professional group being analysed (eg, University of Texas Behavioural Markers for Neonatal Resuscitation, Mayo High Performance Teamwork Scale (MHPTS), Teamwork Behavioural Rater (TBR)11 25 26). We classified NTS into the five most commonly occurring categories: communication, leadership and/or teamwork, situation awareness, decision-making and task management. We also included an ‘other’ section to capture elements not ascribable to one of these categories. Examples where additional behaviours were assessed included: professionalism,27 28 ‘environment in the room’29 and stress and distractors.30 Where descriptors of behaviour were essentially a subcategory of one of the five domains they were included under the relevant heading, for example, cooperation was included under teamwork and vigilance under situation awareness.

Studies were analysed over three broad domains: method of development, the applicability and context of use of the tool and the evidence provided for validity of the tool (including any assessment of usability and training requirements). Where the original development and evidence of validity of a tool was described in more than one publication the data from all relevant papers were analysed, as long as at least one member of the original research team was involved.

Evidence of validity was classified (where possible) into domains described by the American Educational Research Association31 which consider all forms of validity under the overarching term, ‘construct validity’:

  • Content (ie, test items are representative of the construct of interest).

  • Relations to other variables such as the ability to discriminate between learner characteristics (eg, between a good or a poor performance, or between levels of experience or professional groups) or relationships with separate measures (eg, that results from the assessment tool are related to those from a tool measuring another, similar construct, often called concurrent or convergent validity in the studies in this review).

  • Internal structure (including: rater reliability and item correlations).

  • Response process (ie, evidence of data integrity including methods for scoring and data entry).

  • Consequences (intended or unintended consequences of an assessment—rarely reported).

Cook et al 17 have highlighted the difficulty of applying instruments used for clinical studies such as Standards for Reporting Diagnostic Accuracy32 and Guidelines for Reporting Reliability and Agreement Studies33 in the context of assessing tools for educational assessment. To provide some assistance to educators in selecting tools for NTS assessment we have categorised tools in terms of context of use, method of design, evidence of validity and assessment of usability (see table 1). The attributes we assessed were developed by the authors and informed by: the initial study assessment questionnaire (see above and online supplementary appendix 1); the iterative analysis of 118 studies; our experience as clinicians and educators; and guidance on design of educational assessment tools34 (including validity and reliability35–39 and team training assessments40).

Table 1

Attributes assessed during analysis of 76 tools for the measurement of NTS in 118 papers

Risk of bias

Data analysis and interpretation was undertaken with an awareness of the risk of bias. Repeated reflection on potential sources of bias in the context of personal beliefs and values (researcher reflexivity41) was integral to the iterative review of the studies. Study selection bias was minimised through use of a systematic search method.

Potential bias for the authors in reviewing the assessment tools included:

  • Familiarity bias: four of the authors are active educators in simulation-based education (JR was the author of one of the tools (Anaesthetic NTS-Anaesthetic Practitioners42), CV has been involved in the development of other tools for NTS assessment43–45). The lead authors (HH, PRG and JR) have been trained to use the Anaesthetists’ Non-Technical Skills (ANTS) assessment tool.

  • Availability heuristics: the lead authors (HH, PRG and JR) are practising anaesthetists, as such our training and clinical experience is largely in theatre and intensive care unit settings.

  • Anchoring bias: the order in which we reviewed the papers and the organisation of information presented in each study may influence decisions made in assessing the tools.

    Mitigations for these risks included development of a list attributes for analysis of tools (to provide a more objective framework for describing them, see table 1), review by more than one author and repeated re-examinations of the papers in random order.


The screening process is described in figure 1 as per PRISMA guidance. All articles included for review were observational studies of healthcare professionals or students in simulated or real clinical settings.

We identified 76 unique tools for the assessment of NTS in healthcare that were suitable for inclusion in the review. These were described in 118 papers. The first tool was developed by Gaba et al 9 in North America. Subsequently, most tools have been developed in North America (35 tools), followed by Europe (31 tools) and Australasia (8 tools). One tool was developed in Colombia46 and one in Israel47 (country of origin is shown in table 2 and the online supplementary appendix 3).

Table 2

Description of environment, context of use and scoring for 30 tools for the assessment of NTS in healthcare

Most tools were developed de novo, but some were explicitly based on tools developed by other groups48–51 and some relied on data gathered in the original tool. Self-assessment tools were excluded because, while they may be useful in formative settings, self-assessment of NTS is inaccurate and unsuitable for use in high stakes settings.52

Considerable variability was found in method of tool development, applicability, context of use and evidence of validity in this study, in line with previous systematic reviews of assessment.17 53 54

Methods of tool design and context of use

Methods of reporting observations varied. For example, number of observations made using the tool (eg, Behavioural Marker System - Neurosurgical Non-Technical Skills (BMS-NNTS)55 and Explicit Professional Oral Communication (EPOC)56 include an assessment of frequency of interactions), or number of participants or teams observed (some had large numbers of observations or participants56–58 and others fewer49 59 60), and some were individual or team assessments or both, as shown in table 2. Consequently, it was difficult to make meaningful inferences between the studies.

Most assessment tools (37 (49%)) had been designed for use with multidisciplinary teams; 27 (36%) were for single specialty postgraduate healthcare professionals; 8 (10%) were for the assessment of healthcare students; and 4 (5%) were for multispecialty postgraduate doctors (see table 2 and online supplementary appendix 3).

Supplemental material

The environments in which the tools were designed and tested varied but fell under two broad domains—simulated or real clinical settings, and context of use included seven clinical domains: adult inpatient (7 tools (9%)); adult intensive/emergency care (21 tools (28%)); obstetrics (4 tools (5%)); operating theatres (adult and paediatric—25 tools (33%)); paediatric intensive/emergency care (5 tools (7%)); prehospital care (3 tools (4%)); and generic healthcare settings (3 tools (4%)). Tools for the assessment of NTS in undergraduates (8 tools (10%)) were put in a separate category from postgraduate tools (because the authors did) but there were not enough to warrant further subdivision by clinical domain.

NTS categories assessed were also variable. Communication was assessed in every tool although not always as an isolated category (eg, Oxford NOTECHS and ANTS). Teamwork and leadership were the next most commonly included categories (74 (97%) of tools), situation awareness was assessed in 66 (87%), task management in 61 (80%) and decision-making in 36 (47%).

Data for 30 tools grouped by context of use as described above are shown in table 2 (tools are listed chronologically, and if there were more than one tool from the same year, in alphabetical order of the author’s name). Space constraints prevent all 76 being shown here. Those that are shown presented more detail on method of development and the greatest amount of evidence for validity (including reliability), requirements for training and usability. Data for the remaining 46 tools are available as online supplementary appendix 3, and the references are shown below, categorised by context of use (all papers describing tools are included):

Evidence of validity and description of training requirements and usability

The argument-based approach to validity35 103 104 was used to assess the tools but this was limited by the variability in the provision of evidence and because the majority of papers referred to validity using more traditional terms. Validity was classified (where possible) into domains described by the American Educational Research Association31: content, response process, internal structure and relations to other variables and consequences. All tools assessed content validity in some form and the next most common assessment was relation to learner characteristics such as experience or educational level of participants (47 tools (62%)). Tests of relationships with separate measures including tools measuring similar, related constructs (25 (33%)) were more common than those testing tools against others measuring the same construct (frequently these tests were termed convergent or concurrent validity) (11 (14%)) and only three groups considered predictive validity in the sense of ability to predict future performance.105–107 Some tools contained a technical as well as an NTS assessment but not all of them assessed the relationship with the NTS items (see table 3).

Table 3

Evidence of validity, training requirements and assessment of usability for 30 tools for the assessment of NTS in healthcare

Reliability was most commonly assessed with inter-rater testing (61 tools (80%)) or internal consistency (41 tools (54%)). Only 11 studies (14%) considered intra-rater or test–retest reliability.

Some authors went to great lengths to analyse usability and generated qualitative and quantitative data from questionnaires or interviews (which informed the development and deployment of their assessment tools).10 29 30 42 108–114

Recommendations for training were described in very different ways, from those who have designed bespoke courses for their tools (eg, NOTSS,115 OTAS116 and Multiprofessional Inventory for Non-Technical Skills in the Delivery Room (MINTS-DR)92) to those where a tool was designed with a specific remit of not requiring much training to use it (TEAM,19 MHPTS,25 Perinatal Emergency Team Response Assessment (PETRA)109 and Clinical Teamwork Scale (CTS)117). Table 3 provides an overview of validity evidence, training requirements and usability assessments for the same 30 tools in table 2 (the same information for the remaining 46 tools is found in the online supplementary appendix 3).


We have analysed the growing array of NTS assessment tools in healthcare since the first was developed in 1998 by Gaba et al. 9 Box 1 highlights what this study adds to the field.

Box 1

What this study adds

  • There are 76 published tools for the measurement of non-technical skills (NTS) in healthcare across seven clinical areas with widely differing methods of scoring for either individuals or teams.

  • The methods of development and rigour of assessments of validity vary widely among these tools.

  • Recommendations for training also vary greatly and pragmatic assessment of usability is scarce.

  • A standardised approach to the development and testing of tools for the measurement of NTS would assist both educators and researchers.

  • There is currently no pre-eminent tool for the measurement of NTS which we can recommend.

Method of development

The importance of measures which assess whole team performance has been highlighted by several authors;40 118 119 while the training and assessment of NTS in individuals is important120 some tools allowed more flexibility (ie, they could be used for more than one profession or environment).

Instruments varied in their intended purpose, some assessed routine teamwork while others focused on management of crisis scenarios. Simulated settings allow control of scenarios and reliable depiction of behaviours (often by actors). However, it has been suggested that it is not truly representative of a real clinical environment where there may be long periods of relative calm with short bursts of intense activity, whereas a video of a simulated crisis will only focus on the 15 min or so of high pressure.121 It would, therefore, seem desirable to develop tools that might be used in both settings to provide meaningful assessments during training and real clinical practice and in routine as well as emergency situations.

The NTS domains assessed were broadly similar across all the tools, suggesting that they are relevant in a wide variety of clinical settings with the appropriate context-specific adaptations, which begs the question: why are there so many? Authors frequently stated that the reason for the development of a new tool was the lack of one relevant to their specific need. The answer may also be found, to a degree, in the necessity for compromise highlighted by van der Vleuten,34 who described five key components in considering the utility of assessment methods: educational impact, validity, reliability, cost and acceptability (both to examiners and examinees). He stressed that ‘choosing an assessment method inevitably entails compromise and the type of compromise varies for each specific assessment context’ and ‘perfect utility is a utopia.’


The issue of usability and cost of NTS assessment tools is not trivial, and has been brought into sharp relief by the current staff shortages in healthcare and difficulties in releasing staff to train.122

A formative training event may benefit from the use of a tool which requires little training to implement and brings additional richness to the debriefing. However, in high stakes settings evidence of validity and reliability for an assessment tool must be robust and those using it must be trained and experienced in so doing.

Most of the in-depth analysis of usability has occurred in tools developed in the past 5 years, suggesting a heightened awareness of the need to consider the practical use of such assessments.

Training requirements

The challenges of assessing NTS accurately and reliably have been enumerated by Flin et al 120 and Smith-Jentsch et al 123 (eg, difficulty seeing and hearing all the relevant information; difficulty interpreting cognitive skills and rare but important behaviours that may be missed because they are not categorised). Many of the research teams who have designed these tools pointed out the challenges of using them and suggestions for best practice have been put forward by an expert group from aviation and healthcare.124 Furthermore, Gaba et al,9 Moorthy et al 125 and Schraagen et al 126 highlight the value of simplifying the number of NTS domains analysed by a tool in order to improve the reliability of the observers.

While this approach may be more cost-effective, Sevdalis et al showed the value of psychologist or human factor expert raters in using OTAS127 but also recognised the resource implications. A later paper using OTAS showed that it was possible to train clinical staff to assess behaviours reliably in a short space of time.116 Guidelines for the training of faculty in NTS assessment have since been published128 and they stress the importance of training to ensure reliability, particularly for high stakes settings. The authors suggest a minimum requirement of 2 days’ training and a robust process of revalidation which has clear cost implications in practice.

Choosing an NTS assessment tool

This review has revealed the multiplicity of NTS assessment tools available in healthcare, highlighting clear challenges for the educator in healthcare in trying to choose which is most appropriate for their training purposes. The process of categorising the tools in this review highlighted three initial decisions to be made:

  • Is the training for a multidisciplinary team or for a single group, for example, medical students?

  • Is the training in a real or simulated environment?

  • What is the setting for the training, for example, ward based, critical care or obstetrics?

Table 2 has been configured to highlight these key features with the aim of providing a means of selecting a tool for a particular setting. It is hoped that the additional information provided in table 3, where practical issues such as training required to use the tool are described, will further support the selection process for educators in healthcare.

Study limitations

The authors recognise the difficulty of excluding bias and that using the techniques described above can mitigate but not remove it.

Some of the variability described in this review can be ascribed to the following issues:

  • Tools which were published in the early days of NTS research in healthcare were often based on tools from aviation and provided less evidence of validity due to lack of available reference points.

  • Tools only recently published may not have had time to undertake rigorous reliability testing.

  • Tools based on those developed earlier (eg, for use in a different language/culture) did not describe method of design as they relied on data from the original work.

This study was designed to provide an objective analysis of the observer-based tools for assessment of NTS in healthcare, including evidence of validity and an assessment of ease of use. The analysis of attributes allowed for some discrimination between tools but the variability described throughout the review precluded meaningful analysis of, for example, quality of method of design or how long it took before a tool could be used reliably. This is an area deserving further analysis.

Although we contacted authors via email to ask for further information it is possible that we do not have a complete data set for each tool.

We restricted the study to considering only papers that were contiguous with the original development of the tool and did not include data from groups who had used the tools in different settings.


This review has shown that there is variability in the method of design and testing of tools for the assessment of NTS and that consideration of these features is not always complete. Recommendations for designing and training to use tools for the assessment of NTS made by Klampfer et al 124 and Hull et al 128 may be regarded as the gold standard but acceptability and cost implications remain a considerable barrier. Similarities between systems have also been highlighted49 129—strengthening support for a more unified approach to NTS teaching and a rationalisation of assessment tools.

Finally, previous reviews of NTS tools have provided an overview of available assessment techniques in different areas but have not provided a means of discriminating between them.130–134

We have devised a system for categorising tools for the assessment of NTS which could be useful to both novice and expert educators in simulation-based education.

The ideal tool for NTS assessment in healthcare does not yet exist. Further research is required to determine if a more generic tool for use in any healthcare context with the appropriate subject matter expertise to guide assessment of validity and reliability, task analysis and deployment is feasible and brings us closer to that goal.


Grateful thanks go to Neal Thurley at Bodleian libraries for his generous assistance in structuring and refining the search criteria for this review. The authors do not seek to criticise the extensive research and effort which has gone into designing these tools, merely to highlight that we may now need to consider how best to rationalise and standardise future work in this field. We are also very grateful to the researchers who took the time to reply to our requests for additional information about the tools they had designed, particularly those who spoke to HH in person. The information they provided assisted in the analysis of their own tools and added richness to the discussion section of the paper. Finally, I dedicate this article to Dr Denis O’Leary whose insights and gentle wisdom added enormously to this study and so much more.



  • Contributors HH was responsible for the study concept and drafting the initial review manuscript. HH, PRG, JR, LV, DY and CV were responsible for design, analysis, interpretation and preparation of the completed manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article or uploaded as supplementary information.

Linked Articles