Purpose A3 problem solving is part of the Lean management approach to quality improvement (QI). However, few tools are available to assess A3 problem-solving skills. The authors sought to develop an assessment tool for problem-solving A3s with an accompanying self-instruction package and to test agreement in assessments made by individuals who teach A3 problem solving.
Methods After reviewing relevant literature, the authors developed an A3 assessment tool and self-instruction package over five improvement cycles. Lean experts and individuals from two institutions with QI proficiency and experience teaching QI provided iterative feedback on the materials. Tests of inter-rater agreement were conducted in cycles 3, 4 and 5. The final assessment tool was tested in a study involving 12 raters assessing 23 items on six A3s that were modified to enable testing a range of scores.
Results The intraclass correlation coefficient (ICC) for overall assessment of an A3 (rater’s mean on 23 items per A3 compared across 12 raters and 6 A3s) was 0.89 (95% CI 0.75 to 0.98), indicating excellent reliability. For the 20 items with appreciable variation in scores across A3s, ICCs ranged from 0.41 to 0.97, indicating fair to excellent reliability. Raters from two institutions scored items similarly (mean ratings of 2.10 and 2.13, p=0.57). Physicians provided marginally higher ratings than QI professionals (mean ratings of 2.17 and 2.00, p=0.003). Raters averaged completing the self-instruction package in 1.5 hours, then rated six A3s in 2.0 hours.
Conclusion This study provides evidence of the reliability of a tool to assess healthcare QI project proposals that use the A3 problem-solving approach. The tool also demonstrated evidence of measurement, content and construct validity. QI educators and practitioners can use the free online materials to assess learners’ A3s, provide formative and summative feedback on QI project proposals and enhance their teaching.
- continuing education
- continuing professional development
- graduate medical education
- healthcare quality improvement
- health professions education
- lean management
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
- continuing education
- continuing professional development
- graduate medical education
- healthcare quality improvement
- health professions education
- lean management
Improving the quality of healthcare is a universal goal for healthcare practitioners and administrators. A3 problem solving is a structured approach to continuous quality improvement (QI) first employed by Toyota and now widely used by healthcare practitioners and organisations that have adopted the Lean thinking approach to improvement.1–4 Key elements include understanding the reason for action, defining the current state and performance gap, setting a goal, identifying root causes, choosing countermeasures, formulating action plans and establishing a follow-up plan to measure results. QI efforts are more likely to succeed when these elements are employed.
QI is now a required competency for medical students, residents, practising physicians, nurses, pharmacists and other healthcare professionals worldwide.5–10 A common approach to developing QI skills involves participation in a QI project (QIP) designed around a gap in local healthcare quality. The use of A3 problem solving as an instructional framework for QI skill development has been described in manufacturing and more recently in healthcare.11–13 Instruction may occur in formal courses or informally in work settings. While numerous experiential QI curricula have been described, few skills-based assessment tools are available.14–16 None of the existing QIP assessment tools is specific to the A3 problem-solving approach, nor do they provide an easily replicable method to train educators to assess A3 skills.17–19
We combined efforts at our two academic healthcare centres to develop an A3 assessment tool and test its reliability through a series of iterative development cycles. In order for the A3 assessment tool to be easily learnt and widely used, we wanted to develop and test the assessment tool as the central component of a self-instruction package in learning to assess A3s reliably. Development would necessarily include exploring raters’ experiences in using the assessment tool and self-instruction package. Ultimately, the resulting A3 assessment tool and self-instruction package should guide QI educators in assessing learners’ A3s, provide consistent formative and summative feedback on QIP proposals and teach A3 problem solving.
Development cycles for an A3 assessment tool and self-instruction package
We developed an A3 assessment tool and a self-instruction package to assess proposal A3s as part of their QI teaching or advising and to enhance teaching A3 problem solving (online supplemental digital content). Components of the self-instruction package are described in table 1. The five development cycles for the assessment tool and self-instruction package are summarised in the top of table 2. In each cycle, we sought feedback from our raters. In cycles 3–5, we formally tested inter-rater agreement. We used feedback and reliability performance on items at the end of one cycle to refine concepts, improve language precision and enhance presentation of information during the next cycle. Examples of changes across cycles are presented in the bottom of table 2.
We began the first development cycle in 2017 by working with biomedical and business librarians, who performed a systematic literature search using the keywords “A3 thinking”, “A3 problem solving” and “A3 template”. They searched eight databases covering health sciences, business and engineering (PubMed, Embase, Cochrane Library, Scopus, Web of Science, Compendex, ABI and Business Sources Complete) and publication types (eg, white papers) produced outside of traditional academic publishing channels. We found only one other example of an A3 assessment tool in the engineering literature,11 and noted that several types of A3s exist, reflecting the stage of improvement work.2 We focused on a problem-solving A3 because our institutions currently teach developing them to analyse a QI problem and propose interventions. A problem-solving A3 includes all the dimensions of problem investigation (background, current state, problem statement, goal, analysis), then proposes recommendations (countermeasures, action plan, follow-up plan) based on the findings. We refer to a problem-solving A3 as simply an ‘A3’ throughout this paper.
The next step in cycle 1 was to create initial drafts of the A3 template, content guide and assessment tool. We reviewed commonly used A3 templates including ones in use at our institutions.1–3 We created an A3 template that included key sections of A3s with elements described more clearly and operationally than in existing templates. The content guide provided additional descriptive information and illustrations. The assessment tool addressed each element in the template and characteristics across sections. Each item in the assessment tool has response options that range from 0 to 3. General verbal anchors for the options are 0=not addressed, 1=unclear, 2=general and 3=specific, with phrasing modified to reflect an item’s content. We realised that items differed in the information that needed to be assessed. The initial assessment tool had 27 items that could be answered directly from information in an A3 document (eg, How specific is the goal?) and 7 items that required additional knowledge of the local problem context (eg, extent to which important root causes are identified). We decided that individuals unfamiliar with the problem context need only rate items that can be determined from the A3 alone. An experienced QI trainer at each institution reviewed and used the materials, then provided feedback.
Cycle 2 incorporated feedback from cycle 1. Then two external Lean experts reviewed the materials with two of the authors (JEB, JMK). In cycle 3, suggestions from the experts were incorporated and formal tests of agreement began. Each test included raters from our two academic healthcare centres. Four individuals (two physicians with QI teaching experience and two non-physician QI professionals) rated four A3s. Their feedback and performance indicated that agreement in assessments would be enhanced through more detailed definitions and guided experience in applying them. In cycle 4, we added a ‘description of ratings’ document that elaborated operational definitions of individual rating options. We also added examples of exemplary and deficient A3s with rating explanations and the opportunity to assess an A3 and compare ratings against a standard for immediate feedback on performance. The test of agreement expanded the number of raters from 4 to 12 and the number of A3s from 4 to 6. In cycle 5, we added another deficient A3 with rating explanations to compare against a standard. Automated functions were added to the assessment tool to facilitate referencing definitions and totaling scores.
In cycles 3 through 5, we developed exemplary and deficient A3 training examples and A3s used to test inter-rater agreement. First, the authors (JSM, JMK) reviewed examples of A3s submitted by learners in QI methods courses for healthcare professionals (eg, physicians, nurses, other healthcare team members) in training (eg, medical students, residents, graduate nursing students) at our institutions. We used course evaluations of A3s to identify examples of excellent, good and poor A3s. Then, we modified most of the A3s by improving some elements (eg, adding completion dates for action plan items) and making other elements worse (eg, adding a countermeasure that did not correspond to a listed root cause) to provide a range on items across the A3s. The three training A3s addressed evidence-based treatment for epilepsy, patient congestion in a clinic and improving the accessibility of cardiac catheterisation films. The six A3s assessed in cycle 5 addressed patient throughput in a psychiatric emergency room (ER), time to decision-making for chest pain patients in the ER, access to care for patients with diabetes after renal transplant, unnecessary phlebotomy in the hospital and equipment waste in the operating room.
Check on cycle 5 of the assessment tool and self-instruction package
Cycle 5 was the culmination of our work. Its check had two objectives: (1) assess inter-rater agreement among raters using the assessment tool and self-instruction package and (2) learn about the raters’ experiences and views in using the self-instruction package and performing assessments.
The final A3 template is presented in figure 1. The final A3 assessment tool (online supplemental digital content) has 23 items that can be assessed from the A3 document itself and an additional 10 items that require knowledge of the local context.
Our sample size to test inter-rater agreement was based on practical feasibility for the number of raters and the number of A3s assessed.20 We felt that 4 hours was the maximum time commitment that we could reasonably request of volunteer raters. Cycle 4 demonstrated that raters could go through the self-instruction package and rate six A3s in approximately 4 hours. We recruited 12 raters for cycle 5 knowing that the increased number of raters would increase precision in estimating inter-rater agreement. The design of 12 raters rating 23 items on 6 A3s produced 72 ratings per item and 1656 ratings overall.
We identified 12 individuals from our two academic healthcare centres (6 from each) and invited them by email to participate as raters. All raters were at least proficient in QI. We selected raters with some, but varying QI teaching experience to reflect the types of individuals most commonly involved in teaching QI in healthcare. Four raters were non-physician QI professionals who routinely led QI initiatives and taught QI as part of their work. The other eight raters were physicians with experience teaching and/or advising students, residents and fellows in QIP work. Four of the eight had been teaching QI for >2 years while the other four had been teaching QI for <2 years.
One of the authors (JSM, JMK, RVH) had a 10 min phone conversation with each rater, orienting the individual to the study and confirming access to the online self-instruction materials. Raters had 1 month to complete the self-instruction package, rate the six A3s, and submit their ratings.
We created a structured feedback form and distributed it to raters at the time of the orientation phone call (see online supplemental digital content, last section). The form had 19 open-ended items addressing: study orientation, the self-instruction package, the A3 assessment tool and their overall experience with the tool and self-instruction package. Raters provided written feedback when they submitted their A3 ratings and participated in a short debriefing phone call led by one of the investigators. During the call raters could clarify and elaborate upon their comments.
We used intraclass correlation coefficients (ICCs) as the primary method to quantify inter-rater agreement. The three variables are rater, A3 and item rating. Values range from 0 to 1. The value is 1 if raters give similar ratings (low variation) to an item within an A3, but ratings differ (high variation) between A3s. The value is 0 if ratings vary within an A3 item as much as they vary between A3s. While guidelines for interpreting ICCs vary, a frequently quoted interpretation is: <0.40 is poor, 0.40–0.59 is fair, 0.60–0.74 is good and 0.75–1.0 is excellent.21 Lower ICCs reflect greater variation in ratings for an A3 item, so as ICC values decrease the width of an ICC’s CIs increases. For our design of 12 raters and 6 A3s, examples of the decreasing precision (95% CI) with which an ICC is measured for an item are: 0.90 (0.77–0.98, within ‘excellent’), 0.75 (0.44–0.95, ‘fair’ to ‘excellent’) and 0.50 (0.23–0.87, ‘poor’ to ‘excellent’).
We calculated ICCs for each of the 23 rating items. To reflect a rater’s overall assessment of an individual A3, for each A3 we calculated each rater’s mean assessment on the 23 items. A rater’s mean rating for an A3 was treated as an additional item for which the ICC was calculated. The 95% CIs for ICCs were also calculated. The ICCs and CIs were calculated using ‘R’ software for statistical computing based on a single rater, absolute agreement, two-way random effects model.22
The ICC is less appropriate as a measure of inter-rater agreement when ratings are similar across A3s. Little variation in ratings within an A3 is similar to the little variation between A3s, resulting in an artificially low ICC, even though raters actually agree and provide similar rating values for an item on all of the A3s. To check that a limited range of scores on an item across A3s might methodologically lower an ICC, we first calculated within each of the six A3s an item’s mean score over the 12 raters. Then, we used the means for an item across the six A3s to calculate across the six A3s the overall item mean and the SD of item means. A low SD for an item mean across the six A3s indicates a limited range (little variation) in scores between A3s. For these items, we reviewed the actual scores across A3s to confirm that raters agreed in providing similar rating values across A3s.
In addition to analysing the raters’ assessments of items on A3s, we collated qualitative information from raters’ feedback forms and debriefing calls and reviewed responses for illustrative themes.
The ICCs and 95% CIs for agreement over a range of scores for the 12 raters across the six A3s are shown in table 3 for the overall A3 rating and the ratings for each of the 23 individual items.
For overall A3 assessment (mean of ratings on an A3’s 23 items), the ICC is 0.89 (95% CI 0.75 to 0.98), indicating excellent reliability across raters over a range of scores. For individual items, the ICCs for 17 items ranged from 0.57 to 0.97, indicating fair to excellent reliability; the ICCs for three items (#2, #16, #17) ranged from 0.41 to 0.46, indicating marginally fair reliability.
For the remaining three items (#1, #11, #14), the ICCs range from 0.10 to 0.39, suggesting poor reliability across a range of scores. However, these items did not have a wide range of scores. As shown in table 3, these three items have the lowest SDs (0.28 to 0.55) of the 23 items. For these items, raters generally agreed on the items’ scores, but the scores were similar across the six A3s. For example, for item #11 with an ICC of 0.10, with possible ratings ranging from 0 to 3, the means of the 12 rating scores on each of six A3s were 2.9, 2.9, 2.8, 2.7, 2.6 and 2.2. While the raters highly agreed in rating this item between A3s, the variability of scores across A3s was insufficient to demonstrate agreement across a range of scores using an ICC. For items #1, #11 and #14, the lack of variation across A3s methodologically lowered ICCs, limiting our ability to confirm agreement across a range of scores. However, the low SD for these items demonstrate substantial agreement on the score among raters on the items across the six A3s.
For the 20 items with more variation across A3s, the items with higher ICCs tend to have simpler content that focuses on only one element of the A3. For example, the item with the highest ICC is #20. ‘Are estimated completion dates identified for each action item (ie, ‘when’)?’ (ICC=0.97). In contrast, items with ICCs in the ‘fair’ inter-rater agreement range (ICCs 0.40–0.59) require raters to relate multiple elements of information simultaneously, for example, item #17. ‘How many of the proposed countermeasures are linked to identified root causes?’ (ICC=0.46).
The six raters from each of the two institutions used the rating scales similarly (mean ratings of 2.10 and 2.13, p=0.57). Across institutions, the eight physicians provided slightly higher ratings than the four QI professionals (mean ratings of 2.17 and 2.00, p=0.003), but the small difference is not practically meaningful.
On the feedback forms, raters reported that the work took an average of 3.5 hours: the self-instruction package took 1.5 hours (range 1.0–3.0 hours) and rating the six A3s took 2.0 hours (range 1.0–3.5 hours). Illustrative comments about their learning and rating experience are presented in table 4. Overall, raters reported that the self-instruction package and assessment tool were easy to learn and worthwhile to use. For example, “I thought it was easy. I think this tool is going to be a great way to set expectations and give feedback about student A3s”. One rater noted “but [I] had to make sure I wasn’t inferring information and only evaluated what was on the A3”.
This study developed and demonstrated the reliability of a tool to assess the quality of learners’ investigations and recommendations for QI problems in healthcare using the A3 approach. The assessment tool was developed as part of a self-instruction package to assist a broad range of educators in efficiently learning how to reliably assess and provide feedback on learners’ A3 documents. We found that 12 raters using the assessment tool and self-instruction package could reliably rate items across six A3s, with excellent agreement across raters over a range of scores on the overall rating of an A3 and with fair to excellent agreement on 20 items. For the remaining three items, raters agreed in item scoring, but the limited range of scores across A3s precluded confirming agreement across a range of scores. Ratings were similar for raters from different institutions and functionally similar for physician and QI professional raters. The self-instruction package allowed raters to learn to use the assessment tool in about 1.5 hours. Raters found the package and tool easy to learn and worthwhile to use.
Three other studies reported developing assessment tools for QIP. Leenstra et al developed the Quality Improvement Project Assessment Tool (QIPAT-7) in 2007, Rosenbluth et al developed the Multi-Domain Assessment of Quality Improvement Projects (MAQIP) in 2017 and Steele et al developed the Quality Improvement Project Evaluation Report (QIPER) in 2019.17–19 Our study adds to this body of literature. Rather than develop a new conceptual framework, we built on the widely recognised Lean A3 problem-solving approach to QI, which an increasing number of healthcare organisations have adopted. For these institutions, our materials facilitate integration of QI operations and QI education for healthcare professionals, educators and learners at all levels. This integration supports high-quality patient care and is now an expectation for healthcare systems that sponsor graduate medical education programmes in the USA.23 Building on the established A3 framework, we identified specific aspects of A3s to assess and provide educators with a visual template that embeds common QI tools, a companion content guide for the template, examples, practice with feedback and links to resources. Our package of materials is the first to provide training examples of assessments of completed proposals, providing external benchmarks for teachers (and learners). We have gone beyond previous work by demonstrating consistency across raters who are at different institutions, are physicians and QI professionals and are not members of the research team. While we tested the materials on individuals with some experience performing and teaching QI, we anticipate that the self-instruction materials will assist novice QI educators. The assessment tool and instructional package are available online at no cost and require only 2 hours to learn, facilitating their broad use.24
The process of developing and testing the reliability of the assessment tool also demonstrated several aspects of its measurement validity—the extent to which it measures what it claims to measure. The first step in establishing content validity was to review the literature on A3 content and templates, assemble and refine the model A3 template and have experts and teachers of A3 problem solving agree that this was the appropriate content to measure. Experts and teachers also agreed that the rating tool represents the content of the A3 template and the logic underlying it. As a component of content validity, ‘face’ validity is evident in most statements in the template being quoted in items to be rated. Construct validity is demonstrated through items performing in conceptually expected ways, such as items asking about the presence or absence of one element of information being rated more reliably than items involving simultaneous consideration of multiple elements.
Our sequence of development cycles and refinements identified insights that are useful for the QI education and assessment efforts of others. One insight is to distinguish between assessments based on the A3 document alone and assessments based on additional knowledge of the local problem context. Assessments based on the A3 document alone should be consistent among raters. Assessments based on knowledge of the local problem will vary with the assessor’s knowledge. Another insight is to help learners differentiate between the QI problem (‘what is the specific performance gap’) and consequences of the problem (‘why the problem is important’). Both learners and raters may use previous knowledge to assume that a problem is important with no explicit statement of why it is important. More precise wording and examples help both learners and raters realise that consequences of a problem are separate from the problem being addressed. Another insight from examining previously developed A3s is that having a plan for monitoring whether the proposed actions are actually implemented (‘intervention fidelity’) is frequently overlooked.25 Including this concept in the A3 template and assessment tool helps ensure that this important step is addressed.
Our study has several limitations. The assessment tool does not address actual outcomes of QIPs that have been completed. We focused on the proposal stage because development of well-researched, well-analysed and well-considered proposals for interventions is the foundation for carrying out successful QI efforts. Some healthcare settings may not use the A3 framework on which our materials are based. However, use of the framework is sufficiently widespread that teachers and learners should be aware of this approach to developing QIPs. Including only 6 A3s and 12 raters limited the ranges sampled and ICC precision but reasonable evidence of inter-rater agreement was demonstrated. The generalisability of the results to other settings and professional roles is uncertain. Our raters were from one country and two academic centres, which possibly provided some common contexts regarding views of QI and the QI training available. The tool would likely not perform as well with individuals inexperienced in QI or with no experience teaching QI. However, within groups likely to be responsible for teaching and assessing A3s, the results potentially apply to a range of settings, personnel and training levels because our study included raters from different professions (physicians, QI professionals) with experience ranging from some to extensive proficiency in performing QI and teaching QI, and because the A3s that were the basis for testing agreement were authored by different professional student groups (eg, physicians, nurses, pharmacists). Finally, the raters typically knew one of the authors personally, potentially biasing feedback towards being more favourable. However, in our preliminary cycles, similarly chosen raters provided critical feedback that prompted changes. Since previous feedback included negative comments that were addressed, the favourable feedback in the final cycle appears to reflect reasonably unbiased views.
The A3 assessment tool and self-instruction package can be used for future research. The effect of being better trained to assess A3s has yet to be explored for subsequent outcomes such as providing better feedback or teaching effectiveness. Also to be explored is the impact of the assessment tool and self-instruction package on the quality of learners’ A3s and actual QIP outcomes. Assessments and feedback could be provided prospectively to learners to determine the impact of longitudinal formative feedback on A3s. The materials could also be provided to learners to determine the extent to which learners on their own can improve their A3s and those of peers. Future research could also expand studies of reliability of agreement among raters across institutional settings and individuals with different levels of QI knowledge and skills. Finally, supplementing the documents in the current self-instruction package with materials in video format may enhance learning efficiency and effectiveness.
In summary, this study provides evidence of the reliability and validity of a tool to assess the quality of A3 project proposals in healthcare. The assessment tool was developed as the focus of a self-instruction package to assist a broad range of QI educators and practitioners to assess learners’ A3s, to provide consistent formative and summative feedback on QIP proposals and to enhance their teaching of A3 problem solving. We demonstrated that after using the self-instruction package, raters from different institutions and professional backgrounds who are proficient in QI and have some experience teaching QI can reliably assess A3s. Raters performed ratings in about 1.5 hours and found the package and tool to be easy to learn and worthwhile to use. The materials are available on our institutional website at no charge.24 The minimal investment required to use the materials facilitates their widespread use by individuals teaching QI to healthcare professionals and by individuals performing QI in healthcare.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information.
Patient consent for publication
The authors would like to thank the following individuals who participated as raters in this study: Amber-Nicole Bird, Ryan Buckley, Debbie Paliani Burke, Caitlin Clancy, Kevin DeHority, Tammy Ellies, Sara Figueroa, Laurel Glaser, Kevin Gregg, Katie Grzyb, Katy Harmes, Jessica Hart, Michael Heung, Elena Huang, Chloe Hill, Christopher Klock, Jamie Lindsay, Erin Lightheart, Rosalyn Maben-Feaster, Patricia Macolino, Neha Patel, Anita Shelgikar, Elizabeth Valentine, Kimberly Volpe, Jason Wagner, Sarah Yentz. The authors would also like to thank Eric Ethington and John Shook, well known Lean thought leaders, who reviewed and provided feedback on an early version of the materials; the librarians Maylene Kefeng Qiu, Mia Wells, Melanie Cedrone and Sherry Morgan who assisted with the literature review and Larry Gruppen, who provided comments on the manuscript draft.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.