Development and validation of an international appraisal instrument for assessing the quality of clinical practice guidelines: the AGREE project
- The AGREE Collaboration*
- Correspondence to: Dr F Cluzeau, Department of Public Health Sciences, St George’s Hospital Medical School, Cranmer Terrace, London SW17 0RE, UK;
- Accepted 26 June 2002
Background: International interest in clinical practice guidelines has never been greater but many published guidelines do not meet the basic quality requirements. There have been renewed calls for validated criteria to assess the quality of guidelines.
Objective: To develop and validate an international instrument for assessing the quality of the process and reporting of clinical practice guideline development.
Methods: The instrument was developed through a multi-staged process of item generation, selection and scaling, field testing, and refinement procedures. 100 guidelines selected from 11 participating countries were evaluated independently by 194 appraisers with the instrument. Following refinement the instrument was further field tested on three guidelines per country by a new set of 70 appraisers.
Results: The final version of the instrument contained 23 items grouped into six quality domains with a 4 point Likert scale to score each item (scope and purpose, stakeholder involvement, rigour of development, clarity and presentation, applicability, editorial independence). 95% of appraisers found the instrument useful for assessing guidelines. Reliability was acceptable for most domains (Cronbach’s alpha 0.64–0.88). Guidelines produced as part of an established guideline programme had significantly higher scores on editorial independence and, after the publication of a national policy, had significantly higher quality scores on rigour of development (p<0.005). Guidelines with technical documentation had higher scores on that domain (p<0.0001).
Conclusions: This is the first time an appraisal instrument for clinical practice guidelines has been developed and tested internationally. The instrument is sensitive to differences in important aspects of guidelines and can be used consistently and easily by a wide range of professionals from different backgrounds. The adoption of common standards should improve the consistency and quality of the reporting of guideline development worldwide and provide a framework to encourage international comparison of clinical practice guidelines.
Clinical practice guidelines are now a common feature of clinical practice and are of interest worldwide. They are expected to facilitate more consistent, effective and efficient medical practice, and improve health outcomes1 Governments, professional associations, and healthcare organisations are increasingly sponsoring the development and dissemination of clinical guidelines.2 There is also a growing number of guidelines developed by European or international groups.
Although the principles for the development of sound guidelines are well established,3–5 many published guidelines fall short of the basic quality criteria identified in two recent studies.6,7 Defining the quality of guidelines is not straightforward. In principle a “good” guideline is one that eventually leads to improved patient outcome. It needs to be scientifically valid, usable, and reliable. However, this evidence is rarely available. Often the best that can be expected is some information on whether the guideline producers have attempted to minimise all the biases that can occur in the complex process of creating a guideline and how well this is reported.
As the number of published guidelines proliferates, there have been calls for the establishment of internationally recognised standards to improve the development and reporting of clinical guidelines.6 Moreover, there is a pressing need for internationally recognised criteria that are valid, reliable, and useful for various assessment purposes in different countries, both for guideline developers and clearing houses as well as individual users of guidelines.
In response, an international group of researchers from 13 countries—the Appraisal of Guidelines, REsearch and Evaluation (AGREE) Collaboration—has developed and validated a generic instrument that can be used to appraise the quality of clinical guidelines. The AGREE instrument is designed to assess the process of guideline development and how well this process is reported. It does not assess the clinical content of the guideline nor the quality of evidence that underpins the recommendations. In this paper we report the development and validation of the AGREE instrument.
A multi-staged approach was used that included an item generation, selection and scaling process, and field testing and refinement procedures.
Item generation, selection, and scaling
To develop the framework for the instrument, quality was defined as the confidence that the biases linked to the rigour of development, presentation, and applicability of a clinical practice guideline have been minimised and that each step of the development process is clearly reported. We considered the following five theoretical quality domains:
scope and purpose;
rigour of development;
clarity and presentation;
A small working group (FC, JB, RG, PL) generated an initial list of 82 items from validated appraisal instruments and relevant literature6,8–12 that addressed these domains. The working group examined the list for coverage, overlap and content validity, and reduced it to 34 items. The list and a user guide describing the items were pretested on two Dutch and two English guidelines and refinements were made in response to the comments received.
The refined list and user guide were then circulated to all the AGREE partners and to 15 international experts for their views on the clarity, comprehensiveness, relevance, and ease of use. In addition, the AGREE partners were asked to apply the instrument to two guidelines each. The feedback from this process led to reformulation of ambiguous items and removal of overlapping and value laden items. The result was the first draft instrument comprising 24 items grouped into the five domains identified in the development phase. We also modified the user guide to reflect changes made to the items. A 4 point Likert scale was used to score each item (1=strongly disagree, 2=disagree, 3=agree, 4=strongly agree). A 3 point scale (1=not recommend, 2=recommend with provisos or modifications, 3=strongly recommend) was used to score an overall judgement on whether the guideline ought to be recommended for use.
Field testing and refinement
The AGREE collaborators field tested the instrument following a research protocol that covered selection criteria for the guidelines, methods for recruiting appraisers, and time scales (box 1). Each country coordinated the appraisal of at least seven guidelines. Each guideline was assessed independently by four appraisers and, where possible, each appraiser assessed two guidelines. The appraisers received a standard letter with instructions on how to complete the instrument. Most used an English version of the draft AGREE instrument. If necessary, the materials or the user guide only were translated to ensure appraisers’ understanding of the items. Feedback on the instrument, user guide, and the appraisal process was solicited with a standard letter, translated into a national language where necessary.
Box 1 Participating countries, and selection criteria for guidelines and appraisers
Participating countries: Canada, Denmark, England, Finland, France, Germany, Italy, The Netherlands, Scotland, Spain, Switzerland (England and Scotland were considered separately because they have independent guideline programmes).
Selection criteria for guidelines:
guidelines published between 1992 and 1999
preferred disease areas: asthma, breast cancer, and diabetes
documents that contain specific recommendations for clinical practice (excluding systematic reviews or service documents)
Selection criteria for appraisers:
broad range of professions including clinical experts, nurses, researchers and policy makers
different healthcare settings including primary care, secondary care, teaching hospitals
excluding members from guideline development group
The field test was conducted in winter 1999–2000 with the 24-item draft instrument. For this phase, 100 guidelines from 11 countries (mode=8, range 7–22) were evaluated by 194 appraisers. The results of this field test were reviewed at an AGREE workshop in spring 2000 and the instrument and user guide were refined in response to the results. The final version of the instrument underwent further field testing in autumn 2000. In this phase a random sample of three guidelines per country from the original 100 were assessed by 70 newly recruited appraisers.
Mean item scores for each guideline were calculated by averaging the scores across the four appraisers. Standardised domain scores for each guideline were calculated by summing scores across the four appraisers and standardising them as a percentage of the possible maximum score a guideline could achieve. Mean item and standardised domain scores were used in the analyses unless otherwise noted below.
To guide the refinement of the instrument from the draft version to the final version, a principal components analysis was undertaken with data from the first field test. The mean item scores for each of the 100 guidelines were included in the analysis, with the eigen value limit set at 1 and the criteria for the minimum loading score set at 0.52.13,14
Final instrument properties
Two measures of reliability were conducted:
Using mean item scores, the Cronbach α coefficient was calculated to measure internal consistency of each domain of the final instrument.15
Intraclass correlations (ICC) were calculated to assess the reliability within each domain. ICCs based on single appraisers’ ratings and the means of two, three, and four appraisers were calculated.16
Several measures of validity were considered:
Face validity: appraisers’ attitudes about the instrument and user guide were collected by questionnaire and used to assess face validity.
Construct validity: three hypotheses were considered for tests of construct validity:
– (a) Established guideline programmes have opportunities to compose and refine guideline development methodologies, create efficiencies of process, and access committed funds. It was therefore hypothesised that guidelines originating from established programmes would have higher domain scores than those produced outside an established system. To test this hypothesis, a series of one way ANOVAS on quality scores was undertaken for each domain with type of guideline programme (established/not established) as the between subject factor.
– (b) It can be argued that guidelines supported by well documented technical information—either within the guideline itself or as part of supporting reports or publications—will have domain scores higher than those without this documentation. To test this notion, Kendall’s tau B rank correlation tests on quality scores for each domain were undertaken.
– (c) Guidelines developed as national policies should be particularly robust because of the authority conferred on them. It was therefore predicted that guidelines created on a national level should be of higher quality than regional or local ones. To test this notion a series of one way ANOVAS on quality scores was undertaken for each domain with level status (national/other guidelines) as the between subject factor.
Criterion validity: as there is no gold standard in this area, participants’ overall assessment scores were used as a proxy measure. Assessments of criterion validity were assessed by calculating the Kendall’s tau B rank correlation coefficients between the appraisers’ domains scores and the overall assessment scores.
The median time for appraising a guideline was 1.5 hours in both field studies. This included reading the guideline and completing the instrument. All appraisals were completed and returned.
Refinement of instrument
Principal components analysis of the draft instrument items yielded a five-factor solution that generally supported the domains of quality identified in the development phase. Table 1 shows the list of items and their loading (correlation) coefficients on each of the five domains from the rotated factor matrix.
Editorial independence appeared to load on several domains. In response, it was shifted to a sixth domain in the final version of the instrument and a new item addressing conflicts of interest was included. Two items—“The guideline is clearly structured” and “The potential problems with changes of attitude or behaviour of health care professionals in applying the guidelines have been considered”—were removed from the final version of the instrument because of failure to establish adequate reliability in the first field test. Finally, 10 items were reworded slightly in the final version of the instrument in response to feedback received from the appraisers (see Face validity below). The refined instrument in its final form contained 23 items grouped into six domains with the 4 point Likert scale to score each item (table 1).
Final instrument properties
Internal consistency ranged between 0.64 and 0.88 and was acceptable for most domains (table 2). The lower α coefficient found for domain 6 (editorial independence) was not surprising as this domain was composed of only two items. Table 2 also shows the intraclass correlations for each domain as a function of the number of raters. As would be expected, the number of appraisers evaluating a guideline affected reliability; increasing the number of raters resulted in substantially higher ICCs.
Results from the first field test indicated that the appraisers found the instrument useful to assess guidelines (95%) and the user guide helpful (98%). However, almost half of the participants reported having difficulties with at least one item of the instrument (49%). The most commonly reported problem was that guidelines lacked the detailed information necessary to assign a score. After refinement of the instrument, results from the second field test showed that the percentage of appraisers reporting difficulties with at least one item in the instrument decreased to 29%.
Tests of the first hypothesis showed that guidelines produced as part of a guideline programme had significantly higher scores on domain 6 (editorial independence) than those published outside a programme (p<0.05). Tests of the second hypothesis showed that guidelines with technical documentation had higher scores on domain 3 (rigour of development) than those published without documentation (p<0.01). Finally, tests of the third hypothesis revealed that guidelines produced after the publication of a national policy had significantly higher quality scores on domain 3 (rigour of development) than did their counterparts (p<0.05). No other significant differences emerged on any of the other domains for any of the contrasts (table 3).
Kendall’s tau B rank correlation coefficients between the appraisers’ domain scores and their overall assessments were all highly significant (p<0.001), providing some evidence of criterion validity using this proxy measure. Table 4 shows the correlation matrix of the six quality domains. With one exception, the domains tended to be more highly correlated with overall judgement than with each other.
This is the first time an appraisal instrument for clinical practice guidelines has been developed and tested at an international level. Created through a rigorous and iterative process by a collaboration of international experts in clinical guidelines, the instrument was applied to 100 guidelines by over 260 appraisers from 11 countries. Previous studies on similar instruments have been limited to appraisers working in the same institution and from the same country.3,7 This study resulted in a rigorously developed set of criteria for appraising guidelines (box 2) that can be helpful for clinical practice in two ways: (1) to help clinicians to differentiate between guidelines from different sources, and (2) as a support to the development of high quality guidelines for medical practice.
Box 2 Criteria of high quality clinical practice guidelines
1. Scope and purpose
Contain a specific statement about the overall objective(s), clinical questions, and describes the target population.
2. Stakeholder involvement
Provide information about the composition, discipline, and relevant expertise of the guideline development group and involve patients in their development. They also clearly define the target users and have been piloted prior to publication.
3. Rigour of development
Provide detailed information on the search strategy, the inclusion and exclusion criteria for selecting the evidence, and the methods used to formulate the recommendations. The recommendations are explicitly linked to the supporting evidence and there is a discussion of the health benefits, side effects, and risks. They have been externally reviewed before publication and provide detailed information about the procedure for updating the guideline.
4. Clarity and presentation
Contain specific recommendations on appropriate patient care and consider different possible options. The key recommendations are easily found. A summary document and patients’ leaflets are provided.
Discuss the organisational changes and cost implications of applying the recommendations and present review criteria for monitoring the use of the guidelines.
6. Editorial independence
Include an explicit statement that the views or interests of the funding body have not influenced the final recommendations. Members of the guideline group have declared possible conflicts of interest.
Our results show that the instrument is sensitive to differences in important aspects of clinical practice guidelines, and it can be used consistently by a wide range of professionals from different cultural backgrounds. Health professionals, policy makers, and consumers were all able to appraise guidelines with the AGREE questions and user guide. The appraisers found the instrument easy to apply and perceived it to be useful for judging the quality of guidelines.
When interpreting the results, several considerations must be kept in mind. Firstly, the factor analysis confirmed our conceptual framework, lending support to the assumption that the quality of clinical guidelines is composed of distinct domains, each assessing key quality attributes. However, the concept of guideline quality is still grounded in assumptions that need testing empirically, and we do not know the relative contribution of each domain to the overall quality of a guideline. Construct validity, based on three a priori hypotheses, was not strong. It was somewhat surprising to observe that national (as opposed to local) development and established (as opposed to more recent) programmes supporting production did not predict quality more strongly. The high correlations found between the domain scores and the overall assessment corroborated the modest criterion validity, although the effect may be attenuated by the fact that the appraisers made their global ratings after assessing the guidelines.
Secondly, the reliability of the domains is directly affected by the number of appraisers assessing one guideline. Thus, using four appraisers will yield a more reliable assessment than using a single appraiser.17 In this study average ratings of four raters provided the most reliable assessment and we recommend that at least four raters should be used when using the instrument.
Finally, we were not able to demonstrate conclusively the validity of our instrument. The instrument assesses the methodological quality of a guideline and this relies heavily on how well documented the guideline development process is.18 However, explicit reporting does not guarantee optimal recommendations. A well reported guideline may contain flawed recommendations and, conversely, an unsystematically constructed one may provide sound evidence.19 Nevertheless, the criteria we used are accepted as key determinants of valid and effective guidelines among methodologists, and the domains are quite clear. Validation of the instrument is a challenging task. We are currently undertaking detailed content analysis of the appraised guidelines as part of our research programme. This will provide a separate measure of construct validity.
AGREE has considerable implications for research and policy. These standards for the development and reporting of clinical practice guidelines can be used by guideline producers worldwide. The adoption of such standards can improve the consistency and quality of the reporting process.20 The sharing of standards across countries will facilitate international comparison of guidelines and can provide a framework for studies aimed at understanding why guidelines for the same condition may produce differing recommendations.21,22
As the number of clinical practice guidelines submitted for publication increases, there is a need to ensure that they satisfy certain minimum requirements. AGREE can be adopted by editors of peer reviewed journals as a framework to assess the quality of clinical guidelines in the same way that CONSORT is used to judge the quality of randomised controlled trials and meta-analyses.23,24
Given the expansion of national guideline programmes, governments and other agencies must ensure the guidelines are of the highest quality before they endorse them or promote their use in practice. Furthermore, as international cooperation between countries grows there is a strong incentive for policy makers to develop a concerted approach to quality management initiatives, including clinical practice guidelines. The AGREE instrument can enhance this process. This is already taking place as several agencies—such as the National Institute for Clinical Excellence (NICE) in the UK, the National Federation of Cancer Centres (FNCLCC) in France, The Agency for Quality in Medicine in Germany (ÄZQ), and the Scottish Intercollegiate Guidelines Network (SIGN)—are using AGREE in the context of their guidelines programme. The World Health Organisation has adopted the AGREE instrument to assess its guidelines.
In conclusion, the AGREE collaboration has developed an instrument for guideline appraisal using a rigorous methodology. The instrument has been applied to different clinical practice guidelines in 11 countries by a large number of appraisers from a variety of backgrounds. We recommend that guideline producers use this instrument while planning their programmes, and potential guideline users use it to evaluate the quality of guidelines before adopting them.
The AGREE instrument is available on the AGREE website (www.agreecollaboration.org).
Clinical practice guidelines are used increasingly by government agencies and professional organisations around the world to improve patient care, but many published guidelines do not meet the basic quality criteria. There is a pressing need for internationally recognised criteria to assess guidelines that are valid and reliable.
What this study adds
An international collaboration, the AGREE Collaboration, has developed an instrument for assessing the process of guideline development that is reliable and is acceptable in European and non-European countries.
It was not possible to confirm the validity of the instrument.
The instrument provides common standards to improve the quality process and reporting of guideline development worldwide.
These standards can be used for the planning, execution, and monitoring of guideline programmes and for comparing guidelines internationally.
The authors would like to thank the 264 appraisers from the 11 countries who participated in the study and the following colleagues for their valuable comments on the first draft of the instrument: Richard Baker, Martin Eccles, Roeland Geijer, Trisha Greenhalgh, Allen Hutchinson, Nick Hicks, Chris Silagy, Siep Thomas, Richard Thomson, Michel Wensing and Steven Woolf.
Writing group: Françoise Cluzeau (FC), St George’s Hospital Medical School, London, UK; Jako Burgers (JB), University of Nijmegen, The Netherlands; Melissa Brouwers (MB), McMaster University and Cancer Care Ontario, Hamilton, Ontario, Canada; Richard Grol (RG), University of Nijmegen/University of Maastricht, The Netherlands; Marjukka Mäkelä (MM), Finnish Office for Health Care Technology Assessment, Finland; Peter Littlejohns (PL), National Institute for Clinical Excellence, London, UK; Jeremy Grimshaw (JG), Health Services Research Unit, University of Aberdeen, UK; Claire Hunt (CH), Institute of Psychiatry, London, UK.
FC, JB, RG and PL developed the first draft of the instrument and designed the field study. FC and JB drafted the paper and undertook the analyses with CH. MB, MM, RG, PL and JG helped write the final draft.
Contributors: The following individuals provided input into the design and field testing of the AGREE instrument and commented on earlier drafts of the paper: José Asua, Basque Office for Health Technology Assessment, Spain; Anne Bataillard, Fédération Nationale des Centres de Lutte Contre le Cancer, Paris, France; George Browman, Hamilton Regional Cancer Centre, Hamilton, Canada; Bernard Burnand, Institut Universitaire de Médecine Sociale et Préventive, Lausanne, Switzerland; Pierre Durieux, Hôpital Européen Georges Pompidou, Paris, France; Béatrice Fervers, Fédération Nationale des Centres de Lutte Contre le Cancer, Paris, France; Roberto Grilli, Agenzia Sanitaria Regionale, Bologna, Italy; Steven Hanna, McMaster University, Hamilton, Ontario, Canada; Pieter ten Have, Utrecht, The Netherlands; Albert Jovell, Fundacio Biblioteca Josep Laporte, Barcelona, Spain; Niek Klazinga, Academisch Medisch Centrum University of Amsterdam, The Netherlands;
Funding: The research was funded by a grant from the EU BIOMED2 Programme (BMH4-98-3669). The work in Switzerland was funded from the Swiss Federal Office for Education and Science (OFES 97.0447). The Health Services Research Unit, University of Aberdeen is funded by the Chief Scientist Office of the Scottish Executive Department of Health. The views expressed are those of the authors and not the funders.
Conflict of interest: none.
↵* See end of article for contributors