Article Text

Download PDFPDF

The quality of clinical practice guidelines over the last two decades: a systematic review of guideline appraisal studies
  1. Pablo Alonso-Coello1,2,
  2. Affan Irfan3,
  3. Ivan Solà1,
  4. Ignasi Gich1,2,
  5. Mario Delgado-Noguera4,5,
  6. David Rigau1,
  7. Sera Tort1,
  8. Xavier Bonfill1,2,
  9. Jako Burgers6,7,
  10. Holger Schunemann8
  1. 1Iberoamerican Cochrane Centre. Clinical Epidemiology and Public Health Department. Institute of Biomedical Research (IIB Sant Pau), Spain
  2. 2CIBER de Epidemiología y Salud Pública (CIBERESP), Spain
  3. 3Interactive Research and Development, The Indus Hospital, Korangi Crossing, Karachi, Pakistan
  4. 4Department of Pediatrics, University of Cauca, Popayán, Colombia
  5. 5Pediatrics, Obstetrics and Gynecology and Preventive Medicine Department, Universidad Autónoma de Barcelona, Barcelona, Spain
  6. 6Dutch Institute for Healthcare Improvement CBO, Utrecht, The Netherlands
  7. 7Scientific Institute for Quality of Healthcare (IQ healthcare), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
  8. 8Department of Clinical Epidemiology & Biostatistics, McMaster University Health Sciences Centre, Hamilton, Canada
  1. Correspondence to Dr Pablo Alonso-Coello, Iberoamerican Cochrane Centre. Clinical Epidemiology and Public Health Department. Institute of Biomedical Research (IIB Sant Pau), Barcelona, Spain; palonso{at}


Background Despite the increasing number of manuals on how to develop clinical practice guidelines (CPGs) there remain concerns about their quality. The aim of this study was to review the quality of CPGs across a wide range of healthcare topics published since 1980.

Methods The authors conducted a literature search in MEDLINE to identify publications assessing the quality of CPGs with the Appraisal of Guidelines, Research and Evaluation (AGREE) instrument. For the included guidelines in each study, the authors gathered data about the year of publication, institution, country, healthcare topic, AGREE score per domain and overall assessment.

Results In total, 42 reviews were selected, including a total of 626 guidelines, published between 1980 and 2007, with a median of 25 CPGs. The mean scores were acceptable for the domain ‘Scope and purpose’ (64%; 95% CI 61.9 to 66.4) and ‘Clarity and presentation’ (60%; 95% CI 57.9 to 61.9), moderate for domain ‘Rigour of development’ (43%; 95% CI 41.0 to 45.2), and low for the other domains (‘Stakeholder involvement’ 35%; 95% CI 33.9 to 37.5, ‘Editorial independence’ 30%; 95% CI 27.9 to 32.3, and ‘Applicability’ 22%; 95% CI 20.4 to 23.9). From those guidelines that included an overall assessment, 62% (168/270) were recommended or recommended with provisos. There was a significant improvement over time for all domains, except for ‘Editorial independence.’

Conclusions This review shows that despite some increase in quality of CPGs over time, the quality scores as measured with the AGREE Instrument have remained moderate to low over the last two decades. This finding urges guideline developers to continue improving the quality of their products. International collaboration could help increasing the efficiency of the process.

  • Clinical Practice Guidelines (CPG)
  • AGREE instrument
  • appraisal
  • quality
  • healthcare quality

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Clinical practice guidelines (CPGs) are ‘systematically developed statements to assist practitioners and patient decision about appropriate healthcare for specific clinical circumstances’.1 CPGs are instruments intended to reduce the gap between research and practice and should be based on the best available scientific evidence to improve the quality of patient care.

Since the 1980s, the number of CPGs has increased rapidly (figure 1) as well as the number of publications on guidelines. In the late 1990s, growing concern about variations in guideline recommendations and quality became evident.2 ,3 Therefore, formal methods became necessary to assess the quality of CPG. An international group of researchers from 13 countries—the Appraisal of Guidelines, Research and Evaluation (AGREE) Collaboration—developed a tool, the AGREE Instrument, which can be used to appraise and compare the quality of CPGs.4 It provides a systematic framework for assessing key components of methodological quality of the guidelines, including the process of development and the quality of reporting, and intends to support both guideline developers and users. The AGREE Instrument is the only appraisal tool that has been developed and validated internationally.5 Since its publication in 2003, the AGREE Instrument has been translated into many languages and formally endorsed by several organisations (eg, WHO Advisory Committee on Health Research) and used by many guideline development groups.6

Figure 1

Number of guidelines in PubMed.

Despite the validation and dissemination of the AGREE Instrument, concerns about suboptimal quality, the lack of supporting evidence and the applicability of guidelines continue to exist.7–10 This lack of quality may account for limited effects of guidelines on health outcomes. The aim of this study was to review the quality of CPGs across a wide range of heath care topics published since the 1980s, to analyse the time trends in the quality of guideline development and to evaluate the potential impact of the availability of the AGREE Instrument on guideline quality.


Literature search

We conducted a literature search in MEDLINE (accessed via PubMed) to search for reviews that focused on clinical practice guidelines. We limited the search time frame to articles published from 2003, the year of the publication of the AGREE Collaboration on the AGREE Instrument,4 until July 2008. We searched the name of the instrument and its acronym in title and/or abstract fields. Additionally, we used a series of terms related to guidelines assessment to complete the identification of the relevant studies (eg, evaluation, assessment, quality, appraisal, guidelines). We excluded reviews that used fewer than two appraisers to assess the quality, and reviews using other appraisal tools than the AGREE Instrument. There was no language or country restriction. Two reviewers (AI, MDN) checked all abstracts and full text articles for inclusion, and resolved their disagreements by consensus.

AGREE Instrument

The AGREE Instrument contains 23 items grouped in six domains—(1) scope and purpose; (2) stakeholder involvement; (3) rigour of development; (4) clarity and presentation; (5) applicability; and (6) editorial independence—and one overall assessment item, judging whether the guideline ought to be recommended for its use in clinical practice (Appendix 1). To evaluate each item within the domains, a four-point Likert scale is used, ranging from strongly disagree to strongly agree (1 to 4 respectively). For the overall judgement, a three-point scale is used, ranging from not recommended to strongly recommended. At least two appraisers, but preferably four, are needed. The score of each domain is calculated by summing the scores across the appraisers and standardising them as a percentage of the possible maximum score (thus ranging from 0 to 100%). The internal consistency (Cronbach alpha),11 as measured in the international validation study, varied from 0.64 to 0.88, which means that the reliability is acceptable for most domains.4

Data collection

One reviewer (AI) collected data on the following characteristics of each included study: healthcare topic, type of search for retrieving the included guidelines, number of evaluators and interobserver agreement. For the included guidelines in each study, we gathered data on the year of publication, institution, country, AGREE score per domain (scope and purpose, stakeholder involvement, rigour of development, clarity of presentation, applicability and editorial independence) and if the study provided with an overall assessment (recommended, recommended with provisos or not recommended). All data were checked for accuracy by a second reviewer (MD). When any of these data were not available, we contacted the authors of the articles and asked them to send information on the lacking data.


We calculated AGREE domain scores as means and 95% CI, and categorical variables with number of cases and corresponding percentages. The overall assessment was dichotomised into recommended and not recommended. The correlation between the different domains and overall assessment was analysed using the Pearson coefficient. The scores between different variables (date of publication, type of institution, health topic and country of publication) were compared by analysis of variance (ANOVA) and post-hoc (Duncan test) when appropriate. For the analysis of the trend over time, the date of publication was grouped into four periods of 5 years, and analysed using a non-parametric test (Kruskal–Wallis) and a posthoc test (Mann–Whitney). We explored the potential influence of the publication of the AGREE instrument on the quality of guidelines by comparing the quality scores of guidelines published before and after 2003. We analysed the data with SPSS (Version 15.0; SPSS, Chicago, Illinois).


The literature search yielded 1745 references, of which 908 were considered potentially relevant for our study. From these articles, 833 were excluded after screening the titles and abstracts. The full text version of the remaining 75 articles was assessed, which led to exclusion of another 33 articles following our inclusion/exclusion criteria (figure 2). We contacted 14 authors for more details, and five provided additional data and complementary information. In total, 42 reviews were selected including a total of 626 guidelines published between 1980 and 2007. Agreement in the inclusion process was high (kappa 0.92 SE: 0·033).

Characteristics of reviews and guidelines

A median of 25 guidelines were assessed in each review (range: 2–57).9 ,12–53 In 57% (24/42) of the reviews, the guidelines were assessed with three or more appraisers. The interobserver agreement was provided in 39% (16/41) of the reviews. In total, 626 individual guidelines were evaluated with 270 including an overall assessment. Most guidelines were published in the last 10 years (table 1), and the majority were developed in Europe (41%) and North America (41%). More than half of the guidelines (62%) were developed by medical societies, followed by governmental agencies (20%).

Table 1

Basic characteristics of guidelines reviewed (n=626)*

Guideline quality scores

The mean scores were acceptable for domains ‘Scope and purpose’ (64%; 95% CI 61.9 to 66.4) and ‘Clarity and presentation’ (60%; 95% CI 57.9 to 61.9), moderate for domain ‘Rigour of development (43%; 95% CI 41.0 to 45.2) and low for domains Stakeholder involvement (35%; 95% CI 33.9 to 37.5), Editorial independence (30%; 95% CI 27.9 to 32.3) and Applicability (22%; 95% CI 20.4 to 23.9). From those guidelines that included an overall assessment, 62% (168/270) were recommended or recommended with provisos. Over the last 5 years (2003–2007) the mean score for Scope and purpose was 71% (95% CI 67.6 to 74.4), 37% (95% CI 33.8 to 40.2) for Stakeholder involvement, 44% (95% CI 40.2 to 47.8) for Rigour of development, 68% (95% CI 65.0 to 71.0) for Clarity of presentation, 23% (95% CI 19.9 to 26.1) for Applicability and 33% (95% CI 28.6 to 37.4) for Editorial independence. Regarding overall assessment, 73% of guidelines were recommended or recommended with provisos over the last 5 years, compared with just 27% being not recommended. Overall assessment was highly correlated with all six domains (Pearson coefficient (r)=0.70–0.85).

There was a significant improvement over time for all domains, except for ‘Editorial independence’ (appendix 2 and appendix 3). We observed significantly higher scores for those guidelines published in 2003 or later, compared with those developed before, on three domains (Scope and Purpose, Stakeholder involvement, Clarity and presentation) but not on the other domains (Rigour of development, Applicability and Editorial independence). The number of guidelines recommended or recommended with provisos also increased over time and was significantly higher after 2002 (table 2).

Table 2

Appraisal of Guidelines, Research and Evaluation domain scores of guidelines over time (total sample=608)

CPGs developed in North America and Australia scored significantly lower in the domains ‘Scope and purpose’ (p<0.001) and ‘Clarity of presentation’ (p=0.003) compared with guidelines developed in Europe. There were no differences in the other four domains or in the overall assessment. Guidelines on oncology, internal medicine, musculoskeletal and paediatrics scored significantly higher than those on other topics on four domains (Scope and purpose, Rigour of development, Clarity and presentation and Editorial independence). Medical societies scored lower than governmental and international institutions on all domains except ‘Editorial independence,’ and a significantly lower proportion of their guidelines were recommended.


Our review shows that while there has been a significant increase in overall assessment over time, the overall AGREE scores over the last two decades were moderate to low. In particular, the low score for rigour of development is worrying, as this domain may be a stronger indicator of quality than any of the other domains of the instrument. It could be argued that guideline developers have improved guideline presentation, but that there is much room for improvement for the methodology and other domains of quality.

Our review has several strengths. First, it is the first systematic review of guideline quality over a wide range of topics covering the last 20 years. Our structured and explicit approach increases the validity of the findings. Second, two independent evaluators achieved a high degree of agreement when they assessed the articles. Third, we conducted a quality control of the data extracted by two reviewers that further enhances the confidence in our results. Fourth, we contacted a large number of authors to increase the size of the sample and to check the integrity of data that may have not been included in the full text article. Fifth, rather than searching for guidelines de novo, our inclusion criteria, working from reviews rather than individual guidelines, allowed us to include documents that are no longer available in the corresponding institutions' web pages because they have been updated. Potential limitations are that our review relied on published reviews that appraised guidelines rather than on an actual review of individual guidelines itself. This approach allowed for including guidelines from many countries and healthcare areas. Nevertheless, our study represents a non-random sample of guidelines developed over the last two decades, which might bias the findings. Only those reviews were selected which used the AGREE Instrument in accordance with the recommendations of the AGREE Collaboration such as two or more observers reviewing the guidelines independently. However, the majority of the reviews did not report the interobserver agreement, which could be a source of potential bias. An overoptimistic picture of actual quality of guidelines is possible for several reasons: (1) under representation of healthcare topics in areas of poor quality evidence and where organisations are less likely to produce guidelines; (2) inclusion of most recent versions of the guidelines, whereas older versions are more likely of lower quality and often not published in indexed journals and not evaluated in the corresponding included reviews; (3) selection bias of guidelines in the reviews, based on language and country preferences.

Our review showed that the quality domains with acceptable scores are ‘Scope and purpose’ and ‘Clarity of presentation.’ The scores of these domains can be improved further by providing specific information and clear summaries. To further enhance and expedite the process, the number of questions that guidelines address could be limited by focussing only on the most patient important issues.54 This is closely related to the need for having more limited scopes and moving from classic textbook-like guidelines to more succinct documents.

The domain ‘Rigour of development’ had a low mean score (40%) and, surprisingly, has not improved in the last 5 years. This is one of our most worrying findings, as this domain could be argued to have the most direct effect on guideline quality. The low scores may be due to a lack of methodological expertise in guideline developing teams or a lack of resources needed to perform a well-documented systematic literature search. Strong methodological background is paramount for developing guidelines, and institutions should embark on developing the novo guidelines only if they have the minimal requirements to do so. Another explanation for low scores on rigour of development is that the methods used are poorly reported in the guidelines. This could be improved by using addenda for including search strategies, literature selection process or evidence tables. In electronic documents, hyperlinks to these addenda and to methodology sections can be helpful.

Similarly, the scores of the domain ‘Stakeholder involvement’ did not improve during the last 5 years. Low scores reflect the lack of multidisciplinary teams and the absence of considering patients' views and experiences during the development of the guideline. Including professionals from other related disciplines may be perceived as complicating the process. Nevertheless, for improving patient care concerning the topic of the guideline, all relevant disciplines should be involved in the guideline development group. Guideline developers should also actively contact relevant patients' organisations at an early stage for defining the scope of the guideline and to include patient representatives in their panels. They should also consider searching for qualitative and other types of research on patients' values and preferences. Few guidelines are pilot-tested among target users before publication, as it can be time-consuming and not easy to organise. In the next version of the AGREE Instrument (AGREE II), this item has been deleted and integrated in the item that addresses facilitators and barriers to guideline implementation54 ,55

The low scores on the domain ‘Applicability’ could be due to considering guideline development and guideline implementation as separate activities. Guideline development groups may feel that they are not competent to discuss potential organisational barriers and cost implications in applying the recommendations. Criteria for monitoring and/or audit purposes are often developed after publication of the guidelines with a substantial delay. This also needs special expertise which may not be available during the process of guideline development. Institutions developing guidelines need to gradually integrate these issues in the development phase. Relevant professionals with the relevant expertise should be incorporated in the development group in early stages. Alternatively, guidelines should inform users about the need to consider these issues (barriers, costs, indicators or criteria for monitoring) locally when implementing or adapting a particular guideline.

Finally, the low scores in the ‘Editorial Independence’ domain may be due to a lack of information about funding sources and conflict of interests. It would be relatively easy to raise the scores by providing more information on these items. New approaches for dealing with financial and intellectual conflicts of interest are being implemented.56–58

Most guidelines around the world have been developed in high-income countries. Latin America, Asia or Africa produce guidelines, or protocols, which may be less likely published in indexed journals because of language problems and therefore frequently not included in PubMed and similar databases. If these guidelines are of lower quality than those included in our review, the overall quality observed would probably be overestimating the real quality of guidelines around the world. This would stress the need for further improvement of guidelines across the globe. Higher scores in Europe compared with North America are probably due to a greater involvement and funding from public institutions in the former, with more guidelines being developed by medical societies in the latter. The main topics of the guidelines included in the reviews were those related to internal medicine (including, critical care and geriatrics), oncology and musculoskeletal disorders. These topics are related to the most prevalent chronic conditions, and to areas where the research tends to be of a higher quality due to a longer research tradition and larger investments from industry and other sources. If we had taken a random sample with equal representation of each topic, lower-quality scores would have been observed. This further strengthens our conclusions on the modest quality of guidelines.

Our review confirms the findings of earlier studies on the quality of practice guidelines developed by specialty societies being moderate to low.3 ,18 Grilli et al reviewed the quality of 431 guidelines published between 1988 and 1998 by assessing the reporting of the type of professionals and stakeholders involved, the literature search strategy used and the explicit grading of recommendations according to the quality of supporting evidence.3 All three criteria for quality were met in only 5% of the guidelines. Burgers et al assessed 86 guidelines published between 1992 and 1999 with the AGREE Instrument and found that the quality of guidelines developed by government-funded organisations was higher than those developed by professional and specialist societies.18 However, the differences observed were not large. In our review, we also found that CPG performed by medical societies scored lower in all domains except for ‘Editorial Independence.’ This phenomenon might be explained by poor reporting and lack of transparency about the methods. Guideline users should be critical about the guideline organisation before adopting guidelines in practice.

Guidelines that meet the criteria of the AGREE instrument are likely better than others. Nevertheless, the use of the AGREE instrument has some potential drawbacks.4 First, most criteria are based on theoretical assumptions rather than on empirical evidence. Second, it assesses the likelihood of achieving an intended outcome, but it does not determine the clinical validity of the recommendations. Third, the validity of the overall assessment may be limited, as there were no clear rules on how to weigh the different domain scores in making a decision about whether or not to recommend the guidelines. A new version of the AGREE Instrument will be released soon with more specific scoring instructions.

Implications of our research

The findings of our systematic review urge guideline developers to improve the quality of their products. Guideline methodology is a fast-evolving science,59 ,60 and developers may be lagging behind. Large institutions with sufficient resources are more likely to be able to keep up to date with methodology. We provide a series of suggested actions to improve guideline quality (table 3). We believe that the potential solutions should be global and coordinated.

Table 3

Shortcomings and actions to improve guideline quality (Appraisal of Guidelines, Research and Evaluation domains)

Adaptation of high-quality existing guidelines could be an option as an alternative to de novo guideline development, which may increase the efficiency of guideline development. The ADAPTE Collaboration has developed a generic adaptation process that aims to foster valid and high-quality adapted guidelines.61 The ADAPTE process respects evidence-based principles for guideline development and takes into consideration the organisational and cultural context to ensure relevance for local practice.61 While institutions could benefit greatly from adapting sound guidelines,61 the efficacy of the ADAPTE method over de novo guideline development still needs to be established.

Globally, there is an urgent need to minimise the duplication of efforts. International collaboration on specific disease topics may also reduce redundancy.62 Guideline developers should join efforts by specialty, topic or condition as appropriate, and start sharing their resources and initiatives. More energy should be spent on forming networks or collaborations to avoid duplication in evaluating the available evidence. A uniform format for evidence tables could help in sharing evidence worldwide (N Mlika-Cabanne, submitted). Standardised global databases could be developed of existing evidence gaps, and these could feed the agenda of the Cochrane Collaboration, globally or by review group. These research gaps could inform funding institutions across the globe about the needs for making sound recommendations about a whole array of topics. Organisations such as WHO, the Guidelines International Network and the Cochrane Collaboration should play a major role in supporting this work.


We would like to thank the following authors who provided additional information about their reviews which undoubtedly improved the quality of our work: R Voellinger-Pralong, G Nuki, T-L Appleyard and N Puerto. We are also extremely thankful to S Qureshi for her comments on earlier drafts of our review.

Appendix 1 Domains and items of the Appraisal of Guidelines, Research and Evaluation Instrument4

  • Area 1: Scope and purpose

    • 1. The overall objective of the guideline is specifically described

    • 2. The clinical question covered by the guideline is specifically described

    • 3. The patients to whom the guideline is meant to apply are specifically described

  • Area 2: Stakeholder involvement

    • 4. The guideline development group includes individuals from all the prevalent professional groups

    • 5. The patient's views and preferences have been sought

    • 6. The target users of the guideline are clearly described

    • 7. The guideline has been piloted among target users

  • Area 3: Rigour of development

    • 8. Systematic methods were used to search for evidence

    • 9. The criteria for selecting the evidence are clearly described

    • 10. The methods used for formulating the recommendations are clearly described

    • 11. The health benefits, side effects and risks have been considered in formulating the recommendations

    • 12. There is an explicit link between the recommendations and the supporting evidence

    • 13. The guideline has been externally reviewed by experts prior to its publication

    • 14. A procedure for updating the guideline is provided

  • Area 4: Clarity and presentation

    • 15. The recommendations are specific and unambiguous

    • 16. The different options for management of condition are clearly presented

    • 17. Key recommendations are easily identifiable

    • 18. The guideline is supported with tools for application

    • Area 5: Applicability

    • 19. The potential organisational barriers in applying the recommendations have been discussed

    • 20. The potential cost implications of applying the recommendations have been considered

    • 21. The guideline is supported with tools for application

  • Area 6: Editorial independence

    • 22. The guideline is editorially independent from the funding body

    • 23. Conflicts of interest of guideline development members have been recorded

Appendix 2 Quality of clinical practice guidelines over time evaluated with the Appraisal of Guidelines, Research and Evaluation instrument

Embedded Image

Appendix 3 Quality of clinical practice guidelines over time evaluated with the Appraisal of Guidelines, Research and Evaluation instrument

Embedded Image


View Abstract


  • Competing interests HS is a cochair of the GRADE Working Group, an international collaboration of methodologists, epidemiologists, clinicians and guideline developers, and supports the work of GRADE internationally. He receives no direct personal payments for work as a member of the GRADE Working Group. JB is a member of the AGREE Reseach Trust (ART) that aims to facilitate the distribution, maintenance and improvement of the AGREE Instrument and to encourage its development through collaborative research projects. ART is a charitable trust that does not receive any income. PA-C is a member of the GRADE Working Group. He receives no direct personal payments for work as a member of this group.

  • Provenance and peer review Not commissioned; externally peer reviewed.