Introduction

The randomized controlled trial (RCT) is considered the highest level of evidence available for evaluating new therapies. Results of adequately powered RCTs are more definitive than any other type of clinical research information. As such, RCTs represent one of the most reliable sources of evidence to guide clinical practice. However, the methodological quality of a RCT can influence the validity, accuracy and reproducibility of its results [1]. Flaws in the methodological quality of a RCT have been associated with biased estimates of treatment effect and efficacy [27]. Methodological quality is defined as “the confidence that the trial design, conduct, and analysis have minimized or avoided biases in its treatment comparisons” [1], whereas reporting quality is defined as “the provided information about the design, conduct and analysis of the trial” [1]. Inadequate reporting makes the interpretation of studies difficult or impossible. Since the quality of a RCT can be judged only based on what has been reported, quality of reporting has been used as a measure of methodological quality. Quality is judged inadequate unless the information to the contrary is reported (“guilty until proven innocent” approach), with this being often justified by the fact that faulty reporting generally reflects faulty methods [8]. Inadequate or inaccurate reporting is common among medical journals. Deficiencies have been documented in reporting the method used to randomly assign participants to comparison groups, analyze the data and ensure blinding of outcome evaluation, and in reporting primary and secondary endpoints and sample size calculation. However, the quality of reporting may not necessarily reflect the methodological quality of the study, because well-conducted trials may be reported badly [912].

Specific scales, such as the Jadad scale, have been developed to evaluate the methodological quality of clinical trials [9]. Presence and appropriateness of randomization, blinding and reporting of withdrawals, which are key indicators of the quality of RCTs [8], are all included in the Jadad scale [9]. However, other features not included in the Jadad scale can influence the quality of RCTs, including correct sample size calculation, allocation concealment and intention-to-treat analysis. Therefore, although summary scales provide a useful synthetic representation of RCT quality, all relevant methodological components should also be individually evaluated.

A sample size calculation is crucially important to guarantee that the study has adequate power to detect the treatment effect and minimize the risk of false negative findings, a major concern in clinical research. Sample size calculations need to be carefully performed since incorrect calculations can be misleading, and a new indicator of possible bias, referred to as “delta inflation”, has recently been proposed to assess this [13]. Specification of the expected frequency of the outcome in each study group is an important step for sample size calculations of RCTs, since this defines the clinically relevant and scientifically plausible treatment effect targeted by the study. Delta is the predicted effect size of the treatment under study compared to the control treatment on a pre-specified outcome, chosen as the one of greatest importance to relevant stakeholders. The other elements of the sample size calculation are the significance level required for the rejection of the null hypothesis and the statistical power. The delta inflation represents an overestimation of the expected treatment effect size [13]. Compared to misspecification of the other variables, the delta inflation has a larger impact on the required sample size, and is of common occurrence in RCTs investigating therapies for critical illness published in high impact journals [13]. Delta inflation may result in RCTs that have inadequate sample size to detect genuine differences between the investigated treatments, which can lead to false negative findings.

Another indicator of possible bias recently proposed is the “spin strategy”, defined as the use of specific reporting strategies that distort the interpretation of results and misguide readers [14]. Spin strategies include [14]: (1) focusing on secondary statistically significant results, such as statistically significant results from within-group comparisons, analyses of secondary outcomes, subgroup analyses, or modified analyses (e.g., per protocol analysis); (2) interpreting statistically non-significant results as demonstrating treatment equivalence or comparable effectiveness when the study had not been designed to assess equivalence or non-inferiority, designs that require a different statistical approach and larger sample sizes compared with classical superiority RCTs [15]; and (3) claiming or emphasizing the beneficial effect of the treatment despite statistically non-significant results.

In a previous review assessing the quality of reporting of RCTs published in Intensive Care Medicine, from its birth in 1975 to December 2000, the percentage of adequately reported RCTs according to the Jadad scale was only 25 % [16]. Intensive Care Medicine is nowadays recognized as one of the leading journals in the intensive care medicine field with a well-defined identity [17]. Its articles are widely cited in the medical literature, and its impact factor has risen since 2001 to rank second among intensive care journals.

The aim of this study was to compare the quality of reporting of RCTs published in Intensive Care Medicine from 2001 to 2010 with that described in our previous review for RCTs published from 1975 to 2000, using individual components of methodological quality of reporting as well as the Jadad scale.

For RCTs published from 2001 to 2010, we also evaluated the frequency of spin and delta inflation as further indicators of methodological quality. Finally, we tested the hypothesis that RCTs with higher Jadad scores are cited more often than those with lower scores.

Methods

In line with our previous study [16], this review includes all published RCTs that evaluated the efficacy of a treatment. RCTs evaluating diagnostic, management or educational strategies were excluded. Studies were identified by two independent assessors consulting the on-line archive of the Journal and Springer’s website, using the following search terms in the article’s title and abstract: “randomized controlled trial”, “controlled clinical trial”, “randomized”, “trial”, “randomly assigned”, “random order”, “randomization”, “placebo”, “drug therapy”.

Two independent reviewers assessed the studies using a standardized form, and discrepancies were resolved by discussion with a third reviewer until consensus was reached.

Assessment of individual methodological components

The quality of reporting of RCTs was assessed evaluating three major methodological components: randomization process, blinding, and reporting of the participant flow. Key analyzed elements of the randomization process were the description of the method used to generate an unpredictable sequence and its concealment until assignment (e.g., a computerized random number generation of the sequence and its concealment in sealed, opaque, sequentially numbered or coded envelopes). Blinding was analyzed in terms of strategy used to withhold information about the assigned interventions and to protect the randomization sequence after allocation. Explicit statements about the blinding status of the patients and study personnel involved in the RCT, such as clinicians, researchers, statisticians, or outcome assessors, were recorded. Double blinding of ICU personnel and patient was judged as not feasible (i.e. supine vs. prone positioning, different ventilator modalities, and use of devices) [18] based on the assessment of type of intervention, made by two expert intensivists (F.A.R., N.L.). Finally, the key analyzed elements of participant flow were the reporting of the number of patients randomly assigned to a treatment and those who actually received the intended treatment, the number of patients analyzed for the primary outcome, and the number of patients excluded after randomization or lost to follow-up.

The Jadad scale

We used the Jadad scale for a synthetic representation of RCT quality and for comparison with the previous study period. The scale consists of five yes/no questions assessing three key items: randomization (two questions: 1. Was the study described as randomized? 2. Was the randomization scheme described and appropriate?), double blinding (two questions: 3. Was the study described as double-blind? 4. Was the method of double blinding described and appropriate?), and dropouts and withdrawals (one question: 5. Was there a description of dropouts and withdrawals?) [9]. Total scores range from 0 to 5, with scores ≥3 indicating good quality RCTs [8]. We maintained the distinction between RCTs with a Jadad score <3 and those with a score of ≥3 to allow comparison of the overall methodological quality between the two study periods.

Analysis of spin and delta inflation

The RCTs reporting statistically non-significant results were examined for the presence of spin, and information was extracted on the specific spin strategy used by the authors [14].

We evaluated delta inflation only in RCTs with mortality as primary outcome for consistency with the original publication [13], with delta representing the treatment effect (difference in mortality between treatment and control group). The difference between predicted and observed delta was defined as delta-gap. Delta inflation was considered present if the predicted delta was outside the 95% confidence interval of the observed delta.

Other information extracted

The following information was also extracted: sex and age of participants; industry support (classified as total industry funding, in-kind contribution from industry or duality of interest) [19]; presence of parallel groups (yes/no); number of intervention groups (two or more); characteristics of the control group (placebo or active treatment); number of patients included; blinding status (with specification of who were blinded); pre-specified primary outcome; “a priori” calculation of the sample size; type of outcome considered (mortality or specific outcomes). Outcomes were further classified as objectively or subjectively assessed, according to the extent to which outcome assessment could be influenced by the investigators’ judgment. Objectively assessed outcomes included all-cause mortality, outcomes based on a laboratory measurement (e.g., pH, PaO2, cardiac index) and outcomes based on other objective measures (e.g., duration of ICU or hospital stay). Subjectively assessed outcomes included physician assessed outcomes (e.g., ventilator-associated pneumonia, acute respiratory distress syndrome), outcomes based on a combination of several measures (e.g., multiple hemodynamic or respiratory parameters), patient reported outcomes (e.g., post-traumatic stress disorder-related symptoms, pain scoring) [12].

We also obtained the total cumulative citation counts of each paper included in our review using three different sources, Web of Science (Thomson Reuters. ISI Web of Knowledge Web site. http://www.isiwebofknowledge.com), Scopus (Elsevier. Scopus Web site. http://www.scopus.com) and Google Scholar (Google. Google Scholar beta Web site. http://scholar.google.com), to test the hypothesis that RCTs with Jadad score ≥3 are cited more often than RCTs with lower scores. In August 2012, two of us (S.P., C.M.) independently determined the total number of citations to date for all articles according to the Web of Science’s Science Citation Index, Scopus, and Google Scholar using the Digital Object Identifier (DOI) to uniquely identify each article. No discrepancies were found in the number of retrieved citation counts retrieved by the two investigators. The maximum difference in time between assessments of any of the 3 databases was 7 days for all articles.

Data presentation and statistical analysis

We expressed continuous variables as means (standard deviation, SD) or medians (interquartile range, IQR) and discrete variables as counts (percentage), unless otherwise stated. Differences between groups were analyzed by means of a Student’s t test, Mann–Whitney U test, and chi square test (or Fisher exact test), as appropriate. The presence of a time trend in the use of spin strategies was investigated by logistic regression testing the association of spin (any strategy) with year of publication. The association between number of citations and RCT quality was tested using Poisson regression with robust standard error, and the model was adjusted by year of publication (categorical variable with 5 levels, corresponding to two years each over the 10-year period).

Tests were two-tailed, and P ≤ 0.05 was considered as significant. The data were analyzed with STATA 9.0 (Stata, College Station, TX, USA).

Results

From January 2001 to December 2010, 233 RCTs were published in Intensive Care Medicine, of which 221 (95 %) were included in the analysis (Fig. 1; supplemental e-Table). The design characteristics are reported in Table 1. Mortality was the primary outcome in 17 RCTs (8 %). The mean number of RCTs published yearly was 22 (range 14–30), significantly higher than in the previous period between 1975 and 2000 (22 vs. 9; t test: P < 0.001). Sample size was also significantly larger than in the previous period (median 42, IQR 20–100; absolute range 5–1,101 versus 30, 20–64; Mann–Whitney test: P = 0.048). Yet, one-third of RCTs had 20 patients or fewer and 10 % had 10 patients or fewer.

Fig. 1
figure 1

Flow chart of the inclusion and exclusion of studies in the review

Table 1 Design characteristics of published randomized controlled trials (RCTs)

The reporting of the individual methodological components is presented in Table 2, where findings are compared with those from our previous review. Studies with statistically non-significant results were more common than in the previous period (52 vs. 17 %; χ 2 test: P < 0.001). Reporting of the rationale for sample size estimation and allocation concealment increased significantly, but reporting of other important individual methodological components did not change substantially compared with the previous period, and remained low, varying from 12 % for the description of the method used to ensure blinding to 57 % for description of withdrawals.

Table 2 Reporting for individual methodological components in RCTs published in Intensive Care Medicine in the two study periods

Among 69 RCTs (31 %) reporting blinding, 4 studies reported triple blinding (patient, researcher and assessor blinded in 3 studies; patient, researcher and statistician blinded in 1 study), 32 reported double blinding (patient and researcher blinded in 25 studies; researcher and statistician blinded in 3 studies; outcome assessor and statistician blinded in 1 study; double blinding no further specified in 3 studies), and 33 single blinding (blinding of the patient in 11 studies, researcher in 16 studies, data analyst in 5 studies, and outcome assessor in 1 study) (supplemental e-Figure).

Among 152 RCTs (69 %) not reporting blinding, 81 did not report the primary outcome; of the 71 that reported it, the primary outcome was objectively assessed in 42 (mortality: 14; other outcomes: 28). Double blinding of the ICU personnel and patient was judged as not feasible in 110 (supplemental e-Figure).

Among the 151 RCTs not reporting allocation concealment, 83 did not report the primary outcome; of the 68 RCTs that reported it, the primary outcome was objectively assessed in 30 (mortality: 5; other outcomes: 25).

Among RCTs published between 2001 and 2010, the proportion of studies with Jadad score ≥3 was not significantly higher than in RCTs published between 1975 and 2000 (30 vs. 26 %; χ 2 test: P = 0.40), and it increased only slightly (37 %) after exclusion of RCTs where double blinding of ICU personnel and patient was judged as not feasible. Among RCTs with double blinding not feasible, blinding of data analyst was reported in only one study.

Spin strategy was evaluated among 111 RCTs (50 %) published in the period 2001–2010 that reported statistically non-significant result. A spin strategy was used in 69 (62 %): 43 interpreted statistically non-significant results for the primary outcomes as a demonstration of treatment equivalence or comparable effectiveness, 21 RCTs focused on secondary statistically significant results, and 5 claimed or emphasized the beneficial effect of the treatment despite statistically non-significant results. A logistic regression analysis showed no association between presence of spin and year of publication (P = 0.35).

Delta inflation was evaluated in 11 RCTs published in the period 2001–2010 that had survival as a primary outcome and reported both predicted and observed delta. Figure 2 shows evidence of delta inflation in 7 RCTs (64 %), where the predicted deltas are consistently higher and outside the 95 % CI of the observed deltas.

Fig. 2
figure 2

Plot of the observed versus the predicted delta for the 11 RCTs for which the information was available. Vertical lines correspond to the 95 % confidence interval of the observed delta, and the diagonal represents the line of equality. Delta inflation is present in 7 RCTs where the 95 % CI does not cross the line of equality, i.e. the predicted delta is outside the 95 % CI of the observed delta

Among RCTs published in the period 2001–2010, the number of citations was higher for RCTs with Jadad score ≥3 compared with those with Jadad score <3. The Poisson regression model adjusted for year of publication showed an increase in citations associated with a Jadad score ≥3 of 32 % (95 % CI: 1–71 %; P = 0.04 for Web of Science, 31 % (1–69 %; P = 0.04) for Scopus, and 32 % (2–71 %; P = 0.04) for Google Scholar. We found no relationship between the presence of spin and the number of citations in Web of Science, Scopus, or Google Scholar (P values of 0.42, 0.50, and 0.42, respectively).

Discussion

We analyzed the quality of reporting of RCTs published in Intensive Care Medicine from 2001 to 2010, and compared it with a previous analysis of RCTs published in the Journal from 1975 to 2000 [16]. The total number of RCTs increased significantly in the last 10 years compared with the previous 25 years. However, we could not appreciate a similar trend for the quality of reporting. Reporting of the rationale for sample size estimation and allocation concealment increased significantly, but other important quality indicators such as randomization, blinding, and specification of primary outcome were reported in one-third to one-half of published RCTs with no substantial differences with previous period. The sample size increased significantly, yet one-third of RCTs had 20 patients or fewer and 10 % had 10 patients or fewer. We also documented distorted presentation of results and inflated size of treatment effect in a considerable number of RCTs published from 2001 to 2010.

Our results are in agreement with findings from previous studies that found low rates of reporting of important indicators of methodological quality in RCTs published in various specialty and general medical journals [2028]. A recent Cochrane review described the inadequate reporting of essential elements of the methodological quality of published RCTs as a serious endemic problem hindering research utilization in clinical practice and further research [29]. RCTs are the gold standard in evaluating health care interventions, but rigorous methodology is of crucial importance to ensure unbiased comparisons [2]. To assess a trial accurately, readers of a published report need complete, clear, and transparent information on its methodology and findings. Lack of adequate reporting has promoted the development of the CONSORT (Consolidated Standards of Reporting Trials) statement in 1996 and its later revisions [30], and an increased publication of reviews aimed to assess the quality of the reporting of published RCTs [31]. Our findings show that quality of reporting of RCTs published in Intensive Care Medicine in the “after-CONSORT” period did not differ substantially compared to the earlier period. Similar results have been described in other general and specialty journals [28, 31], suggesting that specific measures are needed to increase the adoption of the CONSORT recommendations by authors, reviewers, and editors [17, 32].

RCTs with inadequate or unclear random-sequence generation, inadequate or unclear allocation concealment, or lack of or unclear blinding tend to exaggerate estimates of treatment effects, especially if assessing subjective outcomes [4, 6, 7, 12]. In our analysis, among RCTs not reporting allocation concealment or blinding that had a primary outcome specified, more than half used subjectively assessed outcomes, making the risk of inflated estimates of treatment effect likely. As an important remark, blinding to treatment of the care givers and patients may sometimes be not feasible in RCTs exploring health care interventions in critical care medicine. In such cases, an uncritical evaluation of the quality of reporting may generate over-pessimistic estimates of methodological quality. However, blinding of the outcome assessors, data collectors, or data analysts is always possible, and it is crucial in order to ensure unbiased ascertainment of the outcome and unbiased treatment effect estimates [33]. Lack of blinding can introduce bias if knowledge of the treatment received affects patient care or outcome assessment [12]. Blinding in RCTs can reduce bias, particularly those with subjective outcomes [34]. Therefore, “blinding of as many individuals as is practically possible” should always be done [33]. Among the 110 RCTs where double blinding of ICU personnel and patient was not feasible, only one study reported blinding of the data analyst. In such cases, it would also be desirable to specify at least one objectively assessed outcome, even if the outcome of primary interest is subjective [12].

Reporting of the sample size calculation has greatly increased in the past decades, from 4 % in 1980 to 83 % in 2002 [3537]. In a recent review of general medical journals with high impact factors, only 5 % did not report any sample size calculation [35]. Despite this, calculations were frequently based on inaccurate assumptions about the control group and were often erroneous [35]. A priori calculation of sample size is intended to provide a sample size that is large enough to detect a postulated size of treatment effect with reasonable confidence [38]. We found that methods and assumptions used for sample size estimation were reported in 43 % of RCTs published between 2001 and 2010. Although sample sizes were significantly larger compared with those in the period 1975–2000, they remained small. As expected, overestimation of the event rate in the treatment group, a major determinant of sample size calculation, was commonplace; in fact, we documented an inflated predicted treatment effect in 7 of 11 RCTs (64 %) reporting both predicted and observed treatment effects on mortality. Aberegg et al. [13] found similar proportions (68 %) in 38 RCTs published in high impact factor journals evaluating the effect of therapies on critical care mortality. Underpowered trials are often viewed as a major problem in clinical research. RCTs that are too small can be misleading, either by missing realistically moderate treatment effects that would be clinically important [38], or by over-estimating the size of treatment effect and finding it statistically significant purely due to chance [8, 38]. Moreover, neglecting to report sample size calculations suggests methodological weakness [39, 40]. In constrast, others view underpowered trials as a potential resource, since they may still convey relevant information that can be incorporated into systematic reviews and meta-analyses, provided that bias is avoided and reporting is exhaustive [4, 9, 35, 3841].

In assessing the quality of RCTs, we relied mostly on the analysis of individual components of methodological quality rather than on the Jadad scale. This scale is the only one that has been developed using established standards, and low scores have been associated with increased effect estimates [42]. However, despite its thorough development and validation, the scale is problematic for several reasons, including the fact that it penalizes research areas where blinding to treatment may be not feasible, such as critical care medicine or surgery [42]. In our study, RCTs with double blinding not feasible received an over-pessimistic evaluation using the Jadad scale, which should caution against the use of this scale as a single method to synthesize methodological quality of reporting of RCTs. With this limitation in mind, we found no improvement over time in terms of Jadad scores.

A spin strategy was used in a considerable proportion of RCTs with statistically non-significant primary outcomes published in the period 2001–2010, with no changes over time. Reasons for spin are currently speculative, possibly including the pursuit of personal needs or corporate economic interests, which may increase public distrust in science [43]. Authors might use spin strategies simply to increase the chance of publication, but our findings suggest that this may be futile as well as wrong. In fact, we observed a significant increase over time in the proportion of RCTs with statistically non-significant findings published in Intensive Care Medicine, which is in line with current recommendations for journal publication policies [44]. Sometimes, however, it may be difficult to distinguish spin from genuine mistakes in interpreting studies with “negative results”. For example, a common error in medical literature is the interpretation of absence of evidence as evidence of absence [45], which might explain the interpretation of non-significant findings as demonstration of treatment equivalence by the authors of some of the RCTs. Indeed, establishing therapeutic equivalence between treatments of proven efficacy requires RCTs specifically designed, with predefined equivalence margins [46]. Although the specific impact of spin on the interpretation of RCT findings by peer reviewers and readers is unclear, “the fairness of results reporting” has recently been shown to play an essential role in physicians’ evaluation of a trial validity, with influence on their willingness to believe and act on the trial findings [47, 48].

Intensive Care Medicine’s impact factor more than doubled over the last decade, and yet the quality of reporting of its RCTs did not improve substantially. The impact factor is often used as an indicator of the quality of science published in a journal, but in fact it depends on a number of elements that go beyond scientific aspects [49, 50], including the role of industry-supported research [5153] and editorial strategies that influence its calculation (e.g., reduction of citable articles) [53, 54]. Evidence on the quality of published articles and their methodological weaknesses is therefore important for ensuring a journal’s improvement over time through targeted actions, such as introduction or maintenance of specific methodological requirements for authors and reviewers.

Citation of scientific articles by other researchers is an important indicator of the dissemination of research findings, it reflects the impact of the article within the scientific community. However, whether the number of citations reflects the methodological quality of a paper has been questioned, because other factors may have a higher impact on the frequency of citation, including the journal reputation [5557], the country where the research was done [58], or the finding of positive results [56, 57] may count more than the design merits of the study. We found that better reporting was associated with a statistically significant increase of about a third in all three databases considered, Web of Science, Scopus, and Google Scholar.

In conclusion, our current analysis reveals that quality of reporting of RCTs published in the Intensive Care Medicine between 2001 and 2010 has not substantially improved compared to previous years. Adherence to CONSORT recommendations, with special emphasis on accurate description of randomization and blindness, and correct reporting and discussion of results in RCTs with statistically non-significant results, are recommended. Improved quality of methodological reporting may help to select articles with greater impact on science, and may have beneficial effects on Journal citation.