Raters | AIP questions 1–3 (engagement, insight and action) 1–7 scale reliability (G) | AIP question 4 (global assessment) 1–7 scale reliability (G) (ICC)* | AIP question 5 (binary yes/no recommendation on revalidation) reliability (G) (ICC)* | |||
Internal consistency | Inter-rater | Inter-rater† | Inter-rater (95% CI)‡ | Inter-rater | Inter-rater (95% CI)* | |
1 | 0.94 | 0.71 | 0.66 | – | 0.54 | – |
2 | 0.96 | 0.83 | 0.79 | (0.68 to 0.88) | 0.7 | (0.54 to 0.83) |
3 | 0.96 | 0.88 | 0.85 | (0.78 to 0.91) | 0.78 | (0.69 to 0.86) |
4 | 0.97 | 0.91 | 0.89 | (0.84 to 0.93) | 0.83 | (0.75 to 0.89) |
5 | 0.97 | 0.92 | 0.91 | (0.87 to 0.94) | 0.86 | (0.80 to 0.91) |
6 | 0.97 | 0.94 | 0.92 | (0.89 to 0.95) | 0.88 | (0.83 to 0.92) |
Reliabilities greater than 0.8, as required for high-stakes assessment, are given in bold.9
↵* Intraclass correlation coefficients (ICCs) are G coefficients when you have a one facet design (rater).
↵† Inter-rater reliability is the extent to which one rater's assessments (or when based on multiple raters, the average of raters' assessments) are predictive of another rater's assessments.
↵‡ 95% CIs for reliabilities (ICCs) were calculated using Fisher's ZR transformation which is dependent on raters (k) with a denominator value of (k-1), and so cannot be calculated when there is only one rater.9