Article Text

The ability of a behaviour-specific patient questionnaire to identify poorly performing doctors
  1. Bård Fossli Jensen1,
  2. Fredrik A Dahl1,
  3. Dana Gelb Safran2,
  4. Andrew M Garratt3,
  5. Edward Krupat4,
  6. Arnstein Finset5,
  7. Pål Gulbrandsen1,6
  1. 1HØKH Research Centre, Akershus University Hospital, Lørenskog, Norway
  2. 2Department of Medicine, Tufts University School of Medicine; and Blue Cross Blue Shield of Massachusetts, Boston, Massachusetts, USA
  3. 3National Resource Centre for Rehabilitation in Rheumatology, Diakonhjemmet Hospital, Oslo, Norway
  4. 4Center for Evaluation, Harvard Medical School, Boston, Massachusetts, USA
  5. 5Department of Behavioural Sciences, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
  6. 6Institute of Clinical Medicine, Campus Ahus, University of Oslo, Oslo, Norway
  1. Correspondence to Dr Bård Fossli Jensen, Akershus University Hospital, HØKH Research Centre, Postboks 95, 1478 Lørenskog, Norway; b.f.jensen{at}


Background Doctors' ability to communicate with patients varies. Patient questionnaires are often used to assess doctors' communication skills.

Objective To investigate whether the Four Habits Patient Questionnaire (4HPQ) can be used to assess the different skill levels of doctors.

Design A cross-sectional study of 497 hospital encounters with 71 doctors. Encounters were videotaped and patients completed three post-visit questionnaires.

Setting A 500-bed general teaching hospital in Norway.

Main outcome The proportion of video-observed between-doctor variance that could be predicted by 4HPQ.

Results There were strong correlations between all patient-reported outcomes (range 0.71–0.80 at the doctor level, p<0.01). 4HPQ correlated significantly with video-observed behaviour at the doctor level (Pearson's r=0.42, p<0.01) and the encounter level (Pearson's r=0.27, p<0.01). The proportion of between-doctor variance not detectable by 4HPQ was 88%. The reason for this discordance was large within-doctor between-encounter variance observed in the videos, and small between-patient variance in patient reports. The maximum positive predictive value for the identification of poorly performing doctors (92%) was achieved with a cut-off score for 4HPQ of 82% (ie, patient assessments were concordant with expert observers for these doctors).

Conclusion Using a patient-reported questionnaire of doctors' communication skills, favourable assessments of doctors by patients were mostly discordant with the views of expert observers. Only very poor performance identified by patients was in agreement with the views of expert observers. The results suggest that patient reports alone may not be sufficient to identify all doctors whose communication skills need improvement training.

  • Communication
  • patient-centred care
  • patient satisfaction
  • healthcare quality
  • improvement
  • medical education

Statistics from


The communication between doctors and patients is an important factor contributing to the quality of healthcare and patient safety.1 2 In 1982 Mumford et al consolidated the growing evidence that interviewing and related skills had a significant impact on a wide range of clinical outcomes.3 The main principles of good communication have now been established in systematic reviews of the literature and in consensus statements.4–7

Different approaches are available to study doctors' communications skills.8–11 The two most common approaches are observation of audiotaped or videotaped consultations and patient surveys. There are also a variety of less commonly used methods.12 Patient questionnaires have the advantage of being simple to administer and less costly when compared with other methods, such as observation. Observations have several advantages because when they are recorded or transcribed they can be analysed without the time restriction of real-time observations, and the results can be checked or reproduced by others. However, it is expensive and time consuming to collect observations, resources are required to store them securely, and they require trained personnel to code or interpret them. Observations may be stressful for both patients and doctors because sensitive information is recorded and evaluated by others. However, evaluation of communication by direct observation may have a stronger effect on behaviour than patient surveys through the Hawthorne effect.

Several instruments widely used to assess patient experiences and satisfaction with healthcare include dimensions of communication and information giving.12–17 The development of these instruments is usually based on a literature review, patient interviews and expert group opinion. Following development, psychometric properties are assessed, including construct validity, which involves comparisons with related variables and scores from instruments that assess related constructs. While patient experiences and satisfaction are recognised as important components of care quality, to the authors' knowledge no research has been published comparing scores from patient questionnaires and video-coding using instruments with virtually identical content. One large US study that combined the analysis of audiotapes with performance data concluded that patients' perceptions of their doctors as measured by surveys are highly susceptible to unmeasured patient confounding, for example, when patients' reports on a doctor's communication style were positively associated with the number of glycohemoglobin tests the doctor ordered.18

In the pilot preparation of a randomised controlled trial designed to evaluate the effect of a communication skills training program, the authors developed a patient questionnaire with highly specific items directly related to the skills taught, the Four Habits Patient Questionnaire (4HPQ).19 In the trial all encounters were videotaped, and patients completed this questionnaire along with communication-specific and information-specific items of the OutPatient Experiences Questionnaire (OPEQ)13 and the global satisfaction item of the Consumer Assessment of Healthcare Providers and Systems (CAHPS).14 OPEQ and CAHPS have been used as measures of care quality in national surveys of patients in Norway and the USA.13 14 These questionnaires have undergone validity testing, but in the absence of a gold standard measure of patient experiences, comparisons with a more direct measure of communication skills will further contribute to evidence for their construct validity. The aim of this study was to establish the ability of 4HPQ to distinguish between doctors who are poor communicators from those who are good communicators, and to report correlations between the patient reports and video-based observations.

Material and methods

The randomised controlled trial took place in a 500-bed general teaching hospital in the capital area of Norway between April 2007 and June 2008. Participating doctors were representative of all doctors in the hospital according to age, gender and position, but surgeons were under-represented. The researcher ensured in advance that the doctor would be present, whether this was in the outpatient clinic, at the ward for rounds, or on call for the emergency room. The researcher then recruited patients consecutively at each location, right before they were due to meet the doctor. The trial excluded 77 of 574 (13%) available encounters—34 (6%) because the patient declined participation—leaving 497 of 574 (87%) encounters. Details of recruitment are given in previous papers.20 21 Real encounters from different medical specialties were videotaped. Encounters included outpatient appointments, bedside visits on rounds and inpatient encounters as part of diagnostic or therapeutic procedures. The doctors had up to eight encounters with different patients.

Study design

Although the study was a randomised controlled trial, all collected data were treated as cross sectional under the assumption that any change in skills throughout the observation period should not affect the within-encounter association between observations and patient-reported outcomes.


The videotapes were evaluated by the Four Habits Coding Scheme (4HCS), which includes 23 items that use five-point scales, ranging from not very effective behaviour to highly effective behaviour (table 1, figure 1). 4HCS relates directly to the communication skills taught in the training program ‘the Four Habits approach to effective clinical communication’, developed at Kaiser Permanente, a program that adheres to established principles for good communication skills training.4–7 22 In the training program important elements of good communication are divided into four easily remembered groups (habits) for didactic purposes. The items in 4HCS can be recognised within these habits: Habit I (Invest in the beginning to create rapport quickly and plan the visit—six items); Habit II (Elicit the patient's perspective—three items); Habit III (Demonstrate empathy—four items); and Habit IV (Invest in the end to focus on effective decision making and information sharing—10 items). 4HCS has been used in several studies23–25 and has been validated against the Roter Interaction Analysis System (RIAS),26 one of the most widely used instruments to describe doctor–patient communication.27 Four experienced students educated in psychology were trained to use 4HCS. Inter-rater reliability as measured by the intraclass correlation was >0.71.28 The individual items were summed and transformed to a 0–100 scale, with 100 being the best possible score.

Table 1

Four Habits Coding Scheme

Figure 1

Scatter plot of doctors' communication skills assessed with expert coding and patient questionnaire. Doctors with patient score below best possible cut-off score marked in red.

4HPQ was developed from a 23-item patient questionnaire directly related to the Four Habits and items of 4HCS and it has evidence for validity and reliability.19 The items have a four-point scale of ‘definitely yes’, ‘somewhat yes’, ‘somewhat no’ or ‘definitely no’. The analyses revealed that 10 items satisfied formal criteria for inclusion, including missing values below 10% and Cronbach's α of the resulting scale above the much-used criterion of 0.7. Another five items were retained for the main study because these items were considered crucial elements of doctor–patient communication and therefore important for content validity. The 15 items, covering Habit I (four items), Habit II (two items), Habit III (three items), and Habit IV (six items), were completed by the patients after the encounter (table 2). Scores were calculated in the same way as 4HCS.

Table 2

Questions in the Four Habits Patient Questionnaire (4HPQ)

OPEQ is a 24-item instrument with evidence for reliability and validity following a Norwegian national survey of patients.13 This study included the six-item OPEQ scale that is related to the doctor's communication: ‘Was the doctor well prepared for this encounter?’ (Item 1), ‘Was it clear to you how you should care for yourself after the encounter?’ (Item 2), ‘Did the doctor speak to you so you could understand him/her?’ (Item 3), ‘Did you have confidence in the doctor's competence?’ (Item 4), ‘Did you feel that the doctor cared for you?’ (Item 5), and ‘Did you get the opportunity to express your most important concerns?’ (Item 6). Following evidence that a five-point scale outperformed the 10-point scale,29 the five-point scale from ‘not at all’ to ‘a very large extent’ was used for the six items. The items were summed and transformed to a 0–100 scale, with 100 being the best possible experience of care.

CAHPS is a 39-item questionnaire with evidence for reliability and validity.14 The single global item ‘Using any number from 0 to 10, where 0 is the worst doctor possible and 10 is the best doctor possible, what number would you use to rate this doctor?’ was included in this study as a measure of global satisfaction. For the purposes of this study, patient responses were transformed to a 0–100 scale, with 100 being the best possible.


A commonly used technique when few data are missing is mean value imputation.30 In this case it is preferable to preserve the ordinal property of the response variable, so for all missing values, the mode value was used instead. The difference is negligible because the percentage of missing values is very low, and the frequency of the mode value dominates the other values. Correlation was assessed at the encounter and doctor levels between the 4HCS total score and those for 4HPQ, OPEQ and CAHPS global satisfaction using Pearson's r when the underlying data were normally distributed and Spearman's ρ when they were not. Correlations were weighted according to the number of encounters each doctor had.

With several observations per doctor, the authors were able to calculate the precision of 4HCS (the video observation score) for each doctor, and the corresponding patient reports. The proportion of the between-doctor variance that would not be detectable, even with an unlimited number of patient questionnaires, could also be estimated. Details of this analysis are given in the online supplementary appendix. Data were analysed with PASWstatistics18.0.

Ethics and privacy

The study was approved by the Regional Committee for Medical Research Ethics of Southeast Norway (1.2007.356), and privacy measures accepted by the Privacy Ombudsman for Research in Norwegian universities (NSD approval 16423/2007).


Response rates

Table 3 describes the sample characteristics. There were no missing items for 4HCS. All 497 patients (100%) responded to the questionnaire; 0.6% (45 of 7455) of the items in 4HPQ and 0.8% (25 of 2982) of the items in OPEQ were missing and all patients completed the single item from the CAHPS. Missing responses were fairly evenly distributed across items. For 4HPQ the range was from zero (0.0%) missing for items 7 and 13 to eight (1.6%) missing for item 14. For OPEQ the range was from two (0.4%) missing for items 3 and 4 to seven (1.6%) missing for items 2 and 6.

Table 3

Descriptive data

Correlations and predictive ability

The internal consistency of the instruments mapping Four Habits specific behaviour, as measured by Cronbach's α, was 0.85 for 4HCS and 0.87 for 4HPQ. For 4HCS the sum scores were normally distributed both at the encounter and doctor levels, with mean sum score at the doctor level of 40% for all doctors (44% for internal medicine, 33% for surgery, and 43% for other (one-way ANOVA p=0.001)). For 4HPQ, OPEQ, and CAHPS the sum scores were normally distributed at the doctor level, but skewed in the positive direction at the encounter level. Scores for 4HPQ, OPEQ, and the global satisfaction measure from CAHPS, were highly correlated (Pearson's r) at the doctor level (4HPQ vs OPEQ 0.79, 4HPQ vs CAHPS global satisfaction 0.80; OPEQ vs CAHPS global satisfaction 0.71, all p<0.01). At the encounter level, the correlation (Spearman's ρ) between 4HPQ and 4HCS was 0.27 (p<0.01) (0.20 (p<0.01) with prior knowledge of the doctor, 0.30 (p<0.01) without prior knowledge of the doctor). At the doctor level, the statistically significant correlation (Pearson's r) between the patient questionnaires and 4HCS were low to moderate (table 4). The estimated proportion of the between-doctor variance that would not be detectable by 4HPQ was 88% for all doctors. The reason for this high proportion was large within-doctor between-encounter variance observed in the videos, and small between-patient variance in the patient reports. The proportions for internal medicine, surgery, and other were 95%, 58% and 88%, respectively. There were only minor differences related to types of encounter.

Table 4

The results and correlations between patient reports and the Four Habits Coding Scheme—standardised values (0–100)

The upper left rectangle in the scatter plot (figure 1) shows that a large proportion of the doctors with scores below mean on 4HCS achieve scores above mean on 4HPQ. There is no agreed limit between acceptable and unacceptable communication skills. However, regardless of choice of 4HPQ cut-off score, sensitivity for poor communication skills in this material was low. The highest positive predictive value (92%) was found when the cut-off was set at 82% of maximum achievable 4HPQ score. Twelve doctors (three internal medicine, seven surgery, two other) had mean scores below this level, but as the SE of the means were large, several of their patients gave them excellent ratings. The outlier in the lower left represents a doctor videotaped once and weighted accordingly. It is included in the scatter plot for completeness of the dataset. For the outlier, 4HPQ corresponds well with 4HCS. Exclusion would decrease the correlation between 4HPQ and 4HCS, which would change the results slightly in favour of the conclusion.


The 15-item patient questionnaire (4HPQ) that assesses specific elements of communication demonstrates good validity. The moderate correlation at the doctor level with the video-observed behaviour (4HCS, r=0.42) was highly significant. Highly significant correlations were also found with the US developed global satisfaction score (CAHPS, r=0.80) and the communication and information items of a Norwegian patient experiences questionnaire that has been used in national surveys (OPEQ, r=0.79). The two patient questionnaires that are specific to communication, 4HPQ and OPEQ, were more strongly correlated with the video-observed behaviour than the CAHPS global measure of satisfaction. The components of these two questionnaires included in the present study assess doctor behaviour as distinct from more general patient experiences and satisfaction. Although 4HPQ was specifically designed to detect patient observations of skills evaluated by expert observers, the only slightly higher correlation of the 15-item 4HPQ vs 4HCS compared with the six-item OPEQ vs 4HCS demonstrates that to improve correlation it is not sufficient to increase the precision of the patient questionnaire.

The authors conclude that even an established instrument (OPEQ) and a more specifically focused instrument (4HPQ) may be insufficient to identify all doctors who communicate poorly. One reason for this is that, according to expert judgement, doctors perform differently from patient to patient. Therefore, if expert judgement were to be used as the gold standard for characterisation of doctor performance, this carries substantial variance. The authors found a low (but still statistically significant) correlation (0.27) at the encounter level, which suggests that the reports of single patients do not capture qualities of communication as evaluated by experts. That doctor level correlation exceeds patient level correlation is most likely explained by within-doctor behaviour consistency that is reflected in expert judgements because there were only four experts with high inter-rater reliability, while there were more patients per doctor and they were not given any pre-visit instruction about how to complete the questionnaires. Further, patient reports were strongly skewed with low variance. This contributes to a discussion on whether patients' experiences or expert judgement should be considered the gold standard for evaluation of communication skills.

Why would patient and expert observers differ in their assessment of doctor communication? Part of the answer may be linked to the fact that patients' prior knowledge or experience with the doctor is likely to bear on the encounter. Thirty-six per cent of patients in our study indicated at least some prior experience with the doctor. Patients' assessments of their experiences are almost certainly influenced by their prior experiences with the doctor (and possibly, in comparative terms, their prior experiences with other doctors). Expert observers, however, are trained to adhere to an objective scale in evaluating each encounter individually, and the top score is extremely hard to achieve. Consequently, as observed, correlation between expert judgement and patient report is lower when the doctor and patient have met before.

But maybe too much is asked of patients, when they are requested to evaluate—in detail—the communication skills of doctors (how they do things), while they are trying to focus on the content of the conversation at this important emotional moment in their life. Patients might simply not be able to detect and hold doctors accountable for specific communication behaviours. That is, patients may be missing nuances of communication that expert observers detect. First, they are likely to differ in their abilities to observe doctor's specific behaviour. Second, as noted by others,12 18 it is possible that exogenous patient characteristics may conflate reports about the healthcare experience with the doctor. Third, some evidence suggests that doctors adapt communication behaviour from encounter to encounter.31 This change could be an adjustment in response to the patient's needs, but it could also reflect negative emotions due to the nature of the task, patient behaviour or external stressors. Fourth, patient experiences and hence their assessments of care may be highly dependent on the doctors' ability to solve the medical problem, which may contaminate responses to other, more specific questions. Patient experiences and satisfaction scores are often skewed towards positive ratings. Following this, low variability was observed on the communication-specific and information-specific questions. The combination of low between-patient variability and high within-doctor variability resulted in low sensitivity for doctor performance at least for the patient questionnaires used in this study. A recent study in British general practice made a similar observation. Using multilevel modelling with data from 4573 patients visiting 150 doctors, the authors found that only 6.3% of the variance in doctors' communication skills was due to differences between doctors; 92.4% of the variance occurred at the patient level.32 However, importantly, the study found that very poor communication as reported by a doctor's patients was highly concordant with the views of expert observers. At optimal cut-off, patients' assessments were highly predictable of poor communication. The same was not true for good performance. That is, several doctors who were found to communicate poorly by expert observers were evaluated favourably by their patients.

4HCS builds on a teaching program that is based on established principles of good patient-centred communication.22 The significant correlation between video-observed behaviour as measured by 4HCS and global patient satisfaction suggests that the Four Habits approach includes elements of patient-centred communication that are of importance to patients, also in hospitals. Our material is heterogeneous, involving doctors from different clinical specialities observed during outpatient appointments covering technical procedures like echocardiography or electromyography, emergency room evaluations, and bedside encounters on ward rounds. Not all items are equally relevant to all of these situations, but the authors evaluated them in the same way in all videos in order to maintain high reliability. Differences were not observed among outpatient visits, rounds or emergency room encounters. 4HCS has previously been validated against the widely used communication behaviour coding system RIAS,26 27 and has also been used in several other studies.23–25 Satisfactory inter-rater reliability was achieved in all studies and was above 0.7 in the present study.28 The authors have no reason to think that the evaluation method is flawed, although the use of four coders introduces estimation errors. However, videotapes were distributed randomly to coders and no systematic bias should have been introduced.28 Nevertheless, some 4HCS items, particularly pertaining to the beginning of the consultation and elicitation of the patient's perspective, could contribute to an artificially low score when things go unsaid because the patient is well known to the doctor. The variance not detected by 4HPQ was less for surgeons. One possible reason might be the lower mean score achieved by this subgroup.

There is no agreed definition or criterion relating to acceptable and unacceptable communication skills. The optimal cut-off point was added to the scatter plot merely for illustrative and descriptive purposes. Assuming that the method of evaluation is valid, there is a large potential for improvement for most of the participating doctors. Maybe the criteria applied are too severe, an ideal that doctors do not have a fair chance to live up to. Good evaluations by patients would support this view. But poor communication has been identified as a key reason for things that go wrong in diagnosis and management,1 2 and ambitious goals are necessary in that perspective. Given that the survey instruments evaluated in this study had limited ability to identify all poorly performing doctors, as judged by expert observers, the authors suggest that hospitals should consider developing strategies to improve the communication skills of all doctors.

Strengths and limitations

This is the first large-scale study to investigate the correspondence of a patient questionnaire and objective coding using instruments with virtually identical content derived from the same conceptual model. Previous studies relating to patient experiences and satisfaction measurement have largely focused on questionnaire development and evaluation, including tests of construct validity,17 and associations with aspects of structure, process, outcome and other patient data including sociodemographic variables.33 The study had high participation rates for doctors20 and patients,21 and data were collected from a wide variety of 497 hospital encounters, suggesting that findings are not likely to be affected by selection bias.

There are a large number of questionnaires available for assessing patient experiences and satisfaction with healthcare 16 17 and this study included just three: the OPEQ scale of communication,13 one item relating to overall patient satisfaction from the CAHPS14 and the 4HPQ developed for this study.19 The inclusion of other questionnaires assessing patient experiences, which have good evidence for reliability and validity, may have served to improve the correlation with 4HCS. OPEQ was included because it has good evidence for data quality, reliability and validity in Norway and has been used in a national survey.13 The development of the questionnaire followed a literature review, patient interviews and expert group input. This process was designed to lend OPEQ content validity and the literature review took account of the content of existing questionnaires. Therefore the content of the OPEQ communication scale is reflective of patient experiences and satisfaction questionnaires that assess communication more generally. For reasons of acceptability to patients, additional questions could not be included from nationally used surveys in other countries.

The finding that a large proportion of doctors had high scores representing good communication for the patient questionnaires follows existing evidence that scores from such questionnaires are typically skewed towards good experiences or high levels of satisfaction with care.29 However, this study shows that high scores derived from patient surveys do not always equate with good performance as assessed by experts. This suggests the value of further study involving doctors for whom patient versus expert assessment of communications skills diverge—perhaps employing outcome measures such as adherence to doctor advice—to gain further insight into the validity of each assessment approach.


An extensive comparison between patient reports of doctors' communication skills and video-observed behaviour found a significant positive correlation. In particular, doctors assessed to be poor communicators by their patients were similarly viewed by expert observers. However, the large proportion of doctors who were assessed by patients to have good communication skills who were assessed unfavourably by expert observers suggests that patient reports alone may not be sufficient to identify all doctors who would benefit from communication skills improvement training.


We are indebted to the video coders Wenche Moastuen, Tonje L. Stensrud, Evelyn Andersson, and Anneli Mellblom who all did a major job rating the videos. We also thank Jurate Saltyte Benth for designing the inter-rater reliability worksheets, Erik Holt for digitalising the videotapes, and Haldor Husby for scanning and controlling all questionnaires.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • Linked article 000323.

  • Disclosures: Bayer Pharma and the Norwegian Chiropractor Association have paid Pål Gulbrandsen and Arnstein Finset/Bård Fossli Jensen respectively for giving lectures on the Four Habits model. The Norwegian Association for General Practitioners paid Pål Gulbrandsen and Bård Fossli Jensen for running a communication skills course during their annual meeting.

  • Funding The study was funded by the Regional Health Enterprise for Southeast Norway. The funding body did not influence any part of the scientific process.

  • Competing interests None.

  • Ethics approval The study was approved by the Regional Committee for Medical Research Ethics of Southeast Norway (1.2007.356).

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles