Skip to main content icon/video/no-internet

Interrater Reliability

The concept of interrater reliability essentially refers to the relative consistency of the judgments that are made of the same stimulus by two or more raters. In survey research, interrater reliability relates to observations that in-person interviewers may make when they gather observational data about a respondent, a household, or a neighborhood in order to supplement the data gathered via a questionnaire. Interrater reliability also applies to judgments an interviewer may make about the respondent after the interview is completed, such as recording on a 0 to 10 scale how interested the respondent appeared to be in the survey. Another example of where interrater reliability applies to survey research occurs whenever a researcher has interviewers complete a refusal report form immediately after a refusal takes place and how reliable are the data that the interviewer records on the refusal report form. The concept also applies to the reliability of the coding decisions that are made by coders when they are turning open-ended responses into quantitative scores during open-ended coding.

Interrater reliability is rarely quantified in these survey examples because of the time and cost it would take to generate the necessary data, but if it were measured, it would require that a group of interviewers or coders all rate the same stimulus or set of stimuli. Instead, interrater reliability in applied survey research is more like an ideal that prudent researchers strive to achieve whenever data are being generated by interviewers or coders.

An important factor that affects the reliability of ratings made by a group of raters is the quantity and the quality of the training they receive. Their reliability can also be impacted by the extent to which they are monitored by supervisory personnel and the quality of such monitoring.

A common method for statistically quantifying the extent of agreement between raters is the intraclass correlation coefficient, also known as Rho. In all of the examples mentioned above, if rating data are not reliable, that is, if the raters are not consistent in the ratings they assign, then the value of the data to researchers may well be nil.

Paul J.Lavrakas

Further Readings

ShroutP. E., and FleissJ. L.Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin86 (1979) 420–428. http://dx.doi.org/10.1037/0033-2909.86.2.420
  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading