Interrater Reliability

Sarah Boslaugh

doi:10.4135/9781412953948

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Interrater Reliability

Edited by:
Sarah Boslaugh
In:Encyclopedia of Epidemiology
Chapter DOI:https://doi.org/10.4135/9781412953948.n239
Subject:Epidemiology & Biostatistics, Public Health (general), Public Health Research Methods
Keywords:instruments; instruments; measurement

Request Permissions

Show page numbers Hide page numbers

Researchers and practitioners rely on a variety of instruments for measurement, such as scales, surveys, and recordings. For an instrument to be useful, it must be both reliable (i.e., measurements made using it are consistent and can be replicated) and valid (i.e., it measures what the researcher thinks it is measuring). This entry discusses one aspect of reliability, interrater reliability, in the general context of reliability in measurement.

Reliability essentially refers to repeatability. In asking how reliable an instrument is, we are asking whether we would get the same results if we were to take the same measurement of the same entity over and over again. Reliability does not require that an instrument deliver perfect measurements; rather, it assumes that some error occurs when we use an instrument repeatedly but that the error is random rather than systematic. As a result, after multiple measures using the same instrument, the random errors should cancel each other out and we would have an estimate of the true quantity of whatever we were measuring. Reliability also means that the quality being measured does not change from one occasion of measurement to another; thus, for example, we could evaluate the reliability of a scale in measuring children's heights by taking several measurements within the same day but perhaps not by taking several measurements over a period of months (because their actual height could have changed during that time).

Reliability is a prerequisite for validity. If an instrument yields wildly different results for a presumably unchanging entity on different occasions, it is not possible to evaluate the validity of the instrument because the unreliable nature of the scores precludes interpreting their meaning. On the other hand, if a survey or screening instrument yields essentially the same results over and over again, then we can go on to evaluate whether the instrument is valid, that is, whether it is in fact measuring what we are hoping it is measuring. However, assessing reliability is not always as simple as taking repeated measurements with the same instrument because other factors may cause repeated measurements to be invalid. For instance, if someone is asked the same question repeatedly, he or she may change his or her response because asking the question may have caused him or her to reflect on the issue and change his or her [Page 559]answer from the first to the second asking, or he or she may have become tired of answering the same questions repeatedly and ceased giving honest answers. Because true reliability is impossible to measure without the impact of this survey effect, a variety of alternative means have been developed to capture the idea of repeated measurements.

Interrater reliability is just one possible way of assessing the reliability of a measurement instrument. As its name suggests, interrater reliability is a method of comparing the observations of multiple ‘raters’ or judges. It is often used in psychological and behavioral evaluations (e.g., in judging if a child engages in disruptive behavior in a classroom setting) and in evaluating the accuracy of medical procedures such as reading X rays or evaluating mammograms. Rather than making one individual perform multiple ratings of the same person or event, interrater reliability uses multiple people to observe a single set of responses or actions of an individual and then examines the extent to which different judges agree. If the ratings of the judges do not agree, then the measure is not valid and the instrument used to collect the information may need revision, or the judges may need more training to use it correctly. If the judges do agree, this is supportive evidence that the measure may be a valid one. Of course, we do not expect perfect agreement, and several statistical methods have been developed to evaluate interrater reliability.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Interrater Reliability

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends