Skip to main content icon/video/no-internet

Interrater Reliability

The use of raters or observers as a method of measurement is prevalent in various disciplines and professions (e.g., psychology, education, anthropology, and marketing). For example, in psychotherapy research raters might categorize verbal (e.g., paraphrase) and/or nonverbal (e.g., a head nod) behavior in a counseling session. In education, three different raters might need to score an essay response for advanced placement tests. This type of reliability is also present in other facets of modern society. For example, medical diagnoses often require a second or even third opinion from physicians. Competitions, such as Olympic figure skating, award medals based on quantitative ratings provided by a panel of judges.

Those data recorded on a rating scale are based on the subjective judgment of the rater. Thus, the generality of a set of ratings is always of concern. Generality is important in showing that the obtained ratings are not the idiosyncratic results of one person's subjective judgment. Procedure questions include the following: How many raters are needed to be confident in the results? What is the minimum level of agreement that the raters need to achieve? Is it necessary for the raters to agree exactly or is it acceptable for them to differ from one another as long as the differences are systematic? Are the data nominal, ordinal, or interval? What resources are available to conduct the interrater reliability study (e.g., time, money, and technical expertise)?

Interrater or interobserver (these terms can be used interchangeably) reliability is used to assess the degree to which different raters or observers make consistent estimates of the same phenomenon. Another term for interrater or interobserver reliability estimate is consistency estimates. That is, it is not necessary for raters to share a common interpretation of the rating scale, as long as each judge is consistent in classifying the phenomenon according to his or her own viewpoint of the scale. Interrater reliability estimates are typically reported as correlational or analysis of variance indices. Thus, the interrater reliability index represents the degree to which ratings of different judges are proportional when expressed as deviations from their means. This is not the same as interrater agreement (also known as a consensus estimate of reliability), which represents the extent to which judges make exactly the same decisions about the rated subject. When judgments are made on a numerical scale, interrater agreement generally means that the raters assigned exactly the same score when rating the same person, behavior, or object. However, the researcher might decide to define agreement as either identical ratings or ratings that differ no more than one point or as ratings that differ no more than two points (if the interest is in judgment similarity). Thus, agreement does not have to be defined as an all-or-none phenomenon. If the researcher does decide to include a discrepancy of one or two points in the definition of agreement, the chi-square value for identical agreement should also be reported. It is possible to have high interrater reliability but low interrater agreement and vice versa. The researcher must determine which form of determining rater reliability is most important for the particular study.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading