Skip to main content icon/video/no-internet

The kappa statistic is a measure of agreement, corrected for chance, for a categorical variable. For example, if two radiologists each assess the results for the same set of patients, the kappa is one way to measure how well their conclusions agree. The kappa may be used if the rating system used to grade each patient is binary or categorical. With either a large number of ordinal categories (such as a scale from 0 to 20) or a continuous rating scale, Pearson's correlation coefficient would provide a better assessment of agreement than the kappa.

The formula for the kappa is k= (pope)= (1 − pe), where po is the proportion of observed agreement (the sum of the observed values of the cells on the diagonal over the total number of observations), and pe is the proportion of agreement expected by chance (the sum of the expected values of the same cells on the diagonal over the total number of observations). Notice that the denominator shows the difference between perfect agreement and the amount of agreement expected by chance, representing the best possible improvement of the raters over chance alone. This is contrasted with the numerator, the difference between the observed proportion of agreement and that expected by chance. As a result, the kappa statistic may be interpreted as the proportion of agreement beyond that which is expected just by chance, and kappa values range from less than 0 (less agreement than expected by chance) to 1 (perfect agreement). Kappa values between 0 and 0.4 represent marginal reproducibility or agreement, values between 0.4 and 0.75 show good agreement, and values more than 0.75 indicate excellent agreement.

As the number of possible categories for each rating increases, the associated kappa values tend to decrease. Fortunately, if the categories are ordinal (such as a score from 1 to 10), this can be combated by use of the weighted kappa. In the weighted kappa, the most weight is given to observations with identical ratings, then less weight is given to ratings one unit apart, still less weight is given to observations two units apart, and so on. The user defines how much weight is allotted to each possibility (identical ratings, one unit apart, two units apart, and so on) as well as defining how far apart ratings can be and still contribute toward the weighted agreement.

In the ordinary kappa statistic, only perfect agreement between Raters A and B would count toward agreement. In the weighted kappa statistic, the most weight is given to perfect agreement (dark gray) with less weight given to cells with near perfect agreement (light gray). Specifically, the five observations that Rater A coded as ‘II’ and Rater B coded as ‘I’ would count as disagreement for the ordinary kappa statistic but count as partial agreement in the weighted kappa statistic.

Table 1 Comparison of the Unweighted and Weighted Kappa Statistics

None

The usual versions of the kappa and weighted kappa statistics allow only two raters to assess each observation. The multirater kappa is an alternative when more than two raters assess each observation. The weighted kappa and the multirater kappa may be interpreted in the same way as the generic kappa statistic.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading