Skip to main content icon/video/no-internet

An outlier is an observation in a set of data that is inconsistent with the majority of the data. An observation (i.e., score) is typically labeled an outlier if it is substantially higher or lower than most of the observations. Because, among other things, the presence of one or more outliers can dramatically alter the values of both the mean and variance of a distribution, it behooves a researcher to determine, if possible, what accounts for their presence. An outlier can represent a valid score of a subject who just happens to represent an extreme case of the variable under study, it can result from failure on the part of a subject to cooperate or follow instructions, or it can be due to recording or methodological error on the part of the experimenter. If the researcher has reason to believe that an outlier is due to subject or experimenter error, he or she is justified in removing the observation from the data. On the other hand, if there is no indication of subject or experimental error, the researcher must decide which of the following is the most appropriate course of action to take:

  • Delete the observation from the data. One disadvantage associated with removing one or more outliers from a set of data is reduction of sample size, which in turn reduces the power of any statistical test conducted although in the case of commonly employed inferential parametric tests (e.g., t test and analysis of variance), the latter may be counteracted by a decrease in error variability.
  • Retain the observation, and by doing so risk obtaining values for one or more sample statistics that are not accurate representations of the population parameters they are employed to estimate.
  • Retain one or more observations that are identified as outliers and evaluate the data through the use of one of a number of statistical procedures that have been developed to deal with outliers, some of which are discussed in the last part of this entry.

Strategies for Identifying Outliers

One strategy for identifying outliers employs a standard deviation criterion that is based on Chebyshev's inequality, which states that regardless of the distribution of a set of data, [1 − (1/k)](100) percent of the observations will fall within k standard deviations of the mean (where k is any value greater than 1). Because the percent values computed are quite high when k is equal to 2, 3, 4, and 5 standard deviation units (respectively, 75%, 88.89%, 93.75%, and 96%), it indicates that scores that are beyond two standard deviations from the mean are relatively uncommon, and scores that are three standard deviations and beyond from the mean are relatively rare. However, employing a standard deviation criterion (e.g., declaring any observation that is three or more standard deviations from the mean an outlier) can be misleading sometimes, especially with small sample sizes and data sets with anomalous configurations.

Another tool that can be employed for identifying outliers is a box-and-whisker plot (also referred to as a boxplot), which is a visual method for displaying data developed by the statistician John Tukey. Typically, in a box-and-whisker plot, any value that falls more than one and one-half hinge-spreads outside a hinge can be classified as an outlier, and any value that is more than three hinge-spreads as a severe outlier. The value of a hinge-spread is the difference between the lower and upper hinge of the distribution. In a large sample, the upper and lower hinges of a distribution will approximate the 25th and 75th percentiles.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading