Skip to main content icon/video/no-internet

An outlier, as the term suggests, means an observation in a sample lying outside of the “bulk” of the sample data. For example, the value “87” is an outlier in the following distribution of numbers: 2, 5, 1, 7, 11, 9, 5, 6, 87, 4, 0, 9, 7. This original meaning has been expanded to include those observations that are influential in estimation of a population quantity. Influence of an observation in estimation is intuitively understood as the degree to which the presence or absence of that observation affects the estimate in terms of the variance.

The notion of outliers is common in all statistical disciplines. However, it has a distinctive meaning in sample surveys for mainly two reasons: (1) sample surveys mostly deal with finite populations, often without assuming a parametric distribution; and (2) sample surveys often employ complex sample designs with unequal inclusion probabilities. Moreover, the meaning and handling of outliers differ also, depending on the stage of the survey process at hand: sample design stage, editing stage, and estimation stage.

The occurrence of outliers is frequently unavoidable when a multi-purpose survey is conducted. It may be nearly impossible to make the design efficient for all variables of interest in a large-scale multi-purpose survey. The outliers that have the most impact come from sample units that have a large sample value coupled with a large sampling weight. Probability proportional to size sampling (PPS) or size stratification is often used in the design stage to prevent such a situation from occurring. If the measure of size (MOS) is not reliable, it is difficult to eliminate outliers entirely, unless a census or a sample survey with a high sampling rate is used. This problem is especially pronounced when dealing with a volatile population such as businesses in economic surveys. A typical situation is that a unit with a small MOS, and thus assigned a small probability, has grown to have a medium or large value at time of observation, resulting in a huge weighted value.

In the editing stage of a survey process, outlier detection is performed to find extreme values that may be due to some survey error (response error or keying error during data entry). Such outliers are detected by comparing individual data values with others using a standardized distance measure defined as the absolute difference of the value from the center of the data (location) divided by a dispersion measure (scale). Using the sample mean and sample standard deviation to define the distance tends to mask outliers. To avoid the masking effect, estimates that are robust (insensitive) to outliers should be used. For example, one may use the median to estimate location and either the interquartile range (w), the difference between the third quartile (i.e. 75th percentile) and the first quartile (i.e. 25th percen-tile), or the mean absolute difference—which is the median of observations' absolute differences from the sample median. Weighted values of these quantities rather than unweighted values may be used if the sampling weights are available. Once the standardized distance is defined, a criterion by which an outlier is detected is set as the tolerance interval of the standardized distances; if an observation falls outside of the interval, it is declared as an outlier. The interval can be symmetric, nonsymmetric, or one-sided, depending on the underlying population distribution.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading