Skip to main content icon/video/no-internet

In statistics, we assume that our data come from some probability model, a hypothetical or ideal model for doing statistical analysis mathematically. In the reallife world, unfortunately, the data in our hands usually have some outliers: outlying observations discordant from the hypothesized model. It is well known that outliers in the data set can severely distort the result of statistical inference.

Generally speaking, there are two strategies to deal with outliers. One way is to use some robust methods to accommodate outliers in the data, such as using the sample median (instead of the sample mean) to estimate a population mean. The other way is to try to identify outliers in the sample and then modify or simply delete them for further data analysis. Our topic here is the identification of outliers in the sample data.

Suppose that our sample data x1, x2,…, xn come from an interesting population or distribution. Arrange the sample observations x1, x2,…, xn in ascending order: x(1)x(2) ≤ … ≤ x(n), which are called order statistics. Suppose that we have k suspicious lower outliers x(1), x(2),…, x(k) or upper outliers x(nk+1),…, x(n−1), x(n) in the sample (the case of having both lower and upper outliers is too complicated to be considered here), where k (the number of suspicious outliers) is much smaller than the sample size n. We want to test if they are significantly discordant from the rest of sample observations.

Hypothesis Test for Outliers

To test the discordancy of suspicious outliers, we need to propose statistical hypotheses. For example, to test the discordancy of k suspicious upper outliers x(nk+1), …, x(n−1), x(n) in a normal sample, suppose that the underlying distribution of the sample is a normal distribution N(μ, σ2) with mean μ and variance σ2, where μ and variance σ2 are unknown.

Under a null hypothesis, the sample data x1, x2, …, xn are a random sample from N(μ, σ2). Under an alternative hypothesis, unsuspicious observations x1, x2, …, x(n−k) belong to N(μ, σ2), but the suspicious upper outliers x(n−k+1),…,x(n−1),x(n) belonging to N(μ + a, σ2)(a > 0), which has a larger mean μ + a shifted right from the original mean μ.

In other words, we need to test the null hypothesis

None

against the mean-shifted alternative hypothesis

None

The likelihood-ratio statistic for testing H0: a = 0 against H1: a > 0 is

None

where X¯ and s stand for the sample mean and sample standard deviation, and large values of the test statistic reject the null hypothesis H0, identifying x(nk+1),…,x(n−1),x(n) as outliers or discordant observations.

If the underlying distribution of the sample is a nonnormal distribution, say, a gamma distribution (which includes exponential distribution as a special case), then the likelihood-ratio test statistic will be

None

Dixon Test

There are many varieties of statistical tests for detecting outliers in the sample. Generally speaking, powerful tests are based on sophisticated test statistics, such as the likelihood-ratio test statistics discussed before.

Dixon test is a very simple test that is often used to test outliers in a small sample. The general form of Dixon statistic in the literature is defined

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading