Skip to main content icon/video/no-internet

Influence Statistics

Influence statistics measure the effects of individual data points or groups of data points on a statistical analysis. The effect of individual data points on an analysis can be profound, and so the detection of unusual or aberrant data points is an important part of nearly every analysis. Influence statistics typically focus on a particular aspect of a model fit or data analysis and attempt to quantify how the model changes with respect to that aspect when a particular data point or group of data points is included in the analysis. In the context of linear regression, where the ideas were first popularized in the 1970s, a variety of influence measures have been proposed to assess the impact of particular data points.

The popularity of influence statistics soared in the 1970s because of the proliferation of fast and relatively cheap computing, a phenomenon that allowed the easy examination of the effects of individual data points on an analysis for even relatively large data sets. Seminal works by R. Dennis Cook; David A. Belsley, Edwin Kuh, and Roy E. Welsch; and R. Dennis Cook and Sanford Weisberg led the way for an avalanche of new techniques for assessing influence. Along with these new techniques came an array of names for them: DFFITS, DFBETAS, COVRATIO, Cook's D, and leverage, to name but a few of the more prominent examples. Each measure was designed to assess the influence of a data point on a particular aspect of the model fit: DFFITS on the fitted values from the model, DFBETAS on each individual regression coefficient, COVRATIO on the estimated residual standard error, and so on. Each measure can be readily computed using widely available statistical packages, and their use as part of an exploratory analysis of data is very common.

This entry first discusses types of influence statistics. Then we describe the calculation and limitations of influence statistics. Finally, we conclude with an example.

Types

Influence measures are typically categorized by the aspect of the model to which they are targeted. Some commonly used influence statistics in the context of linear regression models are discussed and summarized next. Analogs are also available for generalized linear models and for other more complex models, although these are not described in this entry.

Influence with respect to fitted values of a model can be assessed using a measure called DFFITS, a scaled difference between the fitted values for the models fit with and without each individual respective data point:

None

where the notation in the numerator denotes fitted values for the response for models fit with and without the ith data point, respectively, MSE(i) is the mean square for error in the model fit without data point i, and hii is the ith leverage; that is, the ith diagonal element of the hat matrix, H = X (XTX)-1 XT. Although DFFITS resembles a t statistic, it does not have a t distribution, and the size of DFFITSi is judged relative to a cutoff proposed by Belsley, Kuh, and Welsch. A point is regarded as potentially influential with respect to fitted values if |DFFITSi| > 2-√p/n, where n is the sample size and p is the number of estimated regression coefficients.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading