Influential Data Points

Neil J.Salkind

doi:10.4135/9781412961288

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Influential Data Points

Edited by:
Neil J. Salkind
In:Encyclopedia of Research Design
Chapter DOI:https://doi.org/10.4135/9781412961288.n187
Subject:Research Design

Request Permissions

Show page numbers Hide page numbers

Influential data points are observations that exert an unusually large effect on the results of regression analysis. Influential data might be classified as outliers, as leverage points, or as both. An outlier is an anomalous response value, whereas a leverage point has atypical values of one or more of the predictors. It is important to note that not all outliers are influential.

Identification and appropriate treatment of influential observations are crucial in obtaining a valid descriptive or predictive linear model. A single, highly influential data point might dominate the outcome of an analysis with hundreds of observations: It might spell the difference between rejection and failure to reject a null hypothesis or might drastically change estimates of regression coefficients. Assessing influence can reveal data that are improperly measured or recorded, and it might be the first clue that certain observations were taken under unusual circumstances. This entry discusses the identification and treatment of influential data points.

Identifying Influential Data Points

A variety of straightforward approaches is available to identify influential data points on the basis of their leverage, outlying response values, or individual effect on regression coefficients.

Graphical Methods

In the case of simple linear regression (p = 2), a contingency plot of the response versus predictor values might disclose influential observations, which will fall well outside the general two-dimensional trend of the data. Observations with high leverage as a result of the joint effects of multiple explanatory variables, however, are difficult to reveal by graphical means. Although simple graphing is effective in identifying extreme outliers and nonsensical values, and is valuable as an initial screen, the eyeball might not correctly discern less obvious influential points, especially when the data are sparse (i.e., small n).

Leverage

Observations whose influence is derived from explanatory values are known as leverage points. The leverage of the ith observation is defined as hi = xi(X′X)-1x′i, where xi is the ith row of the n × p design matrix X for p predictors and sample size n. Larger values of hi, where 0≤hi≤1, are indicative of greater leverage. For reasonably large data sets (n – p > 50), a value of hi greater than 2p/n is a standard criterion for classification as a leverage point, where ∑ni=1hi = p and thus the mean of hi = p / n.

Standardized Residuals

An objective test for outliers is available in the form of standardized residuals. The Studentized deleted residuals,

, where s2(i), is the mean square estimate of the residual variance σ2 with the ith observation removed, have a Student's t distribution with n p 1 degrees of freedom (df) under the assumption of normally distributed errors. An equivalent expression might be constructed in terms of yi(i), which is the fitted value for observation i when the latter is not included in estimating regression parameters:

. As a rule of thumb, an observation might be declared an outlier if |e∗i| > 3. As mentioned, however, classification as an outlier does not necessarily imply large influence.

Estimates of Influence

Several additional measures assess influence on the basis of effect on the model fit and estimated regression parameters. The standardized change in fit, DFFITSi =

, provides a standardized measure of effect of the ith observation on its fitted (predicted) value. It represents the change, in units of standard errors (SE), in the fitted value brought about by omission of the ith point in fitting the linear model. DFFITS and the Studentized residual are closely related: DFFITSi =

. The criteria for large effect are typically | DFFITS | > 2√p/n for large data sets or |DFFITS | > 1 for small data sets. This measure might be useful where prediction is the most important goal of an analysis.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Influential Data Points

Identifying Influential Data Points

Graphical Methods

Leverage

Standardized Residuals

Estimates of Influence

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends