Skip to main content icon/video/no-internet

Shrinkage reflects the bias found between sample statistics and inferred population parameters. Multiple regression generally overestimates population values from sample multiple correlation coefficients (R) and coefficients of multiple determination (R2). A common adjustment method for overinflation is to use the shrunken or adjusted R2. The adjusted R2 accounts for the amount of shrinkage between the sample R2 and the population squared multiple correlation 2). Similarly, results from a model fitted in one sample are often an overestimate of how it would fit using a separate sample from the same population (i.e., a cross-validation sample), and such results also often need to be adjusted for shrinkage.

This entry begins by explaining why regression overestimates the population parameters. Next, the entry provides an example of shrinkage and discusses its use in cross-validity This entry ends with a brief discussion of subsequent knowledge in this area.

Why Regression Overestimates Population Parameters

When working with a random sample of data from a larger population, it is expected that the mean will not exactly match the true population value. Sometimes the mean might be a little higher, and sometimes it might be a little lower. This fluctuation is generally considered to be caused by sampling error. Sampling errors also are present when estimating other numbers, including regression parameters. Regression analyses (in fact all analyses that use the least squares solution) do not account for these positive and negative fluctuations from the true population value when computing the multiple correlation coefficient (R). Multiple R is the product-moment correlation between the dependent variable and a linear combination of the set of independent variables. Because least squares maximizes the correlation between the set of independent variables and the dependent variable, and because multiple R cannot be negative (thus all chance fluctuations are in the positive direction), R is overfitted to the sample from which it was estimated. Each sample has its own idiosyncratic characteristics, and ordinary least squares capitalize on these, thus inflating the estimate of R.

Increasing the number of predictors also results in an artificially higher multiple R value. If all the chance fluctuations are positive, then adding a variable into the model might increase the multiple R by sampling error variance alone—a situation typically referred to as capitalization on chance. An overestimate of multiple R leads to an overestimate of the coefficient of multiple determination (R2)—an estimate of the proportion of the variance of the dependent variable accounted for by the predictor variables. This positive inflation can easily be found in any statistical program: Add more predictor variables—even ones not significantly related to the dependent variable—and watch both R and R2 increase. Positive inflation becomes even greater for small samples. Because of this bias, statisticians recommend estimating the amount of shrinkage that would occur and adjust the R2 appropriately. In many computer programs, this more appropriate measure of the population value is labeled “adjusted R2.”

Example

Following is a common R2 shrinkage adjustment formula that is often used in statistical packages:

None

where n = sample size and k = number of independent variables.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading