Skip to main content icon/video/no-internet

Stepwise Regression

Stepwise, also called stagewise, methods in fitting regression models have been extensively studied and applied in the past 50 years, and they still remain an active area of research. In many study designs, one has a large number K of input variables and the number n of input—output observations (xi1,…, xik, yi), 1 ≤ in, is often of the same or smaller order of magnitude than K. Examples include gene expression studies, where the number K of genomic locations is typically larger than the number n of subjects, and signal or image reconstruction, where the number K of basis functions to be considered exceeds the number n of measurements or pixels. Stepwise methods are perhaps the only computationally feasible ways to tackle these problems, and certain versions of these methods have recently been shown to have many desirable statistical properties as well.

Stepwise regression basically carries out two tasks sequentially to fit a regression model

None

where β = (β1,…, βK)T is a vector of regression parameters, xi = (xi1,…, xik)T is a vector of regressors (input variables), εi represents unobservable noise, and yi is the observed output. The first task is to choose regressors sequentially and the second task is to refit the regression model by least squares after a regressor has been added to the model. For notational simplicity, assume that the yi and xij in Equation 1 have been centered at their sample means so that Equation 1 does not have an intercept term.

To begin, stepwise regression chooses the regressor that is most correlated to the output variable (i.e., such that the [sample] correlation coefficient between yi and xij is the largest among the K regressors). One then performs least squares regression of yi on the selected regressor xij, yielding the least squares fit

None
and the residuals
None
. A variable selection criterion is then applied to determine whether the chosen regressor should indeed be included. If the criterion accepts the chosen regressor, then the researcher repeats the stepwise procedure to the remaining regressors but with
None
in place of yi. More generally, after the regressors labeled j1,…, jk have been included in the model and the residuals
None
have been computed, the researcher chooses
None
such that the correlation coefficient between ei and
None
is the largest among the remaining K − k input variables, and it performs least squares regression of ei on
None
yielding a new set of residuals
None
i, which are used in the criterion to determine whether the regressor labeled
None
should be included. If the criterion rejects the regressor, then it is not included in the model, and the stepwise regression procedure terminates with the set of input variables xi,j1,…, xi,jk.

A traditional variable selection criterion is based on the F test of H0: βj = 0 in the regression model

None
to determine whether the regressor labeled
None
should be added to the model that already contains the regressors labelej1,…, jk. If the F test rejects H0 at significance level a, which is often chosen to be 5%, then the regressor labeled j is included in the model. Otherwise, βj is deemed to be not significantly different from 0, and therefore the corresponding regressor is excluded from the model. Note that because this test-based procedure carries out a sequence of F tests, the overall significance level can differ substantially from α. Such an F test of whether a particular regressor in a larger set of input variables has regression coefficient 0 is called a partial F test, and the corresponding test statistic is called a partial F statistic.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading