Skip to main content icon/video/no-internet

Multiple Imputation

Multiple imputation (MI) is actually somewhat of a misnomer. The phrase is best understood as the name for a post-imputation variance estimation tool that involves repetitions of the imputation process. The father of multiple imputation, Donald Rubin, originally envisioned MI as a tool for the preparation of public use files (PUFs). He advocated that data publishers use MI in order to simplify and improve the analyses conducted by PUF consumers. So far, few data publishers have adopted MI. More usage of MI has been found in highly multivariate analyses with complex missing data structures, such as in the scoring of standardized tests with adaptive item sampling. In that literature, the multiple imputations are most often referred to as plausible values.

Motivation

MI is most commonly used in conjunction with Bayesian imputation methods, in which samples drawn from the posterior distribution of the missing data given the observed data are used to fill in the missing values. However, as long as there is some element of randomness in the imputation process, one can imagine executing the process multiple times and storing the answers from each application (i.e. replication). The variance of a statistic of interest across these replications can then be calculated. This variance can be added to the “naive” estimate of variance (obtained by treating all imputed data as if they were observed) to produce a variance estimate for the statistic that reflects the uncertainty due to both sampling and imputation. That is the essence of multiple imputation.

Controversy

There is a long-standing heated debate within the community of survey research statisticians about the utility of MI for analyses unanticipated by the data publisher. It is easy to find examples where mechanical application of MI results in over- or undercorrection. Rubin has a theorem that identifies the class of imputation procedures that can be used in conjunction with MI to obtain asymptotically valid inferences for a given statistic. He labels such imputation procedures as “proper.” However, a series of debates in the 1990s, culminating in a trio of 1996 papers, demonstrated that proper imputation methods are difficult to construct. Moreover, an imputation procedure that is proper for one analysis might be improper for another analysis.

Basic Formulae

Suppose that the entire imputation process of choice is repeated m times and that all m imputed values are stored along with the reported data. Conceptually, the process produces m completed data sets representing m replicates of this process. If there were originally p columns with missing data, then there will be mp corresponding columns in the new multiply imputed dataset. The user then applies his or her full-sample analysis procedure of choice m times, once to each set of p columns. Suppose that None is the point estimate of some parameter, θ based on the kth set of p columns. (The subscript, I, indicates employment of imputed data.) Also suppose that None is the variance estimate for None provided by the standard complex survey analysis software when applied to the kth set of p columns.

Assuming that the imputation method of choice has a stochastic component, such as imputation that is based on a linear regression model to predict imputed values from covariates, multiple imputations can be used to improve the point estimate and provide better leverage for variance estimation. Rubin's point estimate is None and his variance estimate

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading