Skip to main content icon/video/no-internet

Imputation, also called ascription, is a statistical process that statisticians, survey researchers, and other scientists use to replace data that are missing from a data set due to item nonresponse. Researchers do imputation to improve the accuracy of their data sets.

Missing data are a common problem with most databases, and there are several approaches for handling this problem. Imputation fills in missing values, and the resultant completed data set is then analyzed as if it were complete. Multiple imputation is a method for reflecting the added uncertainty due to the fact that imputed values are not actual values, and yet still allow the idea of complete-data methods to analyze each data set completed by imputation. In general, multiple imputation can lead to valid inferences from imputed data. Valid inferences are those that satisfy three frequentist criteria:

  • Approximately unbiased estimates of population estimands (e.g. means, correlation coefficients)
  • Interval estimates with at least their nominal coverage (e.g. 95% intervals for a population mean should cover the true population mean at least 95% of the time)
  • Tests of significance that should reject at their nominal level or less frequently when the null hypothesis is true (e.g. a 5% test of a zero population correlation that should reject at most 5% of the time when the population correlation is zero)

Among valid procedures, those that give the shortest intervals or most powerful tests are preferable.

Missing-Data Mechanisms and Ignorability

Missing-data mechanisms were formalized by Donald B. Rubin in the mid-1970s, and subsequent statistical literature distinguishes three cases: (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) not missing at random (NMAR). This terminology is consistent with much older terminology in classical experimental design for completely randomized, randomized, and not randomized studies. Letting Y be the N (units) by P (variables) matrix of complete data and R be the N by P matrix of indicator variables for observed and missing values in Y, the missing data mechanism gives the probability of R given Y and possible parameters governing this process, ξ: p(R|Y, ξ).

MCAR

Here, “missingness” does not depend on any data values, missing or observed: p(R|Y,ξ,)= p(R|ξ,). MCAR can be unrealistically restrictive and can be contradicted by the observed data, for example, when men are observed to have a higher rate of missing data on post-operative blood pressure than are women.

MAR

Missingness, in this case, depends only on observed values, not on any missing values: p(R|Y,ξ) = p(R|Yobs,ξ,), where Yobs are observed values in Y, Y =(Yobs, Ymis), with Ymis the missing values in Y. Thus, if the value of blood pressure at the end of a clinical trial is more likely to be missing when some previously observed values of blood pressure are high, and given these, the probability of missingness is independent of the missing value of blood pressure at the end of the trial, the missingness mechanism is MAR.

NMAR

If, even given the observed values, missingness still depends on data values that are missing, the missing data are NMAR: p(R|Y,ξ)≠p(R|Yobs, ξ). This could be the case, for example, if people with higher final blood pressure tend to be more likely to be missing this value than people with lower final blood pressure, even though they have the exact same observed values of race, education, and all previous blood pressure measurements. The richer the data set is in terms of observed variables, the more plausible the MAR assumption becomes.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading