Skip to main content icon/video/no-internet

Missing Data, Imputation of

Imputation involves replacing missing values, or missings, with an estimated value. In a sense, imputation is a prediction solution. It is one of three options for handling missing data—besides ignoring them when possible. The general principle is to delete when the data are expendable, impute when the data are precious, and segment on missing versus nonmissing when this is informative. When measured against deletion, imputation often affords more accurate results. This entry discusses the differences between imputing and deleting, the types of missings, the criteria for preferring imputation, and various imputation techniques. It closes with application suggestions.

Impute or Delete

The trade-off is between convenience and bias. There are two choices for deletion (casewise or pairwise) and several approaches to imputation. Casewise deletion omits the entire observation (or case), for all of the variables when it is missing for any variable. This sacrifices partial information, contained in the other variables, either for convenience or to accommodate certain statistical techniques. Techniques such as structural equation modeling may require complete data for all the variables, so only casewise deletion is possible for them. Pairwise deletion omits observations on a variable-by-variable basis. For techniques like calculating correlation coefficients, pairwise deletion will leverage the partial information of the observations, which can be advantageous for small sample sizes and when missings are not at random.

Imputation is usually the more advantageous technique when (a) the missings are not random, (b) the missings represent a large proportion of the data set, or (c) the data set is small or otherwise precious. If the missings do not occur at random, which is the most common situation, then deleting can create significant bias. For some situations, it is possible to repair the bias through weighting—as in poststratification for surveys. If the data set is small or otherwise precious, then deleting can severely reduce the statistical power or value of the data analysis.

Imputation can repair the missing data by creating one or more versions of how the data set should appear. By leveraging external knowledge, and/or good technique, it is possible to reduce the bias due to missing values. Some techniques offer a quick improvement over deletion. Software is making these techniques faster and sharper. However, improved software is facilitating more low-quality data analysis by enabling those without appropriate training and sufficient governance. Even with software, training, and experience, imputation may require as much effort as building the model that uses the imputed values.

Categorizing Missingness

Missingness can be categorized in two ways: the physical structure of the missings and the underlying nature of the missingness. First, the structure of the missings can be due to item or unit missingness; a merge of structurally different data sets or limitations attributable to the data collection tools. Item missingness refers to the situation in which a single value is missing for a particular observation, and unit missingness refers to the situation in which all the values for an observation are missing.

Second, missings can be categorized by the underlying nature of the missingness. The three categories are (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), summarized in Table 1 and discussed below.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading