Skip to main content icon/video/no-internet

Matching is a statistical method that can be used to estimate quantities of interest that depend on missing, that is, unobserved, values of some variable Y. (As is made clear later in this entry, the variables with missing values in causal inference applications are subtly different from the observed outcome variable, which is commonly referred to as Y.) Schematically, matching works as follows. For each observation with a missing value of Y, find another observation that does not have a missing Y value but that is otherwise maximally similar to the initial observation in question. This similar observation is said to match the observation with the missing Y value. Now use the observed Y value from the matched observation to fill in the missing Y value. Matching can be done by selecting matching observations with or without replacement from the original dataset. It is also possible to match many observations to a single missing observation, in which case the mean of Y from the matching observations is typically used to fill in the missing Y value.

Matching can be applied to a variety of missing data problems—from estimating the population mean of Y to estimating causal effects. Examples below make this clearer. Matching is not a panacea. Matching methods rely on assumptions that can only be tested given auxiliary data and/or assumptions. The key assumptions of conditional ignorability and overlap are discussed later in this entry. There are a wide variety of ways that matching can be implemented by. A discussion of particular matching methods and their statistical properties is beyond the scope of this entry.

Examples

The easiest way to begin to understand how matching works is to walk through some relatively simple examples.

Estimating a Population Mean

Consider a situation where we are interested in estimating the fraction of Republicans in a population. We sample 20 individuals from this population and administer a face-to-face survey. The pollster records the respondent's gender (0 = male, 1 = female), race (0 = nonwhite, 1 = white), and partisanship (0 = non-Republican, 1 = Republican). All respondents report their gender and race accurately; however, some respondents do not report their partisanship. Respondents who report their partisanship will be called reporters and those who do not report their partisanship will be called nonreporters. These data are summarized in Table 1.

We would like to use our sample data to estimate the fraction of individuals in the population who self-identify as Republicans. The simplest way to do this is to take the sample average of the partisanship variable among the reporters. Doing so, we would estimate that 44% of the population are Republican identifiers. Note that if individuals who are more likely to be Republicans answer the partisanship question more often, this simple approach will generally yield estimates of Republican partisanship that are falsely high. Looking at the true (but partially unobserved) partisanship of each individual in Table 1 and the associated sample average, we see that the simple approach of ignoring the missing data yields an estimate that is 9 percentage points too high.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading