Skip to main content icon/video/no-internet

Perturbation Methods

Perturbation methods are procedures that are applied to data sets in order to protect the confidentiality of survey respondents. The goal of statistical disclosure control (SDC) is to provide accurate and useful data—especially public use data files—while also protecting confidentiality. Various methods have been suggested, and these may be classified two ways: (1) methods that do not alter the original data but reduce the amount of data released; and (2) methods that alter individual values while maintaining the reported level of detail. The first set of methods may be described as data coarsening; the second set of methods may be described as statistical perturbation methods.

Perturbation methods have the advantage of maintaining more of the actual data collected by survey respondents than data coarsening. Variables selected for perturbation may be those containing sensitive information about a respondent (such as income) or those that may potentially identify a respondent (such as race). These methods can be used for data released at the microdata level (individual respondent records) or at the tabular level (in the form of frequency tables). Depending on the data, their values, and method of data release, researchers may select one perturbation method over another, use multiple perturbation techniques, or use these techniques in addition to data coarsening.

Examples of perturbation methods are described below, with a focus primarily on perturbation of micro-data. This is not an exhaustive list, as new methods are continually being developed.

Data swapping. In this method, selected records are paired with other records in the file based on a predetermined set of characteristics. Data values from some identifying or sensitive variables are then swapped between the two records. The sampling rate is designed to protect the confidentiality of the data without affecting the usability of the data set. This method introduces uncertainty to an intruder as to which reported values were provided by a particular respondent.

Rank swapping, a method similar to data swapping. With rank swapping, pairs are created that do not exactly match on the selected characteristics but are close in terms of the ranks of the characteristics.

Adding random noise. This method is a way of masking sensitive items by adding or multiplying by random numbers. The random numbers are selected from a pre-specified distribution with a mean of 0 and a selected standard deviation, so that the value is altered as little as possible but enough to prevent reidentincation.

Replacing values with imputed data. With this method, specified sensitive values on a randomly selected set of records are replaced with imputed values from other, similar records. This approach will introduce some uncertainty as to whether the sensitive items on a record were actually reported by a particular respondent.

Data synthesis. Values are replaced with those predicted from models developed to generate multiple imputations that allow for valid statistical inference. All values for all records may be replaced (full synthesis), or a subset of variables on a subset of records (partial synthesis).

Blurring. In this method, small groups of records are formed based on the proximity of their values of a sensitive variable or other variables related to the sensitive variable. Aggregate (usually average) values are calculated from the individual responses for the sensitive item in that group. The aggregate value may be used in place of one (for example, the middle) or all individual responses for the group on the released data file.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading