Skip to main content icon/video/no-internet

Data Cleaning

Data cleaning, or data cleansing, is an important part of the process involved in preparing data for analysis. Data cleaning is a subset of data preparation, which also includes scoring tests, matching data files, selecting cases, and other tasks that are required to prepare data for analysis.

Missing and erroneous data can pose a significant problem to the reliability and validity of study outcomes. Many problems can be avoided through careful survey and study design. During the study, watchful monitoring and data cleaning can catch problems while they can still be fixed. At the end of the study, multiple imputation procedures may be used for data that are truly irretrievable.

The opportunities for data cleaning are dependent on the study design and data collection methods. At one extreme is the anonymous Web survey, with limited recourse in the case of errors and missing data. At the other extreme are longitudinal studies with multiple treatment visits and outcome evaluations. Conducting data cleaning during the course of a study allows the research team to obtain otherwise missing data and can prevent costly data cleaning at the end of the study. This entry discusses problems associated with data cleaning and their solutions.

Types of “Dirty Data”

Two types of problems are encountered in data cleaning: missing data and errors. The latter may be the result of respondent mistakes or data entry errors. The presence of “dirty data” reduces the reliability and validity of the measures. If responses are missing or erroneous, they will not be reliable over time. Because reliability sets the upper bound for validity, unreliable items reduce validity.

None

Missing Data

Missing data reduce the sample size available for the analyses. An investigator's research design may require 100 respondents in order to have sufficient power to test the study hypotheses. Substantial effort may be required to recruit and treat 100 respondents. At the end of the study, if there are 10 important variables, with each variable missing only 5% of the time, the investigator may be reduced to 75 respondents with complete data for the analyses. Missing data effectively reduce the power of the study. Missing data can also introduce bias because questions that may be embarrassing or reveal anything illegal may be left blank. For example, if some respondents do not answer items about income, place of birth (for immigrants without documents), or drug use, the remaining cases with complete data are a biased sample that is no longer representative of the population.

Data Errors

Data errors are also costly to the study because lowered reliability attenuates the results. Respondents may make mistakes, and errors can be introduced during data entry. Data errors are more difficult to detect than missing data. Table 1 shows examples of missing data (ethnicity, income), incomplete data (date and place of birth), and erroneous data (sex).

Causes

All measuring instruments are flawed, regardless of whether they are in the physical or social sciences. Even with the best intentions, everyone makes errors. In the social sciences, most measures are self-report. Potentially embarrassing items can result in biased responses. Lack of motivation is also an important source of error. For example, respondents will be highly motivated in high-stakes testing such as the College Board exams but probably do not bring the same keen interest to one's research study.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading