Data Management

Sarah Boslaugh

doi:10.4135/9781412953948

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Data Management

Edited by:
Sarah Boslaugh
In:Encyclopedia of Epidemiology
Chapter DOI:https://doi.org/10.4135/9781412953948.n100
Subject:Epidemiology & Biostatistics, Public Health (general), Public Health Research Methods

Request Permissions

Show page numbers Hide page numbers

Data analysts often say that they spend 80% of their time getting the data ready to analyze, and 20%, or less, actually analyzing them. As this emphasis usually is not reflected in courses in epidemiology and public health that include data analysis, an obvious gap is created that tends to be filled only through practical experience. Yet data are an omnipresent part of life in the 21st century. Electronic data are in our pockets, in small databases residing on our memory sticks, and in cell phones and personal data assistants. They are on our desktops, residing in our computers’ personal information managers. They are in spreadsheets and relational databases on our desktops or on a computer server.

Though the specifics of the data manager's task vary depending on the software we use, a handful of basic principles are common to all systems. This entry looks at issues regarding existing data, designing new databases, and formulating questions about data. It also discusses object-oriented database structures, database utilization with Web-enabled remote servers, and data security.

Studying Existing Data

Often, we work with data we did not collect ourselves. Even if we were involved in the data collection process, many other people may have had input into both the design and the content of those files. When we receive data files from other hands, we first need to inspect each file carefully with two principal concerns in mind: Are these data correct and undamaged? And what issues exist within these files?

Years of experience have taught the lesson that nothing can be assumed, not even that the correct data file has been provided. We should determine the dates of both file creation and most recent update, plus the numbers of cases and numbers of variables included in the files, and compare these item for item with the specifications (contained in the Manual of Procedures—the ‘MOP’) provided by the persons who provided the data sets. The absence of a codebook should be taken as a warning to inspect the file with particular care. We can learn a great deal by running boxplots, violin plots, and panel plots to rapidly obtain a visual depiction of how each variable is distributed, plus basic descriptive statistics or frequencies for every variable in the entire set of files, and perhaps cross-tabulations for selected variable pairs. All these approaches help locate file transfer problems (e.g., if a variable was imported in the wrong format or is entirely missing); peculiarities within the data, especially those values that are out of the expected range; and entries that are apparently meaningless (e.g., a code for a categorical variable that doesn't translate correctly).

Strategies for uncovering hidden issues about the data set will vary with each situation. However, the following procedures are nearly always appropriate: checking for duplicate cases, locating extreme or outof-range data, and studying the amount and patterns of missing data. Very few data sets are entirely complete. [Page 244]Therefore, to best analyze a data file, we need to determine how many data are missing and why they are missing, and identify the patterns of missing data among variables. For purposes of analysis, we are obliged to think carefully about how we treat missing data, because such decisions can substantially affect our results. Cross-tabulations can reveal patterns of missing data involving pairs of variables, and we can write code that reveals the patterns among larger groups of variables. Knowing the amount and pattern ofmissingdatawithinafileallowsustodecideifwe should consider some kind of interpolation or other substitution for missing data values. Note, however, that newer statistical modeling procedures often have vastly improved ways of handling missing values, so the consequences of varying missing value substitution algorithms are best addressed with the aid of a knowledgeable statistician.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Data Management

Studying Existing Data

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends