Skip to main content icon/video/no-internet

John W. Tukey, the definer of the phrase exploratory data analysis (EDA), made remarkable contributions to the physical and social sciences. In the matter of data analysis, his groundbreaking contributions included the fast Fourier transform algorithm and EDA. He reenergized descriptive statistics through EDA and changed the language and paradigm of statistics in doing so. Interestingly, it is hard, if not impossible, to find a precise definition of EDA in Tukey's writings. This is no great surprise, because he liked to work with vague concepts, things that could be made precise in several ways. It seems that he introduced EDA by describing its characteristics and creating novel tools. His descriptions include the following:

  • “Three of the main strategies of data analysis are: 1. graphical presentation. 2. provision of flexibility in viewpoint and in facilities, 3. intensive search for parsimony and simplicity.” (Jones, 1986, Vol. IV, p. 558)
  • “In exploratory data analysis there can be no substitute for flexibility; for adapting what is calculated—and what we hope plotted—both to the needs of the situation and the clues that the data have already provided.” (p. 736)
  • “I would like to convince you that the histogram is old-fashioned. …” (p. 741)
  • “Exploratory data analysis … does not need probability, significance or confidence.” (p. 794)
  • “I hope that I have shown that exploratory data analysis is actively incisive rather than passively descriptive, with real emphasis on the discovery of the unexpected.” (p. lxii)
  • “‘Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.” (p. 806)
  • “Exploratory data analysis isolates patterns and features of the data and reveals these forcefully to the analyst.” (Hoaglin, Mosteller, & Tukey, 1983, p. 1)
  • “If we need a short suggestion of what exploratory data analysis is, I would suggest that: 1. it is an attitude, AND 2. a flexibility, AND 3. some graph paper (or transparencies, or both).” (Jones, 1986, Vol. IV, p. 815)

This entry presents a selection of EDA techniques including tables, five-number summaries, stem-and-leaf displays, scatterplot matrices, box plots, residual plots, outliers, bag plots, smoothers, reexpressions, and median polishing. Graphics are a common theme. These are tools for looking in the data for structure, or for the lack of it.

Some of these tools of EDA will be illustrated here employing U.S. presidential elections data from 1952 through 2008. Specifically, Table 1 displays the percentage of the vote that the Democrats received in the states of California, Oregon, and Washington in those years. The percentages for the Republican and third-party candidates are not a present concern. In EDA, one seeks displays and quantities that provide insights, understanding, and surprises.

Table

A table is the simplest EDA object. It simply arranges the data in a convenient form. Table 1 is a two-way table.

Five-Number Summary

Given a batch of numbers, the five-number summary consists of the largest, smallest, median, and upper and lower quartiles. These numbers are useful for auditing a data set and for getting a feel for the data. More complex EDA tools may be based on them. For the California data, the five-number summary in percents is shown in Figure 1.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading