Skip to main content icon/video/no-internet

The term data snooping, sometimes also referred to as data dredging or data fishing, is used to describe the situation in which a particular data set is analyzed repeatedly without an a priori hypothesis of interest. The practice of data snooping, although common, is problematic because it can result in a significant finding (e.g., rejection of a null hypothesis) that is nothing more than a chance artifact of the repeated analyses of the data. The biases introduced by data snooping increase the more a data set is analyzed in the hope of a significant finding. Empirical research that is based on experimentation and observation has the potential to be impacted by data snooping.

Data Snooping and Multiple Hypothesis Testing

A hypothesis test is conducted at a significance level, denoted α, corresponding to the probability of incorrectly rejecting a true null hypothesis (the so-called Type 1 error). Data snooping essentially involves performing a large number of hypothesis tests on a particular data set with the hope that one of the tests will be significant. This data-snooping process of performing a large number of hypothesis tests results in the actual significance level being increased, or the burden of proof for finding a significant result being substantially reduced, resulting in potentially misleading results. For example, if 100 independent hypothesis tests are conducted on a data set at a significance level of 5%, it would be expected that about 5 out of the 100 tests would yield significant results simply by chance alone, even if the null hypothesis were, in fact, true. Any conclusions of statistical significance at the 5% level based on an analysis such as this are misleading because the data-snooping process has essentially ensured that something significant will be found. This means that if new data are obtained, it is unlikely that the “significant” results found via the data-snooping process would be replicated.

Data-Snooping Examples

Example 1

An investigator obtains data to investigate the impact of a treatment on the mean of a response variable of interest without a predefined view (alternative hypothesis) of the direction (positive or negative) of the possible effect of the treatment. Data snooping would occur in this situation if after analyzing the data, the investigator observes that the treatment appears to have a negative effect on the response variable and then uses a one-sided alternative hypothesis corresponding to the treatment having a negative effect. In this situation, a two-sided alternative hypothesis, corresponding to the investigator's a priori ignorance on the effect of the treatment, would be appropriate. Data snooping in this example results in the p value for the hypothesis test being halved, resulting in a greater chance of assessing a significant effect of the treatment. To avoid problems of this nature, many journals require that two-sided alternatives be used for hypothesis tests.

Example 2

A data set containing information on a response variable and six explanatory variables is analyzed, without any a priori hypotheses of interest, by fitting each of the 64 multiple linear regression models obtained by means of different combinations of the six explanatory variables, and then only statistically significant associations are reported. The effect of data snooping in this example would be more severe than in Example 1 because the data are being analyzed many more times (more hypothesis tests are performed), meaning that one would expect to see a number of significant associations simply due to chance.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading