Skip to main content icon/video/no-internet

The ecological inference problem is a long-standing problem that encompasses a rich set of intriguing puzzles. Scholars with diverse backgrounds and interests have a stake in approaches to ecological inference problems, which appear as frequently in political science as they do in medicine, geography, economics, or sociology. The problem occurs, for instance, when one is interested in the behavior of individuals, but has data only at an aggregated level (e.g., precincts, hospital wards, counties). In other words, a data limitation creates a situation where the behavior of individuals must be surmised from data on aggregated sets of individuals rather than on individuals themselves. Since the goal is to make inferences from aggregate units that are often derived from an “environmental level” (i.e., geographical/ecological units such as a county or precinct), the term ecological inference is used to describe this type of analysis. More generally, the problem manifests itself whenever one has data at one level of aggregation (e.g., the state level) but is interested in inferences at another level of data aggregation (e.g., the county level). Accordingly, the term cross-level inference is often used as a synonym for ecological inference. This entry discusses the nature and implications of this problem.

A classic example of the ecological inference problem in political science occurs when one tries to determine how members of different racial groups cast their ballots. Because the United States employs the secret ballot, the only data available for solving this inquiry are at the precinct level, where vote totals and racial demographics can be obtained but not vote totals broken down by racial categories. Epidemiologists confront identical methodological issues when they seek to explain which environmental factors influence disease susceptibility using only data from counties or hospital wards, rather than individual patients. Economists studying consumer demand and marketing strategies might need to infer individual spending habits with an analysis of sales data from a specific region and the aggregate characteristics of individuals in that region, rather than from data on individuals' characteristics and purchases. These different queries are substantively varied, and it would be simple to identify a host of other equally unique queries that fit into the ecological inference mold.

In addition to substantive applications that span many fields, the mathematics of the ecological inference problem are also related, sometimes isomorphic, to inferential problems in other disciplines, even when the subject matter is not substantially related. For instance, geographers have long been intrigued with the modifiable areal unit problem (MAUP). MAUP occurs when the estimates at one level of aggregation are different from the estimates obtained at a different level of aggregation. Many statisticians and mathematicians have been captivated by Simpson's paradox, which is the reversal in direction of association between two variables when a third (“lurking”) variable is controlled. Described in this way, Simpson's paradox (and consequently ecological inference) is akin to the omitted variable problem discussed in virtually all econometrics and regression texts. While notation and terminology may differ, the similarities of the underlying problems cannot be denied.

The main difficulty with cross-level inference is that not only might the estimates at different levels of aggregation be different, they could be substantially different and even lead to different conclusions. William Robinson (1950) was among the first to point out this conundrum with correlation coefficients. He examined the relationship between nativity and illiteracy. At the individual level, the correlation coefficient was .118. If the data were aggregated at the state level, the correlation coefficient reverses sign and becomes –.526. If the data were aggregated at the census geographic divisions, the correlation coefficient remains negative at –.619. The same phenomenon occurs when race and illiteracy are examined. The correlation at the individual level is .203. At the census division level, the correlation is .946. At the state level, the correlation is .773. While these correlation coefficients do not change sign, they vary widely and imply different substantive conclusions. In fact, such wide discrepancies are common, and worse, in a given application, a researcher is unable to determine if his or her aggregated results bear similarity to the true individual-level relationship.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading