Skip to main content icon/video/no-internet

Introduction

Methods for detecting differential item functioning (DIF) and item bias typically are used in the process of developing new measures, adapting existing measures, or validating test score inferences. DIF methods allow one to judge whether items (and ultimately the test they constitute) are functioning in the same manner in various groups of examinees. In broad terms, this is a matter of measurement invariance; that is, is the test performing in the same manner for each group of examinees? What follows is a brief introduction to DIF and item bias, including the context in which DIF methods arose. The goal is to provide some organizing principles that allow one to catalogue and then contrast the various DIF detection methods. This entry will end with a discussion of current and future directions for DIF.

Context in which DIF Methods Arose

Concerns about item bias emerged within the context of test bias and high-stakes decision-making involving achievement, aptitude, certification, and licensure tests in which matters of fairness and equity were paramount. Historically, concerns about test bias have centred around differential performance by groups based on gender or race. If the average test scores for such groups (e.g. men vs. women, Blacks vs. Whites) were found to be different, then the question arose as to whether the difference reflected bias in the test. Given that a test is comprised of items, questions soon emerged about which specific items might be the source of such bias.

Given this context, many of the early item bias methods focused on (a) comparisons of only two groups of examinees, (b) terminology such as ‘focal’ and ‘reference’ groups to denote minority and majority groups, respectively, and (c) binary (rather than polytomous) scored items. Due to the highly politicized environment in which item bias was being examined, two inter-related changes occurred. First, the expression ‘item bias’ was replaced by the more palatable term ‘differential item functioning’ or DIF in many descriptions. DIF was the statistical term that was used to simply describe the situation in which persons from one group answered an item correctly more often than equally knowledgeable persons from another group. Second, the introduction of the term ‘differential item functioning’ allowed one to distinguish item impact from item bias. Item impact described the situation in which DIF exists because there were true differences between the groups in the underlying ability of interest being measured by the item. Item bias described the situations in which there is DIF because of some characteristic of the test item or testing situation that is not relevant to the underlying ability of interest (and hence the test purpose).

Traditionally, consumers of DIF methodology and technology have been educational and psychological measurement specialists. As a result, research has primarily focused on developing sophisticated statistical methods for detecting or ‘flagging’ DIF items rather than on refining methods to distinguish item bias from item impact and providing explanations for why DIF was occurring. Although this is changing as increasing numbers of non-measurement specialists become interested in exploring DIF and item bias in tests, it has become apparent that much of the statistical terminology and software being used is not very accessible to many researchers.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading