Item and Test Bias

Neil J.Salkind

doi:10.4135/9781412952644

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Item and Test Bias

Edited by:
Neil J. Salkind
In:Encyclopedia of Measurement and Statistics
Chapter DOI:https://doi.org/10.4135/9781412952644.n229
Subject:Quantitative/Statistical Research, Test & Measurement
Keywords:bias (testing); item bias

Request Permissions

Show page numbers Hide page numbers

Item and test bias have received much attention from the legal system, policymakers, test consumers, educational and psychological researchers, test developers, and the general public. This attention is well deserved because the essence of the issue is an ethical concern. Bias refers to differential validity across subgroups (e.g., males vs. females, minority vs. majority) and suggests that scores have different meanings for members of these subgroups. The Code of Professional Responsibilities in Educational Measurement states that test developers should make their products “as free as possible from bias due to characteristics irrelevant to the construct being measured, such as gender, ethnicity, race, socioeconomic status, disability, religion, age, or national origin” (Section 1.2a). However, there is no such thing as a “nonbiased test” or a test that is “fair” or “valid” for all subgroups under all conditions. This fact should not deter test developers from going to extensive [Page 490]lengths to create instruments that are free of bias against intended subgroups.

To study whether performance may be influenced by factors specific to group membership (e.g., language, culture, gender), the psychometric properties of a test can be investigated for invariance (equality) across groups. The type of invariance investigation depends on the suspected nature of bias and can include a variety of methods to detect (a) differential item functioning (DIF), (b) factor structure invariance, and (c) differential prediction.

Item Bias and Differential Item Functioning

Although the terms item bias and DIF often are used interchangeably, DIF refers to differences in the statistical properties of an item between groups of examinees of equal ability. Two types of DIF can exist. Uniform DIF is a difference in item performance that is consistent across the ability distribution, whereas nonuniform DIF is a difference that is not consistent across the ability distribution (i.e., a group by ability-level interaction). Groups often are referred to as the reference (e.g., majority) and focal (e.g., minority or studied) groups. The concept of comparing groups of equal ability is a cardinal feature separating DIF from the traditional item bias detection methods. Traditional methods, because they do not control for ability differences, are affected by differences in the examinee group ability distributions. Overall ability differences may explain differential item performance, resulting in an item appearing to be, for example, more difficult when the examinees in the focal group are less able overall. Impact is a more appropriate term to refer to differences in item performance that can be explained by group ability differences. DIF detection methods “condition on” or control for ability, meaning that examinees are necessarily matched on ability; thus, only examinees of equal ability (e.g., total test score) in the reference and focal groups are compared.

Items that exhibit DIF threaten test score validity and may have serious consequences for groups as well as individuals, because correct responses are determined by the trait claimed to be measured and factors specific to group membership. The most obvious consequence is the potential impact of DIF on the observed score distributions of specific groups. The less obvious consequence of DIF, yet critically important to the construct validity of a test, is its impact on the meaning and interpretation of test scores, even in the absence of mean score differences between groups. DIF items may cancel and result in similar score distributions across groups. However, when scores are composed of different items systematically scored as correct, it is invalid to infer that “equal” scores are comparable or have the same meaning. In fact, the Standards for Educational and Psychological Testing (Section 7.10) states that mean score differences are insufficient evidence of bias.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Item and Test Bias

Item Bias and Differential Item Functioning

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends