Skip to main content icon/video/no-internet

Within education, tests are samples of knowledge, skill, or other qualities that are used to make inferences about the students or other components of the education system. Modern standardized testing entails methods of development, administration, and scoring that have been devised by psychometricians to foster appropriate inferences. These are described in the publication Standards for Educational and Psychological Testing, a set of testing standards developed jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. The technical quality of standardized tests lends confidence to policymakers and citizens seeking to understand students' aptitudes, social and emotional qualities, intelligence, and academic achievement. Policymakers and citizens are also increasingly relying on such tests both to shed light on the performance of teachers, schools, and school systems and to systemically alter that performance. Because standardized testing is ubiquitous within education systems in many parts of the world, it is important to examine key components of its development, administration, and score reporting. This entry considers these components and the systemic effects of testing systems.

Test Development

Test development entails many techniques to enable reasonable inferences to be drawn from a sample of individual or institutional performances. Sampling begins by developing a framework for the “construct” that is to be tested. For example, a framework for the construct of third-grade mathematics might include topics such as addition and subtraction with whole numbers, multiplication, division, and simple fractions. A framework that leaves out important components (such as geometric shapes) or includes irrelevant ones (such as paragraph reading) will threaten the appropriateness of inferences drawn from the test scores.

Once the framework's components have been specified, an appropriate number of test questions (called items) must be developed to sample each component. The number of items is constrained by such things as the length of time the test should take, the amount of money available for test development, and the need for test reliability. When a greater number of items is used to sample each component of the framework, the scores from the test become more reliable or consistent. Reliability is necessary, but not sufficient, for drawing appropriate inferences from scores. For example, a test may yield reliable scores when used over time with similar populations of students. However, if that test included irrelevant components or left out important topics, the score may not permit a valid inference about what the students know and can do. Validity refers to the evidence and argument needed to support the use of a given score for a given purpose.

Test items are analyzed for bias. This is done both through statistical techniques and through the review of items by panels of diverse readers. A finding that different groups of test takers may generate different average scores does not, by itself, indicate a test is biased. Instead, bias refers to inadequacies in the test itself (or in its administration) that create systematic differences in the inferences from scores for different groups of test takers. To illustrate, if average scores from a well-designed and administered standardized test of writing are lower among rural students, this does not mean the test is biased against them. Many test developers consider such score disparities to arise from problems in social policy, not test development or administration.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading