Skip to main content icon/video/no-internet

Criterion-Referenced Testing: Methods and Procedures

Introduction

Criterion-referenced tests are constructed to allow users to interpret examinee test performance in relation to well-defined domains of content and/or behaviours. Normally, performance standards are set on the test score reporting scale to permit examinee test performance to be classified into performance categories such as below basic, basic, proficient, and advanced. Criterion-referenced tests are well suited for many of the assessment needs that exist in education, the professions, the military, and industry. Today, criterion-referenced tests are called by many names – domain-referenced tests, competency tests, basic skills tests, mastery tests, performance tests, authentic assessments, objectives-referenced tests, and more. In different contexts, test developers and users have adopted these different names. For example, in school contexts, the term ‘mastery testing’ is common. When criterion-referenced tests are developed to model classroom activities or exercises, the term ‘authentic test’ is sometimes used. When criterion-referenced tests consist of many performance tasks, the terms ‘performance test’ or ‘performance assessment’ are used. Regardless, all of these terms refer to a type of assessment where what examinees know and can do is estimated, and often performance standards are used for interpreting examinee performance.

This entry has been divided into three sections. First, the most important criterion-referenced testing concepts will be presented. Second, criterion-referenced tests will be compared to norm-referenced tests. Finally, some conclusions and predictions about the future for criterion-referenced tests will be offered.

Key Criterion-Referenced Testing Concepts

Defining Content Domains

When this approach to assessment was introduced by Glaser (1963) and Popham and Husek (1969), criterion-referenced tests were constructed to assess a set of behavioural objectives. Over the years, it became clear that behavioural objectives did not have the specificity needed to guide instruction or to serve as targets for test development and test score interpretation (Popham, 1978). Numerous attempts were made to increase the clarity of behavioural objectives including the development of detailed domain specifications that included a clearly written objective, a sample test item or two, detailed specifications for appropriate content, and details on the construction of relevant assessment materials (see Hambleton, 1998). Domain specifications seemed to meet the demand for clearer statements of the intended targets for assessment but they were very time-consuming to write and often the level of detail needed for good assessment was impossible to achieve for higher order cognitive skills, and so test developers found domain specifications to be limiting.

Recently the trend in criterion-referenced testing practices has been to write objectives focused on the more important educational outcomes (fewer instructional and assessment targets seem to be preferable) and then offer a couple of sample assessments, preferably samples that show the diversity of approaches that might be used for assessment (Popham, 2000). Coupled with these looser specifications of the objectives is an intensive effort to demonstrate the validity of any assessments that are constructed.

Writing Valid Test Items

The production of valid test items, that is test items that provide a psychometrically sound basis for assessing examinee level of proficiency or performance, require (1) well-trained item writers, (2) item review, (3) field testing, and (4) the use of multiple item formats. Well-trained item writers are persons who have had experience with the intended population of examinees, know the intended curricula, and have experience writing test items using a variety of item formats. Item review often involves checking test items for their validity in measuring the intended objectives, their technical adequacy (that is, being consistent with the best item writing practices), and ensuring items are free of bias and stereotyping. Field-testing must be carried out on samples large enough to provide stable statistical information and representative of the intended population of examinees. Unstable and/or biased item statistical information only complicates and threatens the validity of the test development process. And, finally, one of the most important changes today in testing is the introduction of new item formats, formats that permit the assessment of higher level cognitive skills (see Zenisky & Sireci, in press).

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading