Skip to main content icon/video/no-internet

Introduction

Reliability as a central concept of test theory dates back to the beginning of the 20th century. It is based on the existence of intra-individual variability as well as variation between persons. With intra-individual variability or measurement error, true score was also introduced as a central concept of classical test theory. Observed score variance could then be thought of as true score variance plus error variance. The reliability of a test, rating scale, assessment or any other more or less standardized procedure within a given (sub)population of persons (or other objects of measurements, e.g. classrooms) is defined as the ratio of true score variance to observed score variance or as the squared correlation between observed scores and true scores (Lord & Novick, 1968: 61):

None

Its minimum value is zero, its maximum value one. As will be demonstrated, the definition is not very useful until we have defined precisely what we mean by ‘error’.

After the 1960s, Item Response Theory, IRT for short, became an influential approach in test theory. With IRT person parameters on a latent scale replace true scores. At first sight, there seems to be no place for reliability within the context of IRT. It can be demonstrated, however, that reliability is an important concept in the newer test theoretical approach also.

Reliability and Sources of Variation

When the length of a person is measured repeatedly, we notice small differences in the reading of the length: there is error in the measurements. The same is the case in measuring a person's characteristics in psychological testing. When an intelligence test would be administered to a person repeatedly, we would expect scores to vary: again there is measurement error. Unfortunately, the experiment of repeatedly testing a person with the same measurement instrument is seldom done; in practice we should expect memory effects. Instead, we could administer two tests meant to measure the same construct. Then a score difference might not only be due to chance fluctuations in item responses, but also to differences in content. Many more sources of variation can be thought of; for example, systematic fluctuation of responses over time. Sources of variance due to person characteristics can be classified as lasting or temporary, and lasting or specific. Further, there are factors affecting test administration and there is a category for variance not accounted for otherwise. Most of the sources of variation in responses might be regarded as a source of error variation, but the same sources might be regarded as sources of true variation, depending on the purpose of the test administrator. Let us give an example, mentioned by Stanley (1971: 366), who discusses the subject of sources of variation extensively. A person may be fatigued on the day of testing and this influences test performance. When our interest is to predict performances over some period, reliability would be consistency over time. When the intercorrelations among tests administered at the same session are studied, consistency at that session is relevant. So, the definition of error depends on the purpose of the investigator, and this should determine the choice of reliability coefficient(s).

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading