Skip to main content icon/video/no-internet

The two most important properties of an assessment are its validity and reliability. Validity refers to the meaningfulness of the interpretations and uses of a test score and is the most important property of an assessment. Reliability refers to the extent to which test scores are free from errors of measurement. Thus, validity examines the interpretations and uses that can reasonably be made from the consistent part of the test scores, whereas reliability is concerned with inconsistent or random errors of measurement. As a result, reliability is a necessary but not sufficient condition for validity. That is, there needs to be some level of consistency to understand the meaningfulness of particular uses and interpretations of test scores, but measuring consistently does not guarantee the meaningfulness of the interpretations or uses.

Reliability and validity are not global properties of an assessment. Instead, they are properties of specific uses and interpretations that are made from a set of test scores. A test could be valid for a particular use or interpretation and not for another. For example, a test might measure the curriculum covered in a school without providing valid estimates of student performance because of the length of the tests or the nonequivalence of forms. The same is true for reliability. For example, a test might provide reliable scoring without being stable over time. In addition, reliability and validity are a matter of degree. Tests are not considered valid or invalid. Instead, they are valid to some degree. Similarly, a test is not considered reliable or unreliable, but is reliable to some degree.

Estimates of reliability are indices that quantify the amount of measurement error for a particular test use or interpretation for a specified population. Although reliability can be defined broadly in terms of consistency or generalizability, specific statistical indices of reliability will vary depending on the statistical model and the sources of error. The statistical model may be based on classical test theory, generalizability theory, or item response theory. Classical test theory and generalizability theory are based on total scores, whereas item response theory is based on an estimate of a latent trait. In this entry, only classical test theory and generalizability theory are considered. Within each theory, there are multiple indices of reliability based on multiple sources of measurement error, including item heterogeneity, equivalence of test forms, stability over time, and consistency of subjective ratings. Different sources of error would be of concern in different contexts. For example, the test score of a student writing an essay is affected by errors in scoring, whereas the test score from a student taking a multiple-choice test is affected by the heterogeneity of the items selected to measure the construct. In addition, a test score can be affected by multiple sources of error simultaneously. A student taking the GRE might be affected by the heterogeneity of the items, the form of the test, and the subjectivity of the scoring for the written portion of the test. Thus, there are many types of reliability that vary depending on the sources of error being considered as well as the statistical model or test theory being used. These varying definitions will be selected based on the particular test use or score interpretation being made, and one type of reliability should not be considered interchangeable with another.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading