Introduction to Reliability & Validity

      DR. RYAN MELDRUM: Hello, I'm Dr. Ryan Meldrum.I'm an assistant professor in the Departmentof Criminal Justice at Florida International University.In this tutorial, I'll be discussingfor you what reliability is and I'll alsodiscuss some examples of how researchers assess reliability.Then, I'll move on to a discussion of what validity isand how researchers establish the validity of measures.

    • 00:34

      DR. RYAN MELDRUM [continued]: [What is reliability and how do researchers assessreliability?]With regard to measuring or establishing reliability,reliability entirely boils down to whether or notthere is consistency in the way in which somethingis being measured.If you think about just for example,

    • 00:56

      DR. RYAN MELDRUM [continued]: what the word reliable means when we talkabout a person being reliable.It means that they are behaving in a consistent mannerover time.So just as we can think about reliability with regardto a person, we would like the wayin which we measure our variables to be reliableas well.And whether or not you realize it,

    • 01:16

      DR. RYAN MELDRUM [continued]: we depend upon reliability every day.Think about, for example, the speedometersin our cars or the weight scales that we stand on.We hope that those speedometers and that those weight scalesare reliable in telling us how fast we're goingor how much we weigh.We can imagine how problematic itwould be if we got on the freeway

    • 01:38

      DR. RYAN MELDRUM [continued]: and the car told us we were going 55 miles an hour,when we' actually were going 60.And if the next time we got in the car,we were driving 70 miles an hour and the carwas actually only telling us we were driving 60 miles an hour.So we rely upon that measurement being consistent and reliablefor us so that we don't get pulled overor that we don't get into an accident.

    • 01:59

      DR. RYAN MELDRUM [continued]: And the same thing would apply if wewere talking about the bathroom scales and weighing ourselves.I sure hope that when I get on my bathroom scale,that it's going to consistently providefor me a measurement of how much I weighand that it's not going to be fluctuating all the timebecause it's incorrectly calibrated.So there are at least three ways that researchers can commonly

    • 02:20

      DR. RYAN MELDRUM [continued]: assess the reliability of what is being measured in a study.The first way the researchers canassess how reliable a measurement isis by measuring the same concept or measuring the same variableat two different points in time and assessing whether or notwe get consistency in the scores for that measure.This type of reliability is known

    • 02:41

      DR. RYAN MELDRUM [continued]: as test-retest reliability.One example of establishing test-retest reliabilitywould be asking college students whattheir GPA was in high school.This is something that has already happened in the past.It's fixed.So what that means is if we were to ask college students whattheir GPA was in high school, we should

    • 03:03

      DR. RYAN MELDRUM [continued]: be getting the same answer from them at twodifferent points in time.Let's imagine that I asked a group of college studentstoday what their high school GPA wasand that let's say one individual participant reportedfor me a GPA of 3.1 when they were in high school.

    • 03:28

      DR. RYAN MELDRUM [continued]: Because that GPA is fixed, it occurred in the past,they're never going to go back to high school.The GPA will never change.I should be able to ask that person tomorrow or next weekor next year what their high school GPA wasand I should get the exact same answer or somethingvery close to it.To the extent that they are accurately able to recall what

    • 03:50

      DR. RYAN MELDRUM [continued]: their GPA was.If on the other hand, I had somebody report to methat their high school GPA was 3.1 today,but if I ask them a month from now,they reported it was 4.0, well, thatwould suggest that there is not a good test-retest reliability.We could take this same principleof test-retest reliability and apply it

    • 04:11

      DR. RYAN MELDRUM [continued]: to a more criminal justice-oriented example.Let's imagine that we are collecting datafrom people who are in prison.They've already been arrested, charged and convictedfor a particular type of crime.So we might be interested in knowingwhen these individuals started their criminal careersor what we call the onset of offending.So we might ask individuals in prison

    • 04:33

      DR. RYAN MELDRUM [continued]: to self-report for us at what agethey were first arrested for committing a crime.Again, that is something that has happened in the past.It's fixed.And everyone that we're talking towould be able to report for us the age at which theywere first arrested, because they're all in prison.So if somebody, for example, reported to methat they were 14 years old when they were first arrested

    • 04:55

      DR. RYAN MELDRUM [continued]: for a crime, and if I asked them a month from now,they should give me the exact same answer of 14 years old.If they give me a different age, thatwould suggest I do not have very strong test-retest reliability.The second way that we can assess reliabilityis by measuring the same concept using multiple people

    • 05:15

      DR. RYAN MELDRUM [continued]: at the same point in time.This is known as inter-rater reliability or inter-observerreliability, and we see this all the time.Again, it's just we don't recognize itas an example of reliability.If we think about for example, why we have multiple judges whoscore boxing matches or why we have multiple judges who

    • 05:35

      DR. RYAN MELDRUM [continued]: are scoring Olympic events, like the gymnasticscompetition or diving competitions--we have at least three or four, even more judgeswho are viewing that event and are assigninga score-- whether it be the number of punches landedor the number of punches thrown or the scorethat somebody gets assigned basedupon their performance in the high dive

    • 05:56

      DR. RYAN MELDRUM [continued]: or on a floor event exercise in the gymnasticscompetition in the Olympics.What we typically see, is that across the three, four,five different judges, that the scores thatare being assigned to the participantare nearly identical to one another.And that is because all of those judgeshave been trained to professionally evaluate

    • 06:17

      DR. RYAN MELDRUM [continued]: the performance of the participant.If we were getting wildly different scoresacross each one of the judges, thiswould suggest that we do not have a very reliable measureof what is being scored.But instead what we typically do see,is that the scores are very closely clustered together.And that's because everybody has been trained

    • 06:39

      DR. RYAN MELDRUM [continued]: and they're scoring things in a very reliable, veryconsistent manner.Now we could take that same principleand can apply it to a criminology examplewhere one of the things that has been most extensively studiedby researchers is the relationship between qualityof parenting and involvement in delinquent behavior.And one thing that researchers have done

    • 07:00

      DR. RYAN MELDRUM [continued]: is they've studied the quality of the interactionbetween mothers and toddlers within laboratory environments.You can bring a mother and a toddler into a controlled roomand you might assign them to engage in some type of task.Whether it be completing a puzzle or buildingblocks or any other task where youwould take mutual cooperation between the mother

    • 07:22

      DR. RYAN MELDRUM [continued]: and the child in order to complete that task.Well, the researchers would typicallyeither view this with multiple researchersor they might videotape it.And then they would have multiple researchersassign scores to the quality of that interactionbetween the mother and the child.If you have good inter-rater or inter-observer reliability,

    • 07:45

      DR. RYAN MELDRUM [continued]: what you hopefully would see is that each one of the judgesor each one of the researchers in this instance,would assign a score that is verysimilar to the other scores being assignedby the other researchers.And if that's the case, and it's somethingthat we should expect if they've all been trained to scoreor code that observation, again, the score

    • 08:06

      DR. RYAN MELDRUM [continued]: should be closely or tightly clustered around one another.The third way that we can assess reliabilityis by measuring a single concept using slightly different,but overlapping indicators of that concept.This is known as interitem reliabilityor internal consistency reliability.On screen is an example of a measure for low self-control

    • 08:29

      DR. RYAN MELDRUM [continued]: that possesses very strong interitem reliability basedon the participant's responses.As you can see, the person who responded to each of the itemsselected responses that were consistentand indicate that this person has a lot of self-control.You will notice that some of the items are worded in such a way

    • 08:50

      DR. RYAN MELDRUM [continued]: that the individual who is indicating that it is verycharacteristic of them would indicatehigh self-control, where some of the other itemswould indicate very low self-control.And so you can see that this participant has selectedanswers or responses to each one of the items thatis very consistent in indicating that theyare high in self-control.

    • 09:11

      DR. RYAN MELDRUM [continued]: [What is validity and how do reserchers assess validity?]Now, that we've discussed what reliability isand the different ways that researchers commonlyassess the reliability of measurement quality,we can move on to discussing what validity isand the different ways that researcherscan assess validity.Validity has to do with the extent to which what you have

    • 09:33

      DR. RYAN MELDRUM [continued]: measured accurately reflects the concept that youare trying to measure.There are several ways that researchers can establishreliability and I'm going to discussthree of the specific ways.The first way that a researcher can assess validity is simplyby examining on its face, does a measurement strategy

    • 09:54

      DR. RYAN MELDRUM [continued]: seem to make a good common sense?So this is what we refer to as establishing face validity.Let's look at one example.Let's imagine that we are wantingto measure the concept of academic achievement.One way in which we could measure academic achievementis by looking up somebody's test score for a single final exam

    • 10:14

      DR. RYAN MELDRUM [continued]: in a single course.That certainly gives us some indicationof the academic achievement, but it seems rather incomplete.We would likely agree that looking up somebody's GPAwould provide a more face valid measureof their academic achievement relative to a single classbecause it's more all encompassing.

    • 10:36

      DR. RYAN MELDRUM [continued]: The GPA is based upon the averageof how they performed on multiple examswithin multiple classes.So we would likely agree that using GPAas a measure of academic achievementwould be more face valid than only relyupon a measure based upon somebody's single exam gradefrom one semester.

    • 10:56

      DR. RYAN MELDRUM [continued]: As a second example of establishing face validity,let's imagine that we're wanting to measure criminal behaviorand we have two different options.We can either rely upon arrest data for individualsor rely upon conviction data.What we know is that several people who are arrested neverend up becoming convicted for their particular crime.

    • 11:18

      DR. RYAN MELDRUM [continued]: And so in that sense, from a legalistic standpoint,a measure of criminal behavior based upon conviction datawould probably be more space valid than a measure thatis based upon arrest.Again, because of that disparity between peoplewho get arrested vs. those who are ultimately convicted.The second way that researchers can assess validity

    • 11:40

      DR. RYAN MELDRUM [continued]: is by comparing a measure of conceptto some external measure that should be predictedby the original measure.This is what is known as criterion-related validity.An example of establishing criterion-related validitywould be comparing a measure of self-controlthat is self-reported by a teenager to a measure

    • 12:02

      DR. RYAN MELDRUM [continued]: of self-control that has been reportedupon by the mother of that teenager.If the measure of self-reported self-control by the teenageris valid and if it has a criterion-related validity,what we would expect to see is that those teenagerswho report that they have higher self-control correspondingly,

    • 12:24

      DR. RYAN MELDRUM [continued]: their mothers should also report that theyare high in self-control.And conversely, those teenagers whoindicate they are lower in self-control,the mothers of those teenagers also as well,would be indicating scores that wouldrepresent lower self-control.And so you have the measure of self-control reportedby the teenagers and you are evaluating that score

    • 12:45

      DR. RYAN MELDRUM [continued]: to the external criteria of the score beingprovided by the mothers.So to the extent that those two scores correlate wellwith one another as we would expect,we can establish the criterion-related validityof the original self-reported measure of self-controlprovided by the teenagers.A third way that researchers can determine

    • 13:05

      DR. RYAN MELDRUM [continued]: the validity of a measure is by assessingthe degree to which the measure of a conceptoverlaps with the entirety of that concept.This is called content validity.Let's talk about one example of a measure of content validitywhere that measure would be very strong on content validityvs. an instance where the measure would bevery weak on content validity.

    • 13:28

      DR. RYAN MELDRUM [continued]: Let's imagine we're wanting to measure involvementin juvenile delinquency.And as one option we have informationon whether or not teenagers report wearing a seat belt.Certainly, it's a prohibited act.It's a violation of law, but it isa very narrow range of delinquent behavior thatwould be assessed.

    • 13:48

      DR. RYAN MELDRUM [continued]: Now let's compare that measurement of delinquencyto one in which we are asking teenagersto self-report for us not just whether or notthey wear seat belts, but are they using substances?Are they engaged in property offending?Are they committing violent behaviors?We think about the entire range of the concept of delinquency,

    • 14:08

      DR. RYAN MELDRUM [continued]: we could think of dozens, if not hundreds,of different type of delinquent acts that could be measured.So in the first instance, if we're onlybasing our measure of delinquency on whether or notteenagers report wearing seat belts,we would say that that measure wouldbe very poor in terms of its content validity versus if we

    • 14:29

      DR. RYAN MELDRUM [continued]: had a measure of delinquency where we had informationon a wide variety of delinquent actsthat the participants had reported upon.That would be an example where you have very strong contentvalidity.This idea is illustrated in the graphic on the screen.On the left half of the screen isan example where you would have very strong content validity.

    • 14:50

      DR. RYAN MELDRUM [continued]: There are two circles.The red circle is the concept you're wanting to measureand the green circle is your actual measurementor your actual operationalizationof that concept.And you can see there is significant overlapbetween the two circles.The graphic on the right half of the screenhowever, would be an instance in whichwe have done a very poor job of really

    • 15:10

      DR. RYAN MELDRUM [continued]: tapping into the concept we were intendingto wanting to measure.So if we refer back to the example of measuringdelinquency based upon a single item where we have peopleself-reporting to us whether or not they've worn a seat belt,that would probably fit closer to the graphicon the right side of the screen wherewe have the entirety of delinquency representing

    • 15:31

      DR. RYAN MELDRUM [continued]: the red circle and our measurement of delinquencybeing seat belt use as the green circle.And you can see there's just not a whole lot of overlapbetween the two circles.Now on the other hand, if we were measuring delinquencybased not just on seat belt use, but based upon substance abuse,involvement in violence, property offending,

    • 15:51

      DR. RYAN MELDRUM [continued]: well, then you would have an instance in which thereis much greater overlap between the conceptof delinquency and then our ability to have measuredthat concept in a much more multi-dimensional,much more comprehensive manner.So in this tutorial, I've discussed what reliability isand how researchers commonly assess reliability.

    • 16:13

      DR. RYAN MELDRUM [continued]: And then I proceeded to discuss what validity is and someof the ways in which researchers discuss and assess validity.[MUSIC PLAYING]

Dr. Ryan Meldrum defines reliability and validity, and he explains different ways researchers assess both. He draws on specific examples to illustrate the assessment methods.

Dr. Ryan Meldrum defines reliability and validity, and he explains different ways researchers assess both. He draws on specific examples to illustrate the assessment methods.

