Essentials of a Good Psychological Test
Reliability is the extent to which a test is repeatable and yields consistent scores.
Note: In order to be valid, a test must be reliable; but reliability does not guarantee validity.
All measurement procedures have the potential for error, so the aim is to minimize it. An observed test score is made up of the true score plus measurement error.
The goal of estimating reliability (consistency) is to determine how much of the variability in test scores is due to measurement error and how much is due to variability in true scores.
Measurement errors are essentially random: a person’s test score might not reflect the true score because they were sick, hungover, anxious, in a noisy room, etc.
Reliability can be improved by:
- e.g. Consider university assessment for grades involve several sources. You would not consider one multiple-choice exam question to be a reliable basis for testing your knowledge of "individual differences". Many questions are asked in many different formats (e.g., exam, essay, presentation) to help provide a more reliable score.
There are several types of reliability:
There are a number of ways to ensure that a test is reliable. I’ll mention a few of them now:
1. Test-retest reliability
The test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice. Then the first set of scores is correlated with the second set of scores. Correlations range between 0 (low reliability) and 1 (high reliability) (highly unlikely they will be negative!)
Remember that change might be due to measurement error e.g if you use a tape measure to measure a room on two different days, any differences in the result is likely due to measurement error rather than a change in the room size. However, if you measure children’s reading ability in February and the again in June the change is likely due to changes in children’s reading ability. Also the actual experience of taking the test can have an impact (called reactivity). History quiz - look up answers and do better next time. Also might remember original answers.
2. Alternate Forms
Administer Test A to a group and then administer Test B to same group. Correlation between the two scores is the estimate of the test reliability
3. Split Half reliability
Relationship between half the items and the other half.
4. Inter-rater Reliability
Compare scores given by different raters. e.g., for important work in higher education (e.g., theses), there are multiple markers to help ensure accurate assessment by checking inter-rater reliability
5. Internal consistency
Internal consistence is commonly measured as Cronbach's Alpha (based on inter-item correlations) - between 0 (low) and 1 (high). The greater the number of similar items, the greater the internal consistency. That’s why you sometimes get very long scales asking a question a myriad of different ways – if you add more items you get a higher cronbach’s. Generally, alpha of .80 is considered as a reasonable benchmark
.90 = high reliability
.80 = moderate reliability
.70 = low reliability
High reliability is required when (Note: Most standardized tests of intelligence report reliability estimates around .90 (high).
Lower reliability is acceptable when (Note: For most testing applications, reliability estimates around .70 are usually regarded as low - i.e., 49% consistent variation (.7 to the power of 2).
Reliability estimates of .80 or higher are typically regarded as moderate to high (approx. 16% of the variability in test scores is attributable to error)
Reliability estimates below .60 are usually regarded as unacceptably low.
Levels of reliability typically reported for different types of tests and measurement devices are reported in Table 7-6: Murphy and Davidshofer (2001, p.142).
Validity is the extent to which a test measures what it is supposed to measure.
Validity is a subjective judgment made on the basis of experience and empirical indicators.
Validity asks "Is the test measuring what you think it’s measuring?"
For example, we might define "aggression" as an act intended to cause harm to another person (a conceptual definition) but the operational definition might be seeing:
Are these valid measures of aggression? i.e., how well does the operational definition match the conceptual definition?
Remember: In order to be valid, a test must be reliable; but reliability does not guarantee validity, i.e. it is possible to have a highly reliable test which is meaningless (invalid).
Note that where validity coefficients are calculated, they will range between 0 (low) to 1 (high)
Face validity is the least important aspect of validity, because validity still needs to be directly checked through other methods. All that face validity means is:
"Does the measure, on the face it, seem to measure what is intended?"
Sometimes researchers try to obscure a measure’s face validity - say, if it’s measuring a socially undesirable characteristic (such as modern racism). But the more practical point is to be suspicious of any measures that purport to measure one thing, but seem to measure something different. e.g., political polls - a politician's current popularity is not necessarily a valid indicator of who is going to win an election.
Construct Validity is the most important kind of validity
If a measure has construct validity it measures what it purports to measure.
Establishing construct validity is a long and complex process.
The various qualities that contribute to construct validity include:
To create a measure with construct validity, first define the domain of interest (i.e., what is to be measured), then construct measurement items are designed which adequately measure that domain. Then a scientific process of rigorously testing and modifying the measure is undertaken.
Note that in psychological testing there may be a bias towards selecting items which can be objectively written down, etc. rather than other indicators of the domain of interest (i.e. a source of invalidity)
Criterion validity consists of concurrent and predictive validity.
It is important to know whether this tests returns similar results to other tests which purport to measure the same or related constructs.
Does the measure match with an external 'criterion', e.g. behaviour or another, well-established, test? Does it measure it concurrently and can it predict this “behaviour”?
Important to show that a measure doesn't measure what it isn't meant to measure - i.e. it discriminates.
For example, discriminant validity would be evidenced by a low correlation between between a quantitative reasoning test and scores on a reading comprehension test, since reading ability is an irrelevant variable in a test designed to measure quantitative reasoning.
Just a brief word on generalizability. Reliability and validity are often discussed separately but sometimes you will see them both referred to as aspects of generalizability. Often we want to know whether the results of a measure or a test used with a particular group can be generalized to other tests or other groups.
So, is the result you get with one test, lets say the WISC III, equivalent to the result you would get using the Stanford-Binet? Do both these test give a similar IQ score? And do the results you get from the people you assessed apply to other kinds of people? Are the results generalizable?
So a test may be reliable and it may be valid but its results may not be generalizable to other tests measuring the same construct nor to populations other than the one sampled.
Let me give you an example. If I measured the levels of aggression of a very large random sample of children in primary schools in the ACT, I may use a scale which is perfectly reliable and a perfectly valid measure of aggression. But would my results be exactly the same had I used another equally valid and reliable measure of aggression? Probably not, as it’s difficult to get a perfect measure of a construct like aggression.
Furthermore, could I then generalize my findings to ALL children in the world, or even in Australia? No. The demographics of the ACT are quite different from those in Australia and my sample is only truly representative of the population of primary school children in the ACT. Could I generalize my findings of levels of aggression for all 5-18 year olds in the ACT? No. Because I’ve only measured primary school children and there levels of aggression are not necessarily similar to levels of aggression shown by adolescents.
Standardization: Standardized tests are:
The normative sample should (for hopefully obvious reasons!) be representative of the target population - however this is not always the case, thus norms and the structure of the test would need to interpreted with appropriate caution.