Individual Differences


Essentials of a Good Psychological Test

Last updated:
25 Jul 2004

Reliability - overview




Recommended Links

Reliability - overview

Reliability is the extent to which a test is repeatable and yields consistent scores.

Note:  In order to be valid, a test must be reliable; but reliability does not guarantee validity.

All measurement procedures have the potential for error, so the aim is to minimize it. An observed test score is made up of the true score plus measurement error.

The goal of estimating reliability (consistency) is to determine how much of the variability in test scores is due to measurement error and how much is due to variability in true scores.

Measurement errors are essentially random: a person’s test score might not reflect the true score because they were sick, hungover, anxious, in a noisy room, etc.

Reliability can be improved by:

  • getting repeated measurements using the same test and

  • getting many different measures using slightly different techniques and methods.

- e.g. Consider university assessment for grades involve several sources.  You would not consider one multiple-choice exam question to be a reliable basis for testing your knowledge of "individual differences".  Many questions are asked in many different formats (e.g., exam, essay, presentation) to help provide a more reliable score.

Types of reliability

There are several types of reliability:

There are a number of ways to ensure that a test is reliable. I’ll mention a few of them now: 

1. Test-retest reliability

The test-retest method of estimating a test's reliability involves administering the test to the same group of people at least twice.  Then the first set of scores is correlated with the second set of scores.  Correlations range between 0 (low reliability) and 1 (high reliability) (highly unlikely they will be negative!)

Remember that change might be due to measurement error e.g if you use a tape measure to measure a room on two different days, any differences in the result is likely due to measurement error rather than a change in the room size. However, if you measure children’s reading ability in February and the again in June the change is likely due to changes in children’s reading ability. Also the actual experience of taking the test can have an impact (called reactivity). History quiz - look up answers and do better next time. Also might remember original answers.

2. Alternate Forms

Administer Test A to a group and then administer Test B to same group. Correlation between the two scores is the estimate of the test reliability

3. Split Half reliability

Relationship between half the items and the other half.

4. Inter-rater Reliability

Compare scores given by different raters.  e.g., for important work in higher education (e.g., theses), there are multiple markers to help ensure accurate assessment by checking inter-rater reliability

5. Internal consistency

Internal consistence is commonly measured as Cronbach's Alpha (based on inter-item correlations) - between 0 (low) and 1 (high).  The greater the number of similar items, the greater the internal consistency. That’s why you sometimes get very long scales asking a question a myriad of different ways – if you add more items you get a higher cronbach’s. Generally, alpha of .80 is considered as a reasonable benchmark

How reliable should tests be?  Some reliability guidelines

.90 = high reliability

.80 = moderate reliability

.70 = low reliability

High reliability is required when (Note: Most standardized tests of intelligence report reliability estimates around .90 (high).

  • tests are used to make important decisions

  • individuals are sorted into many different categories based upon relatively small individual differences e.g. intelligence

Lower reliability is acceptable when (Note: For most testing applications, reliability estimates around .70 are usually regarded as low - i.e., 49% consistent variation (.7 to the power of 2).

  • tests are used for preliminary rather than final decisions

  • tests are used to sort people into a small number of groups based on gross individual differences e.g. height or sociability /extraversion

Reliability estimates of .80 or higher are typically regarded as moderate to high (approx. 16% of the variability in test scores is attributable to error)

Reliability estimates below .60 are usually regarded as unacceptably low.

Levels of reliability typically reported for different types of tests and measurement devices are reported in Table 7-6: Murphy and Davidshofer (2001, p.142).


Validity is the extent to which a test measures what it is supposed to measure. 

Validity is a subjective judgment made on the basis of experience and empirical indicators.

Validity asks "Is the test measuring what you think it’s measuring?"

For example, we might define "aggression" as an act intended to cause harm to another person (a conceptual definition) but the operational definition might be seeing:

  • how many times a child hits a doll
  • how  often a child pushes to the front of the queue
  • how many physical scraps he/she gets into in the playground.

Are these valid measures of aggression?  i.e., how well does the operational definition match the conceptual definition?

Remember: In order to be valid, a test must be reliable; but reliability does not guarantee validity, i.e. it is possible to have a highly reliable test which is meaningless (invalid).

Note that where validity coefficients are calculated, they will range between 0 (low) to 1 (high)

Types of Validity

Face validity

Face validity is the least important aspect of validity, because validity still needs to be directly checked through other methods. All that face validity means is:

"Does the measure, on the face it, seem to measure what is intended?"

Sometimes researchers try to obscure a measure’s face validity - say, if it’s measuring a socially undesirable characteristic (such as modern racism).  But the more practical point is to be suspicious of any measures that purport to measure one thing, but seem to measure something different.  e.g., political polls - a politician's current popularity is not necessarily a valid indicator of who is going to win an election.

Construct validity

Construct Validity is the most important kind of validity

If a measure has construct validity it measures what it purports to measure.

Establishing construct validity is a long and complex process.

The various qualities that contribute to construct validity include:

  • criterion validity (includes predictive and concurrent)
  • convergent validity
  • discriminant validity

To create a measure with construct validity, first define the domain of interest (i.e., what is to be measured), then construct measurement items are designed which adequately measure that domain.  Then a scientific process of rigorously testing and modifying the measure is undertaken.

Note that in psychological testing there may be a bias towards selecting items which can be objectively written down, etc. rather than other indicators of the domain of interest (i.e. a source of invalidity)

Criterion validity

Criterion validity consists of concurrent and predictive validity.

  • Concurrent validity: "uDoes the measure relate to other manifestations of the construct the device is supposed to be measuring?"
  • Predictive validity: "uDoes the test predict an individual’s performance in specific abilities?"

Convergent validity

It is important to know whether this tests returns similar results to other tests which purport to measure the same or related constructs.

Does the measure match with an external 'criterion', e.g. behaviour or another, well-established, test? Does it measure it concurrently and can it predict this “behaviour”?

  • Observations of dominant behaviour (criterion) can be compared with self-report dominance scores (measure)
  • Trained interviewer ratings (criterion) can be compared with self-report dominance scores (measure)

Discriminant validity

Important to show that a measure doesn't measure what it isn't meant to measure - i.e. it discriminates.

For example, discriminant validity would be evidenced by a low correlation between between a quantitative reasoning test and scores on a reading comprehension test, since reading ability is an irrelevant variable in a test designed to measure quantitative reasoning.

Sources of Invalidity

  • Unreliability
  • Response sets = psychological orientation or bias towards answering in a particular way:
    • Acquiescence:  tendency to agree, i.e. say "Yes”.  Hence use of  half -vely and half +vely worded items (but there can be semantic difficulties with -vely wording)
    • Social desirability: tendency to portray self in a positive light. Try to design questions which so that social desirability isn't salient.
    • Faking bad: Purposely saying 'no' or looking bad if there's a 'reward' (e.g. attention, compensation, social welfare, etc.).
  • Bias
    • Cultural bias: does the psychological construct have the same meaning from one culture to another; how are the different items interpreted by people from different cultures; actual content (face) validity may be different for different cultures.
    • Gender bias may also be possible.
    • Test Bias
      • Bias in measurement occurs when the test makes systematic errors in measuring a particular characteristic or attribute e.g. many say that most IQ tests may well be valid for middle-class whites but not for blacks or other minorities.  In interviews, which are a type of test, research shows that there is a bias in favour of good-looking applicants.
      • Bias in prediction occurs when the test makes systematic errors in predicting some outcome (or criterion). It is often suggested that tests used in academic admissions and in personnel selection under-predict the performance of minority applicants Also a test may be useful for predicting the performance of one group e.g. males but be less accurate in predicting the performance of females.


Just a brief word on generalizability. Reliability and validity are often discussed separately but sometimes you will see them both referred to as aspects of generalizability. Often we want to know whether the results of a measure or a test used with a particular group can be generalized to other tests or other groups.

So, is the result you get with one test, lets say the WISC III, equivalent to the result you would get using the Stanford-Binet? Do both these test give a similar IQ score? And do the results you get from the people you assessed apply to other kinds of people? Are the results generalizable?

So a test may be reliable and it may be valid but its results may not be generalizable to other tests measuring the same construct nor to populations other than the one sampled.

Let me give you an example.  If I measured the levels of aggression of a very large random sample of children in primary schools in the ACT, I may use a scale which is perfectly reliable and a perfectly valid measure of aggression. But would my results be exactly the same had I used another equally valid and reliable measure of aggression? Probably not, as it’s difficult to get a perfect measure of a construct like aggression.

Furthermore, could I then generalize my findings to ALL children in the world, or even in Australia? No. The demographics of the ACT are quite different from those in Australia and my sample is only truly representative of the population of primary school children in the ACT.  Could I generalize my findings of levels of aggression for all 5-18 year olds in the ACT? No. Because I’ve only measured primary school children and there levels of aggression are not necessarily similar to levels of aggression shown by adolescents.


Standardization: Standardized tests are:

  • administered under uniform conditions. i.e. no matter where, when, by whom or to whom it is given, the test is administered in a similar way.
  • scored objectively, i.e. the procedures for scoring the test are specified in detail so that ant number of trained scorers will arrive at the same score for the same set of responses. So for example, questions that need subjective evaluation (e.g. essay questions) are generally not included in standardized tests.
  • designed to measure relative performance. i.e. they are not designed to measure ABSOLUTE ability on a task. In order to measure relative performance, standardized tests are interpreted with reference to a comparable group of people, the standardization, or normative sample. e.g. Highest possible grade in a test is 100. Child scores 60 on a standardized achievement test. You may feel that the child has not demonstrated mastery of the material covered in the test (absolute ability) BUT if the average of the standardization sample was 55 the child has done quite well (RELATIVE performance).

The normative sample should (for hopefully obvious reasons!) be representative of the target population - however this is not always the case, thus norms and the structure of the test would need to interpreted with appropriate caution.

Recommended Links