Research Methods

Why use Effect Sizes instead of Significance Testing in Program Evaluation?

James Neill
Last updated:
11 Sep 2008

Contents

Introduction

This article discusses the overuse of significance testing and the underuse of effect sizes for reporting on the effects of intervention programs.  

Effect sizes are now a standard and expected part of much statistical reporting and should be more commonly reported and understood (Thompson, 2000).

Significance testing

There are two main issues with relying on significance testing for evaluating the effects of interventions:

  1. Significance tests should only be used when generalising from the sample to a population

Significance tests should only be used in attempts to generalise the sample's results to a population.  For example, based on a sample of 10% of our clients, we want to use the sample data to make generalised conclusions for all of our clients.  In this case, the results of the sample are of little intrinsic interest.  Of primary interest is what the sample data might represent about the target population.

Program evaluations not uncommonly access data from the entire population of interest, e.g., all participants in a specific program.  In such situations, the full population data set is available and there is no need or interest in inferential testing because there is no need to generalise beyond the sample.  In these situations, descriptive statistics and effect sizes are all that is needed.

  1. Significance tests conducted with low power can be misleading

Program evaluation studies with less than approximately 50 participants tend to lack sufficient statistical power (a function of sample size, effect size, and p level) for detecting small, medium or possibly even large effects.  In such situations, the results of significance tests can be misleading because of being subject to Type II errors (incorrectly failing to reject the null hypothesis).  In these situations, it can be more informative to use the effect sizes, possibly with confidence intervals.

  1. Significance tests conducted with high power can be misleading

For studies involving large sample sizes (e.g., > ~ 400), a different problem occurs with significance testing because even small effects are likely to become significant significant, although these effects may be trivial. In these situations, more attention should be paid to effect sizes than to statistical significance testing.
 

To sum up:

  1. When there is no interest in generalising (e.g., we are only interested in the results for the sample), there is no need for significance testing.  In these situations, effect sizes are sufficient and suitable.

  2. When examining effects using small sample sizes, significance testing can be misleading because its subject to Type II errors.  Contrary to popular opinion, statistical significance is not a direct indicator of size of effect, but rather it is a function of sample size, effect size and p level.  In these situations, effect sizes and confidence intervals are more informative than significance testing.

  3. When examining effects using large samples, significant testing can be misleading because even small or trivial effects are likely to produce statistically significant results. This can be dealt with by reporting and emphasising significance test results.

Effect sizes

Types of questions

One of the most common question driving evaluation of intervention programs is "did we get an effect?".  However, very often, the question(s) could be more usefully phrased as "how much effect did this program have?" and perhaps also "how does this effect compare with the effects of other interventions?".

To answer the "did we get an effect?" question it is necessary to compare the observed result against 0.  If the results are to be generalised, the comparison against 0 or a control group can be done inferentially, using statistics such as paired samples t-tests, repeated measures ANOVAs, etc. are used.  Alternatively, confidence intervals can be used.  If the results are not to be generalised (i.e., where the population data is available), then the question can be answered simply by examining the effect sizes.

To answer the "how much effect did the program get?" question, effect sizes can be used. 

To answer the "how does the effect compare with the effects of other interventions?" question, effect sizes can be used because they are standardised and allow ready comparison with benchmarks such as effect sizes for other types of interventions obtained from meta-analyses.

Use of effect sizes can also be combined with other data, such as cost, to provide a measure of cost-effectiveness.  In other words, "how much bang (effect size) for the buck (cost)?".  This is not an uncommon question that government or philanthropic funders may wish to ask and increasingly there is a demand for such "proven evidence" of outcomes and cost-effectiveness in psycho-social intervention programs.

Advantages and disadvantages

Some advantages of effect size reporting are that:

  • It tends to be easier for practitioners to intuitively relate to effect sizes (once it is explained) than significance testing

  • Effect sizes facilitate ready comparison with internal or external benchmarks

  • Confidence intervals can be placed around effect sizes (providing an equivalent to significance testing if desired)

The main disadvantage of using effect sizes include that:

  • Research culture and software packages are still in transition from habitual significance testing to habitual effect size reporting.  Thus, commonly used statistical packages surprisingly still tend to offer limited functionality for creating effect sizes.

  • Most undergraduate and postgraduate research methods and statistics courses tend to teach and overemphasise classical test theory and inferential statistical methods, and to underemphasise effect sizes and confidence intervals.  In response, there has been a campaign since the 1980s to educate social scientists about the misuse of significance testing and the need for more common reporting of effect sizes. Significantly, these endeavours were recognised in changes to the 5th edition of the American Psychological Association publication manual which states that research that doesn't report effect sizes is inferior.

If confused about what types of statistics to report, you can report effect sizes, confidence intervals and significant test results.

Types of Effect Size

There are several types of effect size, based on either difference scores or correlations.   For more information, see Valentine and Cooper (2003), Wikipedia, and Wikiversity.  Effect sizes can be converted into other types of effect sizes, thus the choice of effect size is somewhat arbitrary.  For program evaluation, however, standardised mean effect sizes are more commonly used.

Standardised mean effect size

Standardised mean effect sizes (such as Cohen's d and Hedge's g) are basically z scores.  These effect sizes indicate the mean difference between two variables expressed in standard deviation units.  A score of 0 represents no change and effect size scores can be negative or positive.  The meaning of an effect size varies is dependent on the measurement context, so rules of thumb should be treated cautiously.  A well-known guide is offered by Cohen (1988):

  • .8 = large
  • .5 = moderate
  • .2 = small

Percentile scores, based on the properties of the normal distribution, can be used to aid interpretation of standardised mean effect sizes.  Percentiles can be used to indicate, for example, where someone who started on the 50th percentile could, on average, expect to end up after the intervention (compared to people who didn't experience the intervention).  For example, imagine a child of average reading ability who participates in an intensive reading program.  If the program had a small effect (ES = .2), this would raise the child's reading performance to the 58th percentile, a moderate effect (ES = .5) would raise the child to the 69th percentile, and a large effect (ES = .8) would raise the child to the 79th percentile.  These calculations can be easily done using a free web calculator, such as at HyperStat Online.

Correlational effect sizes

Correlations (r) are effect sizes.  Correlational and standardised mean effect sizes cannot be directly compared, but they can be converted.  In the context of an intervention program, a correlational effect size would indicate the standardised covariance between Time (independent variable) and Outcome (dependent variable).  As with standardised mean effect sizes, interpretation of the size of correlational effects needs to occur within the context of the study, but a general guide (Cohen, 1988) is that:

  • .5 = large
  • .3 = medium
  • .1 = small

Confidence Intervals

If there is no desire to generalise the findings to a population, then the ESs statistics can be interpreted without the CIs.

The main typical use of CIs is to help guide the process of generalisation based on a sample.  In this sense, CIs serve the same function as classic inferential statistical tests such as t-tests, but are more informative because they indicate the spread in the data around the mean effect size. 

A confidence interval represents the likely range of the true population mean effect size, based on the data obtained from the sample.

There are two basic ways a confidence interval can be used or interpreted:

a. Does the CI include 0?

ES = .6 (CI: .2 to 1.0) indicates that, based on the sample of data, it is estimated that the true ES in the population from which the sample is taken is 95% certain to be in the range of .2 to 1.0, i.e., it doesn't include zero, therefore we can be reasonably certain that a positive change was reported - or we say that a statistically significant change occurred.
ES = .2 (CI: -.3 to .5) indicates that, based on the sample of data, it is estimate that the true ES in the population from which the sample is taken is 95% certain to be in the range of -.3 to .5, i.e., it could be zero, therefore if we are seeking to generalise from the sample to the broader population then we should acknowledge the true ES could be 0 and that the observed ES of .2 could have been obtained by chance.

b. Does the CI include some other benchmark? e.g., using the logic described in a., it is possible to make comparisons with benchmarks, such as whether the population effect size(s) are likely to differ from a benchmark, such as .2, which is the typical ES reported for outdoor education programs with school students in the Hattie, et al, (1997) meta-analysis.

It should be also be noted that the desired range of likelihood is chosen by the researcher.  Although 95% is commonly chosen as a default, this is likely to lead to Type II errors in program evaluation, which is often conducted with small sample sizes.  Less stringent confidence intervals, such as 90% or even 80%, may be more appropriate.  Confidence intervals will narrow as the chosen range of likelihood decreases.

Confidence intervals also narrow as the sample size increases (i.e., we become more certain about the estimated range as we have more data on which to base our estimates).

For more information about confidence intervals for effect sizes and their interpretation, see NIST/SEMATECH e-Handbook of Statistical Methods and Davies (2001).

Links

  • Psychologist questions holy cow [.pdf] - Article from Monitor, Newspaper of the University of Canberra, 15 May, 2001)

  • What works clearinghouse - Scientific evidence of what works in education, US Department of Education resource website.

  • ZCalc - an excel spreadsheet for helping to explain a standardised mean effect size (ES) in various ways.

References

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press.

Davies, H. T. O. (2001). What are confidence intervals? Hayward Medical Communications. (Easy to read summary of the meaning of confidence intervals for effect sizes.)

McCartney, K. & Dearing, E. (2002). Evaluating effect sizes in the policy arena. The Evaluation Exchange, 8(1). (Explains the interpretation of effect sizes in the context of family support intervention programs.)

Neill, J. T. (2002, January). Meta-analytic research on the outcomes of outdoor education. Paper presented to the 6th Biennial Coalition for Education in the Outdoors Research Symposium, Bradford Woods, IN. (Provides an overview of meta-analytic research of outdoor education (which reports effect sizes) and suggests applications to program evaluation (such as benchmarking).)

NIST/SEMATECH e-Handbook of Statistical Methods. What are confidence intervals? (A basic explanation and guide to confidence intervals.)

Thompson, B. (2000). A suggested revision to the forthcoming 5th edition of the APA publication manual: Effect size section. (Solid, clear advice and recommendations regarding the need for social sciences to shift towards effect size reporting.)

Warmbrod, J. R. (2001). Conducting, interpreting, and reporting quantitative research [Excerpted from workshop notes], Research Pre-Session, National Agricultural Education Research Conference, December 11, New Orleans, LA. (Handy notes with, rationale, key formulas, examples, rules of thumb for different types of effect size.)

Valentine, J. C. & Cooper, H. (2003). Effect size substantive interpretation guidelines: Issues in the interpretation of effect sizes. Washington, DC: What Works Clearinghouse.

Wikipedia (2008). Effect size. (Statistical formulae oriented article covering the major types of effect size.)

Wikiversity (2008). Effect size. (Educationally-oriented, tutorial-type notes.)

Footnotes

Generalization forms the major distinction between "program evaluation" (no generalization) and "research" (generalization).  Evaluation tends to have a specific, internal focus, whereas research tends to have a focus on theory and how the results of the sample apply to a broader population.