Why use Effect Sizes instead of Significance Testing in Program Evaluation?
11 Sep 2008
This article discusses the overuse of significance testing and the underuse of effect sizes for reporting on the effects of intervention programs.
Effect sizes are now a standard and expected part of much statistical reporting and should be more commonly reported and understood (Thompson, 2000).
There are two main issues with relying on significance testing for evaluating the effects of interventions:
Significance tests should only be used in attempts to generalise the sample's results to a population. For example, based on a sample of 10% of our clients, we want to use the sample data to make generalised conclusions for all of our clients. In this case, the results of the sample are of little intrinsic interest. Of primary interest is what the sample data might represent about the target population.
Program evaluations not uncommonly access data from the entire population of interest, e.g., all participants in a specific program. In such situations, the full population data set is available and there is no need or interest in inferential testing because there is no need to generalise beyond the sample. In these situations, descriptive statistics and effect sizes are all that is needed.
Program evaluation studies with less than approximately 50 participants tend to lack sufficient statistical power (a function of sample size, effect size, and p level) for detecting small, medium or possibly even large effects. In such situations, the results of significance tests can be misleading because of being subject to Type II errors (incorrectly failing to reject the null hypothesis). In these situations, it can be more informative to use the effect sizes, possibly with confidence intervals.
For studies involving large sample sizes (e.g., > ~ 400), a different problem
occurs with significance testing because even small effects are likely to become
significant significant, although these effects may be trivial. In these
situations, more attention should be paid to effect sizes than to statistical
To sum up:
One of the most common question driving evaluation of intervention programs is "did we get an effect?". However, very often, the question(s) could be more usefully phrased as "how much effect did this program have?" and perhaps also "how does this effect compare with the effects of other interventions?".
To answer the "did we get an effect?" question it is necessary to compare the observed result against 0. If the results are to be generalised, the comparison against 0 or a control group can be done inferentially, using statistics such as paired samples t-tests, repeated measures ANOVAs, etc. are used. Alternatively, confidence intervals can be used. If the results are not to be generalised (i.e., where the population data is available), then the question can be answered simply by examining the effect sizes.
To answer the "how much effect did the program get?" question, effect sizes can be used.
To answer the "how does the effect compare with the effects of other interventions?" question, effect sizes can be used because they are standardised and allow ready comparison with benchmarks such as effect sizes for other types of interventions obtained from meta-analyses.
Use of effect sizes can also be combined with other data, such as cost, to provide a measure of cost-effectiveness. In other words, "how much bang (effect size) for the buck (cost)?". This is not an uncommon question that government or philanthropic funders may wish to ask and increasingly there is a demand for such "proven evidence" of outcomes and cost-effectiveness in psycho-social intervention programs.
Some advantages of effect size reporting are that:
The main disadvantage of using effect sizes include that:
If confused about what types of statistics to report, you can report effect sizes, confidence intervals and significant test results.
There are several types of effect size, based on either difference scores or correlations. For more information, see Valentine and Cooper (2003), Wikipedia, and Wikiversity. Effect sizes can be converted into other types of effect sizes, thus the choice of effect size is somewhat arbitrary. For program evaluation, however, standardised mean effect sizes are more commonly used.
Standardised mean effect sizes (such as Cohen's d and Hedge's g) are basically z scores. These effect sizes indicate the mean difference between two variables expressed in standard deviation units. A score of 0 represents no change and effect size scores can be negative or positive. The meaning of an effect size varies is dependent on the measurement context, so rules of thumb should be treated cautiously. A well-known guide is offered by Cohen (1988):
Percentile scores, based on the properties of the normal distribution, can be used to aid interpretation of standardised mean effect sizes. Percentiles can be used to indicate, for example, where someone who started on the 50th percentile could, on average, expect to end up after the intervention (compared to people who didn't experience the intervention). For example, imagine a child of average reading ability who participates in an intensive reading program. If the program had a small effect (ES = .2), this would raise the child's reading performance to the 58th percentile, a moderate effect (ES = .5) would raise the child to the 69th percentile, and a large effect (ES = .8) would raise the child to the 79th percentile. These calculations can be easily done using a free web calculator, such as at HyperStat Online.
Correlations (r) are effect sizes. Correlational and standardised mean effect sizes cannot be directly compared, but they can be converted. In the context of an intervention program, a correlational effect size would indicate the standardised covariance between Time (independent variable) and Outcome (dependent variable). As with standardised mean effect sizes, interpretation of the size of correlational effects needs to occur within the context of the study, but a general guide (Cohen, 1988) is that:
If there is no desire to generalise the findings to a population, then the ESs statistics can be interpreted without the CIs.
The main typical use of CIs is to help guide the process of generalisation based on a sample. In this sense, CIs serve the same function as classic inferential statistical tests such as t-tests, but are more informative because they indicate the spread in the data around the mean effect size.
A confidence interval represents the likely range of the true population mean effect size, based on the data obtained from the sample.
There are two basic ways a confidence interval can be used or interpreted:
a. Does the CI include 0?
ES = .6 (CI: .2 to 1.0) indicates that, based on the sample of data, it is estimated that the true ES in the population from which the sample is taken is 95% certain to be in the range of .2 to 1.0, i.e., it doesn't include zero, therefore we can be reasonably certain that a positive change was reported - or we say that a statistically significant change occurred.
ES = .2 (CI: -.3 to .5) indicates that, based on the sample of data, it is estimate that the true ES in the population from which the sample is taken is 95% certain to be in the range of -.3 to .5, i.e., it could be zero, therefore if we are seeking to generalise from the sample to the broader population then we should acknowledge the true ES could be 0 and that the observed ES of .2 could have been obtained by chance.
b. Does the CI include some other benchmark? e.g., using the logic described in a., it is possible to make comparisons with benchmarks, such as whether the population effect size(s) are likely to differ from a benchmark, such as .2, which is the typical ES reported for outdoor education programs with school students in the Hattie, et al, (1997) meta-analysis.
It should be also be noted that the desired range of likelihood is chosen by the researcher. Although 95% is commonly chosen as a default, this is likely to lead to Type II errors in program evaluation, which is often conducted with small sample sizes. Less stringent confidence intervals, such as 90% or even 80%, may be more appropriate. Confidence intervals will narrow as the chosen range of likelihood decreases.
Confidence intervals also narrow as the sample size increases (i.e., we become more certain about the estimated range as we have more data on which to base our estimates).
Davies, H. T. O. (2001). What are confidence intervals? Hayward Medical Communications. (Easy to read summary of the meaning of confidence intervals for effect sizes.)
McCartney, K. & Dearing, E. (2002). Evaluating effect sizes in the policy arena. The Evaluation Exchange, 8(1). (Explains the interpretation of effect sizes in the context of family support intervention programs.)
Neill, J. T. (2002, January). Meta-analytic research on the outcomes of outdoor education. Paper presented to the 6th Biennial Coalition for Education in the Outdoors Research Symposium, Bradford Woods, IN. (Provides an overview of meta-analytic research of outdoor education (which reports effect sizes) and suggests applications to program evaluation (such as benchmarking).)
NIST/SEMATECH e-Handbook of Statistical Methods. What are confidence intervals? (A basic explanation and guide to confidence intervals.)
Thompson, B. (2000). A suggested revision to the forthcoming 5th edition of the APA publication manual: Effect size section. (Solid, clear advice and recommendations regarding the need for social sciences to shift towards effect size reporting.)
Warmbrod, J. R. (2001). Conducting, interpreting, and reporting quantitative research [Excerpted from workshop notes], Research Pre-Session, National Agricultural Education Research Conference, December 11, New Orleans, LA. (Handy notes with, rationale, key formulas, examples, rules of thumb for different types of effect size.)
Valentine, J. C. & Cooper, H. (2003). Effect size substantive interpretation guidelines: Issues in the interpretation of effect sizes. Washington, DC: What Works Clearinghouse.
Wikipedia (2008). Effect size. (Statistical formulae oriented article covering the major types of effect size.)
Wikiversity (2008). Effect size. (Educationally-oriented, tutorial-type notes.)
Generalization forms the major distinction between "program evaluation" (no generalization) and "research" (generalization). Evaluation tends to have a specific, internal focus, whereas research tends to have a focus on theory and how the results of the sample apply to a broader population.