Back to 301

Tutorial 3:
Correlations

Last updated:
19 Mar 2005


Assumed knowledge

  • Francis Section 3.1 "Relations Between Metric Variables"

  • Francis Section 3.2 "Relations Between Categorical Variables"

  • Francis Section 4.3 "Recoding Variables"

Data files

General advice

The general recommended strategy for tackling these correlation analyses is:

  1. Determine the level of measurement for each variable in the analysis
  2. Obtain univariate descriptive statistics and graphical displays for each variable, to:
    • check for mis-entered data
    • check the frequency / central tendency / distribution
  3. Recode as necessary
  4. Create a bivariate visual display (e.g,. clustered bar graph, scatterplot)
  5. Create tables (e.g., crosstabs with separate tables for row and column %s) and relevant correlational statistics
  6. Interpret/conclude

Phi (φ) & Cramer's V

  • qfsall.sav
  • Phi and Cramer's V are used for analyzing the relationship between two nominal/categorical variables
  • Phi is used when you have 2x2, 2x3 or 3x2 tables
  • Cramer’s V is is used when >=3x3 tables are analysed
  • These are non-parametric tests which do not rely much on assumptions about distribution. But you should make sure that there is a minimum expected frequency of at least 5 in each cell.  You can get this via descriptives - crosstabs - cells - expected.  If you don't have enough data in the cells, you should recode the data into fewer categories.
  • Note that the sign (+ or -) of Phi doesn't mean much because there is no meaningful order to the way the variables are coded.
  1. Is there an association between Gender and Belief in God? (recode to remove mis-entered data)
    (φ is small (.024, p = .94) and not significant; there is no evidence of relationship; use crosstabs and bar graph - clustered)

  2. Is there an association between snoring and smoking? (recode smoking from continuous to dichotomous)
    (φ is ~.24 and significant, p = .001; smokers are almost twice as likely to snore as non-smokers, but be careful in interpretation - this could be due to non-casual factors (e.g., age?); use crosstabs and bar graph - clustered)

  3. Is there an association between favourite season and favourite sense? (recode to remove mis-entered data)
    (Cramer's V is ~.23 and significant, p = .005; in other words there is a different profile of favourite senses, depending on favourite season, e.g., Almost 50% of Summer and Spring people are Visual people.  Winter people, in contrast, tend to prefer Taste and Smell; use crosstabs and stacked area graph)

  4. Is there an association between type of household (urban/rural) and whether or not the household has chickens (Yes/No)? [chickens.sav].  The file contains hypothetical data for two categorical variables.  Resid indicates whether households are in urban or rural areas.  Chickens indicates whether or not the household owns chickens.
    (the answer to this is potential quiz question material - no clues!)

Point Bi-serial Correlation

  • Point bi-serial correlation is for analyzing the relationship between a dichotomous and a continuous variable

  • Point bi-serial correlation is computed as for the product-moment correlation, but interpretation must appropriate to the direction of coding for the dichotomous scale.

  • If you interpret the significance of a point bi-serial, it is equivalent to doing t-test of the mean difference between male and female's ratings of their Australianness.

  1. What is the relationship between Gender (dichotomous) and Australianness (assume continuous)? [qfsall.sav]
    (no relationship (technically it is slightly negative, i.e., males in the sample perceive themselves as very slightly more Australian), i.e., the (point bi-serial) correlation is very small and non-significant; use correlation - bivariate - pearson and scatterplot - chart options - sunflowers and line of best fit)

  2. What is the relationship between Belief in God (recode to dichotomous) and number of Countries visited?
    (important to check the scatterplot on this one - there are outliers which look like they are  influencing the small, non-significant correlation; use correlation - bivariate - pearson and scatterplot - chart options - sunflowers and line of best fit)

Product-Moment Correlation

  • Pearson's correlation or product-moment correlation is for analyzing the linear relationship between two continuous (or near continuous e.g., interval > 5 categories data) variables

  1. What is the relationship between Australianness and Femininity/Masculinity? [qfsall.sav]

    • Draw a scatterplot for Australianness and Femininity/Masculinity

    • Add sunflowers and line of best fit

    • Compute correlation

    (the r here is .12, p = .100, which is larger than the point bi-serial correlation for Gender and Australianness, but is still very small and non-significant; use correlation - bivariate - pearson and scatterplot - chart options - sunflowers and line of best fit)

Correlation Explore & Correlation Guess

  • These exercises help you to intuitively estimate a correlation based on a scatterplot
  1. Correlation Explore
    (explore 20 plots with .1 increments)
  2. Correlation Guess
    (guess 20 plots with .1 increments) - try to get 25 out of 50
  • Note: The following three exercises are desirable, but unfortunately they are java applets which will not currently run due to the UC proxy host firewall.  Try to access these from off-campus if you can - the problem has been reported, but there's no word on when it may be fixed.
  1. Guessing Correlations
    (4 plot exact match to correlations) - try to average over 75%
  2. Guess the Correlation
    (single plot, guess exact correlation) - try to get within .1
  3. Spearman's rank correlation

Exploring the Effect of Outliers

  • regressp.exe (Continue- “Explore the impact of an outlier”)

  • Drag the white point to explore how an outlier can inflate or deflate the correlation, hitting “Recalculate” to recompute the correlation. 

  1. Where would you put the white dot to maximise the correlation?
    (as far to the ends of the line of best fit as possible)

  2. Where would you put the white dot to minimise the correlation?
    (to shift the correlation towards zero, place the outlier as far as possible to the ends of a line which would run perpendicular to the line of best fit, crossing at the mean for X and the mean for Y)

  3. Where would you put the white dot to not change the correlation?
    (on the mean for X and the mean for Y)

Correlations and Non-linear Distributions

  • xy.sav
  • Draw scatterplots, compute the correlations (they are all r=.82) and explain the relationships between:
    1. X1 Y1
      (r is appropriate – linear relationship)
    2. X1 Y2
      (curvilinear – r not appropriate)
    3. X1 Y3
      (strong linear, with outlier, r=.82 is not appropriate)
    4. X2 Y4
      (restricted range, with outlier, r=.82 is not appropriate)

Outliers and Restricted Range

  • aggr.sav 

  • This is a dataset collected by Bernd Heubeck (Division of Psychology, ANU) comparing a sample of 89 children, aged 8-14, from Western Sydney with a sample of 89 children from the same area who had been referred to a Child Psychiatric Clinic.  Separate aggressiveness ratings of the child were obtained independently from both parents.  Aggressiveness ratings can range from 0 (low) to 40 (high).

  1. To what extent the mothers’ and fathers’ Aggressive Behaviour ratings agree with one another?

  • Select only the normal cases (Data – Select Cases – Clin=1 – Filtered – OK)

  • Draw a scatterplot of Mother’s Aggressive Ratings (maggr) and Father’s Aggressive Ratings (faggr)
    • Add sunflowers (Chart Editor – Chart – Options – Sunflowers)
  • What is the product-moment correlation? What % of variance does one variable explain in another variable? 
    (r=.56, r2=.31)
  • What happens if you remove the outliers?  Why?
    (To identify the cases which are outliers – Chart – Options – Case Labels – On.  Then you can delete these cases from the datafile).
    (r=.46, r2=.22) – r drops because outliers can either inflate or deflate, depending on where they lie)
  • Now include the rest of the sample (to regain the outliers, shut down SPSS and restart it with the original datafile).  Examine the scatterplot, and compute the r and r2.  What has happened?  Why?
    (r=.68, r2=.46; the r has increased and over twice as much variance is now explained.  The reason is RANGE RESTRICTION in the normal sample, since there tend to be only low ratings.  By including the clinic sample, we now have high aggressiveness data and no range restrictedness)