L30 — Interpreting Bivariate Data
Key Terms
- Coefficient of determination r²
- The proportion of variation in y explained by x through the linear model; always between 0 and 1 regardless of the sign of r.
- Interpolation
- Predicting y within the observed data range — generally reliable.
- Extrapolation
- Predicting y outside the data range — unreliable; the trend may not continue beyond the observed data.
- Two-way table
- A table for categorical bivariate data; use row or column percentages to compare groups and identify associations.
- Lurking variable
- A hidden third variable that can explain an observed correlation between two variables without one causing the other.
- Residual
- Actual y minus predicted ŷ — indicates how far a data point lies from the regression line.
Complete interpretation framework
When interpreting bivariate data, address all of the following:
- Context: What are the variables? What do they measure?
- Direction: Positive or negative association?
- Form: Linear or non-linear?
- Strength: Strong, moderate, or weak? Describe r.
- Outliers: Any unusual points? What effect do they have?
- Causation: Correlation ≠ causation. What might cause the association?
- Prediction: Use ŷ = a + bx; distinguish interpolation from extrapolation.
Coefficient of determination r²
r² tells us the proportion of variation in y that is explained by x through the linear model.
| r | r² | Interpretation |
|---|---|---|
| 0.9 | 0.81 | 81% of variation in y is explained by x |
| 0.7 | 0.49 | 49% of variation in y is explained by x |
| 0.5 | 0.25 | 25% of variation in y is explained by x |
Effect of an outlier
An outlier can significantly change r and the slope of the line of best fit, especially in small datasets. Always report analyses with and without the outlier if appropriate.
Two-way tables (categorical bivariate data)
When both variables are categorical, a two-way (contingency) table is used instead of a scatter plot. We calculate row percentages or column percentages to compare groups.
Worked Example 1 — Full interpretation of bivariate data
A study records daily rainfall (mm) and umbrella sales for a shop over 12 days. r = 0.88, ŷ = 5 + 0.4x.
Direction: Positive (more rain ⇒ more umbrella sales).
Form: Linear (r suggests linear model is appropriate).
Strength: Strong (r = 0.88 > 0.7).
r² = 0.77: 77% of variation in umbrella sales is explained by rainfall.
Prediction: On a day with 30 mm rain: ŷ = 5 + 0.4(30) = 17 umbrellas.
Causation: Rain likely causes increased umbrella sales (this causal link is physically plausible).
Worked Example 2 — r² interpretation
A regression of household income (x) on education years (y) gives r = 0.72. Interpret r².
r² = 0.72² = 0.518 ≈ 52%.
About 52% of the variation in household income is explained by years of education. The remaining 48% is due to other factors (occupation, location, experience, etc.).
Worked Example 3 — Two-way table interpretation
200 students surveyed: preferred subject (Maths/English) by gender.
| Maths | English | Total | |
|---|---|---|---|
| Female | 45 | 75 | 120 |
| Male | 60 | 20 | 80 |
| Total | 105 | 95 | 200 |
Female preferring Maths: 45/120 = 37.5%.
Male preferring Maths: 60/80 = 75%.
Conclusion: Males in this sample were more likely to prefer Maths (75% vs 37.5%), suggesting an association between gender and subject preference.
Worked Example 4 — Outlier impact
A dataset of 8 points gives r = 0.92 and ŷ = 10 + 3x. When one outlier is removed, r rises to 0.97. What can we say?
The outlier was weakening the linear association. The true relationship (without the outlier) is stronger (r = 0.97). The outlier may represent a data entry error or genuinely unusual case worth investigating.
Worked Example 5 — Limitations of bivariate analysis
List three limitations when interpreting a scatter plot and regression line.
- Causation cannot be inferred from correlation alone.
- Extrapolation is unreliable — the trend may not hold beyond the data range.
- Outliers can distort r and the regression line, especially in small samples.
- (Bonus) The linear model may not be the best fit if the true relationship is non-linear.
-
Interpreting r². Fluency
- (a) r = 0.8. Find r² and interpret it.
- (b) r = −0.6. Find r². Does the negative sign affect r²?
- (c) r² = 0.49. What is r (positive or negative)? What does this mean?
- (d) A model has r² = 0.25. What percentage of variation in y is not explained by x?
-
Two-way table. Fluency
150 teenagers surveyed: own a pet (Yes/No) by whether they exercise daily (Yes/No).
Exercise Yes Exercise No Total Pet Yes 48 32 80 Pet No 22 48 70 Total 70 80 150 - (a) What percentage of pet owners exercise daily?
- (b) What percentage of non-pet-owners exercise daily?
- (c) Does the table suggest an association between pet ownership and daily exercise?
- (d) Can we conclude that owning a pet causes people to exercise more? Why or why not?
-
Full interpretation checklist. Fluency
A scatter plot of age (x, years) vs blood pressure (y, mmHg) shows r = 0.74 and ŷ = 95 + 0.5x, for ages 30–70.
- (a) Describe the direction and strength of the association.
- (b) Calculate r² and interpret it in context.
- (c) Predict the blood pressure of a 50-year-old. Is this interpolation or extrapolation?
- (d) Can we conclude that ageing causes blood pressure to rise?
-
Effect of removing an outlier. Fluency
- (a) A dataset of 6 points has r = 0.55. When a clear outlier is removed, r changes to 0.91. What does this suggest about the outlier?
- (b) In a scatter plot of height vs weight, one individual is 190 cm and 50 kg (very underweight). Is this likely an outlier?
- (c) Should outliers always be removed from analysis? Explain.
- (d) A dataset has an outlier at (100, 5) while all other points cluster in the range x=1–10. How does this outlier likely affect the slope of the regression line?
-
Interpreting a scatter plot with outlier. Understanding
The scatter plot below shows hours of sleep (x) vs reaction time in milliseconds (y) for 10 people. One point is marked as an outlier.
- (a) Describe the direction and form of the association for the 10 main points (excluding outlier).
- (b) r = −0.97 (without outlier). Interpret r and r² in context.
- (c) Suggest a reason for the outlier (6 hours sleep, 460 ms reaction time).
- (d) How would including the outlier change r? Would it increase or decrease |r|?
-
Comparing two datasets. Understanding
Two classes both do a study of study time (x) vs mark (y). Class A has r = 0.92 and Class B has r = 0.55.
- (a) Which class shows the stronger link between study time and marks?
- (b) In Class A, r² = 0.846. Interpret this.
- (c) In Class B, the line of best fit is ŷ = 40 + 6x. Predict the mark for 5 hours study. How reliable is this prediction?
- (d) What might explain why Class B has a weaker correlation even though both classes did the same subject?
-
Categorical bivariate data. Understanding
300 adults surveyed: smoke (Yes/No) vs develop lung disease (Yes/No) over 10 years.
Lung Disease Yes Lung Disease No Total Smoke Yes 90 60 150 Smoke No 30 120 150 Total 120 180 300 - (a) What percentage of smokers developed lung disease?
- (b) What percentage of non-smokers developed lung disease?
- (c) Does the data suggest an association between smoking and lung disease?
- (d) Can we conclude causation? What kind of study would provide stronger evidence?
-
Critique a statistical claim. Understanding
- (a) “Our study shows r = 0.78 between coffee consumption and productivity. Drink more coffee to be more productive!” Evaluate this claim.
- (b) A scatter plot shows r = −0.4. A journalist writes: “There is no real relationship.” Is this accurate?
- (c) A regression line has r² = 0.95. A student says: “The model perfectly predicts y.” Why is this wrong?
- (d) A study finds that people with more books in their home have higher IQs. A school proposes buying all students 20 books. Critique this policy.
-
Real estate analysis. Problem Solving
A real estate agent plots house price ($000s) vs distance from city centre (km) for 12 houses. r = −0.81, ŷ = 850 − 18x, for distances 2–30 km.
- (a) Interpret r and r² in context.
- (b) Predict the price of a house 15 km from the centre. Is this interpolation?
- (c) Predict the price of a house 50 km from the centre. Comment on reliability.
- (d) A house 10 km from the centre sells for $790 000. Calculate the residual and interpret it.
-
Designing a bivariate study. Problem Solving
A nutritionist wants to investigate the relationship between sugar intake (grams/day) and risk of type 2 diabetes.
- (a) Identify the explanatory and response variables.
- (b) Describe how you would collect the data. What would you measure and how many participants would be ideal?
- (c) If r = 0.65 is found, write a correct interpretation (including r²) and a statement about causation.
- (d) What confounding variables might affect the results? Name at least three.