L30 — Interpreting Bivariate Data

Key Terms

Coefficient of determination r²: The proportion of variation in y explained by x through the linear model; always between 0 and 1 regardless of the sign of r.
Interpolation: Predicting y within the observed data range — generally reliable.
Extrapolation: Predicting y outside the data range — unreliable; the trend may not continue beyond the observed data.
Two-way table: A table for categorical bivariate data; use row or column percentages to compare groups and identify associations.
Lurking variable: A hidden third variable that can explain an observed correlation between two variables without one causing the other.
Residual: Actual y minus predicted ŷ — indicates how far a data point lies from the regression line.

Complete interpretation framework

When interpreting bivariate data, address all of the following:

Context: What are the variables? What do they measure?
Direction: Positive or negative association?
Form: Linear or non-linear?
Strength: Strong, moderate, or weak? Describe r.
Outliers: Any unusual points? What effect do they have?
Causation: Correlation ≠ causation. What might cause the association?
Prediction: Use ŷ = a + bx; distinguish interpolation from extrapolation.

Coefficient of determination r²

r² tells us the proportion of variation in y that is explained by x through the linear model.

r	r²	Interpretation
0.9	0.81	81% of variation in y is explained by x
0.7	0.49	49% of variation in y is explained by x
0.5	0.25	25% of variation in y is explained by x

Effect of an outlier

An outlier can significantly change r and the slope of the line of best fit, especially in small datasets. Always report analyses with and without the outlier if appropriate.

Two-way tables (categorical bivariate data)

When both variables are categorical, a two-way (contingency) table is used instead of a scatter plot. We calculate row percentages or column percentages to compare groups.

r² partitions total variation into explained and unexplained components

Hot Tip: r² is always between 0 and 1 regardless of the sign of r — squaring removes the direction. A model with r = −0.9 and one with r = 0.9 both have r² = 0.81, meaning 81% of variation in y is explained by x.

Worked Example 1 — Full interpretation of bivariate data

A study records daily rainfall (mm) and umbrella sales for a shop over 12 days. r = 0.88, ŷ = 5 + 0.4x.

Direction: Positive (more rain ⇒ more umbrella sales).

Form: Linear (r suggests linear model is appropriate).

Strength: Strong (r = 0.88 > 0.7).

r² = 0.77: 77% of variation in umbrella sales is explained by rainfall.

Prediction: On a day with 30 mm rain: ŷ = 5 + 0.4(30) = 17 umbrellas.

Causation: Rain likely causes increased umbrella sales (this causal link is physically plausible).

Worked Example 2 — r² interpretation

A regression of household income (x) on education years (y) gives r = 0.72. Interpret r².

r² = 0.72² = 0.518 ≈ 52%.

About 52% of the variation in household income is explained by years of education. The remaining 48% is due to other factors (occupation, location, experience, etc.).

Worked Example 3 — Two-way table interpretation

200 students surveyed: preferred subject (Maths/English) by gender.

	Maths	English	Total
Female	45	75	120
Male	60	20	80
Total	105	95	200

Female preferring Maths: 45/120 = 37.5%.

Male preferring Maths: 60/80 = 75%.

Conclusion: Males in this sample were more likely to prefer Maths (75% vs 37.5%), suggesting an association between gender and subject preference.

Worked Example 4 — Outlier impact

A dataset of 8 points gives r = 0.92 and ŷ = 10 + 3x. When one outlier is removed, r rises to 0.97. What can we say?

The outlier was weakening the linear association. The true relationship (without the outlier) is stronger (r = 0.97). The outlier may represent a data entry error or genuinely unusual case worth investigating.

Worked Example 5 — Limitations of bivariate analysis

List three limitations when interpreting a scatter plot and regression line.

Causation cannot be inferred from correlation alone.
Extrapolation is unreliable — the trend may not hold beyond the data range.
Outliers can distort r and the regression line, especially in small samples.
(Bonus) The linear model may not be the best fit if the true relationship is non-linear.

See Answers ➔

Interpreting r². Fluency
- (a) r = 0.8. Find r² and interpret it.
- (b) r = −0.6. Find r². Does the negative sign affect r²?
- (c) r² = 0.49. What is r (positive or negative)? What does this mean?
- (d) A model has r² = 0.25. What percentage of variation in y is not explained by x?

Two-way table. Fluency

150 teenagers surveyed: own a pet (Yes/No) by whether they exercise daily (Yes/No).

	Exercise Yes	Exercise No	Total
Pet Yes	48	32	80
Pet No	22	48	70
Total	70	80	150

(a) What percentage of pet owners exercise daily?
(b) What percentage of non-pet-owners exercise daily?
(c) Does the table suggest an association between pet ownership and daily exercise?
(d) Can we conclude that owning a pet causes people to exercise more? Why or why not?

Full interpretation checklist. Fluency

A scatter plot of age (x, years) vs blood pressure (y, mmHg) shows r = 0.74 and ŷ = 95 + 0.5x, for ages 30–70.
- (a) Describe the direction and strength of the association.
- (b) Calculate r² and interpret it in context.
- (c) Predict the blood pressure of a 50-year-old. Is this interpolation or extrapolation?
- (d) Can we conclude that ageing causes blood pressure to rise?
Effect of removing an outlier. Fluency
- (a) A dataset of 6 points has r = 0.55. When a clear outlier is removed, r changes to 0.91. What does this suggest about the outlier?
- (b) In a scatter plot of height vs weight, one individual is 190 cm and 50 kg (very underweight). Is this likely an outlier?
- (c) Should outliers always be removed from analysis? Explain.
- (d) A dataset has an outlier at (100, 5) while all other points cluster in the range x=1–10. How does this outlier likely affect the slope of the regression line?
Interpreting a scatter plot with outlier. Understanding

The scatter plot below shows hours of sleep (x) vs reaction time in milliseconds (y) for 10 people. One point is marked as an outlier.
- (a) Describe the direction and form of the association for the 10 main points (excluding outlier).
- (b) r = −0.97 (without outlier). Interpret r and r² in context.
- (c) Suggest a reason for the outlier (6 hours sleep, 460 ms reaction time).
- (d) How would including the outlier change r? Would it increase or decrease |r|?
Comparing two datasets. Understanding

Two classes both do a study of study time (x) vs mark (y). Class A has r = 0.92 and Class B has r = 0.55.
- (a) Which class shows the stronger link between study time and marks?
- (b) In Class A, r² = 0.846. Interpret this.
- (c) In Class B, the line of best fit is ŷ = 40 + 6x. Predict the mark for 5 hours study. How reliable is this prediction?
- (d) What might explain why Class B has a weaker correlation even though both classes did the same subject?

Categorical bivariate data. Understanding

300 adults surveyed: smoke (Yes/No) vs develop lung disease (Yes/No) over 10 years.

	Lung Disease Yes	Lung Disease No	Total
Smoke Yes	90	60	150
Smoke No	30	120	150
Total	120	180	300

(a) What percentage of smokers developed lung disease?
(b) What percentage of non-smokers developed lung disease?
(c) Does the data suggest an association between smoking and lung disease?
(d) Can we conclude causation? What kind of study would provide stronger evidence?

Critique a statistical claim. Understanding
- (a) “Our study shows r = 0.78 between coffee consumption and productivity. Drink more coffee to be more productive!” Evaluate this claim.
- (b) A scatter plot shows r = −0.4. A journalist writes: “There is no real relationship.” Is this accurate?
- (c) A regression line has r² = 0.95. A student says: “The model perfectly predicts y.” Why is this wrong?
- (d) A study finds that people with more books in their home have higher IQs. A school proposes buying all students 20 books. Critique this policy.
Real estate analysis. Problem Solving

A real estate agent plots house price ($000s) vs distance from city centre (km) for 12 houses. r = −0.81, ŷ = 850 − 18x, for distances 2–30 km.
- (a) Interpret r and r² in context.
- (b) Predict the price of a house 15 km from the centre. Is this interpolation?
- (c) Predict the price of a house 50 km from the centre. Comment on reliability.
- (d) A house 10 km from the centre sells for $790 000. Calculate the residual and interpret it.
Designing a bivariate study. Problem Solving

A nutritionist wants to investigate the relationship between sugar intake (grams/day) and risk of type 2 diabetes.
- (a) Identify the explanatory and response variables.
- (b) Describe how you would collect the data. What would you measure and how many participants would be ideal?
- (c) If r = 0.65 is found, write a correct interpretation (including r²) and a statement about causation.
- (d) What confounding variables might affect the results? Name at least three.

L30 — Interpreting Bivariate Data

Key Terms

Complete interpretation framework

Coefficient of determination r²

Effect of an outlier

Two-way tables (categorical bivariate data)

Worked Example 1 — Full interpretation of bivariate data

Worked Example 2 — r² interpretation

Worked Example 3 — Two-way table interpretation

Worked Example 4 — Outlier impact

Worked Example 5 — Limitations of bivariate analysis

Interpreting r². Fluency

Two-way table. Fluency

Full interpretation checklist. Fluency

Effect of removing an outlier. Fluency

Interpreting a scatter plot with outlier. Understanding

Comparing two datasets. Understanding

Categorical bivariate data. Understanding

Critique a statistical claim. Understanding

Real estate analysis. Problem Solving

Designing a bivariate study. Problem Solving