Practice Maths

L30 — Interpreting Bivariate Data

Key Terms

Coefficient of determination r²
The proportion of variation in y explained by x through the linear model; always between 0 and 1 regardless of the sign of r.
Interpolation
Predicting y within the observed data range — generally reliable.
Extrapolation
Predicting y outside the data range — unreliable; the trend may not continue beyond the observed data.
Two-way table
A table for categorical bivariate data; use row or column percentages to compare groups and identify associations.
Lurking variable
A hidden third variable that can explain an observed correlation between two variables without one causing the other.
Residual
Actual y minus predicted ŷ — indicates how far a data point lies from the regression line.

Complete interpretation framework

When interpreting bivariate data, address all of the following:

  1. Context: What are the variables? What do they measure?
  2. Direction: Positive or negative association?
  3. Form: Linear or non-linear?
  4. Strength: Strong, moderate, or weak? Describe r.
  5. Outliers: Any unusual points? What effect do they have?
  6. Causation: Correlation ≠ causation. What might cause the association?
  7. Prediction: Use ŷ = a + bx; distinguish interpolation from extrapolation.

Coefficient of determination r²

tells us the proportion of variation in y that is explained by x through the linear model.

rInterpretation
0.90.8181% of variation in y is explained by x
0.70.4949% of variation in y is explained by x
0.50.2525% of variation in y is explained by x

Effect of an outlier

An outlier can significantly change r and the slope of the line of best fit, especially in small datasets. Always report analyses with and without the outlier if appropriate.

Two-way tables (categorical bivariate data)

When both variables are categorical, a two-way (contingency) table is used instead of a scatter plot. We calculate row percentages or column percentages to compare groups.

Total variation in y (100%) Explained by x: r²=70% Unexplained: 30% Example: r = 0.837 ⇒ r² = 0.70 (70% explained) 30% of variation in y is due to other factors
r² partitions total variation into explained and unexplained components
Hot Tip: r² is always between 0 and 1 regardless of the sign of r — squaring removes the direction. A model with r = −0.9 and one with r = 0.9 both have r² = 0.81, meaning 81% of variation in y is explained by x.

Worked Example 1 — Full interpretation of bivariate data

A study records daily rainfall (mm) and umbrella sales for a shop over 12 days. r = 0.88, ŷ = 5 + 0.4x.

Direction: Positive (more rain ⇒ more umbrella sales).

Form: Linear (r suggests linear model is appropriate).

Strength: Strong (r = 0.88 > 0.7).

r² = 0.77: 77% of variation in umbrella sales is explained by rainfall.

Prediction: On a day with 30 mm rain: ŷ = 5 + 0.4(30) = 17 umbrellas.

Causation: Rain likely causes increased umbrella sales (this causal link is physically plausible).

Worked Example 2 — r² interpretation

A regression of household income (x) on education years (y) gives r = 0.72. Interpret r².

r² = 0.72² = 0.518 ≈ 52%.

About 52% of the variation in household income is explained by years of education. The remaining 48% is due to other factors (occupation, location, experience, etc.).

Worked Example 3 — Two-way table interpretation

200 students surveyed: preferred subject (Maths/English) by gender.

MathsEnglishTotal
Female4575120
Male602080
Total10595200

Female preferring Maths: 45/120 = 37.5%.

Male preferring Maths: 60/80 = 75%.

Conclusion: Males in this sample were more likely to prefer Maths (75% vs 37.5%), suggesting an association between gender and subject preference.

Worked Example 4 — Outlier impact

A dataset of 8 points gives r = 0.92 and ŷ = 10 + 3x. When one outlier is removed, r rises to 0.97. What can we say?

The outlier was weakening the linear association. The true relationship (without the outlier) is stronger (r = 0.97). The outlier may represent a data entry error or genuinely unusual case worth investigating.

Worked Example 5 — Limitations of bivariate analysis

List three limitations when interpreting a scatter plot and regression line.

  1. Causation cannot be inferred from correlation alone.
  2. Extrapolation is unreliable — the trend may not hold beyond the data range.
  3. Outliers can distort r and the regression line, especially in small samples.
  4. (Bonus) The linear model may not be the best fit if the true relationship is non-linear.
  1. Interpreting r². Fluency

    • (a) r = 0.8. Find r² and interpret it.
    • (b) r = −0.6. Find r². Does the negative sign affect r²?
    • (c) r² = 0.49. What is r (positive or negative)? What does this mean?
    • (d) A model has r² = 0.25. What percentage of variation in y is not explained by x?
  2. Two-way table. Fluency

    150 teenagers surveyed: own a pet (Yes/No) by whether they exercise daily (Yes/No).

    Exercise YesExercise NoTotal
    Pet Yes483280
    Pet No224870
    Total7080150
    • (a) What percentage of pet owners exercise daily?
    • (b) What percentage of non-pet-owners exercise daily?
    • (c) Does the table suggest an association between pet ownership and daily exercise?
    • (d) Can we conclude that owning a pet causes people to exercise more? Why or why not?
  3. Full interpretation checklist. Fluency

    A scatter plot of age (x, years) vs blood pressure (y, mmHg) shows r = 0.74 and ŷ = 95 + 0.5x, for ages 30–70.

    • (a) Describe the direction and strength of the association.
    • (b) Calculate r² and interpret it in context.
    • (c) Predict the blood pressure of a 50-year-old. Is this interpolation or extrapolation?
    • (d) Can we conclude that ageing causes blood pressure to rise?
  4. Effect of removing an outlier. Fluency

    • (a) A dataset of 6 points has r = 0.55. When a clear outlier is removed, r changes to 0.91. What does this suggest about the outlier?
    • (b) In a scatter plot of height vs weight, one individual is 190 cm and 50 kg (very underweight). Is this likely an outlier?
    • (c) Should outliers always be removed from analysis? Explain.
    • (d) A dataset has an outlier at (100, 5) while all other points cluster in the range x=1–10. How does this outlier likely affect the slope of the regression line?
  5. Interpreting a scatter plot with outlier. Understanding

    The scatter plot below shows hours of sleep (x) vs reaction time in milliseconds (y) for 10 people. One point is marked as an outlier.

    4 5 6 7 8 9 200 250 300 350 400 450 500 outlier (6h, 460ms) Hours of sleep Reaction (ms)
    • (a) Describe the direction and form of the association for the 10 main points (excluding outlier).
    • (b) r = −0.97 (without outlier). Interpret r and r² in context.
    • (c) Suggest a reason for the outlier (6 hours sleep, 460 ms reaction time).
    • (d) How would including the outlier change r? Would it increase or decrease |r|?
  6. Comparing two datasets. Understanding

    Two classes both do a study of study time (x) vs mark (y). Class A has r = 0.92 and Class B has r = 0.55.

    • (a) Which class shows the stronger link between study time and marks?
    • (b) In Class A, r² = 0.846. Interpret this.
    • (c) In Class B, the line of best fit is ŷ = 40 + 6x. Predict the mark for 5 hours study. How reliable is this prediction?
    • (d) What might explain why Class B has a weaker correlation even though both classes did the same subject?
  7. Categorical bivariate data. Understanding

    300 adults surveyed: smoke (Yes/No) vs develop lung disease (Yes/No) over 10 years.

    Lung Disease YesLung Disease NoTotal
    Smoke Yes9060150
    Smoke No30120150
    Total120180300
    • (a) What percentage of smokers developed lung disease?
    • (b) What percentage of non-smokers developed lung disease?
    • (c) Does the data suggest an association between smoking and lung disease?
    • (d) Can we conclude causation? What kind of study would provide stronger evidence?
  8. Critique a statistical claim. Understanding

    • (a) “Our study shows r = 0.78 between coffee consumption and productivity. Drink more coffee to be more productive!” Evaluate this claim.
    • (b) A scatter plot shows r = −0.4. A journalist writes: “There is no real relationship.” Is this accurate?
    • (c) A regression line has r² = 0.95. A student says: “The model perfectly predicts y.” Why is this wrong?
    • (d) A study finds that people with more books in their home have higher IQs. A school proposes buying all students 20 books. Critique this policy.
  9. Real estate analysis. Problem Solving

    A real estate agent plots house price ($000s) vs distance from city centre (km) for 12 houses. r = −0.81, ŷ = 850 − 18x, for distances 2–30 km.

    • (a) Interpret r and r² in context.
    • (b) Predict the price of a house 15 km from the centre. Is this interpolation?
    • (c) Predict the price of a house 50 km from the centre. Comment on reliability.
    • (d) A house 10 km from the centre sells for $790 000. Calculate the residual and interpret it.
  10. Designing a bivariate study. Problem Solving

    A nutritionist wants to investigate the relationship between sugar intake (grams/day) and risk of type 2 diabetes.

    • (a) Identify the explanatory and response variables.
    • (b) Describe how you would collect the data. What would you measure and how many participants would be ideal?
    • (c) If r = 0.65 is found, write a correct interpretation (including r²) and a statement about causation.
    • (d) What confounding variables might affect the results? Name at least three.