Bivariate Data Analysis — Topic Review — Solutions

This review covers all four lessons in Bivariate Data Analysis: Scatterplots & Correlation, Least-Squares Regression, Prediction & Interpolation, and Residual Analysis. Click each answer box to reveal the worked solution.

Review Questions

A researcher records the number of hours per week spent exercising (x) and the systolic blood pressure in mmHg (y) for 15 adults. The data gives Pearson’s r = −0.84.
1. Describe the association fully (direction, form, strength).
2. Which variable is the explanatory variable? Justify.
3. Does this mean exercise causes lower blood pressure? Explain.
(a) Strong, negative, linear association (r = −0.84, |r| ≥ 0.75). Adults who exercise more tend to have lower systolic blood pressure.

(b) Hours of exercise (x) is the explanatory variable — we expect exercise to influence blood pressure, not the reverse.

(c) No. Correlation does not imply causation. Other factors (diet, age, medication, genetics) could explain both greater exercise habits and lower blood pressure. A controlled experiment would be needed to establish causation.

Match each r value to its correct description:

r value	Description
0.12	Strong negative linear
−0.79	Moderate positive linear
0.64	Weak positive linear
−0.97	Very strong negative linear

r = 0.12 → Weak positive linear (|r| < 0.5)

r = −0.79 → Strong negative linear (|r| ≥ 0.75, negative)

r = 0.64 → Moderate positive linear (0.5 ≤ |r| < 0.75)

r = −0.97 → Very strong negative linear (|r| close to 1)

A least-squares regression line for data on fertiliser applied (x, kg/ha) and crop yield (y, t/ha) is found to be ŷ = 1.8 + 0.24x.
1. Interpret the gradient 0.24 in context.
2. Interpret the y-intercept 1.8 in context.
3. Use the equation to predict the yield when 30 kg/ha of fertiliser is applied.
(a) For each additional 1 kg/ha of fertiliser applied, crop yield is predicted to increase by 0.24 t/ha on average.

(b) When no fertiliser is applied (x = 0), the predicted crop yield is 1.8 t/ha. This is the base yield without fertiliser.

(c) ŷ = 1.8 + 0.24(30) = 1.8 + 7.2 = 9.0 t/ha

The table below shows data on average daily temperature (°C) and number of hot drinks sold at a café:

Temp x (°C)	10	15	20	25	30
Drinks sold y	85	74	62	48	35

Identify the explanatory and response variables.
Technology gives the regression equation ŷ = 132.0 − 3.22x. Interpret the gradient.
What does the negative gradient tell us about the relationship?

(a) Explanatory variable: temperature (x). Response variable: hot drinks sold (y).

(b) For each 1°C increase in temperature, the number of hot drinks sold is predicted to decrease by 3.22 on average.

(c) The negative gradient confirms a negative association — hotter days are associated with fewer hot drinks sold, which makes intuitive sense.

The regression equation for house price (y, $’000) and floor area (x, m²) is ŷ = 180 + 2.35x. The data was collected from houses with floor areas between 80 m² and 280 m².
1. Predict the price of a house with floor area 150 m². Is this interpolation or extrapolation?
2. Predict the price of a house with floor area 400 m². Is this reliable? Explain.
3. A house sells for $650,000 (y = 650). Find the residual if its floor area is 180 m².
(a) ŷ = 180 + 2.35(150) = 180 + 352.5 = $532,500. This is interpolation (150 m² is within the data range 80–280 m²).

(b) ŷ = 180 + 2.35(400) = 180 + 940 = $1,120,000. This is extrapolation (400 m² is outside the data range). The prediction is not reliable — the linear relationship may not hold beyond 280 m².

(c) ŷ = 180 + 2.35(180) = 180 + 423 = 603. Residual = 650 − 603 = +$47,000 (actual price is $47,000 above the predicted price).
A regression line is fitted to data on study hours (x) and exam score (y). The equation is ŷ = 38 + 6.4x, based on students who studied between 1 and 10 hours.
1. Predict the score for a student who studied 5 hours.
2. A student claims the model predicts a score of 166 for 20 hours of study. Evaluate this claim.
3. Why does extrapolation become increasingly unreliable the further we go beyond the data range?
(a) ŷ = 38 + 6.4(5) = 38 + 32 = 70 marks. This is interpolation and is a reliable prediction.

(b) ŷ = 38 + 6.4(20) = 38 + 128 = 166. Numerically correct, but the prediction is unreliable. 20 hours is well outside the data range (1–10 hours). Also, scores cannot exceed 100 — the linear model breaks down at the extremes.

(c) The linear relationship was only validated within the observed range. Outside this range, the true relationship may curve, plateau, or change direction. The further we extrapolate, the more we rely on assumptions about a pattern we have not observed.
A regression line ŷ = 3 + 2.1x is fitted to a dataset. The residuals (in x-order) are: −2.1, −1.5, −0.8, +0.2, +1.1, +1.9, +2.8.
1. Describe the pattern in these residuals.
2. What does this pattern tell you about the linear model?
3. What type of model would be more appropriate?
(a) The residuals increase steadily from negative to positive as x increases — a clear positive trend (or drifting upward pattern).

(b) This systematic pattern indicates the linear model is not appropriate. The residual plot is not random; there is a trend, meaning the model consistently underestimates for large x and overestimates for small x. This suggests the true relationship may be non-linear (possibly exponential or curved upward).

(c) An exponential or polynomial model would be more appropriate. A log(y) vs x transformation could be tried to check for an exponential relationship.
For a study on advertising spend and sales, technology gives r = 0.88.
1. Calculate the coefficient of determination r².
2. Interpret r² in context.
3. A colleague says “r² = 0.88 means the model is almost perfect.” Is this correct? Explain.
(a) r² = 0.88² = 0.7744 ≈ 0.774

(b) Approximately 77.4% of the variation in sales is explained by the linear relationship with advertising spend. The remaining 22.6% is due to other factors.

(c) The colleague is confusing r with r². While r = 0.88 is strong, r² = 0.774, meaning 22.6% of variation is still unexplained. The model is good but not almost perfect. Also, a high r² does not guarantee the model is appropriate — the residual plot must also show random scatter.
Data on the age of a machine (x, years) and its maintenance cost (y, $) gives the regression equation ŷ = 420 + 185x, with r = 0.94 and data collected for machines aged 1–8 years.
1. Predict the maintenance cost for a 5-year-old machine.
2. Calculate and interpret r².
3. Is it appropriate to use this equation to predict the maintenance cost of a 15-year-old machine? Explain.
4. The actual maintenance cost for a 5-year-old machine is $1,450. Find and interpret the residual.
(a) ŷ = 420 + 185(5) = 420 + 925 = $1,345

(b) r² = 0.94² = 0.8836 ≈ 0.884. About 88.4% of the variation in maintenance cost is explained by the linear relationship with machine age. This is a strong fit.

(c) No. x = 15 is well outside the data range of 1–8 years. This is extrapolation. The linear relationship may not continue to hold for very old machines (maintenance costs might increase rapidly or the machines may have been replaced). The prediction is unreliable.

(d) Residual = 1450 − 1345 = +$105. The actual maintenance cost is $105 above the model’s prediction for a 5-year-old machine — it lies above the regression line.
A scatterplot of body mass index (x) and resting metabolic rate (y, calories/day) for 20 participants shows a moderate positive linear association. Technology gives: ŷ = 820 + 14.3x and r = 0.72.
1. Interpret the gradient in context.
2. Calculate r² and interpret it.
3. Predict the resting metabolic rate for a person with BMI = 28.
4. Explain why this regression line should not be used to establish that high BMI causes a higher metabolic rate.
(a) For each 1-unit increase in BMI, the resting metabolic rate is predicted to increase by 14.3 calories/day on average.

(b) r² = 0.72² = 0.5184 ≈ 0.518. About 51.8% of the variation in resting metabolic rate is explained by the linear relationship with BMI. Nearly half the variation is due to other factors (age, muscle mass, hormones).

(c) ŷ = 820 + 14.3(28) = 820 + 400.4 = 1220.4 calories/day ≈ 1220 calories/day.

(d) Correlation (and regression) describes association, not causation. This is an observational study — we have not manipulated BMI. Other variables (lean muscle mass, age, sex) influence metabolic rate and are correlated with BMI. A controlled experiment would be needed to establish causation.
In a study of primary school children, the number of books read per month (x) and mathematics test score (y) have r = 0.73. A school principal concludes that encouraging reading will improve maths scores.
1. Identify a possible lurking variable.
2. Does r = 0.73 prove the principal’s conclusion? Explain.
3. What study design could better investigate a causal link?
(a) A likely lurking variable is general academic ability or parental engagement — academically engaged children tend to both read more and perform better in maths, without one causing the other.

(b) No. r = 0.73 indicates a strong positive association but correlation does not prove causation. The observed relationship may be entirely due to the lurking variable.

(c) A randomised controlled experiment: randomly assign students to a reading programme vs a control group, and compare maths scores after a set period, controlling for prior ability.
A linear model for rainfall (x, mm) and dam water level (y, m) gives r² = 0.81. The residual plot shows a random scatter pattern.
1. What does r² = 0.81 tell you?
2. What does the random scatter residual plot confirm?
3. Calculate r given r² = 0.81. (Assume a positive association.)
4. Would you trust predictions from this model? Justify.
(a) r² = 0.81 means 81% of the variation in dam water level is explained by the linear relationship with rainfall. Only 19% is due to other factors.

(b) Random scatter in the residual plot confirms that the linear model is appropriate — there is no systematic pattern left unexplained by the model.

(c) r = √0.81 = 0.9 (positive, since the association is positive: more rain → higher dam level).

(d) Yes. Both r² = 0.81 (high explanatory power) and the random residual plot (appropriate model structure) support trusting predictions from this model within the observed data range.
A marine biologist measures water temperature (x, °C) and coral bleaching percentage (y, %) across 12 reef sites. Technology gives: ŷ = −28.4 + 3.7x and r = 0.91.
1. Describe the association (direction, form, strength).
2. Interpret the gradient in context.
3. Calculate r² and interpret it.
4. Predict the bleaching percentage when water temperature is 29°C.
5. The actual bleaching at 29°C is 82%. Find the residual and interpret it.
(a) Strong, positive, linear association (r = 0.91, |r| ≥ 0.75). Higher water temperatures are associated with greater coral bleaching.

(b) For each 1°C increase in water temperature, coral bleaching is predicted to increase by 3.7 percentage points on average.

(c) r² = 0.91² = 0.8281 ≈ 82.8%. About 82.8% of the variation in bleaching percentage is explained by the linear relationship with water temperature. The remaining 17.2% may be due to other factors (salinity, currents, pollution).

(d) ŷ = −28.4 + 3.7(29) = −28.4 + 107.3 = 78.9%

(e) Residual = 82 − 78.9 = +3.1%. The actual bleaching at this site is 3.1 percentage points above the model’s prediction — the site is more severely affected than expected based on temperature alone.
A dataset on bacterial population (y, thousands) after x hours shows a curved scatterplot. A log(y) vs x transformation gives r = 0.997 and a residual plot showing random scatter. The linear model (no transformation) has r² = 0.78 and a curved residual plot.
1. Calculate r² for the transformed model.
2. Which model is more appropriate? Justify using both numerical and graphical evidence.
3. What type of growth does the successful transformation suggest?
(a) r² = 0.997² = 0.994 ≈ 0.994 (99.4% of variation in log(y) is explained).

(b) The log(y) vs x transformed model is far more appropriate. Numerically, r² = 0.994 vs 0.78 — the transformed model explains 99.4% of variation compared to only 78%. Graphically, the transformed model’s residual plot shows random scatter (confirming the model structure is appropriate), whereas the linear model’s curved residual plot confirms systematic non-linearity that the linear model fails to capture.

(c) The log(y) vs x transformation being successful indicates the bacteria are growing exponentially — the population multiplies by a constant factor each hour, which is the classic pattern of bacterial growth in ideal conditions.

The data below shows the selling price ($’000) of 6 used cars and their age (years):

Age x (years)	1	2	3	5	7	9
Price y ($’000)	32	27	23	16	11	7

Technology gives: ŷ = 35.6 − 3.14x, r = −0.997.

Describe the association.
Calculate r² and interpret it.
Predict the price of a 4-year-old car. State whether this is interpolation or extrapolation.
The residual for the 5-year-old car is +0.3. Show that the actual price is consistent with the table.
Would it be appropriate to use the model to predict the price of a 20-year-old car? Justify.

(a) Very strong, negative, linear association (r = −0.997). As car age increases, selling price decreases. The points are almost perfectly on a straight line.

(b) r² = (−0.997)² = 0.994 ≈ 99.4%. About 99.4% of the variation in car price is explained by the linear relationship with age. This is an exceptionally strong fit.

(c) ŷ = 35.6 − 3.14(4) = 35.6 − 12.56 = $23,040 (i.e., $23.04 ’000). This is interpolation since x = 4 is within the data range of 1–9 years.

(d) Predicted value at x = 5: ŷ = 35.6 − 3.14(5) = 35.6 − 15.7 = 19.9. Residual = actual − predicted = +0.3, so actual = 19.9 + 0.3 = 20.2 ≈ $20,200. The table shows y = 16... Wait — checking: e = y − ŷ, so actual = ŷ + e = 19.9 + 0.3 = 20.2. However the table shows 16. This discrepancy occurs because the regression equation is fitted to all 6 points, not just one at a time. The residual of +0.3 from the model means the actual price was $300 above the regression’s prediction for that car. (Note: individual residuals depend on the full dataset; the table value of 16 is correct and the regression predicted 15.7, giving e = 16 − 15.7 = +0.3. Correct check: ŷ = 35.6 − 3.14 × 5 = 35.6 − 15.7 = 19.9. But table y = 16. Residual = 16 − 19.9 = −3.9. This suggests the given residual of +0.3 refers to a different rounding of the equation. Taking ŷ = 35.6 − 3.14(5) = 19.9 gives e = 16 − 19.9 = −3.9, so the answer should note the residual calculation process: e = actual − predicted = 16 − 19.9 = −3.9.)

Corrected approach: At x = 5: ŷ = 35.6 − 3.14(5) = 19.9. Actual y = 16. Residual = 16 − 19.9 = −3.9. The actual price is $3,900 below the model’s prediction. (The question’s stated residual of +0.3 may apply to a slightly different fitted equation — always calculate directly from the given equation.)

(e) No. x = 20 is far outside the data range (1–9 years). This is extrapolation. The model would predict ŷ = 35.6 − 3.14(20) = 35.6 − 62.8 = −27.2 ($’000) — a negative price, which is meaningless. The linear model clearly breaks down for very old cars.