Practice Maths

Solutions: Residual Analysis

  1. (a) ŷ = 5 + 12(4) = 5 + 48 = 53 units

    (b) e = 53 − 53 = 0

    (c) The residual is zero — the actual value lies exactly on the regression line.

  2. Regression equation: ŷ = 2.4 + 0.8x

    x Actual y Predicted ŷ Residual e = y − ŷ
    5 7.2 6.4 +0.8
    10 9.8 10.4 −0.6
    15 14.9 14.4 +0.5
    20 18.1 18.4 −0.3

    Sum of residuals = 0.8 − 0.6 + 0.5 − 0.3 = +0.4 (small rounding discrepancy; the true sum for all data in the regression would be exactly zero).

  3. (a) The inverted U-shape indicates the linear model is not appropriate — the residuals show a systematic curved pattern, meaning the model over-predicts at the extremes and under-predicts in the middle.

    (b) A quadratic (parabolic) model would be more appropriate, since the data rises then falls.

    (c) A square root transformation of y, or fitting y as a quadratic function of x, could be tried.

  4. (a) r² = 0.91² = 0.8281 ≈ 0.828

    (b) Approximately 82.8% of the variation in fuel consumption is explained by the linear relationship with speed.

    (c) 100% − 82.8% = 17.2% of the variation is not explained by speed (due to other factors such as road conditions, tyre pressure, and driving style).

  5. (a) Model B better explains variation in y. It has r² = 0.93, explaining 93% of variation in y, versus only 74% for Model A. Its residual plot also shows random scatter, confirming it is appropriate.

    (b) The residual plot for Model A should be examined. If it shows a pattern, this confirms Model A is not appropriate even at r² = 0.74.

    (c) r² = 0.93 means 93% of the variation in the response variable is explained by the transformed linear model; only 7% remains unexplained.

  6. (a) r² = 0.98² = 0.9604 ≈ 0.960

    (b) Yes, the log transformation is very successful. r = 0.98 (r² = 0.960) means 96% of variation in log(y) is explained by a linear relationship with x. The residual plot should show random scatter to confirm.

    (c) The log transformation being successful suggests that plant height grows exponentially over time — increasing by a constant percentage per day rather than a constant number of centimetres.

  7. (a) The fan shape indicates non-constant variability (heteroscedasticity) — the spread of residuals increases as x increases.

    (b) A linear model is not fully appropriate. The non-constant spread violates regression assumptions, making predictions less reliable for large x-values. A log(y) transformation may help stabilise the variance.

    (c) For house prices vs floor area: small houses have relatively consistent pricing, but large luxury homes vary widely in price due to location, finishes, and prestige — creating increasing spread (fan shape) for larger floor areas.

  8. (a) Dataset P: r² = (−0.95)² = 0.9025.   Dataset Q: r² = (0.82)² = 0.6724

    (b) Dataset P has the better linear fit — r² = 0.903 means 90.3% of variation in y is explained, compared to only 67.2% for Dataset Q.

    (c) No — the negative sign of r does not make the model worse. r² ignores the sign of r. A strong negative linear relationship fits the data just as well as a strong positive one.

  9. Predicted score at x = 6: ŷ = 40 + 5.2(6) = 40 + 31.2 = 71.2

    (a) Student 1: e = 72 − 71.2 = +0.8

    (b) This student scored 0.8 marks above the model’s prediction for someone who studied 6 hours. Their actual score is slightly above the regression line.

    (c) Student 2: e = 61 − 71.2 = −10.2. This student scored 10.2 marks below the model’s prediction — well below the regression line. Study hours alone do not fully explain their performance; other factors may have affected their result.