Solutions: Residual Analysis

(a) ŷ = 5 + 12(4) = 5 + 48 = 53 units

(b) e = 53 − 53 = 0

(c) The residual is zero — the actual value lies exactly on the regression line.

Step 1: Substitute x = 4 into the regression equation ŷ = 5 + 12x.

ŷ = 5 + 12 × 4 = 5 + 48 = 53

Step 2: Residual = actual − predicted = 53 − 53 = 0.

Step 3: A residual of zero means the data point falls exactly on the regression line — the model predicted this week’s sales perfectly.

Note: This is a coincidence; most residuals will not be zero.

Regression equation: ŷ = 2.4 + 0.8x

x	Actual y	Predicted ŷ	Residual e = y − ŷ
5	7.2	6.4	+0.8
10	9.8	10.4	−0.6
15	14.9	14.4	+0.5
20	18.1	18.4	−0.3

Sum of residuals = 0.8 − 0.6 + 0.5 − 0.3 = +0.4 (small rounding discrepancy; the true sum for all data in the regression would be exactly zero).

(a) The inverted U-shape indicates the linear model is not appropriate — the residuals show a systematic curved pattern, meaning the model over-predicts at the extremes and under-predicts in the middle.

(b) A quadratic (parabolic) model would be more appropriate, since the data rises then falls.

(c) A square root transformation of y, or fitting y as a quadratic function of x, could be tried.

(a) The inverted U-shape means residuals are negative, then positive, then negative again as x increases. This systematic pattern is a clear sign of non-linearity — the linear model is consistently wrong in predictable ways across the range of x. A good residual plot would show random scatter with no trend.

(b) An inverted U-shape in the residuals suggests the data peaks at some middle x-value. A quadratic (parabolic) model of the form y = a + bx + cx² (with c < 0) would capture this shape.

(c) Applying a square root transformation to y or treating x² as an additional predictor are both reasonable approaches to address the curvature.
(a) r² = 0.91² = 0.8281 ≈ 0.828

(b) Approximately 82.8% of the variation in fuel consumption is explained by the linear relationship with speed.

(c) 100% − 82.8% = 17.2% of the variation is not explained by speed (due to other factors such as road conditions, tyre pressure, and driving style).

Step 1: Square the correlation coefficient. r² = (0.91)² = 0.8281.

Step 2: Express as a percentage: 0.8281 × 100% = 82.81% ≈ 82.8%.

Step 3: Interpret. We say “82.8% of the variation in fuel consumption is explained by the linear relationship with vehicle speed.”

Step 4: Unexplained variation = 100% − 82.8% = 17.2%. This represents the influence of other variables not included in the model.
(a) Model B better explains variation in y. It has r² = 0.93, explaining 93% of variation in y, versus only 74% for Model A. Its residual plot also shows random scatter, confirming it is appropriate.

(b) The residual plot for Model A should be examined. If it shows a pattern, this confirms Model A is not appropriate even at r² = 0.74.

(c) r² = 0.93 means 93% of the variation in the response variable is explained by the transformed linear model; only 7% remains unexplained.

(a) Model B is superior on two counts: (i) it has a higher r² (0.93 vs 0.74), explaining substantially more of the variation in y; (ii) its residual plot shows random scatter, which confirms the model structure is appropriate. Model A’s residual plot showing a curved pattern is a serious problem, regardless of its r² value.

(b) The residual plot for Model A must be checked. A curved residual plot, even paired with a moderate r², means the linear model is systematically wrong and should not be used for predictions.

(c) In plain English: if you know x (the explanatory variable in its transformed form), you can account for 93% of the differences between individuals’ y-values. The model is highly informative.
(a) r² = 0.98² = 0.9604 ≈ 0.960

(b) Yes, the log transformation is very successful. r = 0.98 (r² = 0.960) means 96% of variation in log(y) is explained by a linear relationship with x. The residual plot should show random scatter to confirm.

(c) The log transformation being successful suggests that plant height grows exponentially over time — increasing by a constant percentage per day rather than a constant number of centimetres.

(a) r² = (0.98)² = 0.9604. Expressed as a percentage: 96.04% ≈ 96%.

(b) r = 0.98 is extremely close to 1, indicating a near-perfect linear relationship between log(y) and x after transformation. This is strong evidence the transformation was successful. To be complete, also check the residual plot of the transformed model for random scatter with no pattern.

(c) When log(y) is linearly related to x, the original relationship between y and x is exponential: y = ab⊃x for some constants a and b. This means plant height multiplies by the same factor b each day — exponential growth.
(a) The fan shape indicates non-constant variability (heteroscedasticity) — the spread of residuals increases as x increases.

(b) A linear model is not fully appropriate. The non-constant spread violates regression assumptions, making predictions less reliable for large x-values. A log(y) transformation may help stabilise the variance.

(c) For house prices vs floor area: small houses have relatively consistent pricing, but large luxury homes vary widely in price due to location, finishes, and prestige — creating increasing spread (fan shape) for larger floor areas.

(a) Heteroscedasticity means the variability (spread) of the response variable around the regression line is not constant — it changes as x changes. A fan shape (wide at the right, narrow at the left) is the classic pattern. This is a violation of a key assumption of least-squares regression.

(b) The model may still predict y reasonably on average, but confidence intervals and prediction intervals will be unreliable — they will be too narrow for large x and too wide for small x. A log(y) transformation often helps, because taking logs compresses the large values more than the small ones, reducing the spread at the high end.

(c) In the house price context: a 150m² home costs relatively consistently (say, $500,000–$600,000 depending on suburb). A 500m² home could range from $1.5M to $10M depending on location, pool, views, finishes. This produces a fan-shaped residual pattern in a price vs area regression.
(a) Dataset P: r² = (−0.95)² = 0.9025. Dataset Q: r² = (0.82)² = 0.6724

(b) Dataset P has the better linear fit — r² = 0.903 means 90.3% of variation in y is explained, compared to only 67.2% for Dataset Q.

(c) No — the negative sign of r does not make the model worse. r² ignores the sign of r. A strong negative linear relationship fits the data just as well as a strong positive one.

(a) Dataset P: r² = (−0.95)² = 0.9025 ≈ 0.903 (90.3%). Dataset Q: r² = (0.82)² = 0.6724 ≈ 0.672 (67.2%).

(b) r² of 0.903 means 90.3% of the variation in y is explained by the linear model for Dataset P. Dataset Q only achieves 67.2%. Dataset P has the better fit by a substantial margin.

(c) The negative sign of r simply means y decreases as x increases (a negative slope). It says nothing about the quality of the linear fit. Whether r = +0.95 or r = −0.95, the value r² = 0.9025 and the goodness of fit is identical. The sign of r determines the direction of the slope, not the strength of the relationship.
Predicted score at x = 6: ŷ = 40 + 5.2(6) = 40 + 31.2 = 71.2

(a) Student 1: e = 72 − 71.2 = +0.8

(b) This student scored 0.8 marks above the model’s prediction for someone who studied 6 hours. Their actual score is slightly above the regression line.

(c) Student 2: e = 61 − 71.2 = −10.2. This student scored 10.2 marks below the model’s prediction — well below the regression line. Study hours alone do not fully explain their performance; other factors may have affected their result.

Step 1: Calculate the predicted value for x = 6.

ŷ = 40 + 5.2 × 6 = 40 + 31.2 = 71.2

Step 2: Student 1 residual = actual − predicted = 72 − 71.2 = +0.8.

A positive residual means the actual score is above the regression line. This student performed slightly better than the model expected based on their study hours.

Step 3: Student 2 residual = 61 − 71.2 = −10.2.

A large negative residual means this student performed substantially below what the model predicted. Despite studying for 6 hours (same as Student 1), they scored 10.2 marks below the model’s expectation. Possible reasons: anxiety, illness, poor sleep, prior knowledge gaps. This illustrates that the regression line gives an average prediction — individuals can vary significantly.
The y vs log(x) transformed model should be used. The reasoning is as follows:

Statistical evidence: The transformed model has r² = 0.89, meaning it explains 89% of the variation in reaction time. The linear model only explains 71% (r² = 0.71). The transformed model is substantially more informative.

Residual plot evidence: The linear model’s residual plot shows a curved pattern, which is a direct indication that the model is not appropriate — it systematically over- or underestimates across the range of x. The transformed model’s residual plot shows random scatter, confirming the model structure is suitable.

Contextual support: The logarithmic model is consistent with what we know about skill learning. Early practice sessions produce large gains in reaction time, while later sessions produce diminishing returns — exactly what a y vs log(x) relationship describes.

Conclusion: Both the statistical evidence (higher r² and random residual plot) and the contextual reasoning support using the y vs log(x) model for this data.