Residual Analysis

Key Terms

Residual: e_i = y_i − ŷ_i (actual − predicted); positive = above line, negative = below line.
Sum of residuals: Always equals zero for a least-squares regression line.
Residual plot: Plots residuals vs x-values; used to assess whether a linear model is appropriate.
Random scatter: If the residual plot shows random scatter with no pattern, the linear model is appropriate.
Pattern in residuals: A curved pattern indicates a non-linear model would be more appropriate.
Coefficient of determination r²: r² × 100% of the variation in y is explained by the linear relationship with x.

Residual
e_i = y_i − ŷ_i (actual − predicted)

Coefficient of Determination
r² = (Pearson's r)²
Interpretation: r² × 100% of the variation in y is explained by the linear relationship with x.

Understanding Residuals

A positive residual means the actual value is above the regression line (the line underestimates).
A negative residual means the actual value is below the regression line (the line overestimates).
The sum of all residuals is always zero for a least-squares regression line.

Residual Plots

A residual plot graphs each residual (e_i) against the corresponding x-value (or fitted value ŷ_i). It is used to check whether a linear model is appropriate.

Random scatter with no pattern → linear model is appropriate.
Curved pattern → the relationship is non-linear; a linear model is not ideal.
Fan shape (increasing spread) → the assumption of constant variability is violated.

Linearising Data

If residuals show a pattern, a transformation may straighten the data:

Try log(y) vs x for exponential-type growth
Try y vs log(x) for logarithmic growth
Try log(y) vs log(x) for power-type relationships

Worked Example 1

A researcher fits a linear regression to 10 data points and obtains the following residuals (in order of x):

+3.1, +2.8, +1.4, −0.2, −1.8, −3.5, −2.9, −1.2, +1.6, +3.4

Identify the pattern: The residuals start positive, become negative in the middle, then return to positive. This U-shaped (curved) pattern suggests the data has a non-linear (quadratic) relationship. A linear model is not appropriate — a transformation or a curved model should be used.

Worked Example 2

For a dataset, technology gives r = 0.94.

(a) Calculate r²: r² = 0.94² = 0.8836 ≈ 0.884

(b) Interpret: Approximately 88.4% of the variation in y is explained by the linear relationship with x. The remaining 11.6% is due to other factors or random variation.

Hot Tip: r² is always between 0 and 1. If r = 0.9, then r² = 0.81, meaning 81% of variation in y is explained — not 90%. Students often confuse r and r². Always square r to get the coefficient of determination.

Full Lesson: Residual Analysis

1. What Is a Residual?

Once we fit a least squares regression line to data, we can ask: how well does the line actually predict each data point? The answer is captured by the residual — the vertical distance between each actual data point and the predicted value on the regression line. Formally, the residual for observation i is e_i = y_i − ŷ_i, where y_i is the actual observed value and ŷ_i is the value predicted by the regression equation for that x-value.

Positive residuals indicate that the actual value lies above the line (the model underestimated), while negative residuals indicate the actual value lies below (the model overestimated). By definition, the sum of all residuals from a least-squares regression line is always exactly zero — the positive and negative deviations perfectly cancel. This is a useful check on your calculations.

2. Calculating Residuals

To calculate residuals, first fit the regression line using technology to obtain the equation ŷ = a + bx. Then substitute each x-value into the equation to find the predicted value, and subtract it from the actual y-value. For example, if the regression line is ŷ = 5 + 2x and a data point is (3, 14), then ŷ = 5 + 2(3) = 11, and the residual is e = 14 − 11 = +3. This positive residual tells us the actual value is 3 units above the line at that point.

Technology (ClassPad, spreadsheet) can calculate all residuals simultaneously and store them for plotting. On ClassPad, after performing linear regression, the residuals are available as a list that can be plotted directly against the x-values.

3. Residual Plots and Model Checking

A residual plot is a scatterplot of residuals (e_i) versus x-values (or fitted values ŷ_i). It is the single most important diagnostic tool for checking whether a linear model is appropriate. The key insight is this: if the linear model is correct, then the residuals should be purely random noise — they should show no systematic pattern. A residual plot with random scatter around the horizontal line e = 0, with roughly equal spread throughout, confirms that the linear model is appropriate.

However, if there is a pattern in the residual plot, this is a signal that the linear model is missing something. A curved or U-shaped pattern in the residuals suggests the true relationship is non-linear — perhaps quadratic or exponential. A fan shape (where the spread of residuals increases as x increases) suggests that the variability of y is not constant across the range of x, violating an assumption of regression. Recognising these patterns and knowing what they mean is a key skill in residual analysis.

4. The Coefficient of Determination (r²)

The coefficient of determination, r², is simply the square of Pearson's correlation coefficient. It is one of the most useful statistics in regression because it tells us the proportion of variation in the y-variable that is explained by the linear relationship with x. For example, if r = 0.87, then r² = 0.756, meaning that 75.6% of the total variation in y-values can be attributed to the linear relationship with x. The remaining 24.4% is due to other variables, randomness, or measurement error.

A high r² (close to 1) indicates the linear model explains most of the variation — the regression line fits the data well. A low r² (close to 0) means the linear model has little predictive power. However, a high r² does not by itself confirm that a linear model is appropriate — you must also examine the residual plot to check for patterns. It is possible to have a high r² with a curved relationship if the curve is steep.

5. Linearising Data by Transformation

When the residual plot shows a curved pattern, or when the scatterplot of raw data is clearly non-linear, we can try to linearise the data by applying a mathematical transformation. Common transformations include: plotting log(y) versus x (useful when y grows or decays exponentially), plotting y versus log(x) (useful for diminishing-returns relationships), and plotting log(y) versus log(x) (useful for power functions). After transformation, we check whether the new scatterplot appears linear and whether the residual plot of the transformed data shows random scatter.

For example, if a dataset of bacterial population (y) over time in hours (x) gives a curved scatterplot, transforming to log(y) vs x might produce a straight line. We then fit a linear regression to the transformed data, and the resulting equation can be back-transformed to give a model for the original data. This technique is powerful but requires careful interpretation — always work with the same units and check that the transformation makes physical sense.

6. Comparing Models Using r²

When choosing between two models (e.g., linear vs transformed), compare their r² values. The model with the higher r² (and a residual plot showing random scatter) is generally preferred. However, a slightly higher r² does not always justify the complexity of a transformation — if both r² values are similar, the simpler linear model may be preferred. Always present both the numerical evidence (r² values) and the visual evidence (residual plots) when comparing models. This holistic approach is expected in exam responses.

Mastery Practice

A regression line is fitted to data on advertising spend (x, $'000) and weekly sales (y, units). For one data point, x = 4 and the regression equation gives ŷ = 5 + 12x. The actual sales for that week were 53 units.
1. Calculate the predicted sales when x = 4.
2. Calculate the residual for this data point.
3. State whether the actual value is above or below the regression line.

The regression line for a dataset is ŷ = 2.4 + 0.8x. The following data points are given:

x	5	10	15	20
y (actual)	7.2	9.8	14.9	18.1

Calculate the residual for each data point and verify they sum to zero (or approximately zero due to rounding).

A residual plot for a linear regression shows the following pattern: residuals are negative for small x-values, positive for middle x-values, and negative again for large x-values (an inverted U-shape).
1. Describe what this pattern tells you about the linear model.
2. What type of model might be more appropriate?
3. What transformation could be tried to linearise the data?
For a study of fuel consumption, technology gives r = 0.91.
1. Calculate r².
2. Interpret r² in the context of fuel consumption and speed.
3. What percentage of variation in fuel consumption is not explained by speed?
Two models are fitted to the same dataset. Model A (linear) has r² = 0.74. Model B (log transformation) has r² = 0.93, and its residual plot shows random scatter.
1. Which model better explains variation in y? Justify.
2. What additional check should be done on Model A before rejecting it?
3. What does r² = 0.93 mean in plain English?
A dataset on plant height (y, cm) after x days shows a curved scatterplot. After applying a log(y) vs x transformation, technology gives r = 0.98 for the transformed data.
1. Calculate r² for the transformed model.
2. Is the log transformation successful? How do you know?
3. What does this suggest about how plant height changes over time?
A residual plot shows that residuals are small and random for low x-values, but become much larger (more spread out) for high x-values, creating a fan shape.
1. What problem does this fan shape indicate?
2. Is a linear model appropriate here? Explain.
3. Suggest what might cause this pattern in a real-world context involving house prices and floor area.
A student is given two values of Pearson's r for two different datasets and must determine which has a better linear fit:
- Dataset P: r = −0.95
- Dataset Q: r = +0.82
1. Calculate r² for each dataset.
2. Which dataset has the better linear fit? Why?
3. Does the negative sign of r for Dataset P mean the model is worse? Explain.
A regression line is fitted to data on hours studied (x) and test score (y) for 12 students. The regression equation is ŷ = 40 + 5.2x. One student studied for 6 hours and scored 72.
1. Find this student's residual.
2. Interpret this residual in context.
3. Another student who studied for 6 hours scored 61. Find their residual and interpret it.
A researcher compares a linear model and a y vs log(x) transformed model for data on reaction time (y, ms) and practice sessions (x). Results:
- Linear model: r² = 0.71, residual plot shows a curved pattern.
- Transformed model: r² = 0.89, residual plot shows random scatter.
Write a statistical conclusion comparing the two models, recommending which should be used and fully justifying your answer.

See Answers ➔