Prediction and Interpolation
Key Terms
- Interpolation
- Predicting ŷ for an x-value WITHIN the observed data range; generally reliable.
- Extrapolation
- Predicting ŷ for an x-value OUTSIDE the data range; unreliable because the relationship may not continue.
- Prediction
- Substitute x into ŷ = a + bx to obtain a predicted y-value.
- Coefficient of determination r²
- r² × 100% of the variation in y is explained by the linear model; the remaining (1 − r²) × 100% is unexplained.
- Reliability
- The closer |r| is to 1 and the more appropriate the model, the more reliable the prediction within the data range.
- Residual
- The difference between the actual y-value and the predicted ŷ-value at any given x.
Given the LSRL ŷ = a + bx, substitute an x-value to obtain a predicted y-value.
Interpolation: predicting within the observed range of x — generally reliable.
Extrapolation: predicting outside the observed range of x — unreliable; the relationship may not continue.
Coefficient of Determination:
r² = proportion of variation in y explained by the linear relationship with x
e.g. r² = 0.81 means 81% of variation in y is explained by x.
Interpreting r²
If r = 0.9, then r² = 0.81. This means 81% of the variation in the response variable is explained by the linear relationship with the explanatory variable. The remaining 19% is due to other factors.
The higher r², the better the linear model fits the data. r² = 1 means a perfect fit; r² = 0 means no linear relationship.
Worked Example 1
A study of 12 Perth apartments gives the LSRL: ŷ = 180 + 2.4x, where x = floor area (m²) and y = weekly rent ($). The data ranges from x = 40 to x = 120 m². Technology gives r = 0.92.
Predict rent for 85 m²: ŷ = 180 + 2.4(85) = 180 + 204 = $384/week. This is interpolation (85 is within 40–120), so it is reliable.
Interpret r²: r² = 0.92² = 0.8464. About 84.6% of variation in weekly rent is explained by floor area.
Worked Example 2
Using the same equation, predict rent for x = 200 m²: ŷ = 180 + 2.4(200) = $660/week. This is extrapolation (200 is well outside 40–120) — the prediction is unreliable. The relationship may not be linear beyond the data range.
Full Lesson: Prediction and Interpolation
1. From Correlation to Prediction
Once we have established a linear association between two variables and fitted a least squares regression line, the natural next step is to use that line for prediction. The equation ŷ = a + bx allows us to substitute any x-value and obtain a predicted y-value. However, not all predictions are equally trustworthy — the reliability depends critically on where the x-value sits relative to the original data.
2. Interpolation: Predicting Within the Data Range
Interpolation means predicting for x-values that fall inside the range of the observed data. For example, if we collected data on house sizes ranging from 60 m² to 150 m², predicting the price of a 100 m² house is interpolation. This is generally reliable because we have direct evidence that the linear model holds across that range — the data points themselves support the shape of the relationship in that region.
3. Extrapolation: Predicting Beyond the Data Range
Extrapolation means predicting for x-values outside the observed range. This is inherently risky. Just because a relationship is linear from 60 to 150 m² does not mean it continues linearly beyond those bounds. In fact, most real-world relationships eventually level off, reverse, or become non-linear at extreme values. A classic example: a medication may reduce blood pressure linearly up to a certain dose, but beyond that dose the effect might plateau or even become dangerous. Always flag extrapolated predictions as unreliable.
4. The Coefficient of Determination r²
Pearson's r tells us the direction and strength of the linear association. Its square, r², gives us a proportion that is arguably even more informative: it tells us what fraction of the total variability in the y-values is “explained” by the linear relationship with x. If r = 0.85, then r² = 0.7225, meaning 72.25% of the variation in y is accounted for by the linear model. The remaining 27.75% is due to other factors not captured by x.
Interpreting r² in context is important. A high r² (say, above 0.75) suggests the linear model is a strong predictor. A low r² (below 0.25) means most variation in y is unexplained by x alone. Always state r² as a percentage with a contextual interpretation: “72% of the variation in exam scores is explained by the linear relationship with study hours.”
5. Predictions in Real-World Context
When making a prediction, always state clearly: (a) the x-value used, (b) the predicted y-value with appropriate units, (c) whether it is interpolation or extrapolation, and (d) a comment on reliability. Rounding predictions sensibly matters too — predicting rent to the nearest dollar is fine; predicting to the nearest cent is absurd given the uncertainty in the model.
6. The Regression Line Always Passes Through (x̄, ȳ)
A key property is that the least squares regression line always passes through the point (x̄, ȳ) — the means of both variables. This means predicting at x = x̄ will always give ŷ = ȳ. This also gives a useful check: if you substitute the mean of x, you should get approximately the mean of y.
7. Common Errors and Limitations
Students sometimes confuse r and r², or incorrectly state “r² = 0.81 means the correlation is 81%”. The correct statement is: “81% of the variation in y is explained by the linear relationship with x.” Another common error is failing to check the data range before making a prediction and presenting an extrapolated value as reliable. Always ask: “Is my x-value inside the range of the data used to fit the line?”
Mastery Practice
-
A regression line for predicting a student's exam score (y) from hours of study (x) is ŷ = 32 + 7.5x. Data was collected for students studying between 1 and 10 hours.
- Predict the exam score for a student who studies for 6 hours.
- Is this interpolation or extrapolation? Explain.
-
Technology gives r = 0.85 for a dataset. Calculate r² and interpret it in the context: x = daily exercise time (minutes), y = resting heart rate (bpm).
-
The LSRL for predicting weekly sales (y, units) from advertising spend (x, $'000) is ŷ = 120 + 18.5x. The data ranged from $1,000 to $8,000 advertising spend.
- Predict sales when $5,000 is spent on advertising.
- Predict sales when $15,000 is spent. Explain why this prediction should be treated with caution.
-
A regression analysis gives r = 0.73. State the value of r² and interpret what it tells us about the linear model.
-
A marine biologist records the water temperature (x, °C) and the number of coral bleaching events (y) at 14 reef sites around Queensland. The LSRL is ŷ = −12.4 + 1.8x, with r = 0.88. Data ranges from 22°C to 30°C.
- Calculate r² and interpret it in context.
- Predict the number of bleaching events at 26°C. Is this reliable?
- What does the slope of 1.8 tell us?
-
A fitness study measures the relationship between body mass index (BMI, x) and blood pressure (y, mmHg). The LSRL is ŷ = 68 + 1.9x, r² = 0.61. Data collected for BMI values 18–38.
- Predict blood pressure for BMI = 25.
- A patient has BMI = 45. Comment on predicting their blood pressure using this equation.
- Interpret r² = 0.61 in this context.
-
Two datasets both have r² = 0.64, but Dataset A has 8 data points and Dataset B has 80 data points. Both use x-values ranging from 10 to 50.
- What does r² = 0.64 tell us about both datasets?
- For which dataset is a prediction at x = 30 more reliable? Explain.
- Both datasets are used to predict at x = 55. Which prediction is more concerning, and why?
-
A transport researcher models the relationship between vehicle speed (x, km/h) and fuel consumption (y, L/100km) on Australian highways. The LSRL is ŷ = 4.2 + 0.052x, r = 0.91. Data collected from 60 km/h to 110 km/h. The mean speed is x̄ = 88 km/h.
- Verify that the line passes through (x̄, ȳ) by finding ȳ.
- Calculate r² and interpret it.
- A driver travels at 95 km/h. Predict their fuel consumption.
-
An agricultural scientist studies the effect of fertiliser application (x, kg/ha) on crop yield (y, tonnes/ha) across 20 farms. The LSRL is ŷ = 1.8 + 0.034x, r = 0.87. Data ranges from x = 50 to x = 200 kg/ha.
- Interpret the slope and y-intercept in context. Is the y-intercept meaningful?
- Calculate r² and explain what the remaining unexplained variation might be due to.
- A farm uses 250 kg/ha of fertiliser. A colleague says “the model predicts 10.3 tonnes/ha — that's a reliable forecast.” Evaluate this claim fully.
- What y-value would be predicted at the mean fertiliser rate x̄ = 125 kg/ha?
-
A climate scientist analyses the relationship between CO&sub2; concentration (x, ppm) and average global temperature anomaly (y, °C above pre-industrial baseline) using data from 1960 to 2020. The LSRL is ŷ = −10.2 + 0.030x, with r² = 0.93. The CO&sub2; values range from 316 ppm (1960) to 412 ppm (2020).
- Interpret r² = 0.93 in this context.
- Predict the temperature anomaly when CO&sub2; = 380 ppm.
- The 2030 forecast projects CO&sub2; at 445 ppm. Predict the temperature anomaly and discuss the reliability of this forecast.
- A journalist writes: “Since CO&sub2; and temperature are strongly correlated (r² = 0.93), CO&sub2; definitely causes global warming.” Evaluate this statement statistically.