L29 — Lines of Best Fit
Key Terms
- Line of best fit
- A straight line that minimises the sum of squared vertical distances from each data point to the line (least-squares line).
- ŷ = a + bx
- The equation of the regression line; ŷ (y-hat) is the predicted value of y for a given x.
- Slope b
- The rate of change — for each 1-unit increase in x, y changes by b units on average.
- y-intercept a
- The predicted value of y when x = 0; only meaningful if x = 0 is within (or near) the data range.
- Interpolation
- Predicting y for an x value within the observed range of data — generally reliable.
- Residual
- The difference (actual y − predicted ŷ); positive residuals sit above the line, negative residuals below.
Line of best fit (least-squares regression line)
A line of best fit is a straight line drawn through a scatter plot to best represent the trend in the data. The least-squares line minimises the sum of squared vertical distances from each point to the line.
Equation of the line of best fit
ŷ = a + bx
| Symbol | Meaning |
|---|---|
| ŷ (y-hat) | Predicted value of y |
| a | y-intercept (predicted y when x = 0) |
| b | Slope (change in y per unit increase in x) |
| x | Value of the explanatory variable |
Finding the equation
| Method | How |
|---|---|
| By hand (two points) | Read two points on the drawn line; use y = mx + c |
| Calculator / technology | Enter data, run linear regression; gives a and b directly |
| Mean point | The line always passes through (¯x, ¯y) |
Interpolation and extrapolation
| Term | Meaning | Reliability |
|---|---|---|
| Interpolation | Predicting within the range of the data | Generally reliable |
| Extrapolation | Predicting outside the range of the data | Unreliable — use with caution |
Residuals
A residual = actual y − predicted ŷ. Positive residuals sit above the line; negative residuals sit below. A good fit has residuals randomly scattered around zero.
Worked Example 1 — Drawing a line of best fit by eye
Data: (1,5), (2,7), (3,9), (4,8), (5,12), (6,13). Draw a line of best fit and find its equation.
Step 1: Plot the points. The pattern is positive and roughly linear.
Step 2: Draw a line with roughly equal numbers of points above and below.
Step 3: Read two points on the line: approximately (1, 4) and (6, 14).
Step 4: Slope b = (14 − 4)/(6 − 1) = 10/5 = 2. Intercept: 4 = 2(1) + a ⇒ a = 2.
Equation: ŷ = 2 + 2x.
Worked Example 2 — Using the equation to predict
A line of best fit for temperature (x, °C) vs ice cream sales (y, units) is ŷ = −20 + 15x. Predict sales when temperature is 28°C.
ŷ = −20 + 15(28) = −20 + 420 = 400 units.
Note: if 28°C is within the range of observed data, this is interpolation and reliable.
Worked Example 3 — Interpreting slope and intercept
ŷ = 30 + 4.5x, where x = hours of training, y = fitness score (out of 100).
Slope (b = 4.5): For each additional hour of training, fitness score increases by approximately 4.5 points.
Intercept (a = 30): A person who does zero hours of training is predicted to score 30. (Only meaningful if x = 0 is in the data range.)
Worked Example 4 — Calculating a residual
A student studied 4 hours and scored 78. The line of best fit gives ŷ = 2 + 15(4) = 62. Find the residual.
Residual = actual − predicted = 78 − 62 = +16.
This student scored 16 marks above what the model predicted. They sit above the line.
Worked Example 5 — Mean point check
Data: x = {2, 4, 6, 8}, y = {10, 14, 20, 24}. Verify that (¯x, ¯y) lies on the line ŷ = 5 + 2.3x.
¯x = (2+4+6+8)/4 = 5. ¯y = (10+14+20+24)/4 = 17.
Predicted: ŷ = 5 + 2.3(5) = 5 + 11.5 = 16.5 ≈ 17. ✓ (Close — the line passes through the mean point.)
-
Reading the line of best fit. Fluency
- (a) A line of best fit passes through (2, 10) and (8, 28). Find its equation.
- (b) ŷ = 5 + 3x. Predict y when x = 7.
- (c) ŷ = 100 − 4x. What is the y-intercept? What is the slope?
- (d) ŷ = 12 + 2.5x. Predict y when x = 0. What does this represent?
-
Interpreting slope and intercept. Fluency
- (a) ŷ = 50 + 6x, where x = weeks of exercise and y = fitness score. Interpret the slope.
- (b) ŷ = 200 − 3x, where x = years and y = resale value ($00s). Interpret the slope and intercept.
- (c) ŷ = 8 + 0.5x, where x = hours of sleep and y = alertness rating. What alertness is predicted for 7 hours?
- (d) ŷ = 60 − 2x, where x = number of absences and y = exam mark. What does a slope of −2 mean?
-
Residuals. Fluency
- (a) Actual y = 45, predicted ŷ = 38. Find the residual. Is the point above or below the line?
- (b) ŷ = 10 + 4x. For x = 5, actual y = 25. Find the residual.
- (c) A point has residual = −8. Is the actual value higher or lower than predicted?
- (d) Why is the sum of residuals for a least-squares line always (approximately) zero?
-
Interpolation vs extrapolation. Fluency
- (a) Data was collected for x = 10 to x = 50. A prediction is made at x = 35. Is this interpolation or extrapolation?
- (b) Using the same data, a prediction is made at x = 70. Which type? Is it reliable?
- (c) ŷ = −5 + 2x predicts negative values when x < 2.5. If x ≥ 3 in all observed data, is predicting y at x = 1 sensible?
- (d) Why is extrapolation risky even when r is very close to 1?
-
Line of best fit from a scatter plot. Understanding
The scatter plot shows hours of revision (x) and exam score (y) for 8 students. A line of best fit has been drawn.
- (a) Read two points on the line of best fit and find its equation.
- (b) Predict the score for a student who revised for 7 hours.
- (c) One student revised 3 hours and scored 65. Calculate their residual.
- (d) Is it reliable to use the line to predict a score for 15 hours of revision? Why?
-
Finding a line through the mean point. Understanding
Data: x = {2, 4, 6, 8, 10}, y = {14, 18, 20, 26, 32}.
- (a) Find ¯x and ¯y.
- (b) Using technology or estimation, the slope is b = 1.8. Find the y-intercept a using the mean point.
- (c) Write the equation of the line of best fit.
- (d) Predict y for x = 5 and x = 12. Which prediction is more reliable?
-
Interpreting a regression line. Understanding
A study of 20 cars gives the regression line: ŷ = 22.5 − 1.8x, where x = age of car (years) and y = resale value ($000s).
- (a) Predict the resale value of a 5-year-old car.
- (b) Interpret the slope in context.
- (c) The data covers cars aged 1 to 10 years. Predict the value of a 15-year-old car. Is this reliable?
- (d) At what age does the model predict the car has no value (y = 0)? Is this realistic?
-
Residual analysis. Understanding
ŷ = 40 + 5x for a dataset of study hours vs test score.
x (hours) 2 4 6 8 10 y (actual) 50 55 72 80 88 - (a) Calculate the predicted ŷ for each x.
- (b) Calculate the residual for each data point.
- (c) Which student performed most above prediction?
- (d) Do the residuals suggest the linear model is a good fit? Explain.
-
Temperature and fuel consumption. Problem Solving
An engineer records daily temperature x (°C) and fuel consumption y (L/100 km) for a bus over 8 days:
x 5 8 10 15 18 22 25 30 y 14.2 13.5 13.0 12.2 11.8 11.0 10.5 9.8 - (a) Describe the correlation (direction, form, strength).
- (b) ¯x = 16.6, ¯y = 12.0. The slope is b = −0.203. Find the y-intercept a.
- (c) Write the equation and predict fuel use at 20°C.
- (d) Predict fuel use at 40°C. Is this reliable? What limitations apply?
-
Choosing the better model. Problem Solving
Two researchers analyse the same dataset (age x vs reaction time y in milliseconds). Researcher A proposes ŷ = 200 + 3x. Researcher B proposes ŷ = 150 + 5x.
x (age) 20 30 40 50 60 y (actual) 258 298 350 395 440 - (a) Calculate the predictions from each model at each x.
- (b) Calculate the residuals for each model.
- (c) Which model has smaller overall residuals (closer to 0)? This is the better fit.
- (d) For which age group does Model A give a better prediction, and Model B a worse one?