Practice Maths

L29 — Lines of Best Fit

Key Terms

Line of best fit
A straight line that minimises the sum of squared vertical distances from each data point to the line (least-squares line).
ŷ = a + bx
The equation of the regression line; ŷ (y-hat) is the predicted value of y for a given x.
Slope b
The rate of change — for each 1-unit increase in x, y changes by b units on average.
y-intercept a
The predicted value of y when x = 0; only meaningful if x = 0 is within (or near) the data range.
Interpolation
Predicting y for an x value within the observed range of data — generally reliable.
Residual
The difference (actual y − predicted ŷ); positive residuals sit above the line, negative residuals below.

Line of best fit (least-squares regression line)

A line of best fit is a straight line drawn through a scatter plot to best represent the trend in the data. The least-squares line minimises the sum of squared vertical distances from each point to the line.

Equation of the line of best fit

ŷ = a + bx

SymbolMeaning
ŷ (y-hat)Predicted value of y
ay-intercept (predicted y when x = 0)
bSlope (change in y per unit increase in x)
xValue of the explanatory variable

Finding the equation

MethodHow
By hand (two points)Read two points on the drawn line; use y = mx + c
Calculator / technologyEnter data, run linear regression; gives a and b directly
Mean pointThe line always passes through (¯x, ¯y)

Interpolation and extrapolation

TermMeaningReliability
InterpolationPredicting within the range of the dataGenerally reliable
ExtrapolationPredicting outside the range of the dataUnreliable — use with caution

Residuals

A residual = actual y − predicted ŷ. Positive residuals sit above the line; negative residuals sit below. A good fit has residuals randomly scattered around zero.

x y ŷ=a+bx residual (¯x,¯y)
Line of best fit passes through (¯x,¯y); residuals shown as dashed lines
Hot Tip: The line of best fit always passes through the mean point (¯x, ¯y). Use this to check your equation: substitute ¯x and verify the predicted value is close to ¯y.

Worked Example 1 — Drawing a line of best fit by eye

Data: (1,5), (2,7), (3,9), (4,8), (5,12), (6,13). Draw a line of best fit and find its equation.

Step 1: Plot the points. The pattern is positive and roughly linear.

Step 2: Draw a line with roughly equal numbers of points above and below.

Step 3: Read two points on the line: approximately (1, 4) and (6, 14).

Step 4: Slope b = (14 − 4)/(6 − 1) = 10/5 = 2. Intercept: 4 = 2(1) + a ⇒ a = 2.

Equation: ŷ = 2 + 2x.

Worked Example 2 — Using the equation to predict

A line of best fit for temperature (x, °C) vs ice cream sales (y, units) is ŷ = −20 + 15x. Predict sales when temperature is 28°C.

ŷ = −20 + 15(28) = −20 + 420 = 400 units.

Note: if 28°C is within the range of observed data, this is interpolation and reliable.

Worked Example 3 — Interpreting slope and intercept

ŷ = 30 + 4.5x, where x = hours of training, y = fitness score (out of 100).

Slope (b = 4.5): For each additional hour of training, fitness score increases by approximately 4.5 points.

Intercept (a = 30): A person who does zero hours of training is predicted to score 30. (Only meaningful if x = 0 is in the data range.)

Worked Example 4 — Calculating a residual

A student studied 4 hours and scored 78. The line of best fit gives ŷ = 2 + 15(4) = 62. Find the residual.

Residual = actual − predicted = 78 − 62 = +16.

This student scored 16 marks above what the model predicted. They sit above the line.

Worked Example 5 — Mean point check

Data: x = {2, 4, 6, 8}, y = {10, 14, 20, 24}. Verify that (¯x, ¯y) lies on the line ŷ = 5 + 2.3x.

¯x = (2+4+6+8)/4 = 5. ¯y = (10+14+20+24)/4 = 17.

Predicted: ŷ = 5 + 2.3(5) = 5 + 11.5 = 16.5 ≈ 17. ✓ (Close — the line passes through the mean point.)

  1. Reading the line of best fit. Fluency

    • (a) A line of best fit passes through (2, 10) and (8, 28). Find its equation.
    • (b) ŷ = 5 + 3x. Predict y when x = 7.
    • (c) ŷ = 100 − 4x. What is the y-intercept? What is the slope?
    • (d) ŷ = 12 + 2.5x. Predict y when x = 0. What does this represent?
  2. Interpreting slope and intercept. Fluency

    • (a) ŷ = 50 + 6x, where x = weeks of exercise and y = fitness score. Interpret the slope.
    • (b) ŷ = 200 − 3x, where x = years and y = resale value ($00s). Interpret the slope and intercept.
    • (c) ŷ = 8 + 0.5x, where x = hours of sleep and y = alertness rating. What alertness is predicted for 7 hours?
    • (d) ŷ = 60 − 2x, where x = number of absences and y = exam mark. What does a slope of −2 mean?
  3. Residuals. Fluency

    • (a) Actual y = 45, predicted ŷ = 38. Find the residual. Is the point above or below the line?
    • (b) ŷ = 10 + 4x. For x = 5, actual y = 25. Find the residual.
    • (c) A point has residual = −8. Is the actual value higher or lower than predicted?
    • (d) Why is the sum of residuals for a least-squares line always (approximately) zero?
  4. Interpolation vs extrapolation. Fluency

    • (a) Data was collected for x = 10 to x = 50. A prediction is made at x = 35. Is this interpolation or extrapolation?
    • (b) Using the same data, a prediction is made at x = 70. Which type? Is it reliable?
    • (c) ŷ = −5 + 2x predicts negative values when x < 2.5. If x ≥ 3 in all observed data, is predicting y at x = 1 sensible?
    • (d) Why is extrapolation risky even when r is very close to 1?
  5. Line of best fit from a scatter plot. Understanding

    The scatter plot shows hours of revision (x) and exam score (y) for 8 students. A line of best fit has been drawn.

    0 2 4 6 8 10 0 20 40 60 80 100 Revision hours Exam score
    • (a) Read two points on the line of best fit and find its equation.
    • (b) Predict the score for a student who revised for 7 hours.
    • (c) One student revised 3 hours and scored 65. Calculate their residual.
    • (d) Is it reliable to use the line to predict a score for 15 hours of revision? Why?
  6. Finding a line through the mean point. Understanding

    Data: x = {2, 4, 6, 8, 10}, y = {14, 18, 20, 26, 32}.

    • (a) Find ¯x and ¯y.
    • (b) Using technology or estimation, the slope is b = 1.8. Find the y-intercept a using the mean point.
    • (c) Write the equation of the line of best fit.
    • (d) Predict y for x = 5 and x = 12. Which prediction is more reliable?
  7. Interpreting a regression line. Understanding

    A study of 20 cars gives the regression line: ŷ = 22.5 − 1.8x, where x = age of car (years) and y = resale value ($000s).

    • (a) Predict the resale value of a 5-year-old car.
    • (b) Interpret the slope in context.
    • (c) The data covers cars aged 1 to 10 years. Predict the value of a 15-year-old car. Is this reliable?
    • (d) At what age does the model predict the car has no value (y = 0)? Is this realistic?
  8. Residual analysis. Understanding

    ŷ = 40 + 5x for a dataset of study hours vs test score.

    x (hours)246810
    y (actual)5055728088
    • (a) Calculate the predicted ŷ for each x.
    • (b) Calculate the residual for each data point.
    • (c) Which student performed most above prediction?
    • (d) Do the residuals suggest the linear model is a good fit? Explain.
  9. Temperature and fuel consumption. Problem Solving

    An engineer records daily temperature x (°C) and fuel consumption y (L/100 km) for a bus over 8 days:

    x58101518222530
    y14.213.513.012.211.811.010.59.8
    • (a) Describe the correlation (direction, form, strength).
    • (b) ¯x = 16.6, ¯y = 12.0. The slope is b = −0.203. Find the y-intercept a.
    • (c) Write the equation and predict fuel use at 20°C.
    • (d) Predict fuel use at 40°C. Is this reliable? What limitations apply?
  10. Choosing the better model. Problem Solving

    Two researchers analyse the same dataset (age x vs reaction time y in milliseconds). Researcher A proposes ŷ = 200 + 3x. Researcher B proposes ŷ = 150 + 5x.

    x (age)2030405060
    y (actual)258298350395440
    • (a) Calculate the predictions from each model at each x.
    • (b) Calculate the residuals for each model.
    • (c) Which model has smaller overall residuals (closer to 0)? This is the better fit.
    • (d) For which age group does Model A give a better prediction, and Model B a worse one?