Least Squares Regression Line
Key Terms
- Least squares regression line (LSRL)
- The line ŷ = a + bx that minimises the sum of squared vertical distances from each data point to the line.
- Slope b
- b = r × (sy/sx); for each 1-unit increase in x, y is PREDICTED to change by b units.
- y-intercept a
- a = ŷ − b&x̄; the predicted value of y when x = 0 (only meaningful if x = 0 is in the data range).
- Key property
- The LSRL always passes through the mean point (&x̄, ŷ).
- Interpreting the slope
- State: direction (increases/decreases), amount (b units), and context (variable names and units).
- Technology
- Use ClassPad statistics regression to obtain a, b, and r; check the scatter plot first.
The Least Squares Regression Line
When the scatterplot shows a linear association, we fit a least squares regression line (LSRL) as the line of best fit. It minimises the sum of squared vertical distances from each data point to the line.
Slope: b = r × (sy / sx)
y-intercept: a = ȳ − b x̄
where r = Pearson's correlation coefficient,
sy, sx = standard deviations of y and x,
ȳ, x̄ = means of y and x
In practice, use technology (ClassPad, spreadsheet) to fit the line. You must be able to interpret the equation in context.
Interpreting the Equation
- Slope (b): For each 1-unit increase in x, y is predicted to change by b units. Give the sign and the context.
- y-intercept (a): The predicted value of y when x = 0. This is only meaningful if x = 0 is within (or close to) the data range.
Worked Example 1
A real estate agent collects data on house size (x, m²) and sale price (y, $'000) for 15 Brisbane properties. Technology gives:
ŷ = 85.3 + 3.42x
Interpret the slope: For each additional 1 m² of house size, the sale price is predicted to increase by $3,420 (i.e., 3.42 × $1,000).
Interpret the y-intercept: A house of 0 m² would be predicted to sell for $85,300. This is not meaningful in context (a house must have some size), so we note the y-intercept has no practical interpretation here.
Worked Example 2
A school canteen records the daily temperature (x, °C) and number of hot drinks sold (y). The LSRL is:
ŷ = 142 − 3.8x
Interpret slope: For each 1°C increase in temperature, hot drink sales are predicted to decrease by 3.8 drinks per day.
Predict sales on a 25°C day: ŷ = 142 − 3.8(25) = 142 − 95 = 47 hot drinks.
The line always passes through (x̄, ȳ). This is a key property of the LSRL.
Full Lesson: Least Squares Regression Line
1. Why Fit a Line?
Once we establish that two variables have a linear association, it is useful to quantify that relationship with an equation. This equation lets us:
- Make predictions for new values of x
- Describe the rate of change between variables
- Compare relationships across different datasets
The least squares regression line is the unique line that minimises the sum of squared residuals (vertical distances from data points to the line). No other straight line has a smaller sum of squared residuals for the same data.
2. The Formulas
The LSRL has the form ŷ = a + bx, where ŷ (y-hat) represents the predicted value of y for a given x.
a = ȳ − b x̄
Notice that b depends on the correlation r, the standard deviations of both variables, and their means. If r = 0, then b = 0 (horizontal line at y = ȳ). If the standard deviations are equal, then b = r.
Key property: The LSRL always passes through the point (x̄, ȳ) — the mean of x and the mean of y. You can verify this by substituting x = x̄ into the equation.
3. Finding the Equation with Technology
ClassPad: Enter data in two lists (List 1 = x, List 2 = y). Go to Statistics → Calc → Linear Reg (ax+b). The output gives a (slope coefficient of x), b (y-intercept), and r. Note: ClassPad uses the form y = ax + b, so 'a' in ClassPad is the slope and 'b' is the y-intercept — the opposite naming convention to this course. Be careful!
Spreadsheet (Excel/Google Sheets): Use SLOPE(y_range, x_range) for b and INTERCEPT(y_range, x_range) for a. Or insert a trendline on a chart and display the equation.
4. Interpreting the Slope in Context
The slope is the most important part of the regression equation. Always interpret it with:
- The direction (increase or decrease)
- The units of both variables
- The context
Example: If ŷ = 12.4 + 0.85x where x = weekly study hours and y = exam score:
"For each additional hour of study per week, the exam score is predicted to increase by 0.85 marks."
5. Interpreting the y-Intercept
The y-intercept (a) is the predicted value of y when x = 0. It is only meaningful if:
- x = 0 is a realistic value in context, AND
- x = 0 is within or near the observed range of x values
If x = 0 is far outside the data range, the y-intercept is a mathematical extrapolation with no practical meaning. In that case, simply note: "The y-intercept of [value] has no practical meaning in this context as an x-value of 0 is outside the observed data range."
6. The Coefficient of Determination (r²)
Technology also gives r², the coefficient of determination. It tells us the proportion of variation in y that is explained by the linear relationship with x.
Express as a percentage: r² × 100%
Example: If r = 0.87, then r² = 0.757. We say: "75.7% of the variation in y is explained by the linear relationship with x." The remaining 24.3% is due to other factors or random variation.
7. Common Errors
- Swapping x and y: The regression line of y on x is not the same as x on y. Always regress the response variable (y) on the explanatory variable (x).
- Predicting outside the data range: See the next lesson on prediction and reliability.
- Reporting the line without interpreting it: Always explain what the slope means in the real-world context.
- Forgetting units: The slope has units of (y-units)/(x-units). Include them in your interpretation.
Mastery Practice
-
A regression line fitted to data on weekly advertising spend (x, $'000) and weekly revenue (y, $'000) gives the equation:
ŷ = 45.2 + 8.6x
- Interpret the slope in context.
- Interpret the y-intercept in context, and comment on whether it is meaningful.
- Predict the weekly revenue when advertising spend is $3,000.
-
A study of NRL players records body mass (x, kg) and bench press maximum (y, kg). Technology gives x̄ = 98.4, ȳ = 118.7, sx = 12.3, sy = 18.5, r = 0.72.
- Calculate the slope b.
- Calculate the y-intercept a.
- Write the regression equation.
- Predict the bench press maximum for a player with body mass 110 kg.
-
For the data below, use technology to find the least squares regression equation where x = temperature (°C) and y = electricity usage (kWh/day).
x 18 22 25 28 31 35 38 y 14.2 16.8 18.5 21.1 23.4 26.7 29.0 - Find the LSRL equation.
- Interpret the slope.
- What proportion of variation in electricity usage is explained by temperature?
-
A regression equation for predicting a student's physics score (y) from their mathematics score (x) is:
ŷ = 12.8 + 0.73x
The mean mathematics score is x̄ = 68 and mean physics score is ȳ = 62.4. Verify that the line passes through (x̄, ȳ) by substituting x̄ into the equation.
-
A marine biologist measures the length (x, cm) and weight (y, kg) of 20 barramundi. Technology gives the LSRL as ŷ = −1.84 + 0.062x, with r² = 0.891.
- Interpret the slope in context.
- Interpret r² in context.
- Estimate the weight of a barramundi that is 75 cm long.
- Should the y-intercept be interpreted as the weight of a 0 cm barramundi? Explain.
-
A car rental company finds that the daily hire fee (x, $) and number of bookings per week (y) have the regression equation:
ŷ = 320 − 2.4x
- What does the negative slope tell us about this relationship?
- Predict the number of bookings when the daily hire fee is $85.
- At what daily fee does the model predict zero bookings? Comment on the reliability of this prediction.
-
Two students fit regression lines to the same dataset, but Student A regresses y on x and Student B regresses x on y. Explain why these two lines are different, and which one should be used to predict y from x.
-
A dataset on hours of sunshine (x) and café outdoor seating revenue (y, $) has x̄ = 6.2, ȳ = 840, and the regression line ŷ = 315 + 85x.
- Verify the line passes through (x̄, ȳ).
- Interpret the slope in context.
- If r = 0.88, find the proportion of variation in revenue explained by the regression model.
-
A health researcher records the number of cigarettes smoked per day (x) and lung capacity (y, litres) for 30 adults. Technology gives ŷ = 5.2 − 0.048x, r = −0.79.
- Interpret the slope and the sign of r in context.
- Estimate the lung capacity of a person who smokes 20 cigarettes per day.
- Estimate the lung capacity of a non-smoker (x = 0). Is this interpolation or extrapolation?
-
A financial analyst examines the relationship between years of experience (x) and annual salary (y, $'000) for 40 accountants. The LSRL is ŷ = 52.4 + 3.15x, with r² = 0.634.
- Write a sentence interpreting the slope for a non-mathematical audience.
- Predict the salary of an accountant with 12 years of experience.
- r² = 0.634. What does this mean, and what factors might account for the unexplained variation?
- Would you use this equation to predict the salary of an accountant with 40 years of experience? Justify your answer.