Practice Maths

Scatter Plots and Correlation

Key Ideas

Key Terms

Bivariate data
Means two variables are measured for each subject (e.g. a person’s height and shoe size).
scatter plot
Displays bivariate data as points on a coordinate plane. The independent variable goes on the x-axis; the dependent variable goes on the y-axis.
Positive correlation
As x increases, y tends to increase (points slope upward left to right).
Negative correlation
As x increases, y tends to decrease (points slope downward left to right).
No correlation
No clear pattern — points are scattered randomly.
strong
(points cluster tightly near a line), moderate (some scatter but a trend is visible), weak (very scattered, trend barely visible).
outlier
A point that lies far away from the general pattern of the data.

Constructing a Scatter Plot

Label the x-axis with the independent variable and the y-axis with the dependent variable. Plot each data pair as a single point. Do not join the points with lines.

Hot Tip Correlation does not mean causation. Two variables may move together for reasons unrelated to one causing the other. Always consider whether a third factor might explain both.

Worked Example

Question: A scatter plot shows data about 20 students. As the number of hours of sleep increases, test scores also increase, but the points have quite a bit of spread around the trend. One student who slept 9 hours scored only 42%, well below the rest. Describe the correlation and identify any outlier.

Step 1 — Identify the direction.
As hours of sleep increase, test scores increase → positive correlation.

Step 2 — Assess the strength.
There is noticeable spread around the trend → moderate positive correlation.

Step 3 — Identify outliers.
The student who slept 9 hours but scored 42% lies far below the general pattern → this point is an outlier.

Conclusion: The data shows a moderate positive correlation between hours of sleep and test scores. There is one outlier at (9, 42%).

What is Bivariate Data?

Bivariate data involves two variables measured on the same subject. For example, recording both the height and shoe size of each person in your class gives you bivariate data. We want to know whether the two variables are related — does one tend to increase as the other increases?

We call the variable we think is explaining the change the independent variable (plotted on the x-axis), and the variable that responds the dependent variable (plotted on the y-axis). For example, study time (hours) is independent; exam score is dependent.

Drawing a Scatter Plot

A scatter plot displays bivariate data as a set of points on a coordinate plane. Each data pair (x, y) becomes one dot on the graph. There are no lines connecting the dots.

Steps: (1) Draw and label your axes with units. (2) Choose a suitable scale for each axis. (3) Plot each (x, y) pair as a point. (4) Give your graph a title.

Example: If a student studied 3 hours and scored 65%, plot (3, 65). Another student studied 6 hours and scored 82% — plot (6, 82). Continue for all data pairs.

Types of Correlation

Once the points are plotted, look at the overall pattern:

  • Positive correlation: As x increases, y tends to increase. Points slope upward left to right. Example: height vs arm span.
  • Negative correlation: As x increases, y tends to decrease. Points slope downward left to right. Example: hours of TV vs exam score.
  • No correlation: No clear pattern. Points are scattered randomly. Example: shoe size vs exam score.

The strength describes how closely the points cluster around a straight line: strong (tight cluster), moderate (some spread), or weak (very spread out).

Identifying Outliers

An outlier is a point that does not fit the general pattern — it sits well away from the other points. Outliers can occur due to measurement errors, or they may represent genuinely unusual cases. Always comment on outliers when describing a scatter plot: note where the point is and that it does not follow the trend.

Correlation Does Not Mean Causation

Just because two variables have a strong correlation does not mean one causes the other. Ice cream sales and drowning rates are strongly positively correlated — but eating ice cream does not cause drowning. Both are caused by a third factor: hot weather. This hidden factor is called a lurking variable.

Always ask: "Is there a logical reason why one would cause the other, or could a third factor explain both?"

Key tip: In an exam, describe a scatter plot using three things: direction (positive/negative/none), strength (strong/moderate/weak), and any outliers. A complete answer earns full marks; saying just "positive correlation" usually does not.

Mastery Practice

  1. Describe the direction of correlation (positive, negative, or no correlation) for each described scatter plot. Fluency

    1. As the outside temperature increases, the number of hot drinks sold at a café decreases.
    2. As a student’s study hours per week increase, their exam mark tends to increase.
    3. The number of goals scored by a footballer and the colour of their boots show no clear pattern.
    4. As the distance from the city centre increases, house prices tend to decrease.
    5. As a car engine’s age (in years) increases, its fuel efficiency tends to decrease.
    6. As a person’s age increases, the number of social media posts they make per day shows no consistent pattern.
    7. As the number of rainy days per month increases, umbrella sales increase.
    8. As hours of television watched per day increase, fitness test scores tend to decrease.
  2. For each description, state both the strength (strong, moderate, or weak) and direction of the correlation. Fluency

    1. Points cluster very tightly along a line that goes up from left to right.
    2. Points show a downward trend but are spread quite widely around it.
    3. Points are scattered all over the graph with no discernible pattern.
    4. Points generally increase from left to right but with considerable scatter.
    5. Points fall very closely along a downward-sloping line.
    6. There is a very faint upward tendency in the points, but it would be easy to miss.
  3. For each situation, identify the independent variable (x-axis) and the dependent variable (y-axis). Fluency

    1. A researcher investigates whether the number of hours spent training affects a swimmer’s race time.
    2. A shop owner wants to see if daily temperature affects ice-cream sales.
    3. A scientist studies whether the amount of fertiliser used affects crop yield.
    4. A teacher examines whether a student’s attendance rate is linked to their final grade.
    5. An engineer tests whether a bridge’s load (in tonnes) affects the amount it deflects (bends).
    6. A dietitian records patients’ daily sugar intake and their resting heart rate.
  4. Decide whether each claim confuses correlation with causation. Explain your reasoning. Understanding

    1. “Data shows that towns with more swimming pools have higher average income. Therefore, buying a swimming pool will make you richer.”
    2. “Students who eat breakfast tend to perform better on morning tests. Eating breakfast causes higher test scores.”
    3. “Countries with more televisions per household tend to have longer life expectancy. Televisions must be helping people live longer.”
    4. “Ice-cream sales and drowning rates are strongly positively correlated. Eating ice cream causes drowning.”
  5. Each description below matches one of the scatter plot types listed. Match each description (a–d) to the correct type. Types: strong positive, weak negative, no correlation, moderate positive. Understanding

    1. A plot of shoe size (x) versus height (y) for 50 adults. Points follow a clear upward trend with only minor scatter.
    2. A plot of the number of TV ads watched (x) versus the time taken to fall asleep (y). There is a vague downward drift but it is hard to be sure.
    3. A plot of birth month (x) versus annual salary (y) for 200 employees. Points are spread evenly across the entire graph.
    4. A plot of daily exercise minutes (x) versus resting heart rate (y) for a group of athletes. Points rise from left to right with moderate scatter.
  6. Real-world data interpretation. Problem Solving

    1. A journalist writes: “Our data shows a strong negative correlation between the number of libraries in a suburb and the local crime rate. We should build more libraries to reduce crime.”
      1. What does “strong negative correlation” mean in this context?
      2. Identify a potential confounding variable that might explain both quantities.
      3. Is the journalist’s conclusion justified? Explain.
    2. Twelve data points are plotted showing the relationship between hours of screen time (x) and hours of physical activity (y) per day for teenagers. Eleven points follow a moderate negative trend. One point shows a teenager with 8 hours of screen time and 4 hours of physical activity.
      1. Is this point likely to be an outlier? Explain.
      2. How might the presence of this outlier affect a researcher’s description of the overall correlation?
      3. Suggest one reason why this data point might be unusual.
    3. A health researcher claims: “Among the 30 patients we studied, there was a weak positive correlation between weekly fruit servings and blood pressure.” A newspaper headline reads: “Eating more fruit raises blood pressure!”
      1. Identify two problems with the newspaper headline.
      2. Write a more accurate headline that reflects what the data actually shows.
  7. Design and interpret a bivariate study. Problem Solving

    Sports Science. A sports scientist wants to investigate whether the number of hours of strength training per week is associated with the maximum weight a weightlifter can lift (in kg).
    1. Which variable should be placed on the x-axis? Explain.
    2. The scientist collects data from 6 elite lifters and 6 beginners. Explain one reason why this sample might produce a misleading scatter plot.
    3. The resulting scatter plot shows a strong positive correlation. Write a careful conclusion that avoids claiming causation, and identifies one possible confounding variable.
  8. Outliers and their effect on correlation. Problem Solving

    Scenario. A scatter plot shows 14 data points with a strong negative correlation between hours of social media use per day (x) and hours of sleep per night (y). A 15th data point is added: (1, 5), which sits far from the trend line of the other points.
    1. Explain why the point (1, 5) might be considered an outlier even though both values seem reasonable individually.
    2. Describe how adding this outlier would likely change the apparent strength of the correlation.
    3. Should the researcher remove the outlier before drawing conclusions? Discuss what they should do instead.
  9. Interpret bivariate data from a table of values. Problem Solving

    Air quality study. The table below shows the average daily traffic volume (thousands of vehicles) and the average fine particle concentration (PM2.5, μg/m³) at 8 monitoring stations.
    Traffic (thousands)815202530384250
    PM2.5611131820252733
    1. Describe the correlation (direction and strength) you would expect from this data.
    2. Identify the independent and dependent variables for this study.
    3. A city planner says: “This proves that reducing traffic will clean the air.” Critique this statement using your knowledge of correlation and causation.
  10. Compare and contrast two bivariate studies. Problem Solving

    Two studies. Study A: 200 adults — weak positive correlation between age and number of doctor visits per year. Study B: 12 students — strong negative correlation between daily exercise minutes and body mass index (BMI).
    1. Which study’s conclusions are more reliable? Justify your answer using the concept of sample size.
    2. Study B shows a strong negative correlation. Can the researcher conclude that exercising more causes lower BMI? Explain why or why not.
    3. Name one confounding variable that could explain the correlation in Study A, and one that could explain the correlation in Study B.