Scatterplots and Correlation

Key Terms

Explanatory variable: The independent variable (x-axis) thought to explain or cause changes in the response variable.
Response variable: The dependent variable (y-axis); the outcome being measured or predicted.
Association: A relationship between two variables; described by direction, form, and strength.
Pearson’s r: Measures strength and direction of a LINEAR association; −1 ≤ r ≤ 1.
Strength guidelines: |r| ≥ 0.75: strong; 0.5 ≤ |r| < 0.75: moderate; |r| < 0.5: weak.
Causation: Correlation does NOT imply causation; a lurking (confounding) variable may explain the association.

What is Bivariate Data?

Bivariate data involves two variables measured on the same subject. We look for association (a relationship) between them. One variable is called the explanatory (independent) variable (x-axis) and the other the response (dependent) variable (y-axis).

Describing Association from a Scatterplot

Direction: Positive (both increase together) or Negative (one increases as the other decreases)
Form: Linear (straight-line pattern) or Non-linear (curved pattern)
Strength: Strong (points close to a line), Moderate, or Weak (points scattered widely)

Pearson's Correlation Coefficient (r)
r measures the strength and direction of a linear association.

• r = 1: Perfect positive linear association
• r = −1: Perfect negative linear association
• r = 0: No linear association

Strength guidelines:
|r| ≥ 0.75: Strong 0.5 ≤ |r| < 0.75: Moderate |r| < 0.5: Weak

Correlation Does NOT Mean Causation

Even a strong correlation does not mean one variable causes changes in the other. There may be a lurking variable (confounding variable) creating the apparent relationship. Always think critically about why an association exists.

Hot Tip: When describing association from a scatterplot, always comment on all three features: direction (positive/negative), form (linear/non-linear), and strength (strong/moderate/weak). A complete description earns full marks; mentioning only one feature earns partial credit at best.

Worked Example 1

A study records the daily maximum temperature (°C) and ice-cream sales (units) at a Gold Coast kiosk. The scatterplot shows a strong positive linear association. Technology gives r = 0.91.

Describe the association: There is a strong, positive, linear association between temperature and ice-cream sales (r = 0.91). As temperature increases, ice-cream sales tend to increase.

Worked Example 2

A researcher finds r = −0.62 between hours of screen time per day and exam score for a sample of Year 12 students.

Interpret r: There is a moderate, negative, linear association — students who spend more hours on screens tend to have lower exam scores. However, this does not prove screen time causes lower scores. Other factors (sleep, study time) may explain the relationship.

Full Lesson: Scatterplots and Correlation

1. Why Study Two Variables Together?

In real life, we rarely care about a single variable in isolation. A doctor wants to know whether blood pressure is related to age. An economist asks whether unemployment relates to crime rates. A sports scientist wonders if training hours relate to performance. Bivariate analysis lets us explore and quantify these relationships.

The explanatory variable (x) is the one we think might influence the other. The response variable (y) is the one we measure as the outcome. Placing the right variable on each axis matters when interpreting results.

2. Constructing a Scatterplot

Each data point is plotted as a coordinate (x, y). For example, if a student studies for 3 hours and scores 72, we plot (3, 72). Key steps:

Label axes with variable names and units
Choose appropriate scales that spread the data across the plot area
Plot each (x, y) pair as a dot
Do not connect the dots with a line

3. Describing Association — Direction, Form and Strength

Direction: A positive association means both variables tend to increase together (dots go up left to right). A negative association means as one increases, the other decreases (dots go down left to right).

Form: If the points roughly follow a straight-line pattern, the association is linear. If they follow a curve, it is non-linear. Pearson's r only measures linear association — it can be misleading if the relationship is curved.

Strength: How closely do the points follow the pattern? Strong means very close to a straight line. Weak means points are scattered far from any line. Outliers (individual points far from the main pattern) should be noted separately.

4. Pearson's Correlation Coefficient

Pearson's r is calculated by technology (ClassPad, spreadsheet) from the formula:

r = Σ[(x_i − x̄)(y_i − ȳ)] ÷ [(n−1) · s_x · s_y]

You do not need to calculate this by hand — technology does it. What you must be able to do is interpret the value. Key properties:

r is always between −1 and 1
The sign gives direction; the magnitude gives strength
r only measures linear strength — a strong curved relationship may have r near 0
r is sensitive to outliers — a single outlier can dramatically change the value

Technology tip (ClassPad): Enter data in two lists, then use Statistics → Calc → Linear Reg to find r. Make sure DiagnosticON is enabled so r is displayed.

5. Correlation vs Causation

This is one of the most important critical thinking concepts in statistics. A correlation tells us that two variables tend to move together — it says nothing about why. Classic examples:

Ice cream sales and drowning rates are positively correlated — both increase in summer. Hot weather is the lurking variable.
Countries with more televisions per household have higher life expectancy — wealth explains both.
Shoe size and reading ability are correlated in primary school children — age explains both.

To establish causation, we need a controlled experiment where the explanatory variable is deliberately manipulated while all other factors are held constant. Observational studies (like most real-world data collection) can only establish association, not causation.

6. Common Errors to Avoid

Swapping axes: The explanatory variable goes on the x-axis. If you swap them, Pearson's r stays the same, but the regression line will be different.
Applying r to non-linear data: Always look at the scatterplot first. If the pattern is curved, r is not meaningful.
Ignoring outliers: Report any outliers and consider their effect on r.
Stating causation from correlation: Always use language like "tends to", "is associated with", "there is evidence of an association".

Mastery Practice

See Answers ➔

The following scatterplot shows data on weekly hours of exercise (x) and resting heart rate (y, beats per minute) for 12 adults.

Describe the association shown in the scatterplot in terms of direction, form and strength.

Technology gives the following Pearson's correlation coefficients for four datasets. Match each r value to its most likely description.

r value	Description
0.94	Moderate negative linear
−0.61	Strong positive linear
0.28	Weak positive linear
−0.88	Strong negative linear

A study of 20 Brisbane households records the number of people in the household (x) and the weekly electricity bill in dollars (y). The data gives r = 0.83.
1. Describe the direction and strength of the association.
2. Which variable is the explanatory variable? Justify your answer.
3. Can we conclude that having more people in a household causes higher electricity bills? Explain.
For each of the following, state whether the association would most likely be positive, negative, or near zero:
1. Distance from CBD and house price in a major city
2. Daily rainfall and number of beach visitors
3. Shoe size and IQ score
4. Age of a car (years) and its resale value ($)
A sports scientist measures sprint speed (m/s) and vertical jump height (cm) for 15 athletes. The scatterplot shows a weak positive association. Technology gives r = 0.38.
1. Interpret this r value in context.
2. An athlete with a very fast sprint speed also has an unusually low jump height. What effect would removing this outlier likely have on r? Explain.
Two variables, monthly rainfall (mm) and wheat yield (tonnes per hectare) in a Queensland farming region, have r = 0.77.
1. Describe the association fully (direction, form, strength).
2. A farmer concludes: "More rain definitely makes wheat grow better — I should build an irrigation system." Evaluate this statement statistically.
3. What additional evidence would be needed to establish causation?
A dataset has 10 points with r = 0.72. One additional data point is added that is far from the main cluster and creates a strong outlier.
1. Explain how an outlier can affect Pearson's r.
2. What should a statistician always do before reporting r?
A researcher investigating childhood development finds that shoe size and reading ability in primary school children are strongly positively correlated (r = 0.81).
1. Identify the likely lurking variable explaining this correlation.
2. Explain why this does not mean bigger feet help children read better.
3. Suggest how the researcher could control for this lurking variable.

The table shows data for study hours (x) and exam score out of 100 (y) for 8 students.

Hours (x)	2	3	4	5	6	7	8	9
Score (y)	45	52	58	63	71	74	82	88

Plot these points on a scatterplot (identify the explanatory variable first).
Describe the association.
Use technology to calculate r and interpret it in context.

A health researcher publishes the headline: "Eating chocolate daily is associated with winning Nobel Prizes — countries with high chocolate consumption have more Nobel laureates per capita (r = 0.79)."
1. Is the correlation coefficient of 0.79 meaningful evidence that chocolate consumption boosts intelligence? Justify your answer.
2. Suggest at least two lurking variables that might explain this correlation.
3. What type of study design would be needed to establish a causal link between diet and cognitive performance?