Interpreting Bivariate Data
Key Ideas
Key Terms
- Sample size
- Small samples (e.g. <20) produce less reliable conclusions. Large, representative samples give more confidence.
- Confounding variables
- Hidden third variables that may explain both x and y, making the association misleading.
- Bias
- In data collection (e.g. only surveying one group) means the data may not represent the population.
- Ethical data use
- Data should not be cherry-picked, manipulated, or presented in a misleading way. Scales and ranges should be chosen fairly.
Writing a Statistical Conclusion
A strong conclusion names both variables, states the direction and strength of the relationship, acknowledges limitations (sample size, possible confounders), and avoids claiming causation unless a controlled experiment was conducted.
Worked Example
Question: A survey of 15 Year 9 students found a moderate positive correlation between the number of books read per month and performance on a reading comprehension test. Write a statistical conclusion and identify two limitations.
Statistical Conclusion:
The data suggests a moderate positive association between the number of books read per month and reading comprehension test scores among Year 9 students. Students who read more books per month tended to score higher on the test.
Limitations:
1. The sample size of 15 is very small, making it difficult to generalise the findings to all Year 9 students.
2. The survey does not establish causation — students with stronger reading ability may choose to read more books rather than reading causing the improvement.
Making Predictions from the Line of Best Fit
The line of best fit is a tool for making predictions. Once you have the equation of the line (y = mx + b), substitute a given x-value to predict the y-value, or rearrange to find x for a given y-value.
Example: A study finds the equation y = 4x + 30, where x = hours of revision and y = exam score (%). Predict the score for a student who revises 7 hours: y = 4(7) + 30 = 28 + 30 = 58%. Always state the units and context in your answer.
Interpreting the Gradient in Context
The gradient (m) in y = mx + b represents the rate of change: how much y increases (or decreases) for each 1-unit increase in x.
Example: In y = 4x + 30 (score vs revision hours), the gradient is 4. This means for each additional hour of revision, the predicted exam score increases by 4 marks. The gradient has units: marks per hour.
A negative gradient means y decreases as x increases — for example, if more hours of TV is associated with lower scores, the gradient would be negative.
Interpreting the y-Intercept in Context
The y-intercept (b) is the predicted value of y when x = 0. This is the "starting value."
Example: In y = 4x + 30, the y-intercept is 30. This means a student who does no revision is predicted to score 30%. However, always consider whether the y-intercept makes sense in context — sometimes x = 0 is outside the range of data (extrapolation), making the y-intercept less meaningful.
Limitations of Predictions
Predictions made from a line of best fit have important limitations: (1) Extrapolation is unreliable — the trend may not continue outside the data range. (2) Scatter reduces accuracy — the more spread out the data, the less precise predictions are. (3) Correlation is not causation — even if the model predicts well, there may be other explanations for the relationship. (4) Outliers can distort where the line is drawn.
A good answer will make the prediction and then comment on its reliability.
Critically Evaluating Bivariate Data Claims
When someone makes a claim based on a scatter plot (e.g. "studying more causes better grades"), think critically: Was the sample representative? Was the data collected carefully? Could a third variable explain the trend? Is the prediction based on interpolation or extrapolation?
Strong correlation with a large, representative, carefully collected data set gives more confidence in predictions than a weak correlation from a small or biased sample.
Mastery Practice
-
Read each data description and answer the questions. Fluency
- A scatter plot of 40 adults shows a strong positive correlation between weekly exercise hours (x) and self-reported energy levels (y, out of 10).
- What does the correlation tell us about the relationship?
- Can we conclude that exercise causes higher energy? Explain.
- A researcher surveys 8 patients and finds a moderate negative correlation between daily sodium intake (x, mg) and kidney function score (y).
- State one concern about the sample used.
- Should the researcher recommend a nationwide salt reduction campaign based on this data alone? Why or why not?
- A company plots employee age (x) against annual sick days taken (y) for their 200 employees and finds almost no correlation.
- What does “almost no correlation” tell us?
- Should the company use age to predict sick days? Explain.
- A scatter plot of 40 adults shows a strong positive correlation between weekly exercise hours (x) and self-reported energy levels (y, out of 10).
-
Evaluate each statistical conclusion. Identify what is correct and what is flawed. Understanding
- A study of 500 teenagers found a moderate negative correlation between hours of sleep and frequency of headaches. Conclusion written: “This study proves that sleeping more cures headaches in teenagers.”
- A scatter plot of 12 data points shows a weak positive correlation between coffee consumption and productivity. Conclusion: “There is very limited evidence of a positive association between coffee intake and productivity. The small sample and weak correlation mean further research with a larger group is needed before any recommendations can be made.”
- A graph showing national chocolate consumption vs number of Nobel Prize winners per million people shows a strong positive correlation. Conclusion: “Nations should encourage citizens to eat more chocolate to win more Nobel Prizes.”
-
Identify the main problem with each data collection method and explain how it affects the reliability of the bivariate analysis. Understanding
- A researcher asks students at an elite private school to report their study hours and test scores to investigate whether study time predicts grades in Queensland students generally.
- A gym owner surveys only people currently at the gym about their exercise habits and weight, to understand the general population’s activity levels.
- A company uses data from only the five best-performing sales months to create a scatter plot showing the relationship between advertising spend and sales revenue.
- An online poll asks: “How many hours of social media do you use per day?” to investigate whether social media use predicts academic performance. The poll is posted on a social media platform.
-
For each situation, identify a likely confounding variable and explain how it might create a misleading association. Understanding
- A positive correlation is found between the number of fire trucks at a fire and the amount of damage caused.
- A negative correlation is found between the number of dentists per capita in a country and the rate of tooth decay.
- A positive correlation is found between a child’s shoe size and their reading ability.
- A positive correlation is found between the number of hospitals in a region and the death rate in that region.
-
Write a full statistical conclusion for each described study. Your conclusion should include: direction and strength of relationship, at least two limitations, and a recommendation for further research. Problem Solving
- A Year 9 class of 25 students recorded their hours of homework per night and their score on a mathematics test. A scatter plot showed a moderate positive correlation, though three students who did a lot of homework scored low, and two who did very little scored high.
- A public health researcher measured daily steps taken (from fitness trackers) and systolic blood pressure for 300 adults aged 40–60. The scatter plot showed a moderate negative correlation (r ≈ −0.55). The data was collected from volunteers who had access to fitness trackers.
-
Ethical and critical data analysis. Problem Solving
- A food manufacturer shows a scatter plot with the y-axis starting at 90 (not 0) to display the relationship between their product’s sales and “customer satisfaction scores.” The line looks steeply positive.
- Explain how starting the y-axis at 90 instead of 0 creates a misleading impression.
- What should a reader always check when viewing a scatter plot?
- A political party publishes a graph showing only the last 3 months of economic data to argue that their policies improved employment, when the 12-month picture shows a flat trend.
- Identify the ethical problem with this presentation.
- How does cherry-picking data affect the reliability of a statistical conclusion?
- Design a brief plan for collecting bivariate data to investigate whether sleep duration is related to reaction time in Queensland Year 9 students. Include: what data to collect, how to measure it, how many participants you would need, and how you would avoid bias.
- A food manufacturer shows a scatter plot with the y-axis starting at 90 (not 0) to display the relationship between their product’s sales and “customer satisfaction scores.” The line looks steeply positive.
-
Interpreting r values. The correlation coefficient r measures the strength and direction of a linear association. r = 1 is perfect positive, r = −1 is perfect negative, r = 0 is no linear correlation.Problem Solving
- A dataset of 50 values yields r = 0.87. Describe the strength and direction of the association. Is it appropriate to say that one variable causes the other? Explain.
- A study of 200 patients gives r = −0.43 between hours of sleep and number of doctor visits per year. (i) Describe this correlation. (ii) A newspaper headline reads “Sleep deprivation sends people to the doctor.” Evaluate this headline using the r value and the concept of causation.
- Two datasets both have r = 0.6. Dataset A has 15 data points; Dataset B has 150 data points. Which dataset provides stronger evidence of a genuine association? Why does sample size matter when interpreting r?
- A scatter plot has r ≈ 0. Does this mean there is definitely no relationship between the two variables? Give an example of a case where r = 0 but a clear relationship still exists.
-
Comparing studies. Read both study descriptions carefully, then answer the questions that follow.Problem Solving
Study A: 25 Year 10 students at one Brisbane school self-reported their average daily screen time and their most recent NAPLAN reading score. r = −0.62.
Study B: 1200 Year 10 students randomly sampled from 40 Queensland schools completed an objective screen-time log over 2 weeks and were given a standardised reading assessment. r = −0.58.
- Both studies show a similar r value. Explain why Study B’s findings are more reliable despite having a slightly weaker correlation.
- Identify one source of bias in Study A that Study B avoids. Explain how that bias could affect the r value.
- A school principal reads Study A and immediately bans all screens during recess to improve reading scores. Evaluate this decision from a statistical perspective, identifying at least two concerns.
- Suggest one confounding variable that might explain the negative correlation seen in both studies, and explain its likely effect.
-
Statistical report. Read the scenario and write a full statistical report, as if advising a decision-maker.Problem Solving
A sports science researcher collects data from 80 recreational runners. She records their average weekly running distance (km) and their resting heart rate (beats per minute). The scatter plot shows a moderate negative correlation (r = −0.54). Data was collected from runners who responded to a social media post in a running group.
- Write a statistical conclusion stating the direction, strength, and what the correlation means in plain language.
- Identify two limitations of this study’s data collection method.
- Identify one likely confounding variable and explain its effect.
- A fitness company wants to use this data in an advertisement claiming “running makes your heart healthier.” Write a brief critique of this advertising claim from a statistical perspective.
-
Design and critique. Think critically about how bivariate studies are designed and presented.Problem Solving
- A researcher wants to study whether the number of books in a household (x) is associated with children’s vocabulary scores (y). Design a brief study plan including: what data to collect and how, sample size and sampling method, and two steps to minimise bias.
- A news article shows a scatter plot of two variables with a strong positive correlation (r = 0.91) and concludes that one variable causes the other. List three questions a statistically literate reader should ask before accepting this conclusion.
- A dataset shows a strong positive correlation between the number of swimming pools in a suburb and the number of skin cancer cases. (i) Identify the most likely confounding variable. (ii) Explain why this confounding variable, rather than swimming pools, is the more likely driver of both trends. (iii) What type of study would be needed to investigate a causal link between UV exposure and skin cancer?