Random Sampling and Bias
Key Terms
- A population is the entire group of interest; a sample is a subset selected for study
- A parameter describes a population (e.g. population mean μ); a statistic describes a sample (e.g. sample mean x̄)
- Simple random sample (SRS)
- every possible set of n individuals is equally likely to be chosen — each member has equal probability of selection
- Systematic
- select every k-th member of the population after a random start; k = N/n
- Stratified
- divide population into subgroups (strata), then take a proportional SRS from each
- Cluster
- divide into groups, randomly select entire groups and survey everyone in them
- Sampling bias
- occurs when the sampling method systematically over- or under-represents part of the population, producing unrepresentative results
Sample size from stratum = (Stratum size / Population size) × Total sample size
Systematic Sampling:
Sampling interval k = N / n (round to nearest whole number)
Select a random start between 1 and k, then every k-th member
Full-time: (300/500) × 50 = 30 employees
Part-time: (200/500) × 50 = 20 employees
Total: 30 + 20 = 50 ✓
Populations and Samples
In statistics, the population is the complete collection of individuals or objects we want to draw conclusions about. In practice, populations are often too large, too costly, or too time-consuming to study entirely. We therefore select a sample — a manageable subset — and use it to draw inferences about the population.
Numbers that describe a population are called parameters. They are fixed (though often unknown). Numbers calculated from a sample are called statistics. Statistics vary from sample to sample and are used to estimate parameters. For example, the population proportion p is a parameter; the sample proportion p̂ is the corresponding statistic.
The goal of a good sampling method is to produce a sample that is representative of the population — one from which valid inferences can be made.
Simple Random Sampling
In a simple random sample (SRS) of size n, every possible subset of n individuals from the population is equally likely to be selected. This gives every member of the population an equal chance of inclusion and eliminates deliberate or unconscious selection bias.
In practice, an SRS can be taken by assigning every population member a number, then using a random number generator or table to select n numbers. The individuals corresponding to those numbers form the sample.
Systematic Sampling
In systematic sampling, the population of size N is listed in some order. We compute the sampling interval k = N/n, choose a random starting point between 1 and k, then select every k-th individual thereafter.
Potential weakness: if the list has a periodic pattern with period k, the sample may systematically include (or exclude) items sharing a particular characteristic, introducing bias.
Stratified Sampling
In stratified sampling, the population is divided into non-overlapping subgroups called strata (e.g. by year level, gender, or region). An SRS is taken from each stratum. The number sampled from each stratum is usually proportional to the stratum’s share of the population.
Advantage: all subgroups are represented, giving more precise estimates when subgroups differ substantially in the characteristic being measured.
Cluster Sampling
In cluster sampling, the population is divided into groups (clusters) — often by geography or natural grouping (e.g. schools, suburbs). A random selection of clusters is chosen, and every member of each chosen cluster is surveyed.
Advantage: practical and cost-effective when the population is spread over a large area. Disadvantage: members within a cluster often resemble each other more than the wider population, which can reduce the statistical precision compared with an SRS of the same total size.
Sampling Bias
Sampling bias occurs when the method of selecting the sample systematically favours some individuals over others, making the sample unrepresentative. Common sources include:
- Voluntary response bias: only individuals who feel strongly (often negatively) bother to respond
- Convenience sampling: selecting whoever is easiest to reach, rather than a random cross-section
- Undercoverage: some groups in the population have little or no chance of being selected
- Non-response bias: people who do not respond may differ systematically from those who do
A census surveys every member of the population, eliminating sampling bias entirely. However, a census is expensive, slow, and sometimes impossible (e.g. destructive testing). A well-designed random sample is usually preferred.
Mastery Practice
- Fluency Distinguish between a parameter and a statistic. Give one example of each in the context of estimating the average commute time for all workers in Brisbane.
- Fluency A school has 600 students: 200 in Year 10, 250 in Year 11, and 150 in Year 12. A stratified sample of 60 students is needed. How many students should be selected from each year level?
- Fluency Describe, step by step, how to take a systematic sample of 50 employees from a list of 1000 employees.
- Fluency Give one advantage and one disadvantage of cluster sampling compared with simple random sampling.
- Understanding An online survey of social media users reports that 90% of respondents spend more than 2 hours per day online. Identify the type of sampling bias present and explain how it makes this result unrepresentative.
- Understanding A factory tests quality by inspecting every 20th item off a production line. (a) What type of sampling is this? (b) Describe a situation in which this could introduce bias.
- Understanding Why might a researcher choose a census over a sample? Give one reason. Then give one reason a sample would be preferred over a census.
- Understanding A researcher selects 5 classrooms at random from 30 classrooms in a school, then surveys every student in those 5 classrooms. What type of sampling method is this? Explain why the results might not generalise well to all students at the school.
- Problem Solving A political poll samples 400 people and finds 52% support Party A. (a) Could this result be due to sampling variability even if exactly 50% of the population supports Party A? Explain. (b) If the researcher wants to halve the margin of error, what sample size is needed?
- Problem Solving Design a complete sampling strategy to estimate the mean hours of exercise per week for students at a large university with 4 faculties of very different sizes (Faculty A: 5000 students, Faculty B: 3000 students, Faculty C: 1500 students, Faculty D: 500 students). Target sample size: 200. Justify your choice of method and show all calculations.