Random Sampling and Bias

Key Terms

A population is the entire group of interest; a sample is a subset selected for study; A parameter describes a population (e.g. population mean μ); a statistic describes a sample (e.g. sample mean x̄)
Simple random sample (SRS): every possible set of n individuals is equally likely to be chosen — each member has equal probability of selection
Systematic: select every k-th member of the population after a random start; k = N/n
Stratified: divide population into subgroups (strata), then take a proportional SRS from each
Cluster: divide into groups, randomly select entire groups and survey everyone in them
Sampling bias: occurs when the sampling method systematically over- or under-represents part of the population, producing unrepresentative results

Stratified Sample Size Formula:
Sample size from stratum = (Stratum size / Population size) × Total sample size

Systematic Sampling:
Sampling interval k = N / n (round to nearest whole number)
Select a random start between 1 and k, then every k-th member

Worked Example (Stratified Sample): A company has 300 full-time and 200 part-time staff. A stratified sample of 50 is needed.

Full-time: (300/500) × 50 = 30 employees
Part-time: (200/500) × 50 = 20 employees
Total: 30 + 20 = 50 ✓

Hot Tip: Know when to use each method. Stratified sampling is best when the population has distinct subgroups and you want each represented proportionally. Cluster is practical when the population is geographically spread. Systematic is quick for ordered lists. The key exam skill is identifying what bias, if any, is introduced by a given method.

Populations and Samples

In statistics, the population is the complete collection of individuals or objects we want to draw conclusions about. In practice, populations are often too large, too costly, or too time-consuming to study entirely. We therefore select a sample — a manageable subset — and use it to draw inferences about the population.

Numbers that describe a population are called parameters. They are fixed (though often unknown). Numbers calculated from a sample are called statistics. Statistics vary from sample to sample and are used to estimate parameters. For example, the population proportion p is a parameter; the sample proportion p̂ is the corresponding statistic.

The goal of a good sampling method is to produce a sample that is representative of the population — one from which valid inferences can be made.

Simple Random Sampling

In a simple random sample (SRS) of size n, every possible subset of n individuals from the population is equally likely to be selected. This gives every member of the population an equal chance of inclusion and eliminates deliberate or unconscious selection bias.

In practice, an SRS can be taken by assigning every population member a number, then using a random number generator or table to select n numbers. The individuals corresponding to those numbers form the sample.

Systematic Sampling

In systematic sampling, the population of size N is listed in some order. We compute the sampling interval k = N/n, choose a random starting point between 1 and k, then select every k-th individual thereafter.

Potential weakness: if the list has a periodic pattern with period k, the sample may systematically include (or exclude) items sharing a particular characteristic, introducing bias.

Stratified Sampling

In stratified sampling, the population is divided into non-overlapping subgroups called strata (e.g. by year level, gender, or region). An SRS is taken from each stratum. The number sampled from each stratum is usually proportional to the stratum’s share of the population.

Advantage: all subgroups are represented, giving more precise estimates when subgroups differ substantially in the characteristic being measured.

Cluster Sampling

In cluster sampling, the population is divided into groups (clusters) — often by geography or natural grouping (e.g. schools, suburbs). A random selection of clusters is chosen, and every member of each chosen cluster is surveyed.

Advantage: practical and cost-effective when the population is spread over a large area. Disadvantage: members within a cluster often resemble each other more than the wider population, which can reduce the statistical precision compared with an SRS of the same total size.

Sampling Bias

Sampling bias occurs when the method of selecting the sample systematically favours some individuals over others, making the sample unrepresentative. Common sources include:

Voluntary response bias: only individuals who feel strongly (often negatively) bother to respond
Convenience sampling: selecting whoever is easiest to reach, rather than a random cross-section
Undercoverage: some groups in the population have little or no chance of being selected
Non-response bias: people who do not respond may differ systematically from those who do

A census surveys every member of the population, eliminating sampling bias entirely. However, a census is expensive, slow, and sometimes impossible (e.g. destructive testing). A well-designed random sample is usually preferred.

Exam technique: When asked to identify bias, name the specific type of bias and explain how it causes the sample to be unrepresentative and in which direction it skews the results. When asked to design a sampling strategy, state the method, explain why it suits the context, and show any relevant calculations (e.g. stratum sizes).

Mastery Practice

Fluency Distinguish between a parameter and a statistic. Give one example of each in the context of estimating the average commute time for all workers in Brisbane.
Fluency A school has 600 students: 200 in Year 10, 250 in Year 11, and 150 in Year 12. A stratified sample of 60 students is needed. How many students should be selected from each year level?
Fluency Describe, step by step, how to take a systematic sample of 50 employees from a list of 1000 employees.
Fluency Give one advantage and one disadvantage of cluster sampling compared with simple random sampling.
Understanding An online survey of social media users reports that 90% of respondents spend more than 2 hours per day online. Identify the type of sampling bias present and explain how it makes this result unrepresentative.
Understanding A factory tests quality by inspecting every 20th item off a production line. (a) What type of sampling is this? (b) Describe a situation in which this could introduce bias.
Understanding Why might a researcher choose a census over a sample? Give one reason. Then give one reason a sample would be preferred over a census.
Understanding A researcher selects 5 classrooms at random from 30 classrooms in a school, then surveys every student in those 5 classrooms. What type of sampling method is this? Explain why the results might not generalise well to all students at the school.
Problem Solving A political poll samples 400 people and finds 52% support Party A. (a) Could this result be due to sampling variability even if exactly 50% of the population supports Party A? Explain. (b) If the researcher wants to halve the margin of error, what sample size is needed?
Problem Solving Design a complete sampling strategy to estimate the mean hours of exercise per week for students at a large university with 4 faculties of very different sizes (Faculty A: 5000 students, Faculty B: 3000 students, Faculty C: 1500 students, Faculty D: 500 students). Target sample size: 200. Justify your choice of method and show all calculations.

See Answers ➔