Random Sampling and Bias — Full Worked Solutions
Full Worked Solutions
-
Question: Distinguish between a parameter and a statistic. Give one example of each in the context of estimating the average commute time for all workers in Brisbane.
Parameter vs Statistic
A parameter is a numerical value that describes an entire population. Parameters are fixed (they have a single true value) but are usually unknown because we cannot feasibly measure the entire population.
A statistic is a numerical value calculated from a sample. Statistics are observable but variable — different samples give different values of the statistic.
Examples in Context
Parameter: The true mean commute time μ for all workers in Brisbane. This is a fixed number, but it cannot be known without surveying every Brisbane worker.
Statistic: The sample mean x̄ calculated from a random sample of, say, 200 Brisbane workers. This is an estimate of μ and will take a slightly different value each time a new random sample is drawn. -
Question: A school has 600 students: 200 in Year 10, 250 in Year 11, and 150 in Year 12. A stratified sample of 60 students is needed. How many students should be selected from each year level?
Setting Up the Stratified Sample
The sampling fraction is:
n/N = 60/600 = 1/10Apply this fraction to each stratum:
Sample Sizes by Year Level
Year 10: (200/600) × 60 = (1/10) × 200 = 20 students
Year 11: (250/600) × 60 = (1/10) × 250 = 25 students
Year 12: (150/600) × 60 = (1/10) × 150 = 15 studentsVerification: 20 + 25 + 15 = 60 ✓
Within each year level, use a simple random sample to select the required number of students.
-
Question: Describe, step by step, how to take a systematic sample of 50 employees from a list of 1000 employees.
Step 1: Calculate the Sampling Interval
k = N/n = 1000/50 = 20
Step 2: Choose a Random Start
Use a random number generator to select a starting number between 1 and 20 (inclusive). Suppose the result is 7.
Step 3: Select Every k-th Employee
Select employees numbered: 7, 27, 47, 67, 87, … continuing in steps of 20 until 50 employees are selected. The last selected employee is: 7 + (50 − 1) × 20 = 7 + 980 = 987.
Verification
The sequence 7, 27, 47, …, 987 has (987 − 7)/20 + 1 = 50 terms ✓
All selected employees are on the list (all ≤ 1000) ✓ -
Question: Give one advantage and one disadvantage of cluster sampling compared with simple random sampling.
Advantage of Cluster Sampling
Practical efficiency: When a population is geographically dispersed, cluster sampling reduces travel and administrative costs dramatically. A researcher surveys all members of a randomly chosen set of clusters rather than travelling to hundreds of individual locations. For example, surveying every student in 10 randomly selected Queensland schools is far more feasible than surveying 1000 individual students scattered across all schools statewide.
Disadvantage of Cluster Sampling
Reduced precision (within-cluster homogeneity): Members of the same cluster tend to be more similar to each other than to the broader population (e.g. students in the same school share a local community, socioeconomic background, and teaching staff). This clustering effect means that data from within a cluster contains less unique information per observation than data from an SRS. As a result, cluster samples of the same total size as an SRS typically produce less precise estimates, with wider confidence intervals.
-
Question: An online survey of social media users reports that 90% of respondents spend more than 2 hours per day online. Identify the type of sampling bias present and explain how it makes this result unrepresentative.
Type of Bias
This is an example of voluntary response bias combined with undercoverage bias.
Explanation
Voluntary response bias: The survey was completed only by people who chose to participate. People who spend a great deal of time online are precisely those most likely to encounter and complete an online survey. Those who rarely use the internet — who certainly spend less than 2 hours online per day — are very unlikely to see or respond to the survey.
Undercoverage: The survey population (online social media users) is a biased subset of the general population. People who are not social media users are completely excluded from the sample.
Effect: Both sources of bias cause the sample to heavily over-represent heavy internet users. The reported 90% almost certainly overestimates the true proportion of the general population who spend more than 2 hours per day online.
-
Question: A factory tests quality by inspecting every 20th item off a production line. (a) What type of sampling is this? (b) Describe a situation in which this could introduce bias.
(a) Type of Sampling
This is systematic sampling with sampling interval k = 20. The production line serves as an ordered list, and every 20th item is inspected.
(b) Potential for Bias
Suppose the production line operates in a 20-item cycle: for example, the machine requires re-lubrication every 20 items, and the item produced immediately after re-lubrication consistently has a slightly different thickness. If the sampling interval k = 20 coincides with this mechanical cycle, then every inspected item is the post-lubrication item. The inspection will either consistently see a particular defect (if the post-lubrication item is defective) or consistently miss a defect that only appears at other positions in the cycle.
This periodic pattern in the production process, aligned with the sampling interval, introduces systematic bias: the sample is no longer representative of all items produced.
-
Question: Why might a researcher choose a census over a sample? Give one reason. Then give one reason a sample would be preferred over a census.
Reason to Prefer a Census
A census surveys every member of the population and therefore produces exact population parameters with no sampling error. For decisions where every individual case matters — such as counting all votes in an election, determining who pays tax, or allocating government funding based on exact population counts — a census is necessary. A sample estimate may be close to the truth but still introduces uncertainty that cannot be eliminated.
Reason to Prefer a Sample
A well-designed sample is far less expensive, faster, and sometimes the only option. For large populations, a census is prohibitively costly and time-consuming. Moreover, some measurements are destructive: testing every light bulb until it fails, or crash-testing every car, would leave nothing to sell. In these cases a carefully drawn sample, combined with statistical inference, delivers reliable conclusions at a fraction of the cost of a census.
-
Question: A researcher selects 5 classrooms at random from 30 classrooms in a school, then surveys every student in those 5 classrooms. What type of sampling method is this? Explain why the results might not generalise well.
Sampling Method
This is cluster sampling. The 30 classrooms are the clusters; 5 were randomly selected; all students within those 5 classrooms were surveyed (a census within each cluster).
Why Results May Not Generalise Well
Within-cluster homogeneity: Students in the same classroom are often from the same year level, subject stream, or ability group. They have the same teacher and shared learning experiences. Their responses on many survey topics (e.g. study habits, subject preferences, social dynamics) may be more similar to each other than to students in other classrooms. The 5 selected classrooms may not capture the diversity of all 30.
Small number of clusters: With only 5 clusters selected from 30, there is a meaningful chance that entire year levels, subject streams, or demographic groups are not represented at all. With so few clusters, one unusual class can have an outsized effect on the results.
An SRS of the same number of students would give more reliable generalisation to the whole school, at the cost of higher administrative complexity.
-
Question: A political poll samples 400 people and finds 52% support Party A. (a) Could this result be due to sampling variability even if exactly 50% of the population supports Party A? Explain. (b) If the researcher wants to halve the margin of error, what sample size is needed?
(a) Sampling Variability
Yes, absolutely. Assuming p = 0.50, the standard deviation of the sample proportion is:
SD(p̂) = √(p(1 − p)/n) = √(0.50 × 0.50 / 400) = √(0.000625) = 0.025
The observed p̂ = 0.52 is only:
z = (0.52 − 0.50) / 0.025 = 0.02 / 0.025 = 0.8 standard deviations above the assumed true proportion.
This is well within normal sampling variation — values within ±2 standard deviations of the mean are common (occurring about 95% of the time). There is no statistical evidence here that the true proportion differs from 50%. The 52% result is entirely consistent with sampling variability.
(b) Required Sample Size to Halve the Margin of Error
The margin of error is proportional to 1/√n:
E ∝ 1/√n
To halve E, we need √n to double, which requires n to quadruple:
New n = 4 × 400 = 1600 people
Verification: √1600 = 40 = 2 × √400 = 2 × 20 ✓ (double the original √n, so half the margin of error)
-
Question: Design a complete sampling strategy to estimate the mean hours of exercise per week for students at a large university with 4 faculties of very different sizes (Faculty A: 5000, Faculty B: 3000, Faculty C: 1500, Faculty D: 500). Target sample size: 200. Justify your method and show all calculations.
Recommended Method: Stratified Random Sampling
Justification: The four faculties vary enormously in size and may well differ in their students’ exercise habits (e.g. students in health sciences may exercise more than those in other faculties). Stratified sampling ensures proportional representation of each faculty. A simple random sample of 200 from 10 000 students might, by chance, over-represent one faculty. Stratification eliminates this risk and typically produces more precise estimates when strata differ in the characteristic being measured.
Calculations
Total population: 5000 + 3000 + 1500 + 500 = 10 000 students
Sampling fraction: 200/10 000 = 1/50Faculty A: (5000/10 000) × 200 = 0.50 × 200 = 100 students
Faculty B: (3000/10 000) × 200 = 0.30 × 200 = 60 students
Faculty C: (1500/10 000) × 200 = 0.15 × 200 = 30 students
Faculty D: (500/10 000) × 200 = 0.05 × 200 = 10 studentsTotal: 100 + 60 + 30 + 10 = 200 ✓
Implementation
Within each faculty, obtain the complete enrolment list and use a random number generator to select the required number of students. Survey each selected student about their weekly exercise hours. The overall mean estimate is a weighted combination of the four faculty sample means, with weights equal to the faculty’s share of total enrolment.