Epistemological & Theoretical Foundations
Cluster Sampling is a probability sampling design in which the population is divided into naturally occurring groups — called clusters — and a probability sample of clusters is selected. All elements within selected clusters are measured (one-stage), or a further probability sample is drawn within each selected cluster (two-stage or multi-stage). Cluster sampling is not stratification in reverse — it is a fundamentally different design with distinct efficiency properties, a unique cost logic, and specific variance consequences governed by the within-cluster intraclass correlation.
Cluster sampling is a method of selecting a probability sample from a population of N elements organised into M mutually exclusive and exhaustive groups (clusters) of sizes N₁, N₂, …, N_M, by selecting a probability sample of m clusters and then measuring either all elements within selected clusters (one-stage), or a further probability sub-sample within each selected cluster (two-stage). The primary sampling unit (PSU) is the cluster; the element is the secondary or ultimate sampling unit (SSU).— Kish, L. (1965). Survey Sampling. John Wiley & Sons, pp. 148–149; Cochran, W.G. (1977). Sampling Techniques (3rd ed.). John Wiley & Sons, pp. 233–235; Hansen, M.H., Hurwitz, W.N., & Madow, W.G. (1953). Sample Survey Methods and Theory (Vol. 1). John Wiley & Sons, pp. 97–100.
The Core Logic: Why Clusters, Not Elements
The defining question in cluster sampling is: why would a researcher deliberately accept a statistically less efficient design — one that produces larger sampling variances for a given total sample size — compared to simple random sampling of elements? The answer is economic and structural, not statistical. Cluster sampling exists to solve two problems that SRS cannot: the absence of a complete element-level frame, and the prohibitive cost of geographically dispersed element-level sampling.
Consider a national study of primary school students. A complete list of every student in the country does not exist in a single, accessible database — but a list of every school does. Cluster sampling selects schools (clusters) at random and then samples or enumerates students within selected schools. The researcher pays travel and access costs to reach each school once, then collects data from many students at that same school with minimal additional cost per student. The efficiency loss from clustering — measured by the design effect — is the statistical price paid for this logistical and cost advantage.
This trade-off is the defining intellectual structure of cluster sampling, and it distinguishes the design from every other probability method. All other common designs aim to maximise statistical efficiency. Cluster sampling deliberately accepts reduced statistical efficiency in exchange for gains in operational feasibility, cost control, and the ability to sample from populations for which no element-level frame exists (Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 97–100).
Historical Development
The systematic mathematical treatment of cluster sampling was established by Morris H. Hansen, William N. Hurwitz, and William G. Madow in their landmark two-volume work Sample Survey Methods and Theory (1953), published by John Wiley & Sons. This work, produced by researchers at the United States Bureau of the Census, formalised the variance formulae for one-stage and two-stage cluster designs, established the PPS (probability proportional to size) selection method, and proved the conditions under which cluster sampling is cost-optimal relative to SRS. The Census Bureau's use of cluster sampling for large-scale population surveys was the primary applied context that motivated this theoretical development.
Leslie Kish (1965), in Survey Sampling, provided the definitive treatment of the design effect (DEFF) and the intraclass correlation coefficient (ρ, roh) as the two fundamental quantities governing the efficiency of cluster designs relative to SRS. Kish's conceptualisation of DEFF = 1 + (b̄ − 1)ρ — where b̄ is the average cluster size and ρ is the intraclass correlation within clusters — remains the central analytical tool for cluster sampling design and has been adopted universally in survey methodology, epidemiology, and educational research. Cochran (1977) embedded cluster sampling within the complete probability sampling framework and derived the exact estimators and variance formulae that doctoral researchers apply directly.
Primary Sampling Units (PSUs)
A probability sample of m clusters is selected from the full list of M clusters. Selection may be with equal probability (SRS) or with probability proportional to cluster size (PPS). This is the only stage in one-stage cluster sampling.
Secondary Sampling Units (SSUs)
Within each selected cluster, a probability sub-sample of n_i elements is drawn (SRS, systematic, or stratified). If all elements are measured, this is one-stage sampling. If a sub-sample is taken, this is two-stage (or multi-stage) cluster sampling.
Why Cluster Sampling Is Used
No Element-Level Frame Exists
In many populations of research interest — all residents of a country, all students in a national education system, all patients treated in a national health network — a complete, current, accessible element-level list does not exist. A list of hospitals, schools, or geographic areas does exist. Cluster sampling makes probability sampling possible when no element frame is available, by using the cluster-level list as the first-stage frame.
Cost Reduction Through Geographic Concentration
When data collection involves travel or site access — face-to-face interviewing, physical measurement, classroom observation — an SRS drawn from a geographically dispersed population would require visiting nearly every location in the country. Cluster sampling concentrates fieldwork within selected clusters, dramatically reducing travel cost and interviewer time. The statistical penalty (increased variance) is traded for an economic gain (reduced cost), and with appropriately sized clusters, the cost-efficiency can be far superior to SRS (Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 226–234).
Multi-Stage Institutional Sampling
Research in education, healthcare, organisational behaviour, and public policy typically targets individuals nested within institutions: students within schools, patients within hospitals, employees within firms. The institution is the natural primary sampling unit. Cluster sampling formalises this structure, treating the institution as the PSU and individuals as SSUs, producing a design that matches the data-generating structure of the research context and enables correct multilevel analysis.
Pilot Survey and Feasibility Studies
When a researcher is in the early stages of a research programme, detailed element-level frames are often unavailable. A cluster-based pilot survey — selecting a small number of clusters and exhaustively enumerating their elements — simultaneously gathers substantive data and constructs the element-level frame for a subsequent, more precise stage of sampling. This two-stage process is the standard approach for large national surveys in low- and middle-income countries where registration systems are incomplete (Groves et al., 2009, pp. 94–96).
Natural Cluster Structure in the Population
Many phenomena occur within naturally bounded groupings: infectious disease within households, learning within classrooms, voting behaviour within precincts. When the research question is specifically about between-cluster or within-cluster variation — as in multilevel modelling — cluster sampling is not merely a convenience but the scientifically correct design. Sampling clusters directly ensures that the design reflects the hierarchical structure being studied.
EPI Cluster Sampling in Public Health
The Expanded Programme on Immunisation (EPI) cluster sampling design, originally developed by Lemeshow and Robinson (1985) for the WHO, uses 30 clusters of 7 subjects (30×7 design) to estimate vaccination coverage in settings where no household roster exists. Each cluster is selected with PPS based on population counts. This is one of the most widely used cluster sampling applications in public health practice, illustrating the design's unique applicability in resource-constrained settings.
The Fundamental Trade-Off: Efficiency vs. Feasibility
Cluster sampling is the only common probability design in which statistical efficiency is deliberately reduced to achieve operational feasibility and cost control. Every other design — SRS, stratified, systematic — aims to maximise precision. Cluster sampling accepts reduced precision as the price for being able to conduct the survey at all, or at acceptable cost. The doctoral researcher's task is to quantify this trade-off explicitly: estimate the expected design effect DEFF = 1 + (b̄ − 1)ρ before the study, report the actual DEFF after data collection, and justify the cluster design in terms of the cost and frame arguments that make alternative designs infeasible or impractical (Kish, 1965, pp. 161–164; Cochran, 1977, pp. 241–243).
Estimators, the Design Effect & Variance Theory
The mathematics of cluster sampling reveals that the key quantity governing all efficiency comparisons is the intraclass correlation coefficient ρ — the degree to which elements within the same cluster resemble one another. When ρ is high, clustering is extremely costly in statistical terms. When ρ is low or zero, clustering approaches SRS efficiency. Understanding ρ is therefore the central analytical task in any cluster sampling design.
1. Notation and Setup
N_i = number of elements in cluster i; N = ΣN_i (total elements)
b̄ = N/M = average cluster size
m = number of clusters selected (PSUs)
n_i = elements sampled within cluster i (SSUs)
ȳ_i = sample mean of cluster i; ȳ_cl = (1/m)·Σ ȳ_i
In one-stage cluster sampling: n_i = N_i (all elements in selected clusters are measured).
In two-stage cluster sampling: n_i < N_i (a sub-sample is drawn within each selected PSU).
2. One-Stage Cluster Sampling: Estimator and Variance
V(ȳ_cl) = (1 − m/M) · S_cl² / m
S_cl² = between-cluster variance = (1/(M−1)) · Σ(Ȳ_i − Ȳ)² where Ȳ_i = true cluster mean
(1 − m/M) = finite population correction for the first stage
Key insight: variance depends entirely on how much cluster means Ȳ_i vary — not on within-cluster variance. If all clusters have identical means (Ȳ_i = Ȳ), selecting more clusters adds no information — the design is maximally inefficient in this sense.
Estimated: v̂(ȳ_cl) = (1 − m/M) · s_cl² / m where s_cl² = Σ(ȳᵢ − ȳ_cl)²/(m−1)
3. Two-Stage Cluster Sampling: Estimator and Variance
V(ȳ_ts) = (1 − f₁)·S_cl²/m + (f₁/m)·(1 − f₂)·S_w²/n̄
S_cl² = between-cluster variance · S_w² = within-cluster variance (average across clusters)
n̄ = average number of elements sampled per cluster
Decomposition: Total variance = first-stage component (between clusters) + second-stage component (within clusters).
When f₁ is small (m ≪ M), the first-stage component dominates — increasing n̄ (more elements per cluster) adds diminishing returns.
When f₂ is small (sampling few elements per cluster), the second-stage component is proportionally more important.
4. The Intraclass Correlation Coefficient (ρ, roh)
S_cl² = S²[1 + (b̄−1)ρ] / b̄ (implied relationship)
S² = overall population variance across all N elements
Range: −1/(b̄−1) ≤ ρ ≤ 1 · In practice for social/health research: typically 0.01 ≤ ρ ≤ 0.30
When ρ = 0: Cluster membership is irrelevant — elements within a cluster are no more alike than any two random elements from the population. Cluster sampling is as efficient as SRS.
When ρ → 1: All elements in each cluster are identical — measuring additional elements within a selected cluster adds no new information. Each cluster yields information equivalent to one element. V(ȳ_cl) = M·S²/m·N → b̄ times the SRS variance: catastrophic efficiency loss.
Negative ρ: Rare in practice; implies within-cluster heterogeneity exceeds random expectation — cluster sampling would be more efficient than SRS.
5. The Design Effect (DEFF) — The Central Quantity
DEFF ≈ 1 + (b̄ − 1) · ρ
ρ = intraclass correlation within clusters
DEFF ≥ 1 for positive ρ — cluster sampling always requires a larger sample than SRS to achieve the same precision when ρ > 0
Effective sample size: n_eff = n / DEFF < n — the cluster sample of n elements is statistically equivalent to only n/DEFF independent observations
Example: b̄ = 20 students per school, ρ = 0.15 (typical for academic achievement): DEFF = 1 + 19 × 0.15 = 3.85. The cluster sample of n = 1,000 students is equivalent to only 260 independently drawn students in precision terms. A sample of 3,850 students (SRS) would be needed to match this precision — the cluster design requires 3.85× the observations of SRS for equivalent precision.
Critical implication: Ignoring DEFF and analysing cluster samples as if they were SRS underestimates standard errors by a factor of √DEFF, producing anti-conservative confidence intervals and inflated Type I error rates.
Each bar represents the DEFF for a given combination of cluster size b̄ and intraclass correlation ρ. The dashed line at DEFF = 1.0 marks the SRS benchmark. Any bar extending beyond this represents statistical inefficiency relative to SRS — the cluster sample requires that many times more elements to match SRS precision.
6. Probability Proportional to Size (PPS) Sampling
When clusters have unequal sizes — the common case in practice — equal-probability selection of clusters is inefficient and introduces bias if cluster means are correlated with cluster size. Probability Proportional to Size (PPS) sampling addresses both problems by assigning each cluster a selection probability proportional to its size: π_i = m · N_i / N. In PPS sampling with a fixed sub-sample of n̄ elements per cluster, the product π_i × n̄/N_i = m·n̄/N = constant for all i — meaning every element has the same marginal inclusion probability regardless of cluster size. The design is therefore EPSEM — self-weighting — even when clusters are of different sizes.
ȳ_HH = (1/m) · Σᵢ₌₁ᵐ ȳᵢ (self-weighting when n̄ constant)
v̂(ȳ_HH) = (1/m(m−1)) · Σᵢ(ȳᵢ − ȳ_HH)²
m = number of clusters selected in the first stage
The Hansen-Hurwitz estimator ȳ_HH is an unbiased estimator under PPS with replacement sampling (PPSWR).
Variance estimation: v̂(ȳ_HH) requires only m ≥ 2 selected clusters — the variance is computed entirely from the between-cluster variation in ȳᵢ, requiring no within-cluster variance estimation.
Practical PPS methods: Systematic PPS (cumulative size method); Lahiri's method; Brewer's method; Sampford's method (for without-replacement PPS). The cumulative size method is by far the most commonly implemented in survey practice (Cochran, 1977, pp. 251–259; Lohr, 2010, pp. 176–184).
7. Optimal Number of Elements per Cluster
c₂ = cost of measuring one additional element within an already-selected cluster (marginal element cost)
ρ = intraclass correlation
Logic: When ρ is high (within-cluster elements are similar), additional elements within a cluster add little new information → small b̄_opt. When ρ is low (within-cluster heterogeneity is high), additional elements are informative → larger b̄_opt.
When c₁ ≫ c₂ (access is expensive, marginal measurement is cheap) → large b̄_opt (measure many elements per cluster to justify the fixed access cost).
When c₁ ≈ c₂ (access and measurement cost roughly equally) → b̄_opt approaches √[(1−ρ)/ρ].
Example: ρ = 0.10, c₁ = $500 (travel), c₂ = $10 (interview): b̄_opt = √[(500/10) · (0.9/0.1)] = √[50 × 9] = √450 ≈ 21 elements per cluster.
8. Variance Estimation in Cluster Designs
A critical property of cluster sampling — shared with stratified sampling but not with systematic sampling — is that design-based unbiased variance estimation is straightforward when m ≥ 2 clusters are selected. The between-cluster variance in observed cluster means ȳᵢ is directly estimable from the m selected clusters, producing an unbiased estimate of the first-stage variance component. This property holds for both PPS and equal-probability selection, and for both one-stage and two-stage designs (with different formulas). For complex multi-stage designs, Taylor linearisation (the delta method), jackknife, and balanced repeated replication (BRR) are the standard approaches, all implemented in major survey software packages.
The Anti-Conservative Analysis Error: Ignoring the Design Effect
The most prevalent and consequential error in the analysis of cluster samples is treating the data as if they were drawn by SRS — computing standard errors as √(s²/n) and conducting standard OLS-based significance tests without accounting for the clustering structure. When the true DEFF is, say, 3.0 and the analyst ignores it, reported standard errors are underestimated by a factor of √3 ≈ 1.73, reported 95% confidence intervals are too narrow, and the nominal Type I error rate of 0.05 may correspond to an actual Type I error rate of 0.20 or higher. This is not a minor correction — it is a fundamental error that has led to numerous false positives in published social science and public health literature. All major methodologists agree: cluster samples must be analysed using design-correct methods that explicitly account for the PSU structure, regardless of the statistical software defaults used (Kish, 1965, pp. 258–265; Groves et al., 2009, pp. 239–244; Lohr, 2010, pp. 158–162).
Cluster Sampling Simulator
Configure the number of clusters, cluster size, the number of clusters to select, and the sampling stage. Observe how PSU selection distributes across the population, compute the live design effect, and compare the sampling distribution of the cluster mean against the SRS benchmark.
Cluster Sampling Monte Carlo Simulator
Visualises PSU selection, within-cluster sub-sampling, DEFF, and the sampling distribution of ȳ_cl
What the Simulator Demonstrates
Draw Sample: Randomly selects m PSUs from the M available clusters. Selected clusters are highlighted. In one-stage mode, all b̄ elements within selected clusters are shaded as measured. In two-stage mode, a random sub-sample of elements is drawn within each selected cluster. The live DEFF and effective sample size update immediately, showing the statistical cost of the clustering structure.
ρ Slider: Adjusting the intraclass correlation from 0 to 0.50 directly updates DEFF = 1 + (b̄ − 1)ρ. Watch the effective sample size drop precipitously as ρ increases — this is the most direct illustration of why high ρ makes cluster sampling statistically expensive.
Run Simulation: Executes 300 independent cluster samples, each time randomly selecting m PSUs and computing ȳ_cl. The resulting histogram displays the empirical sampling distribution of the cluster mean. Higher ρ and larger b̄ produce wider distributions — confirming the DEFF formula empirically.
Assumptions, Conditions & Limitations
Cluster sampling carries a specific and consequential set of assumptions. Five of these — the exhaustiveness of the cluster list, the known or estimable cluster sizes for PPS, the independence of cluster selection, the adequacy of m, and the correct analysis accounting for the design effect — require explicit justification and documentation in any doctoral research employing this design.
Formal Assumptions
| Assumption | Technical Requirement | Violation Consequence | Diagnostic / Remedy |
|---|---|---|---|
| Exhaustive Cluster List (First-Stage Frame) | Every cluster in the defined target population must appear on the PSU frame; no cluster can have zero probability of selection | Coverage error: elements in unlisted clusters have πᵢ = 0, violating EPSEM and introducing coverage bias of unknown direction | Audit the PSU frame against administrative records or geographic maps; quantify the proportion of the population in uncovered clusters; assess non-coverage bias direction |
| Known Cluster Sizes (for PPS) | The measure of size N_i must be known for every cluster on the PSU frame to compute PPS selection probabilities π_i = m·N_i/N | Incorrect size measures produce selection probabilities that deviate from the intended PPS design; if sizes have changed since the frame was constructed, the effective inclusion probabilities differ from the nominal ones | Use the most recent available size measure; document its vintage; implement a size ratio estimator if sizes are outdated; conduct sensitivity analysis using the range of plausible size values |
| Independent Cluster Selection | The selection of one cluster must be statistically independent of the selection of any other cluster at the first stage | Correlated PSU selection (e.g., systematic PPS with undisclosed periodicity) invalidates the standard variance formula; actual variance may differ substantially from the estimated variance | Use SRS without replacement or PPS with replacement for PSU selection; document the selection mechanism; if systematic PPS is used, check the ordering for periodicity matching the skip interval |
| m ≥ 2 PSUs Selected | At least two clusters must be selected to permit design-based variance estimation from the between-cluster variation in ȳᵢ | Single PSU selection (m = 1) makes variance estimation impossible by design; no information exists about between-cluster variation from the data alone | Always select m ≥ 2 PSUs; for reliable variance estimation in complex multi-stage designs, m ≥ 20–30 PSUs is commonly recommended; document the total PSU count and justify the choice of m in the study protocol |
| Correct Design-Based Analysis | All statistical analyses must account for the PSU structure, cluster weights, and the two-stage sampling design using appropriate survey analysis methods | Ignoring the design structure and treating data as SRS underestimates standard errors by √DEFF, inflates test statistics, and produces anti-conservative p-values with actual Type I error rates far exceeding the nominal level | Specify the survey design in software using svydesign (R), svyset (Stata), PROC SURVEYMEANS (SAS), or CSPLAN (SPSS); use Taylor linearisation, jackknife, or BRR for variance estimation; report DEFF for all primary estimates |
| Complete Enumeration Within Selected Clusters (One-Stage) | In one-stage designs, every element in each selected cluster must be contacted and measured; missing elements introduce selection bias if missingness is related to the outcome | Incomplete cluster enumeration converts the design from a probability to a non-probability sample within the affected cluster; bias magnitude depends on the correlation between the outcome and the probability of element exclusion | Pre-commit to exhaustive within-cluster enumeration in the protocol; establish explicit inclusion/exclusion criteria for elements within clusters; track and report the within-cluster response rate separately from the cluster-level response rate |
Core Limitations
The central statistical limitation of cluster sampling is its reduced precision relative to SRS for the same total number of elements measured. This inefficiency is irreducible when within-cluster homogeneity (ρ > 0) exists — and in virtually all real-world contexts, it does. Students in the same school share a teacher, curriculum, and socioeconomic environment. Patients in the same hospital share clinical protocols and local disease burden. Residents of the same neighbourhood share infrastructure, services, and social norms. Wherever the research context provides a natural cluster, the clustering variable is almost always correlated with the outcome variable, producing positive ρ.
The magnitude of this inefficiency is often dramatically underestimated by researchers who have not formally computed DEFF. Kish (1965, pp. 257–262) documents ρ values of 0.10–0.30 as typical for educational achievement, health behaviours, and socioeconomic indicators. With b̄ = 25 elements per cluster (a modest school-based study), these values imply DEFF = 1 + 24×0.10 = 3.40 to DEFF = 1 + 24×0.30 = 8.20. This means the effective sample size is only 29% to 12% of the nominal sample size — an enormous loss that has major consequences for power calculations and confidence interval width.
The doctoral researcher must always conduct formal power calculations using the expected DEFF before data collection, using available prior estimates of ρ from the literature or from pilot studies. Failing to account for DEFF in power calculations leads to severely underpowered studies — one of the most common and consequential errors in cluster-based doctoral research (Donner & Klar, 2000; Murray, 1998).
A non-intuitive but critical property of one-stage cluster sampling is that its variance depends on the between-cluster variance S_cl² — not the total population variance S². This means that increasing the number of elements measured per cluster (while holding m fixed) does not reduce the first-stage variance component at all. The only way to reduce V(ȳ_cl) in a one-stage design is to increase m — the number of clusters selected. This is precisely the opposite of the intuition most researchers bring from SRS, where increasing the sample size always reduces variance.
The practical implication is that for a fixed total budget, the optimal design almost always involves selecting more clusters with fewer elements per cluster, rather than fewer clusters with exhaustive within-cluster measurement. This is the fundamental insight of the b̄_opt formula: b̄_opt = √[(c₁/c₂)·(1−ρ)/ρ]. Unless the cluster access cost c₁ is extraordinarily high relative to the per-element cost c₂ — which is sometimes true in remote geographic sampling — the optimal cluster size is typically much smaller than the cluster's natural size N_i (Cochran, 1977, pp. 241–244; Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 226–234).
Doctoral researchers designing cluster studies should treat the choice of m (number of clusters) as the primary efficiency-determining parameter and b̄ (elements per cluster) as a secondary parameter determined by the cost-optimisation formula — not by convenience or the desire to measure every available element within selected clusters.
In virtually all real-world cluster sampling applications, clusters are of unequal size — schools have different numbers of students, hospitals have different numbers of patients, geographic areas have different population counts. Unequal cluster sizes create two distinct problems that must be addressed separately.
Problem 1 — Bias under equal-probability selection: If clusters are selected with equal probability and cluster means Ȳ_i are correlated with cluster sizes N_i (which they commonly are — larger schools may have lower per-student resources, larger hospitals may treat more complex cases), the simple cluster mean ȳ_cl = (1/m)Σȳ_i is a biased estimator of the population mean Ȳ. The bias arises because large clusters are underrepresented in the equal-probability design relative to their contribution to the population. The ratio estimator ȳ_r = (Σ n_i ȳ_i)/(Σ n_i) — weighting cluster means by cluster size — removes this bias (approximately) but introduces a small bias of its own from the ratio approximation.
Problem 2 — Increased variance: Even with an unbiased estimator, unequal cluster sizes increase the sampling variance relative to the equal-size case because large clusters contribute disproportionately to the sample mean in some draws and less in others. PPS sampling directly addresses this: by selecting clusters with probability proportional to N_i and measuring a fixed n̄ elements per cluster, every element has the same marginal inclusion probability and the estimator is exactly unbiased without requiring ratio estimation (Cochran, 1977, pp. 247–252; Kish, 1965, pp. 186–192).
Variance estimation in cluster samples requires at least m = 2 selected PSUs to compute the between-cluster variance s_cl² = Σ(ȳᵢ − ȳ_cl)²/(m−1). When m is small — as it often is in budget-constrained studies — the variance estimate itself becomes highly unstable: with m = 4 clusters, the variance estimate has only m − 1 = 3 degrees of freedom, producing very wide confidence intervals around the confidence intervals themselves. This is the cluster-level analogue of the single-unit stratum problem in stratified sampling, and it is known in the survey literature as the "lonely PSU" problem (Wolter, 2007, pp. 158–162).
Practical consequences: (1) T-statistics for cluster-based inference should use m − 1 degrees of freedom at the PSU level — not n − 1 at the element level. For m = 4 clusters, the critical value for a 95% CI is t₃,0.975 = 3.18, not 1.96 — a very substantial correction. (2) The "collapsed PSU" technique — pairing adjacent PSUs in the same stratum of a stratified cluster design for variance estimation — provides a conservative but computable variance estimate when individual PSU variance computation is unstable. (3) For regulatory and policy surveys, most methodologists recommend m ≥ 20 PSUs per domain to support stable variance estimation (Groves et al., 2009; Lohr, 2010, pp. 178–180).
When cluster sampling is used and the intraclass correlation ρ is non-trivial, the researcher faces a conceptually important distinction between the statistical clustering problem (the need for design-correct analysis) and the substantive multilevel question (whether and how cluster membership affects the outcome). These two issues are related but distinct, and conflating them is a common source of analytical errors.
The statistical issue: Regardless of whether cluster membership causally affects the outcome, the sampling structure requires design-correct standard errors. A researcher who uses survey regression (svyglm in R, or svy: regress in Stata) accounts for the design structure in the standard errors without making any multilevel causal claims.
The substantive multilevel issue: If the researcher wishes to model the cluster-level context as an explanatory factor — e.g., testing whether school-level socioeconomic composition predicts individual achievement beyond individual-level SES — multilevel models (random effects models, hierarchical linear models) are required. These models must account for the non-random nature of the cluster sample if cluster selection was not independent of the outcome (e.g., a convenience cluster sample of easily accessible schools). Using HLM without accounting for the complex sampling design can produce biased fixed-effect estimates if the PSU selection was informative (Raudenbush & Bryk, 2002; Pfeffermann et al., 1998).
The doctoral researcher must therefore be explicit about which analysis goal is operative: design-correct estimation of population parameters (use survey regression), or estimation of contextual effects and their variance decomposition (use multilevel models with design weights and PSU indicators).
Cluster Sampling vs. Other Probability Designs
Cluster sampling occupies a unique niche among probability designs: it is the least statistically efficient of the common designs, yet the most operationally feasible when no element-level frame exists and geographic concentration of data collection is required. Understanding precisely where it is justified — and where its efficiency costs make alternative designs preferable — is essential for doctoral-level design selection.
Stratified Cluster Sampling: The Real-World Standard
In large-scale national surveys — government population surveys, educational assessments, health examination surveys — pure cluster sampling is rarely used alone. The standard in practice is stratified multistage cluster sampling: the population is first divided into strata (by geographic region, urbanicity, population density, or institutional type), and then clusters are selected independently within each stratum using PPS. This hybrid design captures the efficiency advantages of stratification (smaller between-stratum variance component contributes to estimator precision) while retaining the operational advantages of clustering (no element-level frame, geographic concentration). The design effect of the combined design is approximately DEFF_combined ≈ DEFF_cluster / DEFF_stratification — typically still greater than 1 but substantially smaller than for pure cluster sampling (Kish, 1965, pp. 248–255; Groves et al., 2009, pp. 120–126).
When Cluster Sampling Is the Appropriate Choice
No Complete Element-Level Frame
When the target population consists of individuals nested within institutions or geographic areas, and no complete, accessible list of individuals exists — but a complete list of institutions or areas does — cluster sampling is the only probability design available. This condition is the primary justification for cluster sampling and applies to a majority of large-scale social, health, and educational research contexts in both high- and low-income countries.
Data Collection Requires Physical Site Access
When the measurement process requires researchers to be physically present — classroom observations, medical examinations, facility audits, biometric data collection — each additional site visited imposes substantial fixed costs. Cluster sampling minimises the number of sites visited while maximising the data yield per site, producing the lowest cost per unit of information when between-site travel or access cost is high relative to within-site per-element cost.
Research Question is Explicitly Multilevel
When the research question concerns relationships that operate at both the individual and the institutional level — school effects on student learning, neighbourhood effects on health behaviour, organisational culture effects on employee outcomes — cluster sampling is not merely acceptable but scientifically required. The PSU structure must be preserved in the design so that between-cluster and within-cluster variances can be separately estimated, enabling valid multilevel analysis (Raudenbush & Bryk, 2002).
Low Intraclass Correlation and Large m
The statistical cost of clustering is minimised when ρ is small and m is large. When prior research or pilot data suggest ρ < 0.05 and the design permits selecting m ≥ 30 clusters, the DEFF = 1 + (b̄ − 1) × 0.05 may be acceptably close to 1.0 for modest b̄. In these circumstances, the operational advantages of cluster sampling can be realised at minimal statistical cost — the rare scenario in which cluster sampling is competitive with SRS on both cost and precision grounds.
Implementation Protocol for Doctoral Research
Rigorous implementation of cluster sampling requires explicit documentation of every methodologically consequential decision: the definition of clusters and their boundaries, the PSU frame, the selection method (equal probability or PPS), the second-stage design, the b̄_opt calculation, and the variance estimation approach. The following seven-step protocol meets the reporting standards of APA 7th Edition, STROBE, and CONSORT-equivalent guidelines for cluster-randomised and cluster-sampled designs.
Define Clusters & Obtain PSU Frame
Define cluster boundaries (geographic, institutional, or administrative). Verify the PSU frame is exhaustive. Record cluster identifiers and measures of size N_i from the most recent available source.
Estimate ρ and Compute DEFF
Obtain prior estimates of ρ from published literature or pilot data. Compute the expected DEFF = 1 + (b̄ − 1)ρ for planned b̄. Use DEFF to adjust sample size requirements upward from the SRS-based n: n_cluster = n_SRS × DEFF.
Compute b̄_opt and Determine m
Using ρ and the cost ratio c₁/c₂, compute b̄_opt = √[(c₁/c₂)·(1−ρ)/ρ]. Determine m = n_cluster / b̄_opt. Enforce m ≥ 2 (minimum for variance estimation); aim for m ≥ 20 for stable inference.
Select PSUs (Equal Prob. or PPS)
If clusters are equal or near-equal in size: SRS without replacement is appropriate. If clusters differ substantially in size: use PPS selection (cumulative size / systematic PPS). Document the selection method and all random seeds used.
Enumerate Elements Within Selected Clusters
Obtain or construct a complete element list for each selected cluster. In one-stage designs: measure all N_i elements. In two-stage designs: apply a pre-specified probability design (SRS, systematic) within each selected cluster to select n_i elements.
Collect Data & Track Response
Apply the pre-specified contact and non-response protocol. Record cluster-level and element-level response dispositions separately per AAPOR standards. Document refusals, non-contacts, and ineligibles at both PSU and SSU levels.
Analyse with Design Specification
Specify the PSU, strata (if any), and weights in survey software. Compute ȳ_cl, v̂(ȳ_cl), and DEFF. Report DEFF, effective n, and degrees of freedom = m − 1 (per stratum) for all primary estimates.
PPS Selection: The Cumulative Size Method
Implementing Systematic PPS Selection — Step by Step
The cumulative size method (also called the Lahiri-Hartley-Rao method in its systematic form) is the most widely taught PPS selection procedure. The steps are: (1) List all M clusters with their sizes N_i. (2) Compute cumulative totals: C_0 = 0; C_i = C_{i−1} + N_i for i = 1, …, M. C_M = N. (3) Compute the sampling interval k = N/m. (4) Draw a random start r ~ Uniform(0, k). (5) Select the cluster whose cumulative range (C_{i−1}, C_i] contains the values r, r+k, r+2k, …, r+(m−1)k. Each cluster i is selected if any of these m values falls within its cumulative range, with probability N_i/N per pass — exactly PPS. (6) The selected clusters constitute the PSU sample. If a cluster is selected twice (possible when N_i > k), it is included twice in the sample and counted twice in all estimators — this is the with-replacement (PPSWR) framework. Cochran (1977, pp. 251–253) and Lohr (2010, pp. 177–180) provide complete worked examples.
Variance Estimation Method Selection
| Design Context | Recommended Estimator | Software Implementation | Degrees of Freedom |
|---|---|---|---|
| Simple one-stage, equal-prob. selection | s_cl²/m with FPC = (1 − m/M) | svydesign(ids=~cluster_id); svymean() | m − 1 |
| One-stage PPS with replacement (PPSWR) | Hansen-Hurwitz: (1/m(m−1))·Σ(ȳᵢ−ȳ_HH)² | svydesign(ids=~cluster_id, probs=~pi_i) | m − 1 |
| Two-stage SRS within clusters | Between-cluster + within-cluster components; Taylor linearisation | svydesign(ids=~cluster_id+element_id, fpc=~M+N_i) | m − 1 (dominated by 1st stage) |
| Stratified PPS multistage (national surveys) | Taylor linearisation or BRR/jackknife | svydesign(ids=~psu+ssu, strata=~stratum_var, weights=~wt) | Σ(m_h − 1) across strata |
| m small (< 10 PSUs); unstable variance estimate | Collapsed PSU method; conservative estimate | Pair adjacent PSUs; compute between-pair variance | floor(m/2) |
Reporting Requirements for Cluster Sampling in Peer-Reviewed Research
(a) Cluster definition: Define what constitutes a cluster (school, hospital, geographic area, household), its geographic or administrative boundaries, and the rationale for treating it as the primary sampling unit. Justify why clustering was the only feasible design — or why it was preferable to SRS given the research context.
(b) PSU frame description: Identify the source of the cluster list, its date, the total number of clusters M, the range and distribution of cluster sizes N_i, and whether any clusters were excluded from the frame and why.
(c) First-stage selection: State whether equal-probability or PPS selection was used, and justify the choice in terms of cluster size variability. Document the selection procedure (SRS, systematic PPS, or other), the random seed(s) used, and m — the number of clusters selected.
(d) Second-stage design: For two-stage designs, describe the within-cluster sampling procedure, the target n_i, whether n_i was constant or variable across clusters, and the within-cluster sampling fraction f₂ = n_i/N_i.
(e) Intraclass correlation and design effect: Report the prior estimate of ρ used in the sample size calculation, the source of this estimate, and the actual DEFF computed from the collected data. Report both nominal n and effective n_eff = n/DEFF for all primary estimates.
(f) Variance estimation: Name the variance estimation method (Taylor linearisation, jackknife, BRR, between-PSU), specify the software and design specification syntax, and report the degrees of freedom used for inference. Do not report standard errors computed assuming SRS — this is a fundamental error in cluster sample analysis.
(g) Response rates: Report cluster-level (PSU) response rate and element-level (SSU) response rate separately per AAPOR standards. Document the non-response protocol at both stages. Assess whether non-response was systematically related to cluster characteristics (geography, size, accessibility) — a specific form of non-response bias unique to multi-stage designs.
Survey Software Commands for Cluster Design Specification
| Software | Design Specification Command | Notes |
|---|---|---|
| R (survey package) | svydesign(ids=~psu_id+ssu_id, strata=~stratum, weights=~wt, fpc=~M+N_i, data=df) | Lumley (2010); svymean(), svytotal(), svyglm() for design-correct inference |
| Stata | svyset psu_id [pweight=wt], strata(stratum) fpc(M) || ssu_id, fpc(N_i) | svy: mean outcome; svy: logistic; svy: regress for design-correct regression |
| SAS | PROC SURVEYMEANS DATA=df CLUSTER psu_id; STRATA stratum; WEIGHT wt; TOTAL total_psu; | Handles two-stage designs with nested CLUSTER statements; outputs DEFF automatically |
| SPSS Complex Samples | CSPLAN … CLUSTER psu_var / INCLPROB wt_var; (then CSDESCRIPTIVES or CSLOGISTIC) | SPSS outputs design-corrected standard errors and DEFF for all estimates |
| Python (samplics) | TaylorEstimator(param="mean").estimate(y, psu=psu_var, samp_weight=wt) | samplics library; supports multi-stage designs with Taylor linearisation |
Doctoral-Level Self-Assessment
These questions require application of theoretical, mathematical, and methodological concepts — not rote recall. Questions are calibrated to doctoral comprehensive examination standard and emphasise the design effect, intraclass correlation, PPS selection, variance estimation, and analysis of cluster samples in ways that distinguish them from simpler designs.
Self-Assessment Quiz — Cluster Sampling
Select the best answer for each item, then submit for scored feedback.
Primary Scholarly References
All content in this reference is grounded in peer-reviewed foundational literature in survey sampling, epidemiological methodology, and multilevel research design. References are formatted per APA 7th Edition.
- (1953). Sample survey methods and theory (Vols. 1–2). John Wiley & Sons. [The foundational mathematical derivation of cluster sampling theory, PPS selection, the optimal cluster size formula, and the variance components of one-stage and two-stage designs. Volume 1 covers methods; Volume 2 covers the mathematical proofs.]
- (1965). Survey sampling. John Wiley & Sons. [Chapters 5 and 6 provide the definitive treatment of the design effect (DEFF), the intraclass correlation coefficient ρ, and the efficiency comparison between cluster sampling and SRS. Chapter 8 covers the DEFF framework in complex multi-stage designs.]
- (1977). Sampling techniques (3rd ed.). John Wiley & Sons. [Chapter 9 provides a rigorous and comprehensive doctoral-level treatment of one-stage and two-stage cluster sampling, equal-probability and PPS selection, the optimal cluster size derivation, and variance estimation under all cluster designs.]
- (2010). Sampling: Design and analysis (2nd ed.). Brooks/Cole. [Chapters 5–6 cover cluster sampling with accessible derivations, PPS implementation using the cumulative size method, and R-based survey software implementation including svydesign() for two-stage designs.]
- (2009). Survey methodology (2nd ed.). John Wiley & Sons. [Total survey error framework applied to cluster designs; coverage error in PSU frames; non-response at the PSU and SSU levels; design specification in survey analysis software; the isolated PSU problem.]
- (2007). Introduction to variance estimation (2nd ed.). Springer. [Chapters 3 and 6 cover variance estimation in cluster designs: Taylor linearisation, jackknife, and balanced repeated replication (BRR) for multi-stage designs; the lonely PSU problem and collapsed PSU solution.]
- (2000). Design and analysis of cluster randomization trials in health research. Arnold. [The authoritative reference for cluster design effect and intraclass correlation in health research; empirical ρ values across health and social outcomes; sample size planning accounting for DEFF in cluster randomised trials.]
- (1998). Design and analysis of group-randomized trials. Oxford University Press. [Comprehensive treatment of group-randomised (cluster-randomised) designs; empirical ρ values for behavioural outcomes; power calculation methods accounting for the design effect.]
- (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage. [The definitive reference for multilevel modelling nested within cluster-sampled populations; variance decomposition at PSU and SSU levels; design-weighted multilevel estimation for informative cluster designs.]
- (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society: Series B, 60(1), 23–40. [The foundational paper establishing how design weights must be incorporated into HLM/multilevel analyses of informatively selected cluster samples to avoid biased fixed-effect estimates.]
- (1985). Surveys to measure programme coverage and impact: A review of the methodology used by the expanded programme on immunisation. World Health Statistics Quarterly, 38(1), 65–75. [Originating paper for the EPI 30×7 cluster sampling design widely used in public health surveillance; PPS-based cluster selection for population coverage estimation in resource-limited settings.]
- (2010). Complex surveys: A guide to analysis using R. Wiley. [Practical implementation of multi-stage cluster designs in R's survey package; svydesign() specification for one-stage and two-stage cluster sampling; variance estimation using Taylor linearisation and jackknife; DEFF computation from fitted survey objects.]
- (1992). Model assisted survey sampling. Springer. [Advanced model-assisted treatment of cluster sampling within a design-based inference framework; calibration estimators for multi-stage designs; the GREG estimator applied to cluster samples with auxiliary information.]
Recommended Further Reading for Doctoral Candidates
For the most rigorous mathematical treatment of PPS sampling without replacement and the associated exact variance estimators: Brewer, K.R.W., & Hanif, M. (1983). Sampling with Unequal Probabilities. Springer — particularly Chapters 3–5 on Sampford's and Brewer's methods. For applied estimation of ρ from pilot data and published benchmarks by research domain: Donner, A., & Klar, N. (2000, pp. 30–47) and Murray, D.M. (1998, pp. 98–134). For the most current treatment of design-based inference in multilevel models with complex sampling: Stapleton, L.M. (2006). An assessment of practical solutions for structural equation modeling with complex sample data. Structural Equation Modeling, 13(1), 28–58. For implementation of multi-stage complex designs in Stata including DEFF computation: StataCorp (current edition). Stata Survey Data Reference Manual — the svyset and estat effects documentation.