Cluster Sampling: A Doctoral-Level Reference

Section 01 — Foundational Theory

Epistemological & Theoretical Foundations

Cluster Sampling is a probability sampling design in which the population is divided into naturally occurring groups — called clusters — and a probability sample of clusters is selected. All elements within selected clusters are measured (one-stage), or a further probability sample is drawn within each selected cluster (two-stage or multi-stage). Cluster sampling is not stratification in reverse — it is a fundamentally different design with distinct efficiency properties, a unique cost logic, and specific variance consequences governed by the within-cluster intraclass correlation.

Formal Definition

Cluster sampling is a method of selecting a probability sample from a population of N elements organised into M mutually exclusive and exhaustive groups (clusters) of sizes N₁, N₂, …, N_M, by selecting a probability sample of m clusters and then measuring either all elements within selected clusters (one-stage), or a further probability sub-sample within each selected cluster (two-stage). The primary sampling unit (PSU) is the cluster; the element is the secondary or ultimate sampling unit (SSU).

— Kish, L. (1965). Survey Sampling. John Wiley & Sons, pp. 148–149; Cochran, W.G. (1977). Sampling Techniques (3rd ed.). John Wiley & Sons, pp. 233–235; Hansen, M.H., Hurwitz, W.N., & Madow, W.G. (1953). Sample Survey Methods and Theory (Vol. 1). John Wiley & Sons, pp. 97–100.

Illustration: M = 8 clusters — 3 clusters selected (one-stage); all elements within selected clusters are measured

Unselected cluster

Selected cluster

Measured element

Unmeasured element

The Core Logic: Why Clusters, Not Elements

The defining question in cluster sampling is: why would a researcher deliberately accept a statistically less efficient design — one that produces larger sampling variances for a given total sample size — compared to simple random sampling of elements? The answer is economic and structural, not statistical. Cluster sampling exists to solve two problems that SRS cannot: the absence of a complete element-level frame, and the prohibitive cost of geographically dispersed element-level sampling.

Consider a national study of primary school students. A complete list of every student in the country does not exist in a single, accessible database — but a list of every school does. Cluster sampling selects schools (clusters) at random and then samples or enumerates students within selected schools. The researcher pays travel and access costs to reach each school once, then collects data from many students at that same school with minimal additional cost per student. The efficiency loss from clustering — measured by the design effect — is the statistical price paid for this logistical and cost advantage.

This trade-off is the defining intellectual structure of cluster sampling, and it distinguishes the design from every other probability method. All other common designs aim to maximise statistical efficiency. Cluster sampling deliberately accepts reduced statistical efficiency in exchange for gains in operational feasibility, cost control, and the ability to sample from populations for which no element-level frame exists (Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 97–100).

Historical Development

The systematic mathematical treatment of cluster sampling was established by Morris H. Hansen, William N. Hurwitz, and William G. Madow in their landmark two-volume work Sample Survey Methods and Theory (1953), published by John Wiley & Sons. This work, produced by researchers at the United States Bureau of the Census, formalised the variance formulae for one-stage and two-stage cluster designs, established the PPS (probability proportional to size) selection method, and proved the conditions under which cluster sampling is cost-optimal relative to SRS. The Census Bureau's use of cluster sampling for large-scale population surveys was the primary applied context that motivated this theoretical development.

Leslie Kish (1965), in Survey Sampling, provided the definitive treatment of the design effect (DEFF) and the intraclass correlation coefficient (ρ, roh) as the two fundamental quantities governing the efficiency of cluster designs relative to SRS. Kish's conceptualisation of DEFF = 1 + (b̄ − 1)ρ — where b̄ is the average cluster size and ρ is the intraclass correlation within clusters — remains the central analytical tool for cluster sampling design and has been adopted universally in survey methodology, epidemiology, and educational research. Cochran (1977) embedded cluster sampling within the complete probability sampling framework and derived the exact estimators and variance formulae that doctoral researchers apply directly.

Stage 1 — First-Stage Selection

Primary Sampling Units (PSUs)

A probability sample of m clusters is selected from the full list of M clusters. Selection may be with equal probability (SRS) or with probability proportional to cluster size (PPS). This is the only stage in one-stage cluster sampling.

→

Stage 2 — Second-Stage Selection

Secondary Sampling Units (SSUs)

Within each selected cluster, a probability sub-sample of n_i elements is drawn (SRS, systematic, or stratified). If all elements are measured, this is one-stage sampling. If a sub-sample is taken, this is two-stage (or multi-stage) cluster sampling.

Why Cluster Sampling Is Used

Reason 01

No Element-Level Frame Exists

In many populations of research interest — all residents of a country, all students in a national education system, all patients treated in a national health network — a complete, current, accessible element-level list does not exist. A list of hospitals, schools, or geographic areas does exist. Cluster sampling makes probability sampling possible when no element frame is available, by using the cluster-level list as the first-stage frame.

Reason 02

Cost Reduction Through Geographic Concentration

When data collection involves travel or site access — face-to-face interviewing, physical measurement, classroom observation — an SRS drawn from a geographically dispersed population would require visiting nearly every location in the country. Cluster sampling concentrates fieldwork within selected clusters, dramatically reducing travel cost and interviewer time. The statistical penalty (increased variance) is traded for an economic gain (reduced cost), and with appropriately sized clusters, the cost-efficiency can be far superior to SRS (Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 226–234).

Reason 03

Multi-Stage Institutional Sampling

Research in education, healthcare, organisational behaviour, and public policy typically targets individuals nested within institutions: students within schools, patients within hospitals, employees within firms. The institution is the natural primary sampling unit. Cluster sampling formalises this structure, treating the institution as the PSU and individuals as SSUs, producing a design that matches the data-generating structure of the research context and enables correct multilevel analysis.

Reason 04

Pilot Survey and Feasibility Studies

When a researcher is in the early stages of a research programme, detailed element-level frames are often unavailable. A cluster-based pilot survey — selecting a small number of clusters and exhaustively enumerating their elements — simultaneously gathers substantive data and constructs the element-level frame for a subsequent, more precise stage of sampling. This two-stage process is the standard approach for large national surveys in low- and middle-income countries where registration systems are incomplete (Groves et al., 2009, pp. 94–96).

Reason 05

Natural Cluster Structure in the Population

Many phenomena occur within naturally bounded groupings: infectious disease within households, learning within classrooms, voting behaviour within precincts. When the research question is specifically about between-cluster or within-cluster variation — as in multilevel modelling — cluster sampling is not merely a convenience but the scientifically correct design. Sampling clusters directly ensures that the design reflects the hierarchical structure being studied.

Reason 06

EPI Cluster Sampling in Public Health

The Expanded Programme on Immunisation (EPI) cluster sampling design, originally developed by Lemeshow and Robinson (1985) for the WHO, uses 30 clusters of 7 subjects (30×7 design) to estimate vaccination coverage in settings where no household roster exists. Each cluster is selected with PPS based on population counts. This is one of the most widely used cluster sampling applications in public health practice, illustrating the design's unique applicability in resource-constrained settings.

📐

The Fundamental Trade-Off: Efficiency vs. Feasibility

Cluster sampling is the only common probability design in which statistical efficiency is deliberately reduced to achieve operational feasibility and cost control. Every other design — SRS, stratified, systematic — aims to maximise precision. Cluster sampling accepts reduced precision as the price for being able to conduct the survey at all, or at acceptable cost. The doctoral researcher's task is to quantify this trade-off explicitly: estimate the expected design effect DEFF = 1 + (b̄ − 1)ρ before the study, report the actual DEFF after data collection, and justify the cluster design in terms of the cost and frame arguments that make alternative designs infeasible or impractical (Kish, 1965, pp. 161–164; Cochran, 1977, pp. 241–243).

Section 02 — Mathematical Theory

Estimators, the Design Effect & Variance Theory

The mathematics of cluster sampling reveals that the key quantity governing all efficiency comparisons is the intraclass correlation coefficient ρ — the degree to which elements within the same cluster resemble one another. When ρ is high, clustering is extremely costly in statistical terms. When ρ is low or zero, clustering approaches SRS efficiency. Understanding ρ is therefore the central analytical task in any cluster sampling design.

1. Notation and Setup

Population and Sample Notation

M = total number of clusters in population
N_i = number of elements in cluster i; N = ΣN_i (total elements)
b̄ = N/M = average cluster size
m = number of clusters selected (PSUs)
n_i = elements sampled within cluster i (SSUs)
ȳ_i = sample mean of cluster i; ȳ_cl = (1/m)·Σ ȳ_i

PSU = Primary Sampling Unit (the cluster) · SSU = Secondary Sampling Unit (the element within a cluster)
In one-stage cluster sampling: n_i = N_i (all elements in selected clusters are measured).
In two-stage cluster sampling: n_i < N_i (a sub-sample is drawn within each selected PSU).

2. One-Stage Cluster Sampling: Estimator and Variance

Estimator — Equal-Sized Clusters (One-Stage)

ȳ_cl = (1/m) · Σᵢ₌₁ᵐ ȳᵢ

V(ȳ_cl) = (1 − m/M) · S_cl² / m

ȳ_i = mean of all elements in selected cluster i
S_cl² = between-cluster variance = (1/(M−1)) · Σ(Ȳ_i − Ȳ)² where Ȳ_i = true cluster mean
(1 − m/M) = finite population correction for the first stage
Key insight: variance depends entirely on how much cluster means Ȳ_i vary — not on within-cluster variance. If all clusters have identical means (Ȳ_i = Ȳ), selecting more clusters adds no information — the design is maximally inefficient in this sense.
Estimated: v̂(ȳ_cl) = (1 − m/M) · s_cl² / m where s_cl² = Σ(ȳᵢ − ȳ_cl)²/(m−1)

3. Two-Stage Cluster Sampling: Estimator and Variance

Two-Stage Estimator — Equal-Sized Clusters (Cochran, 1977, §9.3)

ȳ_ts = (1/m) · Σᵢ₌₁ᵐ ȳᵢ

V(ȳ_ts) = (1 − f₁)·S_cl²/m + (f₁/m)·(1 − f₂)·S_w²/n̄

f₁ = m/M = first-stage sampling fraction (clusters) · f₂ = n̄/b̄ = second-stage sampling fraction (elements within clusters)
S_cl² = between-cluster variance · S_w² = within-cluster variance (average across clusters)
n̄ = average number of elements sampled per cluster
Decomposition: Total variance = first-stage component (between clusters) + second-stage component (within clusters).
When f₁ is small (m ≪ M), the first-stage component dominates — increasing n̄ (more elements per cluster) adds diminishing returns.
When f₂ is small (sampling few elements per cluster), the second-stage component is proportionally more important.

4. The Intraclass Correlation Coefficient (ρ, roh)

Intraclass Correlation — Kish (1965), §5.4

ρ = (S_cl² − S_w²/b̄) / (S_cl² + (b̄−1)·S_w²/b̄)

S_cl² = S²[1 + (b̄−1)ρ] / b̄ (implied relationship)

ρ = intraclass correlation — correlation between values of any two elements in the same cluster
S² = overall population variance across all N elements
Range: −1/(b̄−1) ≤ ρ ≤ 1 · In practice for social/health research: typically 0.01 ≤ ρ ≤ 0.30
When ρ = 0: Cluster membership is irrelevant — elements within a cluster are no more alike than any two random elements from the population. Cluster sampling is as efficient as SRS.
When ρ → 1: All elements in each cluster are identical — measuring additional elements within a selected cluster adds no new information. Each cluster yields information equivalent to one element. V(ȳ_cl) = M·S²/m·N → b̄ times the SRS variance: catastrophic efficiency loss.
Negative ρ: Rare in practice; implies within-cluster heterogeneity exceeds random expectation — cluster sampling would be more efficient than SRS.

5. The Design Effect (DEFF) — The Central Quantity

Design Effect — Kish (1965), §8.2

DEFF = V(ȳ_cl) / V(ȳ_SRS,same n)

DEFF ≈ 1 + (b̄ − 1) · ρ

b̄ = average cluster size (number of elements per selected cluster)
ρ = intraclass correlation within clusters
DEFF ≥ 1 for positive ρ — cluster sampling always requires a larger sample than SRS to achieve the same precision when ρ > 0
Effective sample size: n_eff = n / DEFF < n — the cluster sample of n elements is statistically equivalent to only n/DEFF independent observations
Example: b̄ = 20 students per school, ρ = 0.15 (typical for academic achievement): DEFF = 1 + 19 × 0.15 = 3.85. The cluster sample of n = 1,000 students is equivalent to only 260 independently drawn students in precision terms. A sample of 3,850 students (SRS) would be needed to match this precision — the cluster design requires 3.85× the observations of SRS for equivalent precision.
Critical implication: Ignoring DEFF and analysing cluster samples as if they were SRS underestimates standard errors by a factor of √DEFF, producing anti-conservative confidence intervals and inflated Type I error rates.

Design Effect by ρ and b̄ — Illustrative Values of DEFF = 1 + (b̄ − 1)ρ

Each bar represents the DEFF for a given combination of cluster size b̄ and intraclass correlation ρ. The dashed line at DEFF = 1.0 marks the SRS benchmark. Any bar extending beyond this represents statistical inefficiency relative to SRS — the cluster sample requires that many times more elements to match SRS precision.

6. Probability Proportional to Size (PPS) Sampling

When clusters have unequal sizes — the common case in practice — equal-probability selection of clusters is inefficient and introduces bias if cluster means are correlated with cluster size. Probability Proportional to Size (PPS) sampling addresses both problems by assigning each cluster a selection probability proportional to its size: π_i = m · N_i / N. In PPS sampling with a fixed sub-sample of n̄ elements per cluster, the product π_i × n̄/N_i = m·n̄/N = constant for all i — meaning every element has the same marginal inclusion probability regardless of cluster size. The design is therefore EPSEM — self-weighting — even when clusters are of different sizes.

PPS Selection Probability and the Hansen-Hurwitz Estimator

π_i = m · N_i / N (PPS inclusion probability for cluster i)

ȳ_HH = (1/m) · Σᵢ₌₁ᵐ ȳᵢ (self-weighting when n̄ constant)

v̂(ȳ_HH) = (1/m(m−1)) · Σᵢ(ȳᵢ − ȳ_HH)²

N_i = size of cluster i (known from the frame — the "measure of size")
m = number of clusters selected in the first stage
The Hansen-Hurwitz estimator ȳ_HH is an unbiased estimator under PPS with replacement sampling (PPSWR).
Variance estimation: v̂(ȳ_HH) requires only m ≥ 2 selected clusters — the variance is computed entirely from the between-cluster variation in ȳᵢ, requiring no within-cluster variance estimation.
Practical PPS methods: Systematic PPS (cumulative size method); Lahiri's method; Brewer's method; Sampford's method (for without-replacement PPS). The cumulative size method is by far the most commonly implemented in survey practice (Cochran, 1977, pp. 251–259; Lohr, 2010, pp. 176–184).

7. Optimal Number of Elements per Cluster

Optimal b̄ — Cost-Variance Trade-Off (Hansen, Hurwitz & Madow, 1953)

b̄_opt = √[(c₁/c₂) · (1 − ρ)/ρ]

c₁ = cost of selecting and accessing one cluster (PSU travel cost, site establishment cost)
c₂ = cost of measuring one additional element within an already-selected cluster (marginal element cost)
ρ = intraclass correlation
Logic: When ρ is high (within-cluster elements are similar), additional elements within a cluster add little new information → small b̄_opt. When ρ is low (within-cluster heterogeneity is high), additional elements are informative → larger b̄_opt.
When c₁ ≫ c₂ (access is expensive, marginal measurement is cheap) → large b̄_opt (measure many elements per cluster to justify the fixed access cost).
When c₁ ≈ c₂ (access and measurement cost roughly equally) → b̄_opt approaches √[(1−ρ)/ρ].
Example: ρ = 0.10, c₁ = $500 (travel), c₂ = $10 (interview): b̄_opt = √[(500/10) · (0.9/0.1)] = √[50 × 9] = √450 ≈ 21 elements per cluster.

8. Variance Estimation in Cluster Designs

A critical property of cluster sampling — shared with stratified sampling but not with systematic sampling — is that design-based unbiased variance estimation is straightforward when m ≥ 2 clusters are selected. The between-cluster variance in observed cluster means ȳᵢ is directly estimable from the m selected clusters, producing an unbiased estimate of the first-stage variance component. This property holds for both PPS and equal-probability selection, and for both one-stage and two-stage designs (with different formulas). For complex multi-stage designs, Taylor linearisation (the delta method), jackknife, and balanced repeated replication (BRR) are the standard approaches, all implemented in major survey software packages.

⚠️

The Anti-Conservative Analysis Error: Ignoring the Design Effect

The most prevalent and consequential error in the analysis of cluster samples is treating the data as if they were drawn by SRS — computing standard errors as √(s²/n) and conducting standard OLS-based significance tests without accounting for the clustering structure. When the true DEFF is, say, 3.0 and the analyst ignores it, reported standard errors are underestimated by a factor of √3 ≈ 1.73, reported 95% confidence intervals are too narrow, and the nominal Type I error rate of 0.05 may correspond to an actual Type I error rate of 0.20 or higher. This is not a minor correction — it is a fundamental error that has led to numerous false positives in published social science and public health literature. All major methodologists agree: cluster samples must be analysed using design-correct methods that explicitly account for the PSU structure, regardless of the statistical software defaults used (Kish, 1965, pp. 258–265; Groves et al., 2009, pp. 239–244; Lohr, 2010, pp. 158–162).

Section 03 — Interactive Learning Tool

Cluster Sampling Simulator

Configure the number of clusters, cluster size, the number of clusters to select, and the sampling stage. Observe how PSU selection distributes across the population, compute the live design effect, and compare the sampling distribution of the cluster mean against the SRS benchmark.

Cluster Sampling Monte Carlo Simulator

Visualises PSU selection, within-cluster sub-sampling, DEFF, and the sampling distribution of ȳ_cl

Total Clusters (M) 12

Cluster Size (b̄) 8

Clusters Selected (m) 4

Intraclass ρ (×100) 0.15

Sampling Stage

Population Clusters — dark border & bold = selected PSU · shaded units = measured elements

12 Total Clusters M

4 Selected Clusters m

8 Cluster Size b̄

— Elements Measured n

— DEFF ≈ 1+(b̄−1)ρ

— Eff. Sample n/DEFF

— Cluster Mean ȳ_cl

— SE(ȳ_cl)

Sampling distribution of ȳ_cl across simulation runs — compared to SRS benchmark (narrower = more efficient)

🔬

What the Simulator Demonstrates

Draw Sample: Randomly selects m PSUs from the M available clusters. Selected clusters are highlighted. In one-stage mode, all b̄ elements within selected clusters are shaded as measured. In two-stage mode, a random sub-sample of elements is drawn within each selected cluster. The live DEFF and effective sample size update immediately, showing the statistical cost of the clustering structure.

ρ Slider: Adjusting the intraclass correlation from 0 to 0.50 directly updates DEFF = 1 + (b̄ − 1)ρ. Watch the effective sample size drop precipitously as ρ increases — this is the most direct illustration of why high ρ makes cluster sampling statistically expensive.

Run Simulation: Executes 300 independent cluster samples, each time randomly selecting m PSUs and computing ȳ_cl. The resulting histogram displays the empirical sampling distribution of the cluster mean. Higher ρ and larger b̄ produce wider distributions — confirming the DEFF formula empirically.

Section 04 — Critical Evaluation

Assumptions, Conditions & Limitations

Cluster sampling carries a specific and consequential set of assumptions. Five of these — the exhaustiveness of the cluster list, the known or estimable cluster sizes for PPS, the independence of cluster selection, the adequacy of m, and the correct analysis accounting for the design effect — require explicit justification and documentation in any doctoral research employing this design.

Formal Assumptions

Assumption	Technical Requirement	Violation Consequence	Diagnostic / Remedy
Exhaustive Cluster List (First-Stage Frame)	Every cluster in the defined target population must appear on the PSU frame; no cluster can have zero probability of selection	Coverage error: elements in unlisted clusters have πᵢ = 0, violating EPSEM and introducing coverage bias of unknown direction	Audit the PSU frame against administrative records or geographic maps; quantify the proportion of the population in uncovered clusters; assess non-coverage bias direction
Known Cluster Sizes (for PPS)	The measure of size N_i must be known for every cluster on the PSU frame to compute PPS selection probabilities π_i = m·N_i/N	Incorrect size measures produce selection probabilities that deviate from the intended PPS design; if sizes have changed since the frame was constructed, the effective inclusion probabilities differ from the nominal ones	Use the most recent available size measure; document its vintage; implement a size ratio estimator if sizes are outdated; conduct sensitivity analysis using the range of plausible size values
Independent Cluster Selection	The selection of one cluster must be statistically independent of the selection of any other cluster at the first stage	Correlated PSU selection (e.g., systematic PPS with undisclosed periodicity) invalidates the standard variance formula; actual variance may differ substantially from the estimated variance	Use SRS without replacement or PPS with replacement for PSU selection; document the selection mechanism; if systematic PPS is used, check the ordering for periodicity matching the skip interval
m ≥ 2 PSUs Selected	At least two clusters must be selected to permit design-based variance estimation from the between-cluster variation in ȳᵢ	Single PSU selection (m = 1) makes variance estimation impossible by design; no information exists about between-cluster variation from the data alone	Always select m ≥ 2 PSUs; for reliable variance estimation in complex multi-stage designs, m ≥ 20–30 PSUs is commonly recommended; document the total PSU count and justify the choice of m in the study protocol
Correct Design-Based Analysis	All statistical analyses must account for the PSU structure, cluster weights, and the two-stage sampling design using appropriate survey analysis methods	Ignoring the design structure and treating data as SRS underestimates standard errors by √DEFF, inflates test statistics, and produces anti-conservative p-values with actual Type I error rates far exceeding the nominal level	Specify the survey design in software using svydesign (R), svyset (Stata), PROC SURVEYMEANS (SAS), or CSPLAN (SPSS); use Taylor linearisation, jackknife, or BRR for variance estimation; report DEFF for all primary estimates
Complete Enumeration Within Selected Clusters (One-Stage)	In one-stage designs, every element in each selected cluster must be contacted and measured; missing elements introduce selection bias if missingness is related to the outcome	Incomplete cluster enumeration converts the design from a probability to a non-probability sample within the affected cluster; bias magnitude depends on the correlation between the outcome and the probability of element exclusion	Pre-commit to exhaustive within-cluster enumeration in the protocol; establish explicit inclusion/exclusion criteria for elements within clusters; track and report the within-cluster response rate separately from the cluster-level response rate

Core Limitations

The central statistical limitation of cluster sampling is its reduced precision relative to SRS for the same total number of elements measured. This inefficiency is irreducible when within-cluster homogeneity (ρ > 0) exists — and in virtually all real-world contexts, it does. Students in the same school share a teacher, curriculum, and socioeconomic environment. Patients in the same hospital share clinical protocols and local disease burden. Residents of the same neighbourhood share infrastructure, services, and social norms. Wherever the research context provides a natural cluster, the clustering variable is almost always correlated with the outcome variable, producing positive ρ.

The magnitude of this inefficiency is often dramatically underestimated by researchers who have not formally computed DEFF. Kish (1965, pp. 257–262) documents ρ values of 0.10–0.30 as typical for educational achievement, health behaviours, and socioeconomic indicators. With b̄ = 25 elements per cluster (a modest school-based study), these values imply DEFF = 1 + 24×0.10 = 3.40 to DEFF = 1 + 24×0.30 = 8.20. This means the effective sample size is only 29% to 12% of the nominal sample size — an enormous loss that has major consequences for power calculations and confidence interval width.

The doctoral researcher must always conduct formal power calculations using the expected DEFF before data collection, using available prior estimates of ρ from the literature or from pilot studies. Failing to account for DEFF in power calculations leads to severely underpowered studies — one of the most common and consequential errors in cluster-based doctoral research (Donner & Klar, 2000; Murray, 1998).

A non-intuitive but critical property of one-stage cluster sampling is that its variance depends on the between-cluster variance S_cl² — not the total population variance S². This means that increasing the number of elements measured per cluster (while holding m fixed) does not reduce the first-stage variance component at all. The only way to reduce V(ȳ_cl) in a one-stage design is to increase m — the number of clusters selected. This is precisely the opposite of the intuition most researchers bring from SRS, where increasing the sample size always reduces variance.

The practical implication is that for a fixed total budget, the optimal design almost always involves selecting more clusters with fewer elements per cluster, rather than fewer clusters with exhaustive within-cluster measurement. This is the fundamental insight of the b̄_opt formula: b̄_opt = √[(c₁/c₂)·(1−ρ)/ρ]. Unless the cluster access cost c₁ is extraordinarily high relative to the per-element cost c₂ — which is sometimes true in remote geographic sampling — the optimal cluster size is typically much smaller than the cluster's natural size N_i (Cochran, 1977, pp. 241–244; Hansen, Hurwitz & Madow, 1953, Vol. 1, pp. 226–234).

Doctoral researchers designing cluster studies should treat the choice of m (number of clusters) as the primary efficiency-determining parameter and b̄ (elements per cluster) as a secondary parameter determined by the cost-optimisation formula — not by convenience or the desire to measure every available element within selected clusters.

In virtually all real-world cluster sampling applications, clusters are of unequal size — schools have different numbers of students, hospitals have different numbers of patients, geographic areas have different population counts. Unequal cluster sizes create two distinct problems that must be addressed separately.

Problem 1 — Bias under equal-probability selection: If clusters are selected with equal probability and cluster means Ȳ_i are correlated with cluster sizes N_i (which they commonly are — larger schools may have lower per-student resources, larger hospitals may treat more complex cases), the simple cluster mean ȳ_cl = (1/m)Σȳ_i is a biased estimator of the population mean Ȳ. The bias arises because large clusters are underrepresented in the equal-probability design relative to their contribution to the population. The ratio estimator ȳ_r = (Σ n_i ȳ_i)/(Σ n_i) — weighting cluster means by cluster size — removes this bias (approximately) but introduces a small bias of its own from the ratio approximation.

Problem 2 — Increased variance: Even with an unbiased estimator, unequal cluster sizes increase the sampling variance relative to the equal-size case because large clusters contribute disproportionately to the sample mean in some draws and less in others. PPS sampling directly addresses this: by selecting clusters with probability proportional to N_i and measuring a fixed n̄ elements per cluster, every element has the same marginal inclusion probability and the estimator is exactly unbiased without requiring ratio estimation (Cochran, 1977, pp. 247–252; Kish, 1965, pp. 186–192).

Variance estimation in cluster samples requires at least m = 2 selected PSUs to compute the between-cluster variance s_cl² = Σ(ȳᵢ − ȳ_cl)²/(m−1). When m is small — as it often is in budget-constrained studies — the variance estimate itself becomes highly unstable: with m = 4 clusters, the variance estimate has only m − 1 = 3 degrees of freedom, producing very wide confidence intervals around the confidence intervals themselves. This is the cluster-level analogue of the single-unit stratum problem in stratified sampling, and it is known in the survey literature as the "lonely PSU" problem (Wolter, 2007, pp. 158–162).

Practical consequences: (1) T-statistics for cluster-based inference should use m − 1 degrees of freedom at the PSU level — not n − 1 at the element level. For m = 4 clusters, the critical value for a 95% CI is t₃,0.975 = 3.18, not 1.96 — a very substantial correction. (2) The "collapsed PSU" technique — pairing adjacent PSUs in the same stratum of a stratified cluster design for variance estimation — provides a conservative but computable variance estimate when individual PSU variance computation is unstable. (3) For regulatory and policy surveys, most methodologists recommend m ≥ 20 PSUs per domain to support stable variance estimation (Groves et al., 2009; Lohr, 2010, pp. 178–180).

When cluster sampling is used and the intraclass correlation ρ is non-trivial, the researcher faces a conceptually important distinction between the statistical clustering problem (the need for design-correct analysis) and the substantive multilevel question (whether and how cluster membership affects the outcome). These two issues are related but distinct, and conflating them is a common source of analytical errors.

The statistical issue: Regardless of whether cluster membership causally affects the outcome, the sampling structure requires design-correct standard errors. A researcher who uses survey regression (svyglm in R, or svy: regress in Stata) accounts for the design structure in the standard errors without making any multilevel causal claims.

The substantive multilevel issue: If the researcher wishes to model the cluster-level context as an explanatory factor — e.g., testing whether school-level socioeconomic composition predicts individual achievement beyond individual-level SES — multilevel models (random effects models, hierarchical linear models) are required. These models must account for the non-random nature of the cluster sample if cluster selection was not independent of the outcome (e.g., a convenience cluster sample of easily accessible schools). Using HLM without accounting for the complex sampling design can produce biased fixed-effect estimates if the PSU selection was informative (Raudenbush & Bryk, 2002; Pfeffermann et al., 1998).

The doctoral researcher must therefore be explicit about which analysis goal is operative: design-correct estimation of population parameters (use survey regression), or estimation of contextual effects and their variance decomposition (use multilevel models with design weights and PSU indicators).

Section 05 — Comparative Analysis

Cluster Sampling vs. Other Probability Designs

Cluster sampling occupies a unique niche among probability designs: it is the least statistically efficient of the common designs, yet the most operationally feasible when no element-level frame exists and geographic concentration of data collection is required. Understanding precisely where it is justified — and where its efficiency costs make alternative designs preferable — is essential for doctoral-level design selection.

Criterion

Cluster Sampling

SRS

Stratified RS

Systematic RS

Frame Required

Cluster list only — no element list needed at design stage

Complete element list with unique IDs

Complete element list with stratum membership

Ordered element list; no strata required

Statistical Efficiency

Lowest — DEFF = 1+(b̄−1)ρ ≥ 1; effective n often far below nominal n

Baseline — DEFF = 1 by definition

Highest — DEFF < 1 under proportional allocation with correlated stratification variable

Better than SRS (ordered frame); worse if periodic

Variance Estimation

Design-based unbiased when m ≥ 2; requires PSU structure specification in software

Exact, unbiased; simple formula s²/n

Exact, unbiased when n_h ≥ 2

Approximate only — fundamental theoretical limitation

Cost Efficiency

Highest — geographic concentration reduces travel and access cost dramatically

Lowest — geographically dispersed elements require many site visits

Moderate — dispersed but cost-optimal allocation possible

Moderate — sequential selection may still be dispersed

EPSEM Property

Only with PPS first-stage and equal second-stage fraction; not under equal-probability selection with unequal clusters

Always EPSEM

Only under proportional allocation

Always EPSEM (random or non-periodic frame)

Without Element Frame

Feasible — cluster list substitutes for element frame

Impossible

Domain Estimation

Feasible if domains align with clusters; poor for cross-cutting domains (e.g., age group across all clusters)

Unreliable for rare subgroups

Excellent — guaranteed n_h per domain by design

Limited

Multilevel Analysis

Natural fit — PSU structure maps directly onto HLM level-2 units; cluster effects directly estimable

No natural cluster structure; HLM not applicable

Strata usable as level-2 grouping but stratification ≠ clustering

No natural cluster structure

Best Used When

No element frame; geographically dispersed population; multi-stage institutional structure (schools, hospitals, firms); cost constraints require site concentration; multilevel research design

Complete element frame; homogeneous population; no geographic constraints; simplest possible design adequate

Heterogeneous subgroups; domain estimates mandatory; complete frame with stratum info; variance reduction a priority

Ordered frame; sequential/continuous populations; no periodic structure; operational simplicity required

Foundational Reference

Hansen, Hurwitz & Madow (1953); Kish (1965) Chs. 5–6; Cochran (1977) Ch. 9

Cochran (1977) Ch. 2

Neyman (1934); Cochran (1977) Chs. 5–6

Madow & Madow (1944); Cochran (1977) Ch. 8

📖

Stratified Cluster Sampling: The Real-World Standard

In large-scale national surveys — government population surveys, educational assessments, health examination surveys — pure cluster sampling is rarely used alone. The standard in practice is stratified multistage cluster sampling: the population is first divided into strata (by geographic region, urbanicity, population density, or institutional type), and then clusters are selected independently within each stratum using PPS. This hybrid design captures the efficiency advantages of stratification (smaller between-stratum variance component contributes to estimator precision) while retaining the operational advantages of clustering (no element-level frame, geographic concentration). The design effect of the combined design is approximately DEFF_combined ≈ DEFF_cluster / DEFF_stratification — typically still greater than 1 but substantially smaller than for pure cluster sampling (Kish, 1965, pp. 248–255; Groves et al., 2009, pp. 120–126).

When Cluster Sampling Is the Appropriate Choice

Condition 01

No Complete Element-Level Frame

When the target population consists of individuals nested within institutions or geographic areas, and no complete, accessible list of individuals exists — but a complete list of institutions or areas does — cluster sampling is the only probability design available. This condition is the primary justification for cluster sampling and applies to a majority of large-scale social, health, and educational research contexts in both high- and low-income countries.

Condition 02

Data Collection Requires Physical Site Access

When the measurement process requires researchers to be physically present — classroom observations, medical examinations, facility audits, biometric data collection — each additional site visited imposes substantial fixed costs. Cluster sampling minimises the number of sites visited while maximising the data yield per site, producing the lowest cost per unit of information when between-site travel or access cost is high relative to within-site per-element cost.

Condition 03

Research Question is Explicitly Multilevel

When the research question concerns relationships that operate at both the individual and the institutional level — school effects on student learning, neighbourhood effects on health behaviour, organisational culture effects on employee outcomes — cluster sampling is not merely acceptable but scientifically required. The PSU structure must be preserved in the design so that between-cluster and within-cluster variances can be separately estimated, enabling valid multilevel analysis (Raudenbush & Bryk, 2002).

Condition 04

Low Intraclass Correlation and Large m

The statistical cost of clustering is minimised when ρ is small and m is large. When prior research or pilot data suggest ρ < 0.05 and the design permits selecting m ≥ 30 clusters, the DEFF = 1 + (b̄ − 1) × 0.05 may be acceptably close to 1.0 for modest b̄. In these circumstances, the operational advantages of cluster sampling can be realised at minimal statistical cost — the rare scenario in which cluster sampling is competitive with SRS on both cost and precision grounds.

Section 06 — Procedural Guide

Implementation Protocol for Doctoral Research

Rigorous implementation of cluster sampling requires explicit documentation of every methodologically consequential decision: the definition of clusters and their boundaries, the PSU frame, the selection method (equal probability or PPS), the second-stage design, the b̄_opt calculation, and the variance estimation approach. The following seven-step protocol meets the reporting standards of APA 7th Edition, STROBE, and CONSORT-equivalent guidelines for cluster-randomised and cluster-sampled designs.

Define Clusters & Obtain PSU Frame

Define cluster boundaries (geographic, institutional, or administrative). Verify the PSU frame is exhaustive. Record cluster identifiers and measures of size N_i from the most recent available source.

Estimate ρ and Compute DEFF

Obtain prior estimates of ρ from published literature or pilot data. Compute the expected DEFF = 1 + (b̄ − 1)ρ for planned b̄. Use DEFF to adjust sample size requirements upward from the SRS-based n: n_cluster = n_SRS × DEFF.

Compute b̄_opt and Determine m

Using ρ and the cost ratio c₁/c₂, compute b̄_opt = √[(c₁/c₂)·(1−ρ)/ρ]. Determine m = n_cluster / b̄_opt. Enforce m ≥ 2 (minimum for variance estimation); aim for m ≥ 20 for stable inference.

Select PSUs (Equal Prob. or PPS)

If clusters are equal or near-equal in size: SRS without replacement is appropriate. If clusters differ substantially in size: use PPS selection (cumulative size / systematic PPS). Document the selection method and all random seeds used.

Enumerate Elements Within Selected Clusters

Obtain or construct a complete element list for each selected cluster. In one-stage designs: measure all N_i elements. In two-stage designs: apply a pre-specified probability design (SRS, systematic) within each selected cluster to select n_i elements.

Collect Data & Track Response

Apply the pre-specified contact and non-response protocol. Record cluster-level and element-level response dispositions separately per AAPOR standards. Document refusals, non-contacts, and ineligibles at both PSU and SSU levels.

Analyse with Design Specification

Specify the PSU, strata (if any), and weights in survey software. Compute ȳ_cl, v̂(ȳ_cl), and DEFF. Report DEFF, effective n, and degrees of freedom = m − 1 (per stratum) for all primary estimates.

PPS Selection: The Cumulative Size Method

📐

Implementing Systematic PPS Selection — Step by Step

The cumulative size method (also called the Lahiri-Hartley-Rao method in its systematic form) is the most widely taught PPS selection procedure. The steps are: (1) List all M clusters with their sizes N_i. (2) Compute cumulative totals: C_0 = 0; C_i = C_{i−1} + N_i for i = 1, …, M. C_M = N. (3) Compute the sampling interval k = N/m. (4) Draw a random start r ~ Uniform(0, k). (5) Select the cluster whose cumulative range (C_{i−1}, C_i] contains the values r, r+k, r+2k, …, r+(m−1)k. Each cluster i is selected if any of these m values falls within its cumulative range, with probability N_i/N per pass — exactly PPS. (6) The selected clusters constitute the PSU sample. If a cluster is selected twice (possible when N_i > k), it is included twice in the sample and counted twice in all estimators — this is the with-replacement (PPSWR) framework. Cochran (1977, pp. 251–253) and Lohr (2010, pp. 177–180) provide complete worked examples.

Variance Estimation Method Selection

Design Context	Recommended Estimator	Software Implementation	Degrees of Freedom
Simple one-stage, equal-prob. selection	s_cl²/m with FPC = (1 − m/M)	svydesign(ids=~cluster_id); svymean()	m − 1
One-stage PPS with replacement (PPSWR)	Hansen-Hurwitz: (1/m(m−1))·Σ(ȳᵢ−ȳ_HH)²	svydesign(ids=~cluster_id, probs=~pi_i)	m − 1
Two-stage SRS within clusters	Between-cluster + within-cluster components; Taylor linearisation	svydesign(ids=~cluster_id+element_id, fpc=~M+N_i)	m − 1 (dominated by 1st stage)
Stratified PPS multistage (national surveys)	Taylor linearisation or BRR/jackknife	svydesign(ids=~psu+ssu, strata=~stratum_var, weights=~wt)	Σ(m_h − 1) across strata
m small (< 10 PSUs); unstable variance estimate	Collapsed PSU method; conservative estimate	Pair adjacent PSUs; compute between-pair variance	floor(m/2)

Reporting Requirements for Cluster Sampling in Peer-Reviewed Research

(a) Cluster definition: Define what constitutes a cluster (school, hospital, geographic area, household), its geographic or administrative boundaries, and the rationale for treating it as the primary sampling unit. Justify why clustering was the only feasible design — or why it was preferable to SRS given the research context.

(b) PSU frame description: Identify the source of the cluster list, its date, the total number of clusters M, the range and distribution of cluster sizes N_i, and whether any clusters were excluded from the frame and why.

(c) First-stage selection: State whether equal-probability or PPS selection was used, and justify the choice in terms of cluster size variability. Document the selection procedure (SRS, systematic PPS, or other), the random seed(s) used, and m — the number of clusters selected.

(d) Second-stage design: For two-stage designs, describe the within-cluster sampling procedure, the target n_i, whether n_i was constant or variable across clusters, and the within-cluster sampling fraction f₂ = n_i/N_i.

(e) Intraclass correlation and design effect: Report the prior estimate of ρ used in the sample size calculation, the source of this estimate, and the actual DEFF computed from the collected data. Report both nominal n and effective n_eff = n/DEFF for all primary estimates.

(f) Variance estimation: Name the variance estimation method (Taylor linearisation, jackknife, BRR, between-PSU), specify the software and design specification syntax, and report the degrees of freedom used for inference. Do not report standard errors computed assuming SRS — this is a fundamental error in cluster sample analysis.

(g) Response rates: Report cluster-level (PSU) response rate and element-level (SSU) response rate separately per AAPOR standards. Document the non-response protocol at both stages. Assess whether non-response was systematically related to cluster characteristics (geography, size, accessibility) — a specific form of non-response bias unique to multi-stage designs.

Survey Software Commands for Cluster Design Specification

Software	Design Specification Command	Notes
R (survey package)	svydesign(ids=~psu_id+ssu_id, strata=~stratum, weights=~wt, fpc=~M+N_i, data=df)	Lumley (2010); svymean(), svytotal(), svyglm() for design-correct inference
Stata	svyset psu_id [pweight=wt], strata(stratum) fpc(M) \|\| ssu_id, fpc(N_i)	svy: mean outcome; svy: logistic; svy: regress for design-correct regression
SAS	PROC SURVEYMEANS DATA=df CLUSTER psu_id; STRATA stratum; WEIGHT wt; TOTAL total_psu;	Handles two-stage designs with nested CLUSTER statements; outputs DEFF automatically
SPSS Complex Samples	CSPLAN … CLUSTER psu_var / INCLPROB wt_var; (then CSDESCRIPTIVES or CSLOGISTIC)	SPSS outputs design-corrected standard errors and DEFF for all estimates
Python (samplics)	TaylorEstimator(param="mean").estimate(y, psu=psu_var, samp_weight=wt)	samplics library; supports multi-stage designs with Taylor linearisation

Section 07 — Knowledge Assessment

Doctoral-Level Self-Assessment

These questions require application of theoretical, mathematical, and methodological concepts — not rote recall. Questions are calibrated to doctoral comprehensive examination standard and emphasise the design effect, intraclass correlation, PPS selection, variance estimation, and analysis of cluster samples in ways that distinguish them from simpler designs.

Self-Assessment Quiz — Cluster Sampling

Select the best answer for each item, then submit for scored feedback.

Question 01 of 06

A researcher selects m = 25 schools from M = 200 schools using equal probability SRS and administers a reading assessment to all students within each selected school. The average school size is b̄ = 80 students. A published meta-analysis for similar reading outcomes in comparable schools reports ρ = 0.18. What is the expected design effect, the effective sample size, and the critical implication for the precision of the population mean estimate?

ADEFF = 1 + 80 × 0.18 = 15.4; n_eff = 2,000/15.4 ≈ 130. The design is highly inefficient.

BDEFF = 1 + (80−1) × 0.18 = 15.22; n_eff = 2,000/15.22 ≈ 131. Despite 2,000 students measured, precision is equivalent to only 131 independent observations — the design is statistically very costly.

CDEFF = 1.18; n_eff = 2,000 − 18% = 1,640. The efficiency loss is modest and acceptable.

DDEFF = 1.0; cluster sampling always achieves SRS efficiency when the entire cluster is measured rather than a sub-sample.

Question 02 of 06

A national health survey team is designing a study to estimate the prevalence of hypertension. The total budget is C = $120,000. Accessing each sampled health centre (PSU) costs c₁ = $1,500 (travel, permits, equipment transport). Interviewing and measuring each patient within a selected centre costs c₂ = $15 per patient. Prior research in similar settings reports ρ = 0.06. What is the optimal number of patients to sample per health centre, and approximately how many health centres should be selected?

Ab̄_opt = 50 patients per centre; m ≈ 34 centres — equal allocation based on budget division.

Bb̄_opt = √[(1500/15)·(0.94/0.06)] ≈ 40 patients per centre; m ≈ 57 centres. Total budget: 57×(1,500 + 40×15) = $119,700 ≈ $120,000.

Cb̄_opt = √(1500/15) = 10 patients per centre; m ≈ 109 centres — based on cost ratio alone.

Db̄_opt is always the full natural cluster size — since all elements at a site can be measured for only the marginal cost c₂, it is always efficient to measure everyone at each selected site.

Question 03 of 06

A researcher selects villages for a rural survey using probability proportional to size (PPS) based on the latest census village population counts, and then interviews a fixed n̄ = 10 households within each selected village. However, the census was conducted 12 years ago and significant population changes have occurred since. What is the precise nature of the methodological concern, and what are the statistical consequences?

APPS is invalid when village sizes vary — only equal-probability selection is appropriate with heterogeneous cluster sizes.

BOutdated size measures produce incorrect PPS probabilities: villages that have grown are underrepresented; those that have shrunk are overrepresented. The design is no longer EPSEM, and estimates are biased if current size correlates with the outcome. A ratio estimator or updated sizes are required.

COutdated census data makes PPS operationally impossible — the cumulative size method cannot be applied without current population counts.

DThe concern is minor — stale size measures only affect a small proportion of unusually rare villages and do not materially affect the overall estimate.

Question 04 of 06

A doctoral student analyses a two-stage cluster sample of n = 800 employees nested in m = 20 firms, using standard OLS regression in statistical software with default settings. The software reports a standard error of SE = 0.042 for the key regression coefficient. The design effect estimated from the PSU structure is DEFF = 4.2. What is the correct standard error, and what is the consequence of reporting the OLS-based SE without correction?

ASE_correct = 0.042 × 4.2 = 0.176; the OLS SE overstates uncertainty by a factor of 4.2 and is too conservative.

BSE_correct = 0.042 × √4.2 ≈ 0.086; the OLS SE underestimates uncertainty by a factor of √4.2 ≈ 2.05, producing anti-conservative CIs, inflated t-statistics, and spuriously significant results. Degrees of freedom should be m − 1 = 19, not 797.

CSE_correct = 0.042 / √4.2 ≈ 0.020; the OLS SE overestimates the true standard error, making the cluster sample appear less precise than it actually is.

DThe correction is negligible for n = 800 — large samples self-correct for clustering and the OLS SE is approximately valid.

Question 05 of 06

A doctoral candidate proposes to improve cluster sampling efficiency by selecting a larger number of smaller clusters — increasing m from 20 to 80 while reducing b̄ from 40 to 10, keeping total n = 800 constant. She argues this will reduce the design effect. Her thesis committee member argues that it will also increase fieldwork costs dramatically. Both are asserting facts about the design trade-off. Evaluate both claims using the relevant formulae.

AThe candidate is correct and the committee member is wrong — more clusters always improves both precision and cost efficiency.

BBoth are correct. Reducing b̄ from 40 to 10 reduces DEFF (statistical improvement — candidate is right). But m quadruples from 20 to 80, multiplying cluster access costs by 4 (cost argument — committee is right). The resolution is b̄_opt = √[(c₁/c₂)·(1−ρ)/ρ], which balances both concerns.

CThe candidate is correct and the committee member\'s cost argument is irrelevant to sample design decisions — statistical optimality should always govern over cost.

DBoth are wrong — DEFF depends only on ρ, not on b̄ or m, so changing the allocation of n between m and b̄ does not affect precision.

Question 06 of 06

A thesis committee argues that a researcher who used cluster sampling to select 30 urban neighbourhoods and interviewed all adult residents in selected neighbourhoods should perform a multilevel (hierarchical linear) model to account for the clustering, rather than simple survey regression with cluster-robust standard errors. The researcher disagrees, arguing that survey regression is sufficient for producing unbiased population estimates. Who is correct, and why does the distinction matter?

AThe committee is correct — multilevel models are always preferable to survey regression when cluster structure exists, because they model the between-cluster variance directly.

BThe researcher is correct — survey regression with cluster-robust SEs is always preferable because it makes fewer distributional assumptions than HLM.

CBoth are correct for different research goals. Survey regression correctly estimates population parameters with design-valid SEs. HLM correctly estimates contextual (neighbourhood-level) effects and variance decomposition. When the research question involves neighbourhood context, HLM is scientifically required — but must use design weights if PSU selection was informative.

DNeither method is correct — the appropriate analysis is a fixed-effects model that treats each neighbourhood as a dummy variable, eliminating between-neighbourhood confounding.

—

Section 08 — Scholarly References

Primary Scholarly References

All content in this reference is grounded in peer-reviewed foundational literature in survey sampling, epidemiological methodology, and multilevel research design. References are formatted per APA 7th Edition.

Hansen, M. H., Hurwitz, W. N., & Madow, W. G. (1953). Sample survey methods and theory (Vols. 1–2). John Wiley & Sons. [The foundational mathematical derivation of cluster sampling theory, PPS selection, the optimal cluster size formula, and the variance components of one-stage and two-stage designs. Volume 1 covers methods; Volume 2 covers the mathematical proofs.]
Kish, L. (1965). Survey sampling. John Wiley & Sons. [Chapters 5 and 6 provide the definitive treatment of the design effect (DEFF), the intraclass correlation coefficient ρ, and the efficiency comparison between cluster sampling and SRS. Chapter 8 covers the DEFF framework in complex multi-stage designs.]
Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley & Sons. [Chapter 9 provides a rigorous and comprehensive doctoral-level treatment of one-stage and two-stage cluster sampling, equal-probability and PPS selection, the optimal cluster size derivation, and variance estimation under all cluster designs.]
Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Brooks/Cole. [Chapters 5–6 cover cluster sampling with accessible derivations, PPS implementation using the cumulative size method, and R-based survey software implementation including svydesign() for two-stage designs.]
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd ed.). John Wiley & Sons. [Total survey error framework applied to cluster designs; coverage error in PSU frames; non-response at the PSU and SSU levels; design specification in survey analysis software; the isolated PSU problem.]
Wolter, K. M. (2007). Introduction to variance estimation (2nd ed.). Springer. [Chapters 3 and 6 cover variance estimation in cluster designs: Taylor linearisation, jackknife, and balanced repeated replication (BRR) for multi-stage designs; the lonely PSU problem and collapsed PSU solution.]
Donner, A., & Klar, N. (2000). Design and analysis of cluster randomization trials in health research. Arnold. [The authoritative reference for cluster design effect and intraclass correlation in health research; empirical ρ values across health and social outcomes; sample size planning accounting for DEFF in cluster randomised trials.]
Murray, D. M. (1998). Design and analysis of group-randomized trials. Oxford University Press. [Comprehensive treatment of group-randomised (cluster-randomised) designs; empirical ρ values for behavioural outcomes; power calculation methods accounting for the design effect.]
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Sage. [The definitive reference for multilevel modelling nested within cluster-sampled populations; variance decomposition at PSU and SSU levels; design-weighted multilevel estimation for informative cluster designs.]
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society: Series B, 60(1), 23–40. [The foundational paper establishing how design weights must be incorporated into HLM/multilevel analyses of informatively selected cluster samples to avoid biased fixed-effect estimates.]
Lemeshow, S., & Robinson, D. (1985). Surveys to measure programme coverage and impact: A review of the methodology used by the expanded programme on immunisation. World Health Statistics Quarterly, 38(1), 65–75. [Originating paper for the EPI 30×7 cluster sampling design widely used in public health surveillance; PPS-based cluster selection for population coverage estimation in resource-limited settings.]
Lumley, T. (2010). Complex surveys: A guide to analysis using R. Wiley. [Practical implementation of multi-stage cluster designs in R's survey package; svydesign() specification for one-stage and two-stage cluster sampling; variance estimation using Taylor linearisation and jackknife; DEFF computation from fitted survey objects.]
Särndal, C-E., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. Springer. [Advanced model-assisted treatment of cluster sampling within a design-based inference framework; calibration estimators for multi-stage designs; the GREG estimator applied to cluster samples with auxiliary information.]

📚