Doctoral Research Methodology Series

Simple Random Sampling

A rigorous scholarly reference covering the epistemological foundations, mathematical theory, probability mechanics, assumptions, limitations, and practical implementation of SRS in empirical research.

Classification: Probability Sampling
Level: Doctoral / Post-Graduate
Grounded in: Cochran (1977) · Neyman (1934) · Lohr (2010)

Epistemological & Theoretical Foundations

Simple Random Sampling (SRS) constitutes the purest form of probability sampling, wherein every element in the sampling frame possesses an equal, non-zero, and calculable probability of selection.

Formal Definition

Simple Random Sampling is a probabilistic selection method in which a sample of n units is drawn from a finite population of N units such that every possible sample of size n has an equal probability of selection, specifically 1/C(N,n), where C(N,n) is the number of possible combinations.

— Cochran, W.G. (1977). Sampling Techniques (3rd ed.). John Wiley & Sons, p. 18.

Historical Lineage

The formal mathematical treatment of SRS emerges from the foundational work of Jerzy Neyman (1934), whose seminal paper "On the Two Different Aspects of the Representative Method" established the probabilistic basis of modern sampling theory. Neyman distinguished between purposive (quota) sampling and random sampling, demonstrating that only the latter permits legitimate inferential statements about population parameters. This epistemological distinction—between descriptive adequacy and inferential validity—remains central to doctoral-level research methodology today.

W.G. Cochran's (1977) comprehensive formalization and F. Yates's earlier contributions in agricultural experimentation further cemented SRS within the classical frequentist paradigm. The method's design-based inferential logic—where randomness resides in the selection mechanism rather than in any assumed probability model for the outcome variable—distinguishes it from model-based approaches (Royall, 1970).

Two Varieties: With vs. Without Replacement

SRS-WOR · Without Replacement

Simple Random Sampling Without Replacement

Each element may be selected at most once. This is the standard form in social, behavioural, and health sciences research. Each draw reduces the available pool, meaning successive draws are not independent — however, the sample remains unbiased. Finite population correction (FPC) applies. The vast majority of applied survey research employs SRSWOR.

SRS-WR · With Replacement

Simple Random Sampling With Replacement

Elements are returned to the population before the next draw; identical units may appear multiple times in the sample. Each draw is statistically independent (identically distributed draws). While less common in practice, SRSWR simplifies variance derivation and forms the basis of bootstrapping and other resampling methods (Efron & Tibshirani, 1993).

EPSEM: The Core Principle

📐
Equal Probability of Selection Method (EPSEM)

SRS is the canonical EPSEM design. Kish (1965) formalized EPSEM as a design property ensuring self-weighting samples, where each sampled unit represents the same number of population units. In SRSWOR, P(element i selected) = n/N for all i ∈ {1,…,N}. The sample mean is therefore an unbiased estimator of the population mean with no post-stratification weighting required.

Key Properties at a Glance

Property 01

Unbiasedness

The sample mean ȳ is an unbiased estimator of the population mean Ȳ. E(ȳ) = Ȳ for any population distribution, by virtue of the randomisation mechanism alone.

Property 02

Consistency

As n → N, the sample mean converges in probability to the population mean. Variance of the estimator approaches zero as the sampling fraction f = n/N approaches 1.

Property 03

Design-Based Validity

Inferential validity derives from the randomisation mechanism, not from assumptions about the population distribution (Hansen, Hurwitz & Madow, 1953).

Property 04

Reproducibility

With a documented random seed, the selection process is fully replicable — a critical requirement for peer-reviewed research transparency (AAPOR, 2016).

Property 05

Minimum Variance

Among all unbiased linear estimators for a given n under design-based inference, the Horvitz–Thompson estimator under SRSWOR achieves the Cramér–Rao lower bound for variance.

Property 06

CLT Applicability

The Central Limit Theorem guarantees approximate normality of the sampling distribution of ȳ for sufficiently large n, enabling parametric confidence interval construction regardless of the population shape.

Probability, Estimation & Variance Theory

SRS rests on a rigorous mathematical framework encompassing combinatorial probability, unbiased estimation, and variance decomposition. Mastery of these derivations is essential for doctoral-level critical appraisal of sampling designs.

1. Selection Probability

For a finite population of N elements, the number of distinct samples of size n (without replacement) is the binomial coefficient C(N,n). Under SRS, each such sample is equally probable:

Probability of Any Specific Sample
P(S) = 1 / C(N, n) = (n! · (N-n)!) / N!
P(S) = probability of selecting a specific sample S of size n
N = total population size
n = required sample size
C(N,n) = "N choose n" — number of possible combinations

The marginal (first-order inclusion) probability that a specific element i is included in the sample is:

First-Order Inclusion Probability (SRSWOR)
πᵢ = P(i ∈ s) = n / N for all i = 1, 2, …, N
πᵢ = marginal inclusion probability for element i
n = sample size  ·  N = population size
Crucially, πᵢ is identical for all elements — the EPSEM property.

2. The Horvitz–Thompson Estimator

The general Horvitz–Thompson (H-T) estimator of the population total T is:

Horvitz–Thompson Estimator of Population Total
T̂ₕₜ = Σᵢ∈ₛ (yᵢ / πᵢ) = (N/n) · Σᵢ∈ₛ yᵢ
Under SRSWOR: πᵢ = n/N, so the H-T estimator simplifies to N·ȳ
yᵢ = observed value for sampled element i
This estimator is unbiased: E(T̂ₕₜ) = T

3. Sample Mean & Unbiasedness Proof

Unbiased Estimator of Population Mean
ȳ = (1/n) · Σᵢ∈ₛ yᵢ

E(ȳ) = Ȳ = (1/N) · Σᵢ₌₁ᴺ Yᵢ ✓
ȳ = sample mean (estimator)
Ȳ = population mean (target parameter)
Proof: E(ȳ) = E[(1/n)Σyᵢ] = (1/n)·n·Ȳ = Ȳ, using the fact that E(yᵢ) = Ȳ for all i in the sample.

4. Variance of the Sample Mean

Variance of ȳ — SRSWOR
V(ȳ) = (1 - f) · (S²/n)
f = n/N = sampling fraction
(1 - f) = Finite Population Correction (FPC) factor
= population variance = [1/(N-1)] · Σ(Yᵢ - Ȳ)²
n = sample size
As N→∞ or f→0: V(ȳ) → σ²/n (the familiar simple variance formula)
⚠️
When to Apply the Finite Population Correction (FPC)

The FPC factor (1 - f) is theoretically required whenever sampling is without replacement from a finite population. In practice, the correction is negligible when f ≤ 0.05 (i.e., the sample constitutes less than 5% of the population). When f exceeds this threshold — common in organizational studies, census sub-studies, or small-population clinical trials — ignoring FPC produces systematically inflated standard errors and overly conservative confidence intervals. Cochran (1977, p. 25) provides the formal derivation.

5. Estimated Variance (Unknown S²)

Since the population variance S² is unknown in practice, it is estimated from the sample:

Estimated Variance of the Sample Mean
v̂(ȳ) = (1 - f) · (s²/n)

s² = [1/(n-1)] · Σᵢ∈ₛ (yᵢ - ȳ)²
= sample variance (unbiased estimator of S²)
E(s²) = S² — the denominator (n-1) corrects for Bessel's bias
SE(ȳ) = √v̂(ȳ) = standard error of the mean

6. Confidence Interval Construction

95% Confidence Interval for Population Mean
CI: ȳ ± z_{α/2} · SE(ȳ)

= ȳ ± 1.96 · √[(1 - n/N) · (s²/n)]
z_{α/2} = 1.96 for α = 0.05 (large n, CLT applies)
For small n: substitute t_{α/2, n-1} (Student's t-distribution)
The 95% CI has the design-based interpretation: in repeated sampling from the same frame, 95% of such intervals will contain Ȳ.

7. Optimal Sample Size Determination

Cochran's (1977) Sample Size Formula
n₀ = (z²_{α/2} · p · q) / e²

n = n₀ / (1 + (n₀ - 1)/N)
p = estimated proportion (use 0.5 for maximum conservatism)
q = 1 - p  ·  e = desired margin of error
n = finite-population corrected final sample size
Example: N=10,000, p=0.5, e=0.05, z=1.96 → n₀≈384 → n≈370

Simple Random Sampling Simulator

Engage directly with the sampling mechanism. Adjust population size (N), sample size (n), and observe selection probabilities, the sampling distribution of the mean, and the effect of the FPC.

SRS Monte Carlo Simulator

Visualises selection mechanics and the Central Limit Theorem

42
10
100
Population Units (N = 42) — highlighted = selected into sample
42 Population N
10 Sample n
23.8% Inclusion Prob. πᵢ = n/N
0.762 FPC Factor (1 − f)
Sample Mean ȳ
Std. Error SE(ȳ)
🔬
What the Simulator Demonstrates

Draw Sample: Animates the random selection of n units from N, illustrating the uniform inclusion probability πᵢ = n/N for every population element. Each element's identification number serves as its value yᵢ.

Run Simulation: Executes the specified number of independent samples and builds the empirical sampling distribution of ȳ, directly demonstrating the Central Limit Theorem — regardless of the rectangular population distribution, the distribution of sample means approximates normality as the number of draws increases.

Assumptions, Conditions & Limitations

Doctoral-level engagement with SRS requires critical evaluation of its conditions of applicability. Uncritical deployment of SRS without examining underlying assumptions constitutes a methodological error.

Formal Assumptions

Assumption Technical Statement Violation Consequence Diagnostic / Remedy
Complete Sampling Frame Every population element must be listed and accessible in the frame Coverage bias; non-representativeness; exclusion of unlisted subgroups Frame audit; dual-frame designs (Hartley, 1962)
Equal Inclusion Probability πᵢ = n/N for all i — no element privileged or excluded Biased H-T estimator; invalid standard errors Verify frame completeness; use PPS sampling if needed
Independence of Selection SRSWR: draws are i.i.d.; SRSWOR: controlled dependence via FPC Under-estimated variance if clustering ignored Design effect (DEFF) calculation; complex survey SE methods
Finite Population N must be known and fixed; super-population model not assumed FPC miscalculation; over/under-coverage Census or registry enumeration; model-assisted estimation
Non-zero Response Rate Selected units must respond / be measurable Non-response bias if missingness is non-random (MNAR) Propensity weighting; multiple imputation (Little & Rubin, 2002)
True Randomisation Selection must use a validated random mechanism (PRNG or table) Selection bias; pseudo-random artifacts Cryptographically secure PRNG (NIST SP 800-90A)

Frequently Cited Limitations in the Literature

SRS is only as valid as its sampling frame. Groves et al. (2009) distinguish between the target population (conceptually defined) and the frame population (operationally accessible). Any discrepancy between these two — termed coverage error — produces systematic bias that cannot be corrected through any design-based estimator.

Common frame deficiencies include: undercoverage (unlisted elements, e.g., homeless populations in household surveys), overcoverage (duplicate entries, out-of-scope units), and clustering (groups of elements represented by a single entry). Doctoral researchers must report frame construction procedures explicitly and assess coverage error magnitude.

SRS implicitly treats the population as homogeneous with respect to variability. When the population contains distinct subgroups (strata) with markedly different means or variances, SRS may yield disproportionate representation by chance. Neyman (1934) demonstrated that optimal allocation in stratified random sampling can reduce variance by up to 40% compared to SRS under conditions of between-stratum heterogeneity.

For example, in a study of household income across urban and rural regions with vastly different income distributions, SRS risks severely under-representing one stratum. Stratified random sampling with proportional or optimal allocation (Neyman allocation) is the methodologically superior choice in such scenarios.

SRS ignores geographical structure entirely. When selected units are dispersed across a large geographic area, data collection costs escalate dramatically. Cluster sampling — grouping population elements into natural clusters and selecting clusters randomly — trades increased variance (measured by the design effect, DEFF = 1 + (b̄ - 1)ρ, where ρ is the intracluster correlation coefficient and b̄ is the average cluster size) for dramatically reduced travel and administrative costs. Kish (1965) provides the definitive treatment of this trade-off.

When the research interest lies in estimating parameters for a rare subgroup (e.g., a 2% prevalence condition), SRS requires an enormously large total sample to produce adequate domain estimates. If the target subgroup prevalence is P = 0.02 and a minimum domain sample of n_d = 100 is required, the total SRS sample must be approximately n = 100/0.02 = 5,000. Oversampling designs, disproportionate stratification, or targeted sampling methods (Watters & Biernacki, 1989) are more cost-efficient in such contexts.

Randomisation of the selection mechanism does not protect against bias introduced by differential non-response. If non-response is related to the outcome variable — Missing Not At Random (MNAR) in the terminology of Rubin (1976) — the effective realised sample is no longer a probability sample in the strict sense. The response propensity model (Rosenbaum & Rubin, 1983) offers a partial remedy, but the assumption of ignorable non-response remains untestable without auxiliary data. This is one of the most consequential practical limitations of SRS in field research contexts.

SRS versus Other Probability Sampling Designs

No single sampling design is universally optimal. The appropriate design is determined by population structure, research objectives, budgetary constraints, and acceptable levels of design complexity.

Criterion
SRS
Stratified RS
Cluster S.
Systematic S.
Requires Full Frame
YES
YES
Cluster list only
YES
Statistical Efficiency
Baseline
Higher (if strata homogeneous)
Lower (DEFF > 1)
Equal or higher (periodic populations)
Subgroup Analysis
Poor (rare groups)
Excellent (proportional/optimal alloc.)
Feasible
Limited
Cost / Logistics
Moderate–High
Moderate–High
Low (geographically clustered)
Moderate
Design-Based Validity
Full
Full
Full (with DEFF)
Full (if periodic ok)
Variance Estimation
Simple closed-form
Moderately complex
Complex (sandwich/BRR/Jackknife)
Conservative (requires assumptions)
Best Used When
Homogeneous population, accessible frame, sufficient budget
Known heterogeneous subgroups exist
No complete frame; geographic dispersion
Ordered frame exists; periodicity acceptable
Foundational Reference
Cochran (1977)
Neyman (1934)
Kish (1965)
Madow & Madow (1944)
📖
The Design Effect (DEFF) as a Comparative Metric

The design effect, defined by Kish (1965) as DEFF = V_design(ȳ) / V_SRS(ȳ), quantifies the variance inflation (or deflation) of an alternative design relative to SRS. DEFF > 1 indicates that the alternative design is less efficient than SRS (e.g., cluster sampling with high intracluster correlation). DEFF < 1 indicates superior efficiency (e.g., stratified sampling with high between-stratum variance). The effective sample size is n_eff = n / DEFF. All published survey analyses should report DEFF to enable cross-study comparison and meta-analytic integration.

When SRS is the Optimal Choice

Condition 01

Complete, Accurate Frame Available

SRS is most appropriate when a comprehensive, up-to-date, unduplicated sampling frame exists — e.g., a student registry, employee database, or national health record system — with full coverage of the target population.

Condition 02

Population Relatively Homogeneous

When within-population variance on the key outcome variable is relatively uniform (low between-subgroup heterogeneity), SRS achieves near-optimal efficiency and stratification yields minimal gains.

Condition 03

No Domain Estimation Required

When research objectives concern the overall population mean or total — not subgroup-specific estimates — SRS provides unbiased estimation with minimal design complexity.

Condition 04

Theoretical Baseline Required

In methods research, pilot studies, and simulation studies evaluating estimator properties, SRS serves as the canonical design-based benchmark against which alternative procedures are evaluated.

Implementation Protocol for Doctoral Research

Rigorous implementation of SRS requires systematic adherence to a documented protocol. Each step must be reported with sufficient detail to satisfy peer-review and research ethics board requirements.

1
Define Target Population

Specify inclusion and exclusion criteria with precision. Ambiguous boundaries cause coverage error.

2
Construct Sampling Frame

Enumerate all N population elements. Audit for duplicates, out-of-scope entries, and undercoverage.

3
Determine Sample Size n

Apply Cochran's formula. Specify α, desired power, expected effect size, and margin of error.

4
Assign Unique IDs

Assign sequential identifiers (1 to N) to each population element. Document the assignment.

5
Generate Random Numbers

Use a validated PRNG or random number table. Record the seed for reproducibility and audit trail.

6
Select & Contact Units

Select the n elements corresponding to generated numbers. Document all contact attempts.

Randomisation Tools: Standards and Best Practices

Method Standard Acceptable For Reproducibility
Cryptographic PRNG NIST SP 800-90A (e.g., CTR_DRBG) All research levels, including RCTs Seed-dependent — record seed
R: sample() / set.seed() Mersenne Twister (MT19937) Academic / social science research Full reproducibility with set.seed()
Python: random.sample() / numpy PCG-64 (NumPy ≥1.17) or MT19937 Computational and applied research numpy.random.default_rng(seed)
SPSS: Random Cases Wichmann–Hill algorithm Behavioural sciences Seed via SET SEED command
Physical Random Number Table Rand Corporation (1955) Million Random Digits Historical benchmark; small N studies Document start row and column

Reporting Requirements (APA 7th / CONSORT / STROBE)

(a) Population and frame: Full description of target population, frame source, frame date, coverage rate estimate, and any known frame deficiencies.

(b) Sample size determination: Report n, the formula used (Cochran or equivalent), assumed parameter values (α, power, effect size or margin of error), and any adjustment for anticipated non-response or design effect.

(c) Randomisation procedure: Software/algorithm used, version number, and random seed. State explicitly whether sampling was with or without replacement.

(d) Response rate: Report final response rate per AAPOR Response Rate Definitions. Provide comparison of respondent vs. non-respondent characteristics on available auxiliary variables.

(e) Variance estimation: State whether FPC was applied and justify the decision. Report standard errors, not merely standard deviations.

Doctoral-Level Self-Assessment

These questions require application of theoretical concepts, not rote recall. Questions are calibrated to doctoral comprehensive examination standard.

Self-Assessment Quiz — Simple Random Sampling

Select the best answer for each item, then submit for feedback.

Question 01 of 06
In a SRSWOR design with N = 500 and n = 50, a researcher omits the finite population correction factor when calculating the standard error. What is the direction and magnitude of the resulting error?
Question 02 of 06
A researcher uses SRS to study employees' job satisfaction in an organisation with 1,200 employees across three departments: 800 in Operations, 300 in Finance, and 100 in HR. The sample of n=120 by chance yields 98 from Operations, 18 from Finance, and 4 from HR. Which methodological concern is MOST pressing?
Question 03 of 06
Which of the following constitutes a violation of the EPSEM principle in a nominally SRS design?
Question 04 of 06
A researcher reports a 95% CI for ȳ as [42.3, 51.7] based on SRS. What is the correct design-based interpretation?
Question 05 of 06
Using Cochran's (1977) formula with p=0.5, e=0.05, α=0.05, what adjusted sample size n is needed from a population of N = 800?
Question 06 of 06
A survey achieves 60% response rate (n_r = 240 of n = 400 selected). The researcher compares respondents vs. non-respondents on administrative records and finds no significant difference on key covariates. What is the MOST defensible conclusion?

Primary Scholarly References

All content in this resource is grounded in peer-reviewed foundational literature. References are formatted per APA 7th Edition.

  • Cochran, W. G. (1977). Sampling techniques (3rd ed.). John Wiley & Sons. [The definitive doctoral-level reference for SRS theory, variance derivation, and sample size determination.]
  • Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625. [Foundational paper establishing probability sampling on rigorous mathematical grounds.]
  • Kish, L. (1965). Survey sampling. John Wiley & Sons. [Classic treatment of EPSEM, design effects, cluster sampling, and complex survey analysis.]
  • Lohr, S. L. (2010). Sampling: Design and analysis (2nd ed.). Brooks/Cole. [Contemporary graduate-level survey sampling textbook; recommended for variance estimation chapters.]
  • Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology (2nd ed.). John Wiley & Sons. [Authoritative reference on non-response bias, coverage error, and total survey error.]
  • Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685. [Original H-T estimator derivation; foundational for weighted estimation under unequal probability sampling.]
  • Hansen, M. H., Hurwitz, W. N., & Madow, W. G. (1953). Sample survey methods and theory (Vols. 1–2). John Wiley & Sons. [Comprehensive two-volume reference; essential for design-based inference theory.]
  • Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. Chapman & Hall. [Establishes SRSWR as the basis of bootstrap resampling; relevant for variance estimation via Monte Carlo methods.]
  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). John Wiley & Sons. [Rubin's MCAR/MAR/MNAR taxonomy and its implications for non-response in SRS surveys.]
  • Royall, R. M. (1970). On finite population sampling theory under certain linear regression models. Biometrika, 57(2), 377–387. [Model-based vs. design-based inference debate; critical for doctoral-level epistemological positioning.]
  • American Association for Public Opinion Research (AAPOR). (2016). Standard definitions: Final dispositions of case codes and outcome rates for surveys (9th ed.). AAPOR. [Mandatory reference for response rate reporting standards.]
  • Madow, W. G., & Madow, L. H. (1944). On the theory of systematic sampling. Annals of Mathematical Statistics, 15(1), 1–24. [Foundational paper on systematic sampling as an alternative to SRS.]
📚
Recommended Further Reading for Doctoral Candidates

For model-assisted and model-based extensions of SRS theory: Särndal, C-E., Swensson, B., & Wretman, J. (1992). Model Assisted Survey Sampling. Springer. For Bayesian approaches to finite population inference: Ericson, W.A. (1969). Subjective Bayesian models in sampling finite populations. Journal of the Royal Statistical Society, Series B, 31(2), 195–233. For complex survey analysis in R: Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.