Experimental Research: A Doctoral-Level Guide

Section 01

Introduction to Experimental Research

Among the various methodological traditions that define scientific inquiry, experimental research occupies a singular and privileged position. It is the only research approach that permits researchers to speak with confidence about causation rather than mere association — the difference between saying that two variables move together and saying that one variable causes the other to change. This distinction, subtle as it may appear to the uninitiated, carries profound consequences for the kinds of claims researchers can make, the policies that can be recommended, and the interventions that can be justified in clinical, educational, and social practice.

Experimental research traces its intellectual lineage to some of the most transformative moments in the history of science. The rigorous manipulation of conditions, the careful assignment of participants to groups, the meticulous measurement of outcomes — these practices emerged not from one discipline but from the convergence of multiple traditions in natural science, medicine, and, eventually, the social sciences. Today, experimental methods underpin evidence hierarchies in medicine, inform program evaluations in education, and remain the benchmark against which other forms of empirical inquiry are assessed.

Yet experimental research is not without its complexities, tensions, and practical constraints. The very features that give experiments their explanatory power — controlled manipulation, random assignment, laboratory or controlled settings — can simultaneously limit how broadly findings may be generalized. This tension between internal validity and external validity is one of the persistent intellectual challenges that any serious researcher must grapple with when designing an experimental study.

Why This Matters for Doctoral Research

At the doctoral level, understanding experimental methodology is not merely about knowing its procedures. It requires appreciating the philosophical assumptions underlying causal inference, the statistical logic of hypothesis testing, the ethical imperatives governing human subjects research, and the practical judgment required when real-world constraints prevent textbook designs. This page addresses all of these dimensions comprehensively.

This resource is structured to guide you systematically — from foundational concepts and historical roots, through the taxonomy of experimental designs and their respective strengths and weaknesses, to practical guidance on conducting studies, analyzing data, and navigating ethical requirements. Each section builds on the previous, but the page is also designed so that experienced researchers can navigate directly to the sections most relevant to their current work.

Section 02

Definition and Core Concepts

Scholarly Definition

Experimental research is a systematic, empirical method of inquiry in which a researcher deliberately manipulates one or more independent variables under controlled conditions, randomly assigns participants (or units) to treatment and control groups, and measures the effect of the manipulation on one or more dependent variables — with the express purpose of establishing causal relationships.

Adapted from: Campbell & Stanley (1963); Creswell & Creswell (2018); Shadish, Cook & Campbell (2002)

Several elements of this definition warrant careful unpacking. The word systematic signals that experimental inquiry follows planned, replicable procedures — not improvisation. Empirical means the investigation relies on observable, measurable evidence gathered from the real world. Deliberate manipulation distinguishes experimental research from observational studies, where researchers observe phenomena without intervening. And causal relationships identifies the ultimate inferential goal: to determine whether changes in X bring about changes in Y.

Key Characteristics of Experimental Research

Seven core characteristics distinguish experimental research from other quantitative approaches. Understanding these characteristics is essential because each one contributes — either alone or in combination with others — to the capacity of the experiment to yield causal inferences.

Manipulation of Variables

The researcher actively introduces, modifies, or withholds the independent variable — rather than simply observing it. This is the defining act that separates experimental from non-experimental research.

Random Assignment

Participants are assigned to conditions through a chance mechanism (e.g., lottery, random number table, computerized randomization), ensuring each participant has an equal probability of receiving any condition.

Controlled Conditions

Extraneous variables that could influence the dependent variable are identified and held constant, or their effects are distributed equally across groups — preventing confounding of results.

Comparison Groups

At least one experimental group (receiving treatment) is compared against a control group (receiving no treatment or an alternative treatment), providing a baseline against which effects are measured.

Measurement of Outcomes

Dependent variables are operationalized into observable, measurable indicators, and data are collected using standardized instruments to enable statistical analysis and replication.

Hypothesis Testing

Experiments are driven by specific, testable hypotheses (null and alternative) that articulate predicted relationships or differences among variables, allowing probabilistic conclusions about those relationships.

Replicability

Experimental procedures are documented with sufficient detail so that independent researchers can repeat the study under similar conditions to verify, challenge, or extend the original findings.

Section 03

Historical Development of Experimental Research

The roots of experimental methodology reach back to antiquity, though it was not until the seventeenth century that systematic experimentation began to crystallize as a recognized approach to generating knowledge. Francis Bacon's advocacy for inductive reasoning and empirical investigation, articulated in Novum Organum (1620), laid the philosophical groundwork for what would eventually become the scientific method. William Harvey's demonstrations of blood circulation through careful physiological experiments in the 1620s showed that controlled observation and systematic manipulation could yield knowledge far beyond what armchair speculation could provide.

The formalization of experimental logic in the social sciences followed a longer path. John Stuart Mill's A System of Logic (1843) introduced the method of difference — the foundational logic underlying experimental comparison — which holds that if two circumstances differ only in one factor, that factor must be either the cause or the effect of the observed difference. This principle, though articulated in philosophical rather than statistical terms, anticipates the internal validity logic of the modern randomized controlled trial.

The statistical machinery essential to modern experimental design emerged largely through the work of Ronald A. Fisher at the Rothamsted Experimental Station in England during the 1920s and 1930s. Fisher's contributions included analysis of variance (ANOVA), the concept of randomization as both a design principle and the basis for probability theory, factorial designs, and the notion of replication. His landmark texts — Statistical Methods for Research Workers (1925) and The Design of Experiments (1935) — provided the mathematical scaffolding upon which all subsequent experimental methodology has been built.

Landmark Contribution

Fisher's insistence that randomization, not just control, was the basis for valid causal inference was a conceptual breakthrough that distinguished modern experimental design from earlier controlled observations. The randomized controlled trial (RCT) that now dominates medical research traces its direct lineage to Fisher's agricultural experiments.

Donald Campbell and Julian Stanley's seminal monograph Experimental and Quasi-Experimental Designs for Research (1963) marked the inflection point at which experimental methodology became the gold standard framework in the social sciences and education. Their system for evaluating threats to internal and external validity remains, with later refinements by Shadish, Cook, and Campbell (2002), the dominant framework for assessing experimental designs today.

In recent decades, experimental methodology has extended into new domains. Field experiments — experiments conducted in naturalistic settings rather than laboratories — became prominent tools in economics and public policy through the work of researchers like Esther Duflo and Abhijit Banerjee, whose randomized evaluations of poverty interventions in developing countries earned them the Nobel Prize in Economics in 2019. This expansion demonstrates that the experimental tradition, far from being confined to laboratory benches, has become a tool for answering some of the most pressing questions in global social policy.

Section 04

Types of Experimental Research

Experimental research is not a monolithic category but a family of related designs that vary in the degree to which they fulfill the classical requirements of manipulation, random assignment, and control. Three major types are recognized in the research methodology literature: true experimental, quasi-experimental, and pre-experimental designs. Each has its appropriate applications, strengths, and limitations.

True Experimental Designs

True experimental designs are characterized by three defining features: (1) deliberate manipulation of an independent variable, (2) random assignment of participants to conditions, and (3) a control or comparison group. When all three features are present, researchers have the strongest possible basis for causal inferences. The randomized controlled trial (RCT) is the archetypal true experimental design, widely regarded across medicine, psychology, and educational research as the "gold standard" for establishing causation.(Shadish, Cook & Campbell, 2002)

The logic is straightforward: if participants are randomly assigned, then on average, the experimental and control groups will be equivalent on all variables — measured and unmeasured — prior to the intervention. Any differences observed afterward can therefore be attributed to the intervention itself, not to pre-existing differences between groups. This is the power of randomization: it controls for confounders you have not thought to measure, not merely those you have anticipated.

▶ Practical Example

True Experiment: Testing a Reading Intervention in Elementary Schools

A researcher wants to determine whether a phonics-based reading program improves reading fluency in Grade 2 pupils. Sixty students are randomly assigned — 30 to the experimental group (receives 12 weeks of the phonics program) and 30 to the control group (continues with the standard curriculum). Pre-tests and post-tests measure reading fluency. Because of random assignment, any significant post-test difference between groups can be attributed to the phonics program.

IV: Type of instruction (phonics program vs. standard curriculum)
DV: Reading fluency score (words per minute)
Design: Pretest–Posttest Control Group Design

Quasi-Experimental Designs

Quasi-experimental designs include manipulation and comparison groups but lack random assignment. This absence is often not a researcher's choice but a practical and ethical reality: in many educational and social settings, randomly assigning students to classrooms, patients to hospitals, or communities to programs is impossible or inappropriate. Quasi-experimental designs are the researcher's methodological response to this constraint.

Because random assignment is absent, pre-existing differences between groups become a serious concern. Researchers using quasi-experimental designs must employ statistical techniques — such as matching, analysis of covariance (ANCOVA), regression discontinuity, or difference-in-differences analysis — to account for these differences and strengthen causal claims.(Imbens & Rubin, 2015)

▶ Practical Example

Quasi-Experiment: Evaluating a Teacher Training Program

A school district implements a professional development program for teachers in three schools. Three other schools, similar in demographics and academic performance, serve as comparison schools. Teachers are not randomly assigned; they are in their assigned schools based on employment. The researcher compares student achievement gains in treatment versus comparison schools, using pre-program achievement scores as covariates in ANCOVA.

Pre-Experimental Designs

Pre-experimental designs are the weakest of the three categories. They lack both random assignment and adequate comparison groups, making causal inference highly problematic. Despite this, pre-experimental designs are frequently encountered in applied settings where resources or logistics prohibit more rigorous designs. They include the one-shot case study, the one-group pretest-posttest design, and the static-group comparison.

Their primary utility lies in exploratory research — generating hypotheses for more rigorous follow-up studies, or providing preliminary data to justify the investment required for a true experiment. At the doctoral level, using a pre-experimental design requires explicit acknowledgment of its limitations and careful justification of why a stronger design was not feasible.(Campbell & Stanley, 1963)

Design Type	Manipulation	Random Assignment	Control Group	Causal Inference
True Experimental	✓ Yes	✓ Yes	✓ Yes	Strong
Quasi-Experimental	✓ Yes	✗ No	✓ Yes (non-random)	Moderate
Pre-Experimental	✓ Yes	✗ No	✗ Absent or weak	Weak

Section 05

Key Components of an Experiment

Every experiment, regardless of its specific design, is built from a set of conceptual and operational components. Understanding these components — and the distinctions among them — is foundational to competent design, execution, and reporting of experimental research.

Independent Variable (IV)

The independent variable is the factor that the researcher deliberately manipulates. It is the presumed cause in the causal relationship being tested. In experiments, IVs are defined in terms of levels or conditions — the specific values or forms the variable takes. A study examining the effect of sleep duration on cognitive performance might have three levels of the IV: 4 hours, 6 hours, and 8 hours. IVs must be operationalized with precision: vague manipulations produce uninterpretable results.

Dependent Variable (DV)

The dependent variable is the outcome the researcher measures to assess the effect of the independent variable. It is the presumed effect in the causal relationship. DVs must be operationalized — that is, defined in terms of specific, observable, and measurable indicators. In the sleep example above, the DV might be operationalized as score on the Trail Making Test Part B, a standardized neuropsychological measure of processing speed and executive function. Poor operationalization — using vague or unreliable measures — is among the most common threats to the interpretability of experimental findings.

Control Variables (CVs)

Control variables are extraneous variables — factors other than the IV that could influence the DV — that are held constant or otherwise accounted for. They might be controlled through inclusion criteria (e.g., restricting participants to a specific age range), experimental procedures (e.g., conducting all sessions at the same time of day), or statistical methods (e.g., entering them as covariates in ANCOVA). The identification of control variables requires both theoretical knowledge of the phenomenon under study and practical familiarity with the research context.

Experimental and Control Groups

In its most basic form, an experiment involves two groups. The experimental group (also called the treatment group) receives the manipulation — the independent variable. The control group receives no treatment, a placebo, or a comparison treatment, depending on the research question. In more complex designs, multiple experimental groups may be created to test different levels of the IV, with or without a separate control group.

Important Distinction

A placebo control is used when simply receiving any treatment — regardless of its specific content — might affect participant behavior or outcomes (the placebo effect). Medical trials routinely use inert pills as placebos; educational researchers might use an alternative activity of equal duration to control for time-on-task effects. Failure to account for placebo effects can lead to overestimation of treatment efficacy.

Confounding Variables

A confounding variable is one that is correlated with both the IV and the DV, creating the spurious appearance of a causal relationship or masking a true one. Consider a study finding that students who attend private schools outperform public school students on standardized tests. Socioeconomic status (SES) is a confounder: it is correlated with both private school attendance (IV) and academic achievement (DV). Without controlling for SES, the apparent effect of school type is not interpretable as a causal effect. Random assignment, when feasible, distributes confounders equally across groups; when randomization is not possible, researchers must identify and statistically adjust for known confounders.

Operational Definitions

An operational definition specifies exactly how a variable will be measured or manipulated in a given study. It translates an abstract theoretical construct into a concrete procedure. Consider the construct "academic anxiety." An operational definition might be: "scores on the Academic Anxiety Scale – Revised (AAS-R; Betz & Hackett, 1983), with higher scores indicating greater anxiety." The quality of operational definitions directly determines the construct validity of an experiment — whether the study actually measures and manipulates what it claims to.(Trochim, 2020)

Section 06

Common Experimental Research Designs

Within the categories of true, quasi, and pre-experimental research, a rich variety of specific designs exist. Selecting the appropriate design requires balancing research questions, practical constraints, ethical considerations, and the inferential goals of the study.

Posttest-Only Control Group Design

In this design, participants are randomly assigned to experimental and control groups. No pretest is administered. After the treatment, both groups are measured on the dependent variable. The key advantage is simplicity: it avoids pretest sensitization — the risk that taking a pretest influences participants' performance on the posttest. Its limitation is that, without pretest data, it is not possible to assess baseline equivalence of groups or to measure change within individuals.(Creswell & Creswell, 2018)

R → Group A

→

Treatment (X)

→

Posttest (O)

R → Group B

→

No Treatment

→

Posttest (O)

R = Random Assignment; X = Treatment; O = Observation/Measurement

Pretest-Posttest Control Group Design

This classic design adds a pretest before the treatment, allowing researchers to verify initial equivalence of groups and measure change scores. Both groups are measured before and after the treatment. The pretest also enables more sensitive statistical analyses, such as ANCOVA with pretest scores as a covariate, which reduces error variance and increases statistical power. This is among the most commonly used designs in educational and psychological research.(Maxwell, Delaney & Kelley, 2018)

Solomon Four-Group Design

Developed by Richard Solomon in 1949, this elegant design combines a pretest-posttest design with a posttest-only design, yielding four groups: two that receive pretests (one treatment, one control) and two that do not (one treatment, one control). This design directly tests whether the pretest interacts with the treatment — a form of external validity known as testing effects. It is rarely used in practice because it requires four times the participants of a simple two-group design, but it remains the theoretical ideal for controlling pretest-treatment interactions.

Factorial Designs

Factorial designs incorporate two or more independent variables simultaneously, allowing researchers to examine not only the main effect of each IV but also the interaction effects between IVs. An interaction occurs when the effect of one IV depends on the level of another IV. For example, a 2×2 factorial design might cross type of instruction (lecture vs. problem-based learning) with class size (small vs. large), producing four conditions. The researcher can test whether the effectiveness of instructional method varies depending on class size — an interaction that no single-factor design could detect.(Kirk, 2013)

▶ Practical Example

2×3 Factorial Design in Workplace Training Research

A researcher studies the effects of training format (online vs. face-to-face) and training duration (2 hours, 4 hours, 6 hours) on employee skill acquisition. The 2×3 design creates six cells. Main effects of format and duration can be tested independently. More importantly, the interaction can reveal whether, for example, online training is more effective at shorter durations but face-to-face training becomes superior at longer durations — a nuanced finding impossible to detect from two separate one-factor experiments.

Repeated Measures (Within-Subjects) Designs

In repeated measures designs, the same participants are exposed to all conditions of the independent variable, either in sequence or after a washout period (to prevent carryover effects). This design is powerful because it eliminates between-person variability as a source of error — each participant serves as their own control. It is particularly suited to studies of learning, habituation, or treatment response over time. The key threats to this design are order effects (performance changes due to the sequence of conditions) and carryover effects (effects of one condition lingering into the next), which are typically addressed through counterbalancing.(Field, 2013)

Regression Discontinuity Design

The regression discontinuity (RD) design is a quasi-experimental approach used when assignment to treatment is based on a cutoff score on a continuous variable (called the assignment variable or running variable). Students scoring below a cutoff on a reading test are assigned to a remedial program; those scoring above are not. By comparing outcomes for participants just below and just above the cutoff — who are assumed to be nearly equivalent in all respects — the researcher can estimate the causal effect of the program. The RD design is one of the most credible quasi-experimental alternatives to randomization when an eligibility threshold governs program access.(Imbens & Lemieux, 2008)

Interrupted Time Series Design

The interrupted time series (ITS) design involves collecting observations on a dependent variable at multiple time points before and after an intervention. Rather than comparing experimental and control groups at a single point in time, it compares the trajectory of the outcome variable before and after the intervention. Changes in level (immediate shift) and slope (change in trend) at the point of intervention provide evidence of impact. ITS designs are common in public health, economics, and policy research, where population-level data are available over extended periods.(Bernal et al., 2017)

Section 07

Validity in Experimental Research

Validity in experimental research refers to the trustworthiness of the inferences drawn from a study. It is not a single, unified property but a multidimensional concept encompassing four distinct types, each addressing a different inferential question. Campbell and Stanley (1963), extended by Shadish, Cook, and Campbell (2002), articulated this framework, which remains the authoritative taxonomy for evaluating the validity of experimental and quasi-experimental studies.

Internal Validity

The degree to which the observed relationship between the IV and DV is truly causal — not the result of confounding or extraneous variables. Internal validity asks: Did the treatment really cause the change?

External Validity

The degree to which study findings can be generalized beyond the specific sample, setting, and time of the study. External validity asks: To whom and in what settings do these findings apply?

Construct Validity

The degree to which the operational measures used accurately represent the theoretical constructs they are intended to capture. Construct validity asks: Are we really measuring what we think we are measuring?

Statistical Conclusion Validity

The degree to which inferences about covariation between the IV and DV are appropriate given the data and statistical analysis used. This asks: Did we use the right statistical approach, with adequate power?

Threats to Internal Validity

Threats to internal validity are factors other than the independent variable that could explain observed changes in the dependent variable. Campbell and Stanley (1963) identified eight classic threats; Shadish, Cook, and Campbell (2002) extended this list to fourteen. Below, the major threats are explained through an accordion interface.

Definition: Events occurring outside the experiment but during the study period that could affect the dependent variable. If a researcher is studying the effects of a stress-management workshop and a major national disaster occurs midway through the study, any changes in stress levels might be attributable to the external event rather than the workshop. History threats are particularly problematic in longer studies.

Control strategy: Use a concurrent control group that is exposed to the same historical events; keep study duration as short as feasible.

Definition: Biological, psychological, or developmental changes within participants that occur naturally over time and may affect outcomes independent of the treatment. In a study of a six-month educational intervention with young children, cognitive development and literacy maturation are ongoing. Observed improvements at posttest may reflect normal development, not the intervention.

Control strategy: A concurrent control group matures at the same rate; comparing groups isolates the treatment effect from maturational change.

Definition: The effect of taking a test on subsequent performance on that test. Participants who take a pretest may score higher on a posttest simply because of familiarity with the test format — the testing effect — independent of any treatment. This is particularly relevant for cognitive assessments, intelligence tests, and attitude surveys.

Control strategy: Use parallel test forms for pretest and posttest; use a posttest-only design when pretest data are not essential; include a control group that takes both tests without receiving treatment.

Definition: Changes in the measuring instrument or measurement procedures over the course of the study that affect observed scores. Human raters may become more lenient or stringent over time (rater drift). Questionnaire wording may be modified. Calibration of physical instruments may shift. Any of these changes can produce apparent change in the DV that is artifactual.

Control strategy: Standardize measurement procedures; train and recalibrate raters throughout the study; use mechanical or automated scoring where feasible; assess inter-rater reliability at multiple time points.

Definition: The statistical tendency for extreme scores on an initial measurement to move toward the mean on subsequent measurements, regardless of any intervention. If an intervention selects participants on the basis of extreme scores (e.g., lowest-performing students for remediation), apparent improvement at posttest may partly reflect this statistical phenomenon rather than treatment efficacy.

Control strategy: Use a randomly assigned control group selected using the same criteria; avoid selecting participants solely on the basis of extreme scores without a comparison group.

Definition: Pre-existing differences between experimental and control groups that result from the way participants were assigned or selected. In quasi-experimental designs, participants who volunteer for or are assigned to treatment may differ systematically from those who do not — in motivation, baseline skill, or other relevant characteristics. Any observed differences at posttest may reflect these initial differences, not the treatment.

Control strategy: Random assignment; when randomization is not possible, use matching, propensity score analysis, or ANCOVA with measured baseline covariates.

Definition: The differential loss of participants from experimental and control groups over the course of the study. If participants who drop out of the treatment group differ systematically from those who remain (e.g., those who find the intervention too demanding drop out), the observed effect will be biased. This is known as differential attrition; it undermines the equivalence created by random assignment.

Control strategy: Minimize attrition through engagement strategies; conduct intention-to-treat (ITT) analysis; use multiple imputation for missing data; report attrition rates by condition and test for differential attrition.

Definition: Also called treatment contamination, this occurs when participants in the control group learn about or gain access to elements of the experimental treatment, diminishing the contrast between conditions. In school-based studies, treatment and control students may communicate across classrooms; in workplace studies, employees may share information from training programs.

Control strategy: Assign intact units (e.g., schools, classrooms, clinics) rather than individuals when contamination is likely; physically separate treatment and control groups; monitor for contamination.

Section 08

Conducting an Experimental Study: Step-by-Step Guide

Conducting a rigorous experimental study requires careful, systematic planning before a single participant is recruited. Researchers who rush from hypothesis to data collection without adequate preparation routinely encounter problems — invalid measures, underpowered designs, ethical complications, or confounded results — that could have been avoided with proper planning. The following steps reflect best practices in experimental design as articulated in contemporary methodological literature.

Formulate a Clear Research Question and Hypotheses
Define the phenomenon you wish to study, the population of interest, and the causal relationship you wish to test. Formulate a null hypothesis (H₀) and a directional or non-directional alternative hypothesis (H₁). Hypotheses must be specific, testable, and derived from theory or prior empirical literature.
Review the Existing Literature
Conduct a systematic review of prior experimental and non-experimental studies on your topic. Identify gaps, inconsistencies, and unanswered questions that your study will address. Use this review to justify your choice of variables, measures, and design.(Borenstein et al., 2021)
Select and Justify Your Experimental Design
Choose the design that best fits your research question, population, setting, and resources. Document your reasoning explicitly, acknowledging design limitations. If random assignment is not feasible, explain why and specify how you will address threats to internal validity.
Identify and Operationalize Variables
Specify the levels of your independent variable, the operational definition of your dependent variable(s), and any covariates or control variables. Select or develop measurement instruments with documented reliability and validity for your population.
Conduct a Power Analysis
Calculate the sample size required to achieve adequate statistical power (conventionally ≥0.80) at your predetermined alpha level (typically 0.05), given an estimate of the expected effect size drawn from prior literature. Underpowered studies produce unreliable results and waste resources.(Cohen, 1988; Lakens, 2013)
Obtain Ethical Approval
Submit your protocol to your institution's Institutional Review Board (IRB) or Research Ethics Committee (REC). Prepare informed consent materials. Address issues of risk, confidentiality, data security, and participant rights. No data collection may begin until ethical approval is obtained.
Recruit and Screen Participants
Recruit participants from your target population using your chosen sampling strategy. Screen for inclusion and exclusion criteria. Obtain informed consent. Create a clear randomization protocol if applicable.
Administer Pretest Measures (if applicable)
Collect baseline data on the dependent variable(s) and any covariates. Standardize data collection procedures across all participants and assessors to minimize instrumentation bias.
Implement the Experimental Treatment
Administer the intervention with fidelity — ensuring it is delivered as designed and consistently across all participants in the experimental group. Monitor fidelity through observation checklists, session logs, or audio/video recording. Keep control group conditions constant.
Administer Posttest Measures
Collect outcome data from all participants using the same procedures as at pretest. Minimize missing data through follow-up procedures. If possible, ensure assessors are blind to participants' group assignment (blinding).
Analyze Data
Clean and prepare your dataset. Conduct appropriate statistical analyses (e.g., independent samples t-test, ANOVA, ANCOVA, mixed-model ANOVA). Report effect sizes alongside significance tests. Check all statistical assumptions before proceeding with analyses.
Interpret and Report Findings
Interpret results in light of your hypotheses, study limitations, and existing literature. Follow reporting standards such as APA 7th edition guidelines or CONSORT (for RCTs). Register your study on a trial registry (e.g., ClinicalTrials.gov, AEA Registry) prior to data collection to protect against publication bias.

Section 09

Data Analysis Methods in Experimental Research

The choice of statistical analysis in experimental research is determined by the design, the nature of the dependent variable, and the number of independent variables and groups. Conducting an incorrect analysis — or failing to check assumptions — can invalidate otherwise well-designed research. The following methods are the most commonly used in experimental studies.

Independent Samples t-Test

Used when comparing the means of two independent groups on a continuous dependent variable. In its standard form, it tests the null hypothesis that the population means of the two groups are equal. Assumptions include normality of the DV within each group, homogeneity of variances (tested with Levene's test), and independence of observations. Effect size is reported as Cohen's d.(Field, 2013)

t = (M₁ − M₂) / √[s²p(1/n₁ + 1/n₂)]
where s²p is the pooled variance, M = group mean, n = group size

One-Way Analysis of Variance (ANOVA)

Used when comparing means across three or more independent groups. ANOVA partitions total variance in the DV into variance due to group membership (between-groups variance) and variance due to individual differences within groups (within-groups variance). A significant F-ratio indicates that at least one pair of group means differs, but does not identify which pairs. Post-hoc tests (e.g., Tukey's HSD, Bonferroni correction) are required to make pairwise comparisons while controlling the Type I error rate.(Maxwell, Delaney & Kelley, 2018)

Analysis of Covariance (ANCOVA)

ANCOVA extends ANOVA by statistically controlling for one or more covariates — typically pretest scores or background variables that are correlated with the DV. By removing the variance in the DV accounted for by the covariate, ANCOVA reduces error variance, increases statistical power, and provides a more precise estimate of treatment effects. It is the method of choice for pretest-posttest designs and quasi-experimental studies where group differences on covariates need to be controlled.(Tabachnick & Fidell, 2019)

Factorial ANOVA

Used in factorial designs with two or more independent variables. It tests main effects (the independent effect of each IV) and interaction effects (whether the effect of one IV depends on the level of another IV). Interaction effects are often the most theoretically interesting findings from factorial designs, as they reveal conditions under which treatments work or fail to work.

Repeated Measures ANOVA

Used when the same participants are measured at multiple time points or under multiple conditions. By treating within-person correlation as a factor in the analysis, it substantially reduces error variance compared to between-subjects designs. The sphericity assumption — that the variances of differences between all pairs of conditions are equal — must be tested using Mauchly's test; violations require adjustment using Greenhouse-Geisser or Huynh-Feldt epsilon corrections.(Field, 2013)

Effect Size and Practical Significance

Statistical significance alone does not indicate whether findings are practically meaningful. Effect size measures the magnitude of the treatment effect, independent of sample size. Common measures include Cohen's d (standardized mean difference), η² and partial η² (proportion of variance explained), and ω² (an unbiased estimate of population variance explained). Cohen's (1988) conventional benchmarks — d = 0.2 (small), 0.5 (medium), 0.8 (large) — provide rough guidance, but benchmarks should ideally be calibrated against typical effects in the substantive domain of research.(Lakens, 2013; Cumming, 2014)

Contemporary Best Practice

The American Statistical Association's 2016 and 2019 statements on p-values emphasize that statistical significance (p < .05) should not be the sole criterion for interpreting experimental results. Report confidence intervals for effect estimates, effect sizes, and consider the practical and theoretical significance of findings in addition to statistical significance. Replication of findings across multiple studies is a stronger basis for inference than a single p-value.(Wasserstein & Lazar, 2016)

Section 10

Practical Examples Across Disciplines

To ground the foregoing theoretical discussion in real-world research practice, the following examples illustrate how experimental methodology has been applied across a range of disciplines. Each example describes the study's design, independent and dependent variables, and key findings, drawing on published research.

▶ Education Research

Effect of Formative Assessment on Academic Achievement

Black and Wiliam (1998, 2009) synthesized dozens of experimental and quasi-experimental studies examining whether formative assessment — ongoing assessment for learning — improved student achievement. Studies typically compared classrooms using structured formative assessment strategies (experimental) against those using conventional assessments (control). Findings indicated effect sizes ranging from 0.4 to 0.7 standard deviations — among the largest effects observed for any educational intervention. These findings drove widespread policy reform in assessment practices globally.

IV: Formative assessment strategy (treatment vs. control)
DV: Standardized achievement scores
Design: Multiple pretest-posttest quasi-experimental and RCT studies

▶ Public Health / Medicine

COVID-19 Vaccine Efficacy Trials (2020–2021)

The phase III efficacy trials for COVID-19 mRNA vaccines (BNT162b2 and mRNA-1273) are landmark examples of large-scale double-blinded RCTs. Tens of thousands of participants were randomly assigned to receive either the vaccine or a placebo. Primary outcomes were laboratory-confirmed symptomatic COVID-19 cases. Polack et al. (2020) reported 95% vaccine efficacy for BNT162b2 (Pfizer-BioNTech); Baden et al. (2021) reported 94.1% for mRNA-1273 (Moderna). These trials exemplify rigorous experimental methodology applied to questions of global public health consequence.

IV: Vaccine vs. placebo
DV: Incidence of symptomatic COVID-19 infection
Design: Phase III double-blind RCT

▶ Psychology

Cognitive Behavioral Therapy for Depression

DeRubeis et al. (2005) conducted a rigorous RCT comparing cognitive behavioral therapy (CBT), antidepressant medication (paroxetine), and placebo in patients with moderate to severe major depressive disorder. Patients were randomly assigned to one of the three conditions and assessed at 8 and 16 weeks. Both active treatments outperformed placebo, and CBT was equivalent to medication in acute treatment — a finding that supported CBT as a first-line treatment for depression and informed clinical practice guidelines internationally.(DeRubeis et al., 2005)

IV: Treatment condition (CBT / medication / placebo)
DV: Depression severity (Hamilton Rating Scale for Depression)
Design: Three-group RCT

▶ Development Economics

Conditional Cash Transfers and Educational Attainment (PROGRESA, Mexico)

The PROGRESA program (later renamed Oportunidades) in Mexico randomized 506 rural communities to receive conditional cash transfers contingent on school attendance and health clinic visits. Schultz (2004) used this randomized rollout to estimate program effects, finding significant increases in secondary school enrollment — especially for girls — in treatment communities. This study is a foundational example of large-scale policy evaluation using randomized design, and it influenced conditional cash transfer programs across Latin America, Africa, and Asia.(Schultz, 2004)

Section 11

Ethical Considerations in Experimental Research

The history of research ethics is in many respects a history of experimental abuses. The Nazi medical experiments documented at the Nuremberg trials, the Tuskegee Syphilis Study in the United States, and the Willowbrook hepatitis studies — among others — revealed catastrophic violations of human dignity and autonomy in the name of scientific inquiry. These abuses gave rise to the foundational principles of research ethics that govern experimental research today: respect for persons, beneficence, and justice, codified in the Belmont Report (National Commission for the Protection of Human Subjects, 1979) and refined in subsequent national and international guidelines.

Informed Consent

Participants must receive sufficient information about the study — its purposes, procedures, risks, benefits, and their right to withdraw at any time without penalty — to make a genuinely voluntary and informed decision about participation. For participants who cannot give informed consent independently (children, individuals with cognitive disabilities, incarcerated persons), additional protections apply, including consent from legally authorized representatives and, where appropriate, assent from the participant. Informed consent documents must be written in plain language accessible to the participant population, not in academic or legal jargon.(CIOMS, 2016)

Risk-Benefit Assessment

Researchers are obligated to minimize risks to participants and to ensure that the potential benefits of the research — both to individual participants and to society — justify any risks incurred. This requires a careful prospective assessment of potential harms: physical, psychological, social, economic, and legal. In experimental research, particular attention must be paid to withholding potentially beneficial treatments from control group participants. Active control conditions, delayed treatment designs (where control participants receive treatment after the study period), and equipoise — genuine uncertainty about whether treatment is superior — are ethical safeguards in such cases.

Confidentiality and Data Protection

Participant data must be protected from unauthorized access. This involves de-identification of data, secure storage of records, limited access protocols, and compliance with applicable data protection legislation — such as the Data Privacy Act of 2012 (Republic Act 10173) in the Philippines or the General Data Protection Regulation (GDPR) in the European Union. Researchers must specify data retention policies and procedures for data destruction upon study completion.

Deception and Debriefing

Some experimental designs require temporary deception of participants about the study's true purpose — for example, studies of prosocial behavior, obedience, or implicit bias may reveal their true purpose only after data collection, to prevent demand characteristics from distorting results. When deception is used, it must be scientifically justified, the information withheld must not be information the participant would have refused consent over, and a thorough debriefing must follow data collection, explaining the true purpose and removing any misconceptions. Deception that causes distress requires additional justification and safeguards.(APA, 2017)

Vulnerable Populations

Special protections apply when research involves vulnerable populations: children, pregnant women, prisoners, individuals with diminished decision-making capacity, economically disadvantaged persons, and members of communities with histories of exploitation by researchers. Vulnerability does not disqualify populations from research participation — exclusion of vulnerable populations from research can itself be unethical if it deprives them of the benefits of evidence-based interventions — but it demands heightened procedural protections and community engagement.(CIOMS, 2016)

Institutional Review and Pre-Registration

All experimental research involving human participants must be reviewed and approved by an accredited Institutional Review Board (IRB) or Research Ethics Committee (REC) before data collection begins. Pre-registration of experimental studies on public registries (ClinicalTrials.gov, OSF Registries, AEA Registry) has become an important safeguard against publication bias, outcome switching, and p-hacking — practices that distort the research literature by selectively reporting significant findings or modifying hypotheses after results are known.(Nosek et al., 2018)

Section 12

Strengths and Limitations of Experimental Research

Causal Inference

The strongest basis for concluding that X causes Y. No other design provides the same logical and statistical foundation for causal claims when properly executed.

Control of Extraneous Variables

Random assignment and experimental control minimize the influence of confounders, providing cleaner tests of hypothesized relationships than observational designs.

Replicability

Standardized, documented procedures enable other researchers to reproduce, challenge, and build upon findings — the cornerstone of cumulative scientific knowledge.

Precision and Quantification

Experimental data permit precise quantification of effect magnitudes, enabling evidence-based decision-making and comparison of effect sizes across studies and populations.

External Validity Concerns

Highly controlled experiments may not generalize well to real-world settings, populations, or contexts — particularly when conducted in laboratories with convenience samples of university students.

Ethical Constraints

Random assignment is unethical in many contexts: withholding an established effective treatment from a control group, or deliberately exposing participants to harmful conditions, is not permissible.

Cost and Complexity

Well-designed experiments, particularly large RCTs, require substantial resources: funding, time, personnel, infrastructure, and regulatory compliance that are not always accessible to researchers.

Hawthorne and Demand Effects

Participants may behave differently simply because they know they are being studied (Hawthorne effect), or they may respond in ways they believe the researcher expects (demand characteristics), both of which threaten validity.

Section 13

Experimental Research Compared to Other Methods

Research methodology does not exist in hierarchical isolation; different methods serve different purposes and answer different questions. Understanding where experimental research stands relative to other quantitative and qualitative approaches allows researchers to make informed methodological choices aligned with their research questions — rather than defaulting to a preferred paradigm regardless of fit.

Dimension	Experimental	Correlational / Survey	Causal-Comparative (Ex Post Facto)	Qualitative
Primary Purpose	Establish causation	Describe relationships	Explore cause after the fact	Understand meaning & experience
Manipulation of IV	Yes	No	No (IV already occurred)	No
Random Assignment	Yes (true exp.)	No	No	No
Control of Extraneous Variables	High	Low-moderate	Low-moderate	Not applicable
Causal Claims	Strong	Weak (association only)	Weak-moderate	Not primary aim
Generalizability	Variable	Moderate-high (large samples)	Moderate	Transferability (not generalizability)
Typical Sample Sizes	Small-medium (power-based)	Large	Medium-large	Small (purposive)
Data Type	Numerical/interval	Numerical/ordinal	Numerical	Text, narrative, observation

Mixed Methods Note

Increasingly, researchers combine experimental methods with qualitative inquiry in mixed methods designs. An experiment may yield significant treatment effects, but qualitative interviews with participants can illuminate why and how the treatment worked — causal mechanisms that statistical analyses alone cannot reveal. Creswell and Plano Clark (2018) describe several mixed methods designs appropriate for embedding qualitative components within experimental frameworks.

Section 14

Self-Assessment Quiz

Test your understanding of experimental research with these 10 doctoral-level questions. Detailed feedback is provided after each answer.

Question 1 of 10

Question 01

What is the primary advantage of random assignment in an experimental design?

Random assignment is the cornerstone of true experimental designs because it distributes both measured and unmeasured confounders across groups through chance — not through researcher judgment or participant self-selection. This statistical equivalence at baseline is what permits causal inference: if groups differ at posttest, the difference cannot be attributed to pre-existing group differences. Note that random assignment does NOT guarantee identical groups (especially in small samples), does not eliminate the need for statistical analysis, and does not automatically ensure generalizability.

Question 02

A researcher studies whether a new math curriculum improves achievement. Due to school policies, she cannot randomly assign students to classrooms. She instead compares students in three schools using the new curriculum to students in three demographically similar schools using the standard curriculum. What type of design is this?

This is a quasi-experimental design — specifically a nonequivalent comparison group design. The researcher manipulates the IV (curriculum type) and includes a comparison group, but participants were not randomly assigned to conditions. The absence of random assignment is the defining feature of quasi-experimental research. The design is not causal-comparative because the researcher does intervene — she implements the new curriculum. It is not pre-experimental because a meaningful comparison group is present.

Question 03

Which of the following best describes "statistical regression" as a threat to internal validity?

Regression to the mean occurs because extreme scores on any measurement contain a component of random error. On re-measurement, that random error component will tend to be smaller (closer to zero), making the observed score move toward the mean. This is a serious threat when participants are selected precisely because of extreme scores (e.g., the lowest achievers for a remediation program). Apparent improvement may partly reflect this statistical phenomenon, not the treatment effect. Option A describes maturation; Option B describes attrition (experimental mortality); Option D describes instrumentation.

Question 04

In a 2×3 factorial design, how many conditions (cells) does the experiment have?

A factorial design notation reads as the number of levels for each independent variable. A 2×3 design has two IVs: the first has 2 levels and the second has 3 levels. The total number of conditions (cells) is the product of all levels: 2 × 3 = 6. Each cell represents a unique combination of IV levels. For example, if IV1 is instructional method (lecture, discussion) and IV2 is class size (small, medium, large), the six cells are: lecture-small, lecture-medium, lecture-large, discussion-small, discussion-medium, discussion-large.

Question 05

A researcher adds pretest scores as a covariate in analyzing the results of a pretest-posttest control group experiment. What is the primary statistical purpose of this procedure?

Using pretest scores as a covariate in ANCOVA serves two primary purposes: (1) it removes from the error term the variance in posttest scores that is predictable from pretest performance — reducing error variance and increasing the sensitivity (statistical power) of the test; and (2) it adjusts for any residual group differences on the covariate that remain after randomization (which is especially important in small samples or quasi-experimental designs). ANCOVA does NOT convert a quasi-experimental design into a true experiment — that requires random assignment, which no statistical technique can substitute for after the fact.

Question 06

Which ethical principle, codified in the Belmont Report, requires researchers to ensure that the potential benefits of research justify any risks to participants?

The Belmont Report (1979) identifies three foundational principles: Respect for Persons (recognizing individuals as autonomous agents with the right to make informed decisions about participation), Beneficence (the obligation to maximize benefits and minimize harms — this is the risk-benefit assessment principle), and Justice (fair distribution of the burdens and benefits of research across populations). Veracity is not a Belmont principle; it is a principle in nursing ethics associated with truthfulness.

Question 07

Which validity type is most directly threatened when participants in an experiment modify their behavior because they know they are being observed or studied?

When participants change their behavior because they are aware they are being observed — the Hawthorne effect — this constitutes a threat to internal validity. The observed outcome differences between groups may reflect differential reactivity to being observed, not the actual causal effect of the treatment. If the experimental group responds more positively to scrutiny than the control group (or vice versa), the treatment effect estimate is confounded. Some researchers also frame the Hawthorne effect as a construct validity threat (the DV measures reactivity rather than the intended construct), but the most precise framing in Campbell and Stanley's framework is as an internal validity threat under the category of "instrumentation" or, in later frameworks, as a form of "participant reactivity."

Question 08

Cohen (1988) proposed effect size benchmarks for Cohen's d: 0.2 (small), 0.5 (medium), 0.8 (large). A researcher obtains d = 0.35 with p = .003 in a large study (N = 500). What is the most accurate interpretation?

A d of 0.35 falls between Cohen's small (0.2) and medium (0.5) benchmarks. With N = 500, even a small true effect will likely be statistically significant (p < .05), because statistical significance is a function of both effect size AND sample size. The correct interpretation acknowledges that the effect is real (p = .003 is convincing evidence against H₀) but modest in magnitude. Whether it is practically significant depends on the specific intervention: a d of 0.35 in a low-cost, scalable public health intervention may be highly meaningful; the same effect size for an expensive surgical procedure may not justify widespread adoption. Cohen himself cautioned that his benchmarks were rough guides, not rigid criteria.

Question 09

What is the defining feature that distinguishes the regression discontinuity (RD) design from other quasi-experimental designs?

The regression discontinuity design exploits a known rule: assignment to treatment is determined by a continuous score crossing a cutoff. For example, students scoring below 70 on a diagnostic test receive remediation; those above 70 do not. The causal insight is that participants just below and just above the cutoff are nearly identical in all respects — their scores differ by only a small random amount — yet one group receives the treatment and the other does not. Comparing outcomes for these near-threshold participants provides an estimate of the local average treatment effect. Option A describes randomization, which would make it a true experiment. Option C describes interrupted time series. Option D describes a crossover design.

Question 10

Pre-registration of an experimental study before data collection primarily serves to protect against which research practice?

Pre-registration involves publicly documenting a study's hypotheses, design, primary outcomes, and planned analyses before any data are collected. This creates a transparent, timestamped record that allows readers to distinguish confirmatory tests of pre-specified hypotheses from exploratory analyses conducted after seeing the data. Pre-registration guards against HARKing (presenting post-hoc hypotheses as if they were a priori), p-hacking (conducting multiple analyses and reporting only those yielding p < .05), and outcome switching (changing the primary outcome after data collection to whichever showed the largest effect). Nosek et al. (2018) and the Open Science Collaboration provide extensive documentation on why pre-registration has become a best practice in experimental science.

0

out of 10

Quiz Complete

Well done! Here is a review of your performance.

Section 15

Classroom Activities for Teachers

The following activities are designed for use by research methodology instructors at the undergraduate and graduate levels. Each activity targets a core concept in experimental research, encourages active engagement and critical thinking, and includes materials, time estimates, and learning outcomes. They are appropriate for both in-person and online synchronous classes.

Activity 1

The Candy Experiment: Experiencing Randomization

Students conduct a classroom experiment to experience random assignment firsthand. Each student receives a slip of paper with a random number. Those with odd numbers form the "experimental group" (they receive a piece of candy before a short cognitive task); those with even numbers form the "control group" (no candy). Students complete a 10-item pattern recognition task. Results are tabulated, and the class discusses whether candy improved performance — and what confounders might exist even with random assignment.

Learning Outcome: Students understand why random assignment does not guarantee identical groups in small samples and why statistical analysis is still necessary.
Follow-up discussion: What threats to validity might still apply? (e.g., knowledge of condition, expectation effects)

⏱ 30 minutes 👥 Any class size 📋 Materials: Candy, printed pattern tasks 🎯 Level: Introductory to intermediate

Activity 2

Design Critique Workshop: Identifying Validity Threats

In groups of four, students receive a printed description of one of four research scenarios (each with a different embedded validity threat: history, maturation, attrition, or instrumentation). Groups identify the threat, explain how it would bias the study's conclusions, and propose a design modification that would address it. Groups present to the class, and the instructor facilitates discussion on trade-offs between internal and external validity.

Learning Outcome: Students can identify, explain, and address specific threats to internal validity in realistic experimental scenarios.
Assessment option: Groups submit a written critique (300–400 words) as a graded assignment.

⏱ 50 minutes 👥 Groups of 4 📋 Materials: Printed scenario cards 🎯 Level: Intermediate to advanced

Activity 3

Power Analysis Lab: Using G*Power Software

Students download the free G*Power 3.1 software and conduct power analyses for three research scenarios with different designs (independent samples t-test, one-way ANOVA, ANCOVA). They calculate required sample sizes for power = 0.80 and 0.90 at α = 0.05 with small, medium, and large effect sizes, then compare results. A structured worksheet guides them through the analysis and prompts reflection on why adequate power matters.

Learning Outcome: Students can conduct a priori power analyses, articulate the relationship between sample size, effect size, and power, and justify sample size decisions in a research proposal.
Reflection prompt: "What happens to required sample size when the expected effect size is small? What are the practical implications for dissertation research?"

⏱ 60 minutes 👥 Individual or pairs 📋 Materials: Computers with G*Power, worksheet 🎯 Level: Graduate / doctoral

Activity 4

Mock IRB Review: Ethical Evaluation of Experimental Proposals

Students are assigned roles on a mock Institutional Review Board panel. Each group receives a fictitious research proposal involving an experiment with a genuine ethical complication (e.g., partial deception, vulnerable population, withholding treatment from a control group). The "board" reviews the proposal against Belmont Report principles, formulates questions for the researcher, and renders a decision: full approval, modifications required, or rejection. This activity builds ethical reasoning skills and familiarity with the IRB process.

Learning Outcome: Students apply ethical principles to evaluate research designs, articulate concerns about research ethics in concrete scenarios, and understand the role of institutional oversight in experimental research.

⏱ 75 minutes 👥 Groups of 5–6 📋 Materials: IRB role cards, proposal sheets, Belmont Report summary 🎯 Level: Graduate / doctoral

Activity 5

Factorial Design Challenge: Mapping Interactions

Students are given a 2×2 factorial data set (pre-computed ANOVA results with interaction graphs) from a hypothetical educational study (e.g., teaching method × student motivation level). They must: (1) identify whether there is a significant interaction, (2) correctly interpret the interaction graph, (3) explain what the interaction implies for the main effects, and (4) write a one-paragraph interpretation in APA format. Students compare their interpretations in pairs, then discuss as a class.

Learning Outcome: Students can correctly interpret interaction effects in factorial ANOVA, distinguish between main effects and interactions, and communicate findings in scholarly writing.

⏱ 45 minutes 👥 Individual then pairs 📋 Materials: Printed data sheets, interaction graphs 🎯 Level: Graduate / doctoral

Section 16

References

References follow APA 7th edition format. Prioritizing sources from 2010–2026 where available.

American Psychological Association. (2017). Ethical principles of psychologists and code of conduct (2002, amended 2010 and 2017). https://www.apa.org/ethics/code
American Statistical Association. (2016). ASA statement on statistical significance and p-values. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
Baden, L. R., El Sahly, H. M., Essink, B., Kotloff, K., Frey, S., Novak, R., … Zaks, T. (2021). Efficacy and safety of the mRNA-1273 SARS-CoV-2 vaccine. New England Journal of Medicine, 384(5), 403–416. https://doi.org/10.1056/NEJMoa2035389
Bernal, J. L., Cummins, S., & Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. International Journal of Epidemiology, 46(1), 348–355. https://doi.org/10.1093/ije/dyw098
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. https://doi.org/10.1007/s11092-008-9068-5
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2021). Introduction to meta-analysis (2nd ed.). Wiley.
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Rand McNally.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Council for International Organizations of Medical Sciences (CIOMS). (2016). International ethical guidelines for health-related research involving humans (4th ed.). CIOMS. https://cioms.ch/publications/product/international-ethical-guidelines-for-health-related-research-involving-humans/
Creswell, J. W., & Creswell, J. D. (2018). Research design: Qualitative, quantitative, and mixed methods approaches (5th ed.). SAGE Publications.
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods research (3rd ed.). SAGE Publications.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
DeRubeis, R. J., Hollon, S. D., Amsterdam, J. D., Shelton, R. C., Young, P. R., Salomon, R. M., … Gallop, R. (2005). Cognitive therapy vs medications in the treatment of moderate to severe depression. Archives of General Psychiatry, 62(4), 409–416. https://doi.org/10.1001/archpsyc.62.4.409
Field, A. (2013). Discovering statistics using IBM SPSS Statistics (4th ed.). SAGE Publications.
Imbens, G. W., & Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142(2), 615–635. https://doi.org/10.1016/j.jeconom.2007.05.001
Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press.
Kirk, R. E. (2013). Experimental design: Procedures for the behavioral sciences (4th ed.). SAGE Publications.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, Article 863. https://doi.org/10.3389/fpsyg.2013.00863
Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing experiments and analyzing data: A model comparison perspective (3rd ed.). Routledge.
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (1979). The Belmont Report: Ethical principles and guidelines for the protection of human subjects of research. U.S. Department of Health and Human Services.
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115(11), 2600–2606. https://doi.org/10.1073/pnas.1708274114
Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., … Gruber, W. C. (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New England Journal of Medicine, 383(27), 2603–2615. https://doi.org/10.1056/NEJMoa2034577
Republic of the Philippines. (2012). Data Privacy Act of 2012 (Republic Act No. 10173). Official Gazette. https://www.officialgazette.gov.ph/2012/08/15/republic-act-no-10173/
Schultz, T. P. (2004). School subsidies for the poor: Evaluating the Mexican PROGRESA poverty program. Journal of Development Economics, 74(1), 199–250. https://doi.org/10.1016/j.jdeveco.2003.12.009
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin.
Tabachnick, B. G., & Fidell, L. S. (2019). Using multivariate statistics (7th ed.). Pearson.
Trochim, W. M. K. (2020). Research methods knowledge base (3rd ed.). Cengage Learning.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108
World Health Organization. (2021). Guidance on research methods for health emergency and disaster risk management. WHO Press.