The Scatterplot in Statistical and Academic Practice
The scatterplot is a two-dimensional graphical display in which each observation in a dataset is represented as a single point positioned according to its values on two continuous quantitative variables. The horizontal axis, conventionally designated the X-axis, encodes one variable; the vertical axis, the Y-axis, encodes the other. The resulting cloud of points carries information about the direction, form, and strength of the relationship between the two variables in a way that no single numerical summary can fully capture.
The scatterplot's origins in scientific visualization predate formal statistics. John Herschel used a form of the device in the 1830s for astronomical data, and Francis Galton employed it systematically in the 1870s and 1880s when developing his theory of regression toward the mean, making the scatterplot the founding visual artifact of bivariate statistical analysis. Karl Pearson later formalised the mathematical measure that the scatterplot visualises, introducing the product-moment correlation coefficient in 1896 in the Philosophical Transactions of the Royal Society. The scatterplot and the correlation coefficient thus emerged together as a unified graphical and numerical apparatus for examining linear association.
What the Scatterplot Communicates
A well-constructed scatterplot communicates four distinct properties of a bivariate relationship, none of which is captured completely by any single statistic.
Correlation Coefficients: A Conceptual and Mathematical Foundation
Two correlation coefficients are in universal use in academic research. Each encodes a different set of mathematical assumptions and is appropriate for different data structures.
| Property | Pearson r | Spearman rho |
|---|---|---|
| Measures | Linear association between two continuous variables | Monotonic association between two variables |
| Data requirement | Continuous, ideally bivariate normal | Ordinal or continuous; no distribution assumption |
| Sensitivity to outliers | High: a single outlier can strongly inflate or deflate r | Low: based on ranks, which compress extreme values |
| Range | -1 to +1 | -1 to +1 |
| Equals zero when | No linear relationship | No monotonic relationship |
| Significance test | t = r sqrt(n-2) / sqrt(1-r^2), df = n-2 | Same formula applied to rho |
| Effect size benchmarks | Cohen (1988): small = .10, medium = .30, large = .50 | Same benchmarks by convention |
Formula Reference: Correlation and Regression
Core Formula Summary
| Statistic | Formula | Range | Notes |
|---|---|---|---|
| Pearson r | r = S_xy / (S_x * S_y) | -1 to +1 | S_xy = sum of (xi - x-bar)(yi - y-bar); S_x, S_y = sample SDs of X and Y |
| Spearman rho | rho = 1 - 6*sum(d_i^2) / (n*(n^2-1)) | -1 to +1 | d_i = rank(x_i) - rank(y_i). Exact when no tied ranks; otherwise use Pearson r on ranks. |
| Regression slope b1 | b1 = S_xy / S_x^2 | Any real | S_x^2 = sample variance of X. Change in predicted Y per unit increase in X. |
| Regression intercept b0 | b0 = y-bar - b1 * x-bar | Any real | Value of predicted Y when X equals zero. |
| R-squared | R^2 = r^2 = SS_reg / SS_tot | 0 to 1 | Proportion of variance in Y explained by the linear model. In simple regression, R^2 = r^2 exactly. |
| t-test for r | t = r * sqrt(n-2) / sqrt(1-r^2) | df = n-2 | Tests H0: rho = 0 in the population. Same formula applied to Spearman rho. |
| Fisher Z (CI for r) | Z = 0.5 * ln((1+r)/(1-r)) | Any real | SE(Z) = 1 / sqrt(n-3). Back-transform CI endpoints using r = (e^(2Z)-1)/(e^(2Z)+1). |
| Notation: n = sample size; x-bar, y-bar = sample means; S_xy = sample covariance (sum divided by n-1); S_x, S_y = sample standard deviations; d_i = rank difference for observation i; SS_reg = regression sum of squares; SS_tot = total sum of squares. | |||
t = r * √(n-2) / √(1-r²), df = n-2
95% CI via Fisher Z: Z = 0.5 * ln((1+r)/(1-r))
Assumes a linear relationship and is sensitive to outliers. For n = 30, r = 0.40: t = 0.40 * √28 / √(1-0.16) = 2.31, p = .029 (two-tailed).
or: Pearson r computed on ranks [ties present]
t = rho * √(n-2) / √(1-rho²), df = n-2
Nonparametric. Measures monotonic (not necessarily linear) association. Robust to outliers. This tool uses the Pearson-on-ranks method to handle tied values correctly.
b0 = y-bar - b1 * x-bar
Y-hat = b0 + b1 * X
R² = r² [simple linear regression only]
Minimises the sum of squared vertical residuals (Y - Y-hat)^2. The regression line always passes through the point (x-bar, y-bar).
Inferential Statistics for Correlation
The t-test for a correlation coefficient tests the null hypothesis that the population correlation rho equals zero. The test statistic follows the t-distribution with n minus 2 degrees of freedom under the null hypothesis and the assumption of bivariate normality. A significant result (p below the chosen alpha level) leads to rejection of the null hypothesis and the conclusion that a linear relationship exists in the population from which the sample was drawn. Statistical significance does not establish the practical importance of the relationship; effect size (|r|) and confidence intervals are required for that purpose.
.10 ≤ |r| < .30: small effect
.30 ≤ |r| < .50: medium effect
|r| ≥ .50: large effect
R² interpretation: proportion of variance in Y explained by X.
r = .30 explains R² = 9% of variance; r = .50 explains 25%; r = .70 explains 49%.
SE(Z_r) = 1 / √(n-3)
Z_hi = Z_r + 1.96 * SE(Z_r)
r_hi = (e^(2*Z_hi) - 1) / (e^(2*Z_hi) + 1)
SS_res = ∑(y_i - Y-hat_i)²
SS_reg = SS_tot - SS_res
R² = SS_reg / SS_tot = 1 - SS_res/SS_tot
Measures average prediction error in Y units.
2. Independence: observations are independent
3. Homoscedasticity: variance of Y is constant across X
4. Normality: Y is approximately normal at each X (for inference)
5. No extreme influential observations distorting estimates
Verify conditions 1, 3, and 5 by visual inspection of the scatterplot and residual plot. Pearson r is robust to moderate violations of normality for n above 30.
Design Requirements for Academic Publication
The APA Publication Manual (7th edition, 2020) specifies that figures displaying bivariate relationships must include clearly labelled axes with the variable name and unit of measurement. Data points must be clearly visible and distinguishable from the regression line. The figure caption must state the sample size, identify the correlation coefficient and its significance level, and specify whether the regression line is displayed. When multiple groups are plotted on the same scatterplot, distinct point shapes or fills must be used and a legend provided.
- Label. Figure 1 in bold below the image; figure title in italic title case on the next line.
- Axes. Both axes must be labelled with the variable name and unit of measurement in parentheses.
- Caption. States n, r (or rho), p-value, and whether the regression line is included. Ends with a period.
- Zero baseline. Required only when zero is a meaningful value on the scale. For scales with no natural zero (e.g., standardised scores), axes may begin at a non-zero value.
- Regression line. If included, state the equation and R-squared in the caption or figure note.
Conditions for Appropriate Use
Selected Methodological Questions
How many observations are needed for reliable correlation estimation?
The power of the significance test for Pearson r depends on the sample size, the true population correlation, and the chosen alpha level. For detecting a medium effect (rho = .30) with 80% power at alpha = .05 (two-tailed), a minimum of 84 observations is required by Cohen's (1988) power tables. For a large effect (rho = .50), 28 observations suffice. For a small effect (rho = .10), 782 observations are required. Researchers should conduct a priori power analysis before data collection rather than interpreting post-hoc significance as confirmation of adequate sample size.
When should Spearman rho be reported instead of Pearson r?
Spearman rho is preferred when the data contain extreme outliers that would distort Pearson r, when one or both variables are measured on an ordinal scale, when the scatterplot reveals a monotonic but clearly nonlinear relationship, or when the assumption of bivariate normality is substantially violated in a small sample. For large samples (n above 100) without extreme outliers, Pearson r and Spearman rho typically converge to similar values and the choice between them is largely a matter of the substantive research question. Many researchers report both as a robustness check.
What does it mean for a regression line to pass through the centroid?
The OLS regression line always passes through the point whose coordinates are the sample means of X and Y, called the centroid or center of gravity of the data. This is a mathematical consequence of the OLS normal equations, not a modelling choice. It means that when X equals its sample mean, the regression equation predicts Y equal to its sample mean. Researchers can use this property to verify the regression equation: substituting x-bar into the regression equation should yield y-bar to within rounding error.