The Scatterplot in Statistical and Academic Practice

The scatterplot is a two-dimensional graphical display in which each observation in a dataset is represented as a single point positioned according to its values on two continuous quantitative variables. The horizontal axis, conventionally designated the X-axis, encodes one variable; the vertical axis, the Y-axis, encodes the other. The resulting cloud of points carries information about the direction, form, and strength of the relationship between the two variables in a way that no single numerical summary can fully capture.

The scatterplot's origins in scientific visualization predate formal statistics. John Herschel used a form of the device in the 1830s for astronomical data, and Francis Galton employed it systematically in the 1870s and 1880s when developing his theory of regression toward the mean, making the scatterplot the founding visual artifact of bivariate statistical analysis. Karl Pearson later formalised the mathematical measure that the scatterplot visualises, introducing the product-moment correlation coefficient in 1896 in the Philosophical Transactions of the Royal Society. The scatterplot and the correlation coefficient thus emerged together as a unified graphical and numerical apparatus for examining linear association.

Primacy of the Scatterplot Before computing any correlation coefficient or fitting any regression line, the scatterplot must be examined. Anscombe's Quartet (1973) demonstrated with four artificial datasets that identical Pearson r values, means, variances, and regression lines can arise from radically different underlying distributions: one linear, one curved, one with an outlier distorting an otherwise perfect line, and one with all points on a vertical line except one. The numerical summaries conceal what the scatterplot reveals immediately. No correlation analysis is complete without a scatterplot.

What the Scatterplot Communicates

A well-constructed scatterplot communicates four distinct properties of a bivariate relationship, none of which is captured completely by any single statistic.

Direction
A positive relationship appears as a cloud of points that rises from lower-left to upper-right: as X increases, Y tends to increase. A negative relationship runs from upper-left to lower-right: as X increases, Y tends to decrease. A horizontal or circular cloud indicates no systematic directional relationship. Direction corresponds to the sign of the correlation coefficient.
Form
The form of the relationship describes whether the trend is linear (points scattered around a straight line), curvilinear (points scattered around a curve), or patterned in some other way. Pearson r and OLS regression assume and detect only linear form. A strong curvilinear relationship may produce a near-zero Pearson r because the linear component averages to zero, even though a strong systematic relationship clearly exists in the scatterplot.
Strength
The strength of the relationship is conveyed by the tightness of the point cloud around the implied trend line. Points closely clustered around the line indicate a strong relationship. Points scattered widely around the line indicate a weak relationship. Strength corresponds to the absolute value of the correlation coefficient and to R-squared in the regression context.
Outliers and Influential Observations
Outliers appear as points well separated from the main cloud. In bivariate analysis, the relevant concept is the influential observation: a point whose removal would substantially change the regression line or correlation coefficient. High-leverage points are extreme on X; influential points are extreme on both axes relative to the trend. These are invisible in summary statistics but immediately apparent in the scatterplot.

Correlation Coefficients: A Conceptual and Mathematical Foundation

Two correlation coefficients are in universal use in academic research. Each encodes a different set of mathematical assumptions and is appropriate for different data structures.

Property Pearson r Spearman rho
MeasuresLinear association between two continuous variablesMonotonic association between two variables
Data requirementContinuous, ideally bivariate normalOrdinal or continuous; no distribution assumption
Sensitivity to outliersHigh: a single outlier can strongly inflate or deflate rLow: based on ranks, which compress extreme values
Range-1 to +1-1 to +1
Equals zero whenNo linear relationshipNo monotonic relationship
Significance testt = r sqrt(n-2) / sqrt(1-r^2), df = n-2Same formula applied to rho
Effect size benchmarksCohen (1988): small = .10, medium = .30, large = .50Same benchmarks by convention

Formula Reference: Correlation and Regression

Core Formula Summary

Statistic Formula Range Notes
Pearson r r = S_xy / (S_x * S_y) -1 to +1 S_xy = sum of (xi - x-bar)(yi - y-bar); S_x, S_y = sample SDs of X and Y
Spearman rho rho = 1 - 6*sum(d_i^2) / (n*(n^2-1)) -1 to +1 d_i = rank(x_i) - rank(y_i). Exact when no tied ranks; otherwise use Pearson r on ranks.
Regression slope b1 b1 = S_xy / S_x^2 Any real S_x^2 = sample variance of X. Change in predicted Y per unit increase in X.
Regression intercept b0 b0 = y-bar - b1 * x-bar Any real Value of predicted Y when X equals zero.
R-squared R^2 = r^2 = SS_reg / SS_tot 0 to 1 Proportion of variance in Y explained by the linear model. In simple regression, R^2 = r^2 exactly.
t-test for r t = r * sqrt(n-2) / sqrt(1-r^2) df = n-2 Tests H0: rho = 0 in the population. Same formula applied to Spearman rho.
Fisher Z (CI for r) Z = 0.5 * ln((1+r)/(1-r)) Any real SE(Z) = 1 / sqrt(n-3). Back-transform CI endpoints using r = (e^(2Z)-1)/(e^(2Z)+1).
Notation: n = sample size; x-bar, y-bar = sample means; S_xy = sample covariance (sum divided by n-1); S_x, S_y = sample standard deviations; d_i = rank difference for observation i; SS_reg = regression sum of squares; SS_tot = total sum of squares.
Pearson r
Product-Moment Correlation
r = ∑[(x_i - x-bar)(y_i - y-bar)] / [(n-1) * S_x * S_y]

t = r * √(n-2) / √(1-r²), df = n-2
95% CI via Fisher Z: Z = 0.5 * ln((1+r)/(1-r))

Assumes a linear relationship and is sensitive to outliers. For n = 30, r = 0.40: t = 0.40 * √28 / √(1-0.16) = 2.31, p = .029 (two-tailed).

Spearman rho
Rank-Order Correlation
rho = 1 - 6∑d_i² / (n(n²-1)) [no ties]
or: Pearson r computed on ranks [ties present]

t = rho * √(n-2) / √(1-rho²), df = n-2

Nonparametric. Measures monotonic (not necessarily linear) association. Robust to outliers. This tool uses the Pearson-on-ranks method to handle tied values correctly.

OLS Regression
Ordinary Least Squares
b1 = ∑(x_i - x-bar)(y_i - y-bar) / ∑(x_i - x-bar)²
b0 = y-bar - b1 * x-bar
Y-hat = b0 + b1 * X

R² = r² [simple linear regression only]

Minimises the sum of squared vertical residuals (Y - Y-hat)^2. The regression line always passes through the point (x-bar, y-bar).

Inferential Statistics for Correlation

The t-test for a correlation coefficient tests the null hypothesis that the population correlation rho equals zero. The test statistic follows the t-distribution with n minus 2 degrees of freedom under the null hypothesis and the assumption of bivariate normality. A significant result (p below the chosen alpha level) leads to rejection of the null hypothesis and the conclusion that a linear relationship exists in the population from which the sample was drawn. Statistical significance does not establish the practical importance of the relationship; effect size (|r|) and confidence intervals are required for that purpose.

Effect Size Benchmarks for Correlation (Cohen, 1988)
|r| < .10: negligible effect
.10 ≤ |r| < .30: small effect
.30 ≤ |r| < .50: medium effect
|r| ≥ .50: large effect

R² interpretation: proportion of variance in Y explained by X.
r = .30 explains R² = 9% of variance; r = .50 explains 25%; r = .70 explains 49%.
95% Confidence Interval for Pearson r (Fisher Z Method)
Step 1: Transform r to Z
Z_r = 0.5 * ln((1+r)/(1-r))
SE(Z_r) = 1 / √(n-3)
Step 2: CI on Z scale
Z_lo = Z_r - 1.96 * SE(Z_r)
Z_hi = Z_r + 1.96 * SE(Z_r)
Step 3: Back-transform to r scale
r_lo = (e^(2*Z_lo) - 1) / (e^(2*Z_lo) + 1)
r_hi = (e^(2*Z_hi) - 1) / (e^(2*Z_hi) + 1)
Regression Residuals and Model Fit
Residual for Observation i
e_i = y_i - Y-hat_i = y_i - (b0 + b1 * x_i)
Sum of Squares Decomposition
SS_tot = ∑(y_i - y-bar)²
SS_res = ∑(y_i - Y-hat_i)²
SS_reg = SS_tot - SS_res
R² = SS_reg / SS_tot = 1 - SS_res/SS_tot
Standard Error of the Estimate
S_e = √(SS_res / (n-2))
Measures average prediction error in Y units.
Assumptions of Pearson r and OLS Regression
Required Conditions
1. Linearity: the true relationship is linear
2. Independence: observations are independent
3. Homoscedasticity: variance of Y is constant across X
4. Normality: Y is approximately normal at each X (for inference)
5. No extreme influential observations distorting estimates

Verify conditions 1, 3, and 5 by visual inspection of the scatterplot and residual plot. Pearson r is robust to moderate violations of normality for n above 30.

Design Requirements for Academic Publication

The APA Publication Manual (7th edition, 2020) specifies that figures displaying bivariate relationships must include clearly labelled axes with the variable name and unit of measurement. Data points must be clearly visible and distinguishable from the regression line. The figure caption must state the sample size, identify the correlation coefficient and its significance level, and specify whether the regression line is displayed. When multiple groups are plotted on the same scatterplot, distinct point shapes or fills must be used and a legend provided.

APA 7th Edition Requirements for Scatterplots
  1. Label. Figure 1 in bold below the image; figure title in italic title case on the next line.
  2. Axes. Both axes must be labelled with the variable name and unit of measurement in parentheses.
  3. Caption. States n, r (or rho), p-value, and whether the regression line is included. Ends with a period.
  4. Zero baseline. Required only when zero is a meaningful value on the scale. For scales with no natural zero (e.g., standardised scores), axes may begin at a non-zero value.
  5. Regression line. If included, state the equation and R-squared in the caption or figure note.

Conditions for Appropriate Use

Appropriate Conditions
Both variables are continuous or at minimum ordinal with many levels. The research question concerns the association or predictive relationship between the two variables. The sample contains at least 10 paired observations; for reliable inference, at least 30 pairs are recommended. Each observation contributes one point independently of all other observations.
When Pearson r Is Inappropriate
The relationship is clearly nonlinear in the scatterplot. One or both variables are binary or have very few distinct values. The data contain extreme outliers on one axis that distort the correlation. Variables are measured on a nominal scale. In these cases, consider Spearman rho, Kendall tau, the point-biserial correlation, or eta-squared depending on the data structure.
Correlation Is Not Causation
A statistically significant Pearson r or regression slope does not establish a causal relationship between X and Y. Both variables may be caused by a third confounding variable; the temporal order may be reversed from what the researcher assumes; the relationship may be spurious. Causal conclusions require experimental or quasi-experimental designs with appropriate controls, not correlation analysis alone.

Selected Methodological Questions

How many observations are needed for reliable correlation estimation?

The power of the significance test for Pearson r depends on the sample size, the true population correlation, and the chosen alpha level. For detecting a medium effect (rho = .30) with 80% power at alpha = .05 (two-tailed), a minimum of 84 observations is required by Cohen's (1988) power tables. For a large effect (rho = .50), 28 observations suffice. For a small effect (rho = .10), 782 observations are required. Researchers should conduct a priori power analysis before data collection rather than interpreting post-hoc significance as confirmation of adequate sample size.

When should Spearman rho be reported instead of Pearson r?

Spearman rho is preferred when the data contain extreme outliers that would distort Pearson r, when one or both variables are measured on an ordinal scale, when the scatterplot reveals a monotonic but clearly nonlinear relationship, or when the assumption of bivariate normality is substantially violated in a small sample. For large samples (n above 100) without extreme outliers, Pearson r and Spearman rho typically converge to similar values and the choice between them is largely a matter of the substantive research question. Many researchers report both as a robustness check.

What does it mean for a regression line to pass through the centroid?

The OLS regression line always passes through the point whose coordinates are the sample means of X and Y, called the centroid or center of gravity of the data. This is a mathematical consequence of the OLS normal equations, not a modelling choice. It means that when X equals its sample mean, the regression equation predicts Y equal to its sample mean. Researchers can use this property to verify the regression equation: substituting x-bar into the regression equation should yield y-bar to within rounding error.