What is a scatterplot and when is it used in research?

A scatterplot is a two-dimensional graphical display in which each observation is represented as a point positioned according to its values on two continuous variables. The horizontal axis encodes the predictor or independent variable and the vertical axis encodes the outcome or dependent variable. Scatterplots are used in research to examine the direction, form, and strength of the relationship between two quantitative variables before applying inferential procedures such as Pearson correlation or linear regression.

What is the Pearson product-moment correlation coefficient?

The Pearson product-moment correlation coefficient, denoted r, measures the strength and direction of the linear relationship between two continuous variables. It ranges from negative one to positive one. A value of zero indicates no linear relationship. Values near positive one indicate a strong positive linear relationship; values near negative one indicate a strong negative linear relationship. Cohen's (1988) benchmarks classify absolute values below 0.10 as negligible, 0.10 to 0.29 as small, 0.30 to 0.49 as medium, and 0.50 and above as large.

What is the Spearman rank-order correlation?

Spearman's rho is a nonparametric measure of the monotonic relationship between two variables. It is computed as the Pearson correlation applied to the ranks of the data rather than the raw values. It does not assume bivariate normality and is appropriate when data are ordinal, when the relationship is monotonic but nonlinear, or when outliers are present. It shares the same negative one to positive one range and interpretive benchmarks as Pearson r.

What is R-squared in linear regression?

R-squared, the coefficient of determination, equals the square of the Pearson correlation coefficient in simple linear regression. It represents the proportion of total variance in the dependent variable that is explained by the linear regression model. An R-squared of 0.36 means that 36 percent of the variability in Y is accounted for by the linear relationship with X. It ranges from zero to one; higher values indicate greater explanatory power of the model.

What is the Fisher Z-transformation and why is it used for correlation confidence intervals?

The Fisher Z-transformation converts a Pearson r value to Z equals 0.5 times the natural logarithm of the quantity (1 plus r) divided by (1 minus r). The transformed statistic is approximately normally distributed with standard error 1 divided by the square root of n minus 3, making it appropriate for constructing confidence intervals and testing hypotheses about correlation coefficients. Direct confidence intervals on r are not appropriate because the sampling distribution of r is skewed, especially for values far from zero.

Scatterplot Maker and Reference Guide | Statistical Data Visualization Tool

The Scatterplot in Statistical and Academic Practice

The scatterplot is a two-dimensional graphical display in which each observation in a dataset is represented as a single point positioned according to its values on two continuous quantitative variables. The horizontal axis, conventionally designated the X-axis, encodes one variable; the vertical axis, the Y-axis, encodes the other. The resulting cloud of points carries information about the direction, form, and strength of the relationship between the two variables in a way that no single numerical summary can fully capture.

The scatterplot's origins in scientific visualization predate formal statistics. John Herschel used a form of the device in the 1830s for astronomical data, and Francis Galton employed it systematically in the 1870s and 1880s when developing his theory of regression toward the mean, making the scatterplot the founding visual artifact of bivariate statistical analysis. Karl Pearson later formalised the mathematical measure that the scatterplot visualises, introducing the product-moment correlation coefficient in 1896 in the Philosophical Transactions of the Royal Society. The scatterplot and the correlation coefficient thus emerged together as a unified graphical and numerical apparatus for examining linear association.

Primacy of the Scatterplot Before computing any correlation coefficient or fitting any regression line, the scatterplot must be examined. Anscombe's Quartet (1973) demonstrated with four artificial datasets that identical Pearson r values, means, variances, and regression lines can arise from radically different underlying distributions: one linear, one curved, one with an outlier distorting an otherwise perfect line, and one with all points on a vertical line except one. The numerical summaries conceal what the scatterplot reveals immediately. No correlation analysis is complete without a scatterplot.

What the Scatterplot Communicates

A well-constructed scatterplot communicates four distinct properties of a bivariate relationship, none of which is captured completely by any single statistic.

Direction

A positive relationship appears as a cloud of points that rises from lower-left to upper-right: as X increases, Y tends to increase. A negative relationship runs from upper-left to lower-right: as X increases, Y tends to decrease. A horizontal or circular cloud indicates no systematic directional relationship. Direction corresponds to the sign of the correlation coefficient.

Form

The form of the relationship describes whether the trend is linear (points scattered around a straight line), curvilinear (points scattered around a curve), or patterned in some other way. Pearson r and OLS regression assume and detect only linear form. A strong curvilinear relationship may produce a near-zero Pearson r because the linear component averages to zero, even though a strong systematic relationship clearly exists in the scatterplot.

Strength

The strength of the relationship is conveyed by the tightness of the point cloud around the implied trend line. Points closely clustered around the line indicate a strong relationship. Points scattered widely around the line indicate a weak relationship. Strength corresponds to the absolute value of the correlation coefficient and to R-squared in the regression context.

Outliers and Influential Observations

Outliers appear as points well separated from the main cloud. In bivariate analysis, the relevant concept is the influential observation: a point whose removal would substantially change the regression line or correlation coefficient. High-leverage points are extreme on X; influential points are extreme on both axes relative to the trend. These are invisible in summary statistics but immediately apparent in the scatterplot.

Correlation Coefficients: A Conceptual and Mathematical Foundation

Two correlation coefficients are in universal use in academic research. Each encodes a different set of mathematical assumptions and is appropriate for different data structures.

Property	Pearson r	Spearman rho
Measures	Linear association between two continuous variables	Monotonic association between two variables
Data requirement	Continuous, ideally bivariate normal	Ordinal or continuous; no distribution assumption
Sensitivity to outliers	High: a single outlier can strongly inflate or deflate r	Low: based on ranks, which compress extreme values
Range	-1 to +1	-1 to +1
Equals zero when	No linear relationship	No monotonic relationship
Significance test	t = r sqrt(n-2) / sqrt(1-r^2), df = n-2	Same formula applied to rho
Effect size benchmarks	Cohen (1988): small = .10, medium = .30, large = .50	Same benchmarks by convention

Formula Reference: Correlation and Regression

Core Formula Summary

Statistic	Formula	Range	Notes
Pearson r	r = S_xy / (S_x * S_y)	-1 to +1	S_xy = sum of (xi - x-bar)(yi - y-bar); S_x, S_y = sample SDs of X and Y
Spearman rho	rho = 1 - 6sum(d_i^2) / (n(n^2-1))	-1 to +1	d_i = rank(x_i) - rank(y_i). Exact when no tied ranks; otherwise use Pearson r on ranks.
Regression slope b1	b1 = S_xy / S_x^2	Any real	S_x^2 = sample variance of X. Change in predicted Y per unit increase in X.
Regression intercept b0	b0 = y-bar - b1 * x-bar	Any real	Value of predicted Y when X equals zero.
R-squared	R^2 = r^2 = SS_reg / SS_tot	0 to 1	Proportion of variance in Y explained by the linear model. In simple regression, R^2 = r^2 exactly.
t-test for r	t = r * sqrt(n-2) / sqrt(1-r^2)	df = n-2	Tests H0: rho = 0 in the population. Same formula applied to Spearman rho.
Fisher Z (CI for r)	Z = 0.5 * ln((1+r)/(1-r))	Any real	SE(Z) = 1 / sqrt(n-3). Back-transform CI endpoints using r = (e^(2Z)-1)/(e^(2Z)+1).
Notation: n = sample size; x-bar, y-bar = sample means; S_xy = sample covariance (sum divided by n-1); S_x, S_y = sample standard deviations; d_i = rank difference for observation i; SS_reg = regression sum of squares; SS_tot = total sum of squares.

Pearson r

Product-Moment Correlation

r = ∑[(x_i - x-bar)(y_i - y-bar)] / [(n-1) * S_x * S_y]

t = r * √(n-2) / √(1-r²), df = n-2
95% CI via Fisher Z: Z = 0.5 * ln((1+r)/(1-r))

Assumes a linear relationship and is sensitive to outliers. For n = 30, r = 0.40: t = 0.40 * √28 / √(1-0.16) = 2.31, p = .029 (two-tailed).

Spearman rho

Rank-Order Correlation

rho = 1 - 6∑d_i² / (n(n²-1)) [no ties]
or: Pearson r computed on ranks [ties present]

t = rho * √(n-2) / √(1-rho²), df = n-2

Nonparametric. Measures monotonic (not necessarily linear) association. Robust to outliers. This tool uses the Pearson-on-ranks method to handle tied values correctly.

OLS Regression

Ordinary Least Squares

b1 = ∑(x_i - x-bar)(y_i - y-bar) / ∑(x_i - x-bar)²
b0 = y-bar - b1 * x-bar
Y-hat = b0 + b1 * X

R² = r² [simple linear regression only]

Minimises the sum of squared vertical residuals (Y - Y-hat)^2. The regression line always passes through the point (x-bar, y-bar).

Inferential Statistics for Correlation

The t-test for a correlation coefficient tests the null hypothesis that the population correlation rho equals zero. The test statistic follows the t-distribution with n minus 2 degrees of freedom under the null hypothesis and the assumption of bivariate normality. A significant result (p below the chosen alpha level) leads to rejection of the null hypothesis and the conclusion that a linear relationship exists in the population from which the sample was drawn. Statistical significance does not establish the practical importance of the relationship; effect size (|r|) and confidence intervals are required for that purpose.

Effect Size Benchmarks for Correlation (Cohen, 1988)

|r| < .10: negligible effect
.10 ≤ |r| < .30: small effect
.30 ≤ |r| < .50: medium effect
|r| ≥ .50: large effect

R² interpretation: proportion of variance in Y explained by X.
r = .30 explains R² = 9% of variance; r = .50 explains 25%; r = .70 explains 49%.

95% Confidence Interval for Pearson r (Fisher Z Method)

Step 1: Transform r to Z

Z_r = 0.5 * ln((1+r)/(1-r))
SE(Z_r) = 1 / √(n-3)

Step 2: CI on Z scale

Z_lo = Z_r - 1.96 * SE(Z_r)
Z_hi = Z_r + 1.96 * SE(Z_r)

Step 3: Back-transform to r scale

r_lo = (e^(2*Z_lo) - 1) / (e^(2*Z_lo) + 1)
r_hi = (e^(2*Z_hi) - 1) / (e^(2*Z_hi) + 1)

Regression Residuals and Model Fit

Residual for Observation i

e_i = y_i - Y-hat_i = y_i - (b0 + b1 * x_i)

Sum of Squares Decomposition

SS_tot = ∑(y_i - y-bar)²
SS_res = ∑(y_i - Y-hat_i)²
SS_reg = SS_tot - SS_res
R² = SS_reg / SS_tot = 1 - SS_res/SS_tot

Standard Error of the Estimate

S_e = √(SS_res / (n-2))
Measures average prediction error in Y units.

Assumptions of Pearson r and OLS Regression

Required Conditions

1. Linearity: the true relationship is linear
2. Independence: observations are independent
3. Homoscedasticity: variance of Y is constant across X
4. Normality: Y is approximately normal at each X (for inference)
5. No extreme influential observations distorting estimates

Verify conditions 1, 3, and 5 by visual inspection of the scatterplot and residual plot. Pearson r is robust to moderate violations of normality for n above 30.

Design Requirements for Academic Publication

The APA Publication Manual (7th edition, 2020) specifies that figures displaying bivariate relationships must include clearly labelled axes with the variable name and unit of measurement. Data points must be clearly visible and distinguishable from the regression line. The figure caption must state the sample size, identify the correlation coefficient and its significance level, and specify whether the regression line is displayed. When multiple groups are plotted on the same scatterplot, distinct point shapes or fills must be used and a legend provided.

APA 7th Edition Requirements for Scatterplots

Label. Figure 1 in bold below the image; figure title in italic title case on the next line.
Axes. Both axes must be labelled with the variable name and unit of measurement in parentheses.
Caption. States n, r (or rho), p-value, and whether the regression line is included. Ends with a period.
Zero baseline. Required only when zero is a meaningful value on the scale. For scales with no natural zero (e.g., standardised scores), axes may begin at a non-zero value.
Regression line. If included, state the equation and R-squared in the caption or figure note.

Conditions for Appropriate Use

Appropriate Conditions

Both variables are continuous or at minimum ordinal with many levels. The research question concerns the association or predictive relationship between the two variables. The sample contains at least 10 paired observations; for reliable inference, at least 30 pairs are recommended. Each observation contributes one point independently of all other observations.

When Pearson r Is Inappropriate

The relationship is clearly nonlinear in the scatterplot. One or both variables are binary or have very few distinct values. The data contain extreme outliers on one axis that distort the correlation. Variables are measured on a nominal scale. In these cases, consider Spearman rho, Kendall tau, the point-biserial correlation, or eta-squared depending on the data structure.

Correlation Is Not Causation

A statistically significant Pearson r or regression slope does not establish a causal relationship between X and Y. Both variables may be caused by a third confounding variable; the temporal order may be reversed from what the researcher assumes; the relationship may be spurious. Causal conclusions require experimental or quasi-experimental designs with appropriate controls, not correlation analysis alone.

Selected Methodological Questions

How many observations are needed for reliable correlation estimation?

The power of the significance test for Pearson r depends on the sample size, the true population correlation, and the chosen alpha level. For detecting a medium effect (rho = .30) with 80% power at alpha = .05 (two-tailed), a minimum of 84 observations is required by Cohen's (1988) power tables. For a large effect (rho = .50), 28 observations suffice. For a small effect (rho = .10), 782 observations are required. Researchers should conduct a priori power analysis before data collection rather than interpreting post-hoc significance as confirmation of adequate sample size.

When should Spearman rho be reported instead of Pearson r?

Spearman rho is preferred when the data contain extreme outliers that would distort Pearson r, when one or both variables are measured on an ordinal scale, when the scatterplot reveals a monotonic but clearly nonlinear relationship, or when the assumption of bivariate normality is substantially violated in a small sample. For large samples (n above 100) without extreme outliers, Pearson r and Spearman rho typically converge to similar values and the choice between them is largely a matter of the substantive research question. Many researchers report both as a robustness check.

What does it mean for a regression line to pass through the centroid?

The OLS regression line always passes through the point whose coordinates are the sample means of X and Y, called the centroid or center of gravity of the data. This is a mathematical consequence of the OLS normal equations, not a modelling choice. It means that when X equals its sample mean, the regression equation predicts Y equal to its sample mean. Researchers can use this property to verify the regression equation: substituting x-bar into the regression equation should yield y-bar to within rounding error.

Scatterplot Maker Reference Guide and Builder