The Line Graph in Statistical and Academic Practice
The line graph is a graphical display in which quantitative values measured at successive ordered points are represented as markers and connected by line segments. Its defining visual claim is that the connecting line between adjacent data points is meaningful: it asserts either that the values change continuously between measured points, as in a true time series, or at minimum that the ordered sequence of categories carries directional significance. This claim distinguishes the line graph from the bar chart and makes it the primary graphical form for longitudinal, repeated-measures, and time-series data in academic research.
The line graph's intellectual origins lie in the same tradition as the scatterplot. William Playfair, the Scottish engineer and political economist who invented both the bar chart and the pie chart, introduced the line graph in his Commercial and Political Atlas (1786) to display the national debt of England across time. His innovation was to recognise that a connected sequence of points conveys not only the value at each measured occasion but the trajectory of change between occasions, transforming a static display into a visual argument about process, direction, and rate. This insight has proved so durable that the form Playfair established in 1786 remains the standard for time-series visualization in scientific publication today.
What the Line Graph Communicates
A well-constructed line graph communicates four properties of a quantitative series across an ordered dimension, each of which carries distinct analytical and interpretive content.
When to Use a Line Graph: Conditions and Contraindications
The appropriateness of a line graph depends on the measurement structure of the independent variable, the nature of the dependent variable, and the research question being addressed.
Formula Reference: Change and Trend Statistics
Core Measures Summary
| Statistic | Formula | Interpretation | Notes |
|---|---|---|---|
| Absolute Change | delta_t = y_t - y_(t-1) | Raw difference between consecutive values | Positive = increase; negative = decrease. Units are the same as the variable. |
| Percentage Change | pct_t = (y_t - y_(t-1)) / |y_(t-1)| * 100 | Relative change as a proportion of the prior value | Uses absolute value of prior value to handle negative baselines correctly. Undefined when y_(t-1) = 0. |
| Cumulative Change | cum_t = y_t - y_1 | Total change from the first observation to time t | Anchors all comparisons to the baseline observation. |
| Average Rate of Change | ARC = (y_last - y_first) / (T - 1) | Mean change per time interval over the entire series | T = number of time points. Equal to the slope of a line through the first and last values. |
| Coefficient of Variation | CV = (SD / |mean|) * 100 | Relative variability as a percentage of the mean | Allows comparison of variability across series with different units or scales. |
| Notation: y_t = value at time point t; y_1 = first value; y_last = final value; T = total number of time points; SD = sample standard deviation; mean = arithmetic mean of the series. | |||
Variance: s^2 = sum(y_t - x-bar)^2 / (T-1)
SD: s = sqrt(s^2)
SE: s / sqrt(T)
Median: middle value of sorted series
All statistics use the sample formula (T-1 denominator). The mean summarises level; the SD summarises period-to-period variability around that level.
Percentage: pct_t = delta_t / |y_(t-1)| * 100
Cumulative: cum_t = y_t - y_1
Avg Rate: ARC = (y_last - y_1) / (T-1)
Percentage change uses the absolute value of the prior value to produce a correctly signed result when the prior value is negative. When the prior value is zero, percentage change is mathematically undefined.
where t = 1, 2, 3, ..., T (index of time point)
b1 > 0: increasing trend
b1 < 0: decreasing trend
b1 = 0: no linear trend
The OLS slope through the time index provides the best linear fit to the series and is the standard measure of trend in time-series analysis. It is equivalent to the average rate of change only when the series is perfectly linear.
Descriptive Statistics Formulas
s^2 = sum(y_t - x-bar)^2 / (T - 1)
s = sqrt(s^2)
SE = s / sqrt(T)
Q1 = 25th percentile (linear interpolation)
Q3 = 75th percentile (linear interpolation)
IQR = Q3 - Q1
Total pct change = (y_last - y_first) / |y_first| * 100
ARC = (y_last - y_first) / (T - 1)
CV less than 15%: low variability
CV 15% to 35%: moderate variability
CV greater than 35%: high variability
Stt = sum[(t - t-bar)^2]
Sty = sum[(t - t-bar)(y_t - y-bar)]
b1 = Sty / Stt
b0 = y-bar - b1 * t-bar
R^2 = 1 - SS_res / SS_tot
Multi-Series Line Graphs: Design and Interpretation
The multi-series line graph is among the most powerful and most frequently misused forms in academic data visualization. Its power lies in its capacity to place multiple temporal trajectories in direct visual correspondence, enabling the simultaneous perception of level differences, trend differences, and crossing patterns. Its misuse arises when too many series are plotted on a single set of axes, when series are not distinguished by sufficiently different visual encodings, or when the axes are scaled to favour the appearance of one series over others.
The Y-Axis Baseline: When Zero Is and Is Not Required
The question of whether the Y-axis must begin at zero is more nuanced for line graphs than for bar charts, and the answer depends on the encoding principle and the research question. Bar charts encode value by bar length, so a non-zero baseline distorts the perceived ratio between bars and constitutes misrepresentation. Line graphs encode value by position along the vertical axis, not by length, so a non-zero baseline does not create the same distortion of ratios. A line graph showing temperature change between 18 and 24 degrees Celsius can legitimately display that range without beginning at zero, because the reader is not comparing areas or lengths.
However, a non-zero baseline can still mislead if it visually amplifies apparent variation to a degree that is disproportionate to the actual magnitude of change. A series ranging from 99.1 to 99.9 plotted with a Y-axis from 99.0 to 100.0 will appear to show extreme volatility that the data do not support when interpreted in context. The guiding principle is that the axis range should be chosen to accurately represent the substantive significance of the variation in the data. When the range is restricted, the figure caption must state the axis limits explicitly and explain why a restricted range was chosen.
- Label. Figure 1 in bold below the image; title in italic title case on the next line.
- Axes. Both axes labelled with variable name and unit of measurement in parentheses. Tick marks and gridlines used sparingly.
- Legend. Required when two or more series are displayed. Placed inside the figure area or directly below; not as a separate element requiring page flipping.
- Caption. States the time period covered, identifies each series if multiple, states the unit of measurement, and reports the sample size or data source. Ends with a period.
- Data points. Should be shown as markers when the number of time points is small (fewer than 20) to allow the reader to distinguish actual observations from the interpolated connecting line.
Selected Methodological Questions
When should smoothed curves replace straight line segments?
Straight line segments connecting adjacent data points are the default and are appropriate in most research contexts because they make the actual data values unambiguous: the line passes exactly through each measured point, and any departure from the line is visually interpretable as a change in trajectory. Smoothed curves, such as cubic spline interpolations or LOESS fits, are appropriate when the data are densely sampled and the underlying process is known to be smooth, or when individual measurement noise is so large that straight segments produce a jagged display that obscures the underlying trend. When smooth curves are used, the figure caption must state that the curve is a smoothed fit and specify the method, because the smoothed line no longer passes exactly through the data points and the reader cannot recover individual values from the curve alone.
How should missing time points be handled in a line graph?
When a series contains missing values at one or more time points, the connecting line should be broken rather than drawn through the gap, because drawing through the gap implies that the trajectory between the measured points is known when it is not. Chart.js and most statistical packages provide options for handling missing values as gaps or as linear interpolations; the choice must be documented in the figure caption. When the missing values represent a known event such as a study interruption, an annotated break in the line with explanatory text is the appropriate display. The most common error is to allow software to silently connect across missing values, producing a continuous line that implies data that do not exist.
What is the difference between a line graph and a time-series plot?
In common usage the terms are interchangeable. In technical statistical literature, a time-series plot specifically refers to a line graph of data collected at regular, equally-spaced intervals in time, to which time-series specific analyses such as autocorrelation, stationarity testing, and spectral analysis may be applied. A line graph in the general sense includes ordered categorical sequences that are not equally-spaced in time. This tool produces line graphs in the general sense; researchers applying formal time-series analysis should verify that the equal-spacing assumption holds for their data before applying autocorrelation-based methods.