Test Pearson Correlation Significance From r and n
Test whether a Pearson correlation coefficient is statistically significant. Get t-statistic, p-value, confidence interval, and effect size interpretation.
Pearson's r measures linear association between two variables, bounded by [−1, +1]. The significance test asks whether the observed |r| is large enough to rule out ρ = 0, the null of zero population correlation. Test statistic: t = r·√(n − 2) / √(1 − r²), distributed under H₀ as t with n − 2 degrees of freedom.
For a confidence interval on ρ, the t alone isn't enough. r has a skewed sampling distribution near ±1 that the t doesn't correct for. Apply Fisher's z-transform z = ½·ln((1 + r) / (1 − r)), approximately normal with SE = 1/√(n − 3); build the interval in z-space, then transform back. The page returns t, df, two-sided and one-sided p, and the Fisher-based 95% CI on ρ.
From r to t to p-value
Test statistic and degrees of freedom
Under H₀: ρ = 0, the standardized statistic
t = r · √(n − 2) / √(1 − r²), df = n − 2
follows a t-distribution with df = n − 2. The √(n − 2) in the numerator is what lets large samples detect even small correlations: at n = 1000, df = 998 and the t critical value is essentially 1.96 (the normal limit). At n = 10, df = 8 and the critical value jumps to 2.31 because the t-distribution has heavier tails when df is small.
Worked example: r = 0.42, n = 45. df = 43. t = 0.42·√43 / √(1 − 0.42²) = 0.42 · 6.557 / 0.908 ≈ 3.03. With df = 43, two-tailed p ≈ 0.0042, so reject H₀ at α = 0.05. R's cor.test() and scipy.stats.pearsonr return the same numbers to several decimals.
P-value direction
Two-tailed: p = 2·(1 − F(|t|, df)) where F is the t-distribution CDF. One-tailed right (H₁: ρ > 0): p = 1 − F(t, df). One-tailed left (H₁: ρ < 0): p = F(t, df). Pre-register the tail before seeing data. Switching after the fact halves your p and is textbook p-hacking.
Fisher z for confidence intervals on ρ
The sampling distribution of r is skewed when ρ is far from 0, which makes a symmetric ± formula on the r scale wrong. Fisher's transform
z = ½ · ln((1 + r) / (1 − r)), SE_z = 1 / √(n − 3)
makes r approximately normal regardless of the true ρ. Build the CI in z-space (z ± zα/2 · SE), then back-transform with r = (e^(2z) − 1) / (e^(2z) + 1). For r = 0.42, n = 45: z = 0.448, SE = 1/√42 ≈ 0.154, CI in z-space = [0.146, 0.750], back-transformed CI on ρ ≈ [0.14, 0.64]. Asymmetric on the r scale, as it should be.
Significance vs effect size, don't confuse them
A significant correlation isn't automatically meaningful. Cohen's benchmarks for |r|: below 0.1 negligible, 0.1-0.3 small, 0.3-0.5 medium, 0.5-0.7 large, ≥ 0.7 very large. They come from behavioral science and aren't universal. In physics, r = 0.5 between two measured quantities is unimpressive. In epidemiology, r = 0.3 can be substantial. Use the benchmarks as a sanity check, not a target.
The variance explained by a linear relationship is r², not r. So r = 0.7 explains 49% of the variance, not 70%. r = 0.3 explains 9%. r = 0.1 explains 1%. Practical importance comes from r², not the headline correlation.
Sample size determines what's detectable; effect size determines what's meaningful. With n = 1000, r = 0.07 hits two-tailed p ≈ 0.026 but explains 0.5% of the variance. Statistically significant, practically irrelevant. The opposite case: r = 0.45 with n = 15 misses significance (p ≈ 0.09) despite a medium effect, because power is too low. Always report r alongside p, and ideally the 95% CI on ρ.
The CI on ρ is what tells you precision. Width tells you uncertainty. r = 0.42 with CI [0.14, 0.64] is medium but uncertain. r = 0.42 with CI [0.40, 0.44] would be a different story. CI excluding 0 is equivalent to two-tailed rejection at the same α. CI containing 0 means the data don't rule out ρ = 0, which isn't the same as proving ρ = 0.
APA 7 reporting expects r, df, p, and CI together. Format: "r(43) = 0.42, p = .004, 95% CI [0.14, 0.64]." Don't strip out the df or the CI to save space. Both carry information not in the bare correlation.
When r misleads: small n, nonlinearity, outliers
Small n. With n below ~30, the t-based test assumes bivariate normality of x and y, and the assumption matters. Heavy tails inflate Type I error. Severe skew distorts the sampling distribution of r. For small n with non-normal marginals, Spearman's rank correlation is the safer choice. R's cor.test(x, y, method = "spearman") and scipy.stats.spearmanr handle it. Below n = 5, no correlation test is reliable; the data don't support any conclusion about ρ.
Non-linear relationships. r only catches linear association. A clean parabola y = x² centered at zero gives r ≈ 0 over a symmetric range, even though the relationship is deterministic. Anscombe's quartet (1973) and the Datasaurus Dozen (Matejka and Fitzmaurice 2017) show datasets with identical r and radically different scatter shapes. Always plot before reporting. If the scatter is monotonic but curved, Spearman captures the monotonic trend by working on ranks. If it's non-monotonic, no single correlation coefficient is the right summary and you want a regression with the appropriate functional form.
Outliers. A single influential point can flip r from +0.5 to −0.2 at small n. Investigate before deciding what to do. Data-entry error: fix or remove. Genuine extreme observation: report r with and without the point, and prefer Spearman or biweight midcorrelation (R's WRS2 package). Never quietly drop outliers without disclosing it.
Restricted range. Sampling x over a narrow window suppresses r below its true population value. Studying SAT vs college GPA only on Harvard students gives a misleadingly low r because the SAT range is truncated near the top. If your sample range is much narrower than the population range you care about, the correlation you compute won't generalize.
Reading r alongside p: common pitfalls
Causation. r between A and B doesn't tell you A causes B, B causes A, or a third variable C causes both. Coffee drinking correlates with heart disease in observational data, but the direction and any common cause (age, smoking, work stress) need experimental design or causal inference machinery (instrumental variables, DAGs with sufficient assumptions, regression discontinuity) to disentangle. Pearl's "The Book of Why" is the accessible entry point. Correlations are useful for hypothesis generation and for predictive models that don't require causal interpretation, but they don't license causal claims on their own.
Multiple comparisons inflate false positives. Computing dozens of correlations across a wide dataset and reporting only the significant ones is data dredging. With 100 random pairs at α = 0.05, you expect 5 to come out significant by chance alone. Bonferroni or Benjamini-Hochberg corrections handle the family-wise error rate. Pre-register which correlations you care about, or use a two-stage exploratory/confirmatory split.
Selection effects produce spurious correlations and suppress real ones. Berkson's bias arises when sampling conditions on the outcome (hospital-based studies finding negative correlations between unrelated diseases). Range restriction deflates r when sampling truncates x. Both are easy to overlook because they're features of the sampling design, not the data.
Treating Pearson's r as the universal correlation. r assumes linearity. If your scatter is monotonic but bent, r underestimates the strength of the relationship. Spearman's ρ captures the monotonic trend by working on ranks. For non-monotonic relationships, neither correlation coefficient is appropriate; you want a regression model that matches the shape, or distance correlation (Székely, Rizzo & Bakirov 2007) which detects general dependence without assuming monotonicity.
Limitations of correlation testing
Linear association only: r catches linear relationships, nothing else. Anscombe's quartet (1973) is the canonical demonstration that very different scatter shapes can share the same r down to two decimals. Always look at the scatter before reporting.
Outlier sensitivity: a single influential point can flip r from +0.5 to −0.2 at small n. Run robust alternatives (Spearman's ρ, biweight midcorrelation) when outliers are present, and report which method you used.
Bivariate normality at small n: the t-based significance test assumes both marginals and the joint distribution are normal. Heavy-tailed or skewed data distorts small-sample p-values. Non-parametric correlation tests are the safer fallback for n below ~30.
Significance ≠ strength: with n = 1000, r = 0.07 hits p ≈ 0.026 but explains under 0.5% of the variance. Report r alongside p, not as a substitute.
Note: scipy.stats.pearsonr returns r and a two-tailed p. R's cor.test() returns r, p, and the Fisher-based 95% CI on ρ. For Spearman, scipy.stats.spearmanr and R's cor.test(method="spearman") are the standard tools.
Sources
- •NIST/SEMATECH e-Handbook on the correlation coefficient.
- •Penn State STAT 501 §1.6 on Pearson correlation and significance.
- •Anscombe, F. J. (1973). "Graphs in Statistical Analysis," The American Statistician, 27(1), 17-21.
- •Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Lawrence Erlbaum.
- •Pearl, J. and Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect.
Pearson r in practice: working questions
Pearson vs Spearman, which should I use?
Pearson measures linear association; Spearman measures monotonic association. If the relationship is linear and the marginals are roughly normal, Pearson has more power. If the relationship is monotonic but curved (a saturating curve, an exponential ramp), or if outliers are present, Spearman is the safer choice because it operates on ranks. Plot first; the diagnosis is easy. R: cor(x, y, method = "spearman") and Python's scipy.stats.spearmanr handle it.
How do I compute the p-value for r?
Test statistic t = r·√(n − 2) / √(1 − r²), distributed under H₀: ρ = 0 as t with n − 2 df. Two-tailed p comes from both tails. For r = 0.40 with n = 25, t = 2.10 with df = 23, two-tailed p ≈ 0.047. The Fisher z-transform z = ½·ln((1 + r) / (1 − r)) gives an approximately normal alternative for testing non-zero null values, and is what you use for confidence intervals on ρ. R's cor.test() and scipy.stats.pearsonr both return r and a t-based p.
What does Fisher's z-transform actually do?
Stabilizes the variance of r. The sampling distribution of r is skewed when ρ is far from 0, which makes symmetric CIs on the r scale wrong. Fisher's transform z = ½·ln((1 + r) / (1 − r)) makes the distribution approximately normal with SE = 1/√(n − 3), regardless of the true ρ. Build the CI in z-space (z ± 1.96/√(n − 3)), then back-transform with r = (e^(2z) − 1) / (e^(2z) + 1). The asymmetry of the resulting CI on r captures the truth that r near ±1 is more constrained than r near 0.
How does sample size affect the significance of r?
At n = 10, |r| = 0.63 is needed for two-tailed p < 0.05. At n = 30, |r| = 0.36. At n = 100, |r| = 0.20. At n = 1000, |r| = 0.062. With huge n, even trivial correlations hit significance, which is why r = 0.07 in a 1000-row dataset is statistically significant (p ≈ 0.026) but explains under 0.5% of the variance. Always report r alongside p, never as a substitute.
Is r = 0.3 a strong correlation?
Field-dependent. Cohen (1988) called 0.10 small, 0.30 medium, 0.50 large for behavioral science. In physics or chemistry, r = 0.3 is weak. In epidemiology or psychology, it can be a meaningful effect. The variance explained is r², so 0.3 explains 9% of the variance, which is modest. Look at the scatter, not just the number. A genuine r = 0.3 with no outliers is different from r = 0.3 driven by one extreme point that flips the sign when removed.
What if my variables aren't normal?
Pearson is reasonably robust to mild non-normality at n above ~30 thanks to the CLT. Heavy-tailed or strongly skewed marginals at small n inflate Type I error. Switch to Spearman, which only needs monotonicity and is invariant to monotonic transformations. For inference (CIs and p-values) when normality is questionable, the bootstrap is the cleanest fallback: resample (xᵢ, yᵢ) pairs with replacement, compute r each time, take percentile CI. R's boot package and scipy.stats.bootstrap handle it.
How do I read a 95% CI on ρ?
An interval that captures the true population correlation 95% of the time across repeated samples. CI [0.12, 0.51] for r̂ = 0.34 with n = 30 says ρ is plausibly anywhere from weak to medium. If the CI excludes 0, the corresponding two-tailed test rejects ρ = 0. The CI is wider on the side closer to 0 because of the Fisher transform's curvature. For two correlations to be "significantly different," their CIs should not overlap meaningfully; the formal test for ρ₁ ≠ ρ₂ uses the Fisher-z-transformed difference.
Related Statistical Tools
View AllRegression Analysis
Fit linear and polynomial models to your data points
T-Test Calculator
Compare means between two groups using t-tests
ANOVA Calculator
Compare means across three or more groups
Descriptive Statistics
Calculate mean, median, mode, standard deviation, and more
Normal Distribution
Calculate probabilities and z-scores from the normal distribution
Error Propagation Calculator
Calculate uncertainty propagation through mathematical formulas