Skip to main content

Run One-Sample, Two-Sample, and Paired T-Tests

Perform one-sample, two-sample (independent), or paired t-tests from summary statistics. Get t-statistic, p-value, confidence intervals, and effect size (Cohen's d).

Last Updated: February 13, 2026

You have two samples and want to know if they came from populations with different means. That's the question. Formally: H₀: μ₁ = μ₂ versus H₁: μ₁ ≠ μ₂ (or one-sided if you have a directional hypothesis you set in advance). The t statistic is the standardized difference: t = (x̄₁ − x̄₂) / SE, where SE depends on whether you assume equal variances (pooled) or not (Welch's). Default to Welch's unless you have strong prior reason to assume equal variances; Ruxton (2006) made the case clearly and most stats packages now use Welch as the default for good reason.

The page handles all three cases: one-sample (test against a known μ₀), two-sample independent, and paired (where the relevant statistic is on the difference scores, not the raw values). Output includes t, df, two-tailed p, the 95% CI on the mean difference, and Cohen's d so you have an effect size to report alongside the p-value. One trap: paired data analyzed as two independent samples inflates the variance estimate and robs the test of power. If you measured the same subjects twice, use paired.

Pick the Test Type (One, Two, Paired)

If you're trying to figure out whether a difference between means is real or just random noise, the t-test calculator does the heavy lifting. But first, you need to pick the right version. Most people get tripped up here because they grab whatever test looks familiar without checking whether it actually fits their data.

One-sample t-test: Use this when you have a single group and want to compare its average to some known or expected value. For instance, a company claims their light bulbs last 1,000 hours. You test 25 bulbs, record their lifespans, and ask: does my sample average differ from 1,000? The null hypothesis here is that your sample mean equals the claimed value.

Two-sample (independent) t-test: Use this when comparing two separate groups that have no overlap. Different people, different batches, different treatments. A classic example: does teaching method A produce higher scores than method B? Each student only experiences one method, so the groups are independent.

Paired t-test: Use this when the same subjects get measured twice, or when you have naturally matched pairs. Before-and-after blood pressure readings on the same patients. Left eye vs. right eye measurements. Twins assigned to different conditions. The pairing structure matters because it removes between-subject variability and focuses on within-subject change.

Quick decision rule: Count your groups. One group vs. a known value = one-sample. Two separate groups = two-sample. Same subjects measured twice = paired. Picking the wrong test doesn't just give you a different number—it can lead you to the wrong conclusion entirely.

Choosing pooled vs Welch's variance

Degrees of freedom (df) determine the shape of the t-distribution you're comparing against. The formula changes depending on your test type, and getting it wrong shifts your critical values and p-values.

One-sample: df = n − 1

Two-sample (pooled): df = n₁ + n₂ − 2

Two-sample (Welch): df = complex formula (usually not an integer)

Paired: df = n − 1 (where n = number of pairs)

For two-sample tests, you face a choice: assume equal variances (pooled method) or don't (Welch's method). The pooled approach combines both samples' variance estimates into one, which works fine when the two groups truly have similar spread. But if one group is more variable than the other, pooling produces misleading results.

Welch's t-test handles unequal variances by keeping separate variance estimates and adjusting the degrees of freedom downward. The trade-off: slightly less power when variances actually are equal. Most statisticians now recommend Welch as the default because it performs well either way, while the pooled test can fail badly when variances differ.

Practical tip: If your standard deviations differ by more than a factor of 2, or your sample sizes are unequal, lean toward Welch's method. When in doubt, Welch is safer.

p-Value and t-Statistic With Interpretation

The t-statistic measures how far your observed difference is from zero, scaled by how much variability exists in your data. A larger absolute t-value means the difference is bigger relative to the noise. But the t-value alone doesn't tell you probability—that's where the p-value comes in.

One-sample: t = (x̄ − μ₀) / (s / √n)

Two-sample: t = (x̄₁ − x̄₂) / SE

Paired: t = d̄ / (s_d / √n)

The p-value answers: if the null hypothesis were true (no real difference), what's the probability of getting a t-statistic at least as extreme as what you observed? A small p-value (typically below 0.05) suggests your data would be unusual if there were truly no effect, leading you to reject the null.

Two-tailed vs. one-tailed: Two-tailed tests check whether the true value differs from the hypothesized value in either direction. One-tailed tests only look in one direction, which gives you more power to detect effects that way but misses effects in the opposite direction. Specify your hypothesis before seeing the data—switching after the fact inflates your false positive rate.

Reading the result: p = 0.03 does not mean there's a 3% chance the null is true. It means that under the null, data this extreme would occur about 3% of the time by chance. The distinction matters when communicating findings.

Effect Size Snapshot: Cohen's d

Statistical significance tells you whether an effect probably exists. Effect size tells you how big it is. You can have a tiny, meaningless difference that's statistically significant (with enough data) or a substantial difference that fails to reach significance (with too little data). Cohen's d puts the magnitude on a standardized scale.

One-sample: d = (x̄ − μ₀) / s

Two-sample (pooled): d = (x̄₁ − x̄₂) / s_pooled

Paired: d = d̄ / s_d

Cohen's rough benchmarks: d around 0.2 is small, 0.5 is medium, 0.8 is large. But context matters. In clinical trials, even a small d can be clinically meaningful if it translates to fewer symptoms or longer survival. In education research, medium effects are often celebrated. The benchmarks are starting points, not hard rules.

|d| ≈ 0.2

Small effect

|d| ≈ 0.5

Medium effect

|d| ≥ 0.8

Large effect

Always report both p-value and effect size. A result with p = 0.04 and d = 0.1 tells a very different story than p = 0.04 and d = 0.9. The first is statistically significant but practically negligible; the second is significant and substantial.

Assumptions Checklist Before You Trust It

The t-test isn't magic—it relies on certain conditions being at least approximately true. Violating these assumptions doesn't always ruin your analysis, but severe violations can mislead you.

Independence

Each observation should be independent of the others. If students in the same classroom influence each other's scores, or if you measure the same subject repeatedly without accounting for it, independence breaks down. Paired tests handle the "same subject twice" case, but more complex dependencies require different methods.

Normality

The t-test assumes data come from a normally distributed population. With larger samples (n > 30 per group), the Central Limit Theorem helps—sample means become approximately normal regardless of the underlying distribution. With smaller samples, check histograms or Q-Q plots for severe skewness or outliers.

Equal Variances (for pooled two-sample)

The pooled t-test assumes both groups have similar variability. If one group's standard deviation is much larger than the other's, use Welch's method instead. Levene's test can formally check this, but often a visual comparison or ratio check is enough.

Continuous Measurement

The dependent variable should be measured on an interval or ratio scale. Likert scales (1–5 ratings) are technically ordinal, though many researchers treat them as continuous when there are enough response options and the distribution isn't too skewed.

What breaks this test: Extreme outliers, severely non-normal distributions with small samples, non-independent observations, or mixing paired data with an independent test. When assumptions fail badly, consider non-parametric alternatives like Mann-Whitney U or Wilcoxon signed-rank tests.

Common questions about t-tests

When should I use a t-test instead of a z-test?

Use a t-test when you're estimating the population standard deviation from your sample (which is almost always). The z-test requires you to already know the true population standard deviation—a rare situation outside textbook problems. In practice, the t-test is the default for comparing means.

My p-value is 0.06. Is that significant?

Strictly speaking, if you set α = 0.05 beforehand, then 0.06 doesn't cross the threshold. But statistical significance isn't a cliff edge. A p-value of 0.06 indicates moderate evidence—not enough to reject the null at the 0.05 level, but not nothing. Report the actual value, consider the effect size, and avoid treating 0.05 as a magic cutoff.

Can I use a t-test with unequal sample sizes?

Yes, but be careful. Unequal sample sizes combined with unequal variances can distort results with the pooled method. Welch's t-test handles this better because it doesn't assume equal variances and adjusts degrees of freedom accordingly.

What's the minimum sample size for a t-test?

Technically you need at least 2 observations per group, but practical recommendations suggest 10–30 per group for reasonable power and robustness. With very small samples, the normality assumption matters more, and you have limited ability to detect real effects.

How do I report t-test results properly?

Include the test type, t-statistic, degrees of freedom, p-value, effect size, and confidence interval. Example: "An independent samples t-test showed that Group A (M = 85.2, SD = 7.1) scored significantly higher than Group B (M = 78.4, SD = 8.3), t(48) = 3.21, p = 0.002, d = 0.87, 95% CI [2.5, 11.1]."

Why does the confidence interval matter if I already have a p-value?

The p-value tells you whether the difference is likely real; the confidence interval tells you the plausible range of that difference. A CI of [0.5, 10.0] for a mean difference is much more informative than just "p < 0.05." If the interval barely excludes zero, you know the effect could be tiny.

Limitations of the t-test

Independence: the binding assumption. Clustered data (students within classrooms, repeated measures on the same subject) inflates Type I error if you ignore the structure. Switch to mixed-effects models (R's lme4 package, or statsmodels.regression.mixed_linear_model in Python) or a proper repeated-measures setup.

Normality at small n: for n below ~30 per group, severe skew or heavy outliers distort p-values. Mann-Whitney for two-sample, Wilcoxon signed-rank for paired, are the non-parametric fallbacks.

Welch by default: default to Welch's variance unless you have a strong prior reason to assume equal variances. The cost when variances genuinely match is small. The protection when they don't is real (Ruxton 2006).

Note: The pitfall worth flagging up front: don't run twenty t-tests against the same data and report only the significant ones. That's not a 5% false-positive rate, that's a guaranteed false positive. Apply Bonferroni or Benjamini-Hochberg FDR. R's t.test() and scipy.stats.ttest_ind / ttest_rel are the standard implementations. The ASA Statement on p-values (2016) is the right reference for how to report and interpret.

Sources

T-tests in practice: working questions

How do I choose between paired and unpaired t-tests?

Paired (or dependent) when the same subjects are measured twice, or when units are matched (twin studies, before/after on identical patients). The test runs on the difference scores, so within-subject variability cancels out. Unpaired (independent two-sample) when the two groups are different subjects with no natural pairing. The most common silent mistake: analyzing pre-post data on the same people as if it were two independent samples. That inflates the variance estimate and robs the test of power. If you measured the same units twice, use paired.

Welch's vs Student's t-test, which is the default?

Welch's. R's t.test() defaults to Welch (var.equal = FALSE) for good reason: it doesn't assume equal variances, and Ruxton (2006) showed the cost in power when variances actually match is tiny while the protection when they don't is substantial. Student's pooled-variance t-test is the right choice only when you have a strong prior reason to believe variances are equal (paired design within subjects, replicate measurements from the same instrument). For everything else, default to Welch and don't bother running Levene's test as a screen.

How small can n be for a t-test?

There's no hard floor, but small n needs the underlying distribution to be close to normal. With n = 5 per group, a heavy outlier dominates the test statistic. With n = 10-15, mild non-normality is fine. With n above 30, the Central Limit Theorem rescues you for most practical violations. The honest answer: if your data are clearly non-normal at small n, the t-test isn't the right tool. Mann-Whitney for two-sample, Wilcoxon signed-rank for paired, are the rank-based fallbacks that don't assume normality.

How do I interpret Cohen's d in practice?

Cohen's d = (x̄₁ − x̄₂) / s_pooled, the standardized mean difference. Cohen's original benchmarks (1988): 0.2 small, 0.5 medium, 0.8 large. Field-specific norms vary: in clinical trials, d = 0.3 may be clinically meaningful; in physics, d = 2 is common. With n = 1000, you can hit p &lt; 0.001 at d = 0.1, which is statistically significant but practically trivial. Always report d alongside p. The point of effect size is to keep significance and importance separate quantities.

How do I report a t-test in a results section?

Standard format: "t(df) = X.XX, p = .XXX, d = X.XX, 95% CI [lower, upper]." Example: "Welch's t-test showed a significant difference in test scores between groups, t(47.3) = 3.21, p = .002, d = 0.78, 95% CI for the mean difference [2.4, 9.6]." Include the test variant (Welch's vs paired vs Student's), the test statistic, df, p-value to three decimals (or to four if below 0.001), Cohen's d, and the confidence interval. APA 7 requires effect size, which is why d matters even when significance carries the headline.

When should I use a one-tailed vs two-tailed test?

Two-tailed unless you have a strong directional hypothesis you set in advance. Most journal guidelines default to two-tailed for that reason. One-tailed is appropriate when theory or prior data lock in a direction (a new drug can't make symptoms worse; a new manufacturing process can't reduce yield below baseline). Picking one-tailed after seeing the data, because two-tailed didn't quite hit significance, halves your p-value and is textbook p-hacking. Pre-register the tail before collecting data.

What if my data have outliers?

Investigate before deciding. A single extreme value can flip a t-test result, especially at small n. First check if the outlier is a data-entry error or measurement glitch. If it's a real value, robust alternatives are the rank-based Mann-Whitney (for unpaired) or Wilcoxon signed-rank (for paired). Trimmed t-tests (Yuen's t-test) trim a fraction from each tail before computing. Don't quietly delete outliers; report what you did and why. Reviewers will ask.

How do I do a t-test from summary statistics instead of raw data?

Plug in n, x̄, and s for each group. The Welch t-statistic is t = (x̄₁ − x̄₂) / √(s₁²/n₁ + s₂²/n₂), with Welch-Satterthwaite degrees of freedom. R has BSDA::tsum.test for summary-stats t-tests; Python's statsmodels has stats.ttest_ind_from_stats. Useful when the raw data isn't available (textbook problems, replicating a paper, regulatory comparisons). For paired tests from summary stats, you need the standard deviation of the differences, not the per-group SDs, and that's where most errors happen.