Skip to main content

Run One-Sample, Two-Sample, and Paired T-Tests

Perform one-sample, two-sample (independent), or paired t-tests from summary statistics. Get t-statistic, p-value, confidence intervals, and effect size (Cohen's d).

Last Updated: February 13, 2026

Pick the Test Type (One, Two, Paired)

If you're trying to figure out whether a difference between means is real or just random noise, the t-test calculator does the heavy lifting. But first, you need to pick the right version. Most people get tripped up here because they grab whatever test looks familiar without checking whether it actually fits their data.

One-sample t-test: Use this when you have a single group and want to compare its average to some known or expected value. For instance, a company claims their light bulbs last 1,000 hours. You test 25 bulbs, record their lifespans, and ask: does my sample average differ from 1,000? The null hypothesis here is that your sample mean equals the claimed value.

Two-sample (independent) t-test: Use this when comparing two separate groups that have no overlap. Different people, different batches, different treatments. A classic example: does teaching method A produce higher scores than method B? Each student only experiences one method, so the groups are independent.

Paired t-test: Use this when the same subjects get measured twice, or when you have naturally matched pairs. Before-and-after blood pressure readings on the same patients. Left eye vs. right eye measurements. Twins assigned to different conditions. The pairing structure matters because it removes between-subject variability and focuses on within-subject change.

Quick decision rule: Count your groups. One group vs. a known value = one-sample. Two separate groups = two-sample. Same subjects measured twice = paired. Picking the wrong test doesn't just give you a different number—it can lead you to the wrong conclusion entirely.

Degrees of Freedom and Variance Options Explained

Degrees of freedom (df) determine the shape of the t-distribution you're comparing against. The formula changes depending on your test type, and getting it wrong shifts your critical values and p-values.

One-sample: df = n − 1

Two-sample (pooled): df = n₁ + n₂ − 2

Two-sample (Welch): df = complex formula (usually not an integer)

Paired: df = n − 1 (where n = number of pairs)

For two-sample tests, you face a choice: assume equal variances (pooled method) or don't (Welch's method). The pooled approach combines both samples' variance estimates into one, which works fine when the two groups truly have similar spread. But if one group is more variable than the other, pooling produces misleading results.

Welch's t-test handles unequal variances by keeping separate variance estimates and adjusting the degrees of freedom downward. The trade-off: slightly less power when variances actually are equal. Most statisticians now recommend Welch as the default because it performs well either way, while the pooled test can fail badly when variances differ.

Practical tip: If your standard deviations differ by more than a factor of 2, or your sample sizes are unequal, lean toward Welch's method. When in doubt, Welch is safer.

p-Value and t-Statistic With Interpretation

The t-statistic measures how far your observed difference is from zero, scaled by how much variability exists in your data. A larger absolute t-value means the difference is bigger relative to the noise. But the t-value alone doesn't tell you probability—that's where the p-value comes in.

One-sample: t = (x̄ − μ₀) / (s / √n)

Two-sample: t = (x̄₁ − x̄₂) / SE

Paired: t = d̄ / (s_d / √n)

The p-value answers: if the null hypothesis were true (no real difference), what's the probability of getting a t-statistic at least as extreme as what you observed? A small p-value (typically below 0.05) suggests your data would be unusual if there were truly no effect, leading you to reject the null.

Two-tailed vs. one-tailed: Two-tailed tests check whether the true value differs from the hypothesized value in either direction. One-tailed tests only look in one direction, which gives you more power to detect effects that way but misses effects in the opposite direction. Specify your hypothesis before seeing the data—switching after the fact inflates your false positive rate.

Reading the result: p = 0.03 does not mean there's a 3% chance the null is true. It means that under the null, data this extreme would occur about 3% of the time by chance. The distinction matters when communicating findings.

Effect Size Snapshot: Cohen's d

Statistical significance tells you whether an effect probably exists. Effect size tells you how big it is. You can have a tiny, meaningless difference that's statistically significant (with enough data) or a substantial difference that fails to reach significance (with too little data). Cohen's d puts the magnitude on a standardized scale.

One-sample: d = (x̄ − μ₀) / s

Two-sample (pooled): d = (x̄₁ − x̄₂) / s_pooled

Paired: d = d̄ / s_d

Cohen's rough benchmarks: d around 0.2 is small, 0.5 is medium, 0.8 is large. But context matters. In clinical trials, even a small d can be clinically meaningful if it translates to fewer symptoms or longer survival. In education research, medium effects are often celebrated. The benchmarks are starting points, not hard rules.

|d| ≈ 0.2

Small effect

|d| ≈ 0.5

Medium effect

|d| ≥ 0.8

Large effect

Always report both p-value and effect size. A result with p = 0.04 and d = 0.1 tells a very different story than p = 0.04 and d = 0.9. The first is statistically significant but practically negligible; the second is significant and substantial.

Assumptions Checklist Before You Trust It

The t-test isn't magic—it relies on certain conditions being at least approximately true. Violating these assumptions doesn't always ruin your analysis, but severe violations can mislead you.

Independence

Each observation should be independent of the others. If students in the same classroom influence each other's scores, or if you measure the same subject repeatedly without accounting for it, independence breaks down. Paired tests handle the "same subject twice" case, but more complex dependencies require different methods.

Normality

The t-test assumes data come from a normally distributed population. With larger samples (n > 30 per group), the Central Limit Theorem helps—sample means become approximately normal regardless of the underlying distribution. With smaller samples, check histograms or Q-Q plots for severe skewness or outliers.

Equal Variances (for pooled two-sample)

The pooled t-test assumes both groups have similar variability. If one group's standard deviation is much larger than the other's, use Welch's method instead. Levene's test can formally check this, but often a visual comparison or ratio check is enough.

Continuous Measurement

The dependent variable should be measured on an interval or ratio scale. Likert scales (1–5 ratings) are technically ordinal, though many researchers treat them as continuous when there are enough response options and the distribution isn't too skewed.

What breaks this test: Extreme outliers, severely non-normal distributions with small samples, non-independent observations, or mixing paired data with an independent test. When assumptions fail badly, consider non-parametric alternatives like Mann-Whitney U or Wilcoxon signed-rank tests.

t-Test Questions, Answered

When should I use a t-test instead of a z-test?

Use a t-test when you're estimating the population standard deviation from your sample (which is almost always). The z-test requires you to already know the true population standard deviation—a rare situation outside textbook problems. In practice, the t-test is the default for comparing means.

My p-value is 0.06. Is that significant?

Strictly speaking, if you set α = 0.05 beforehand, then 0.06 doesn't cross the threshold. But statistical significance isn't a cliff edge. A p-value of 0.06 indicates moderate evidence—not enough to reject the null at the 0.05 level, but not nothing. Report the actual value, consider the effect size, and avoid treating 0.05 as a magic cutoff.

Can I use a t-test with unequal sample sizes?

Yes, but be careful. Unequal sample sizes combined with unequal variances can distort results with the pooled method. Welch's t-test handles this better because it doesn't assume equal variances and adjusts degrees of freedom accordingly.

What's the minimum sample size for a t-test?

Technically you need at least 2 observations per group, but practical recommendations suggest 10–30 per group for reasonable power and robustness. With very small samples, the normality assumption matters more, and you have limited ability to detect real effects.

How do I report t-test results properly?

Include the test type, t-statistic, degrees of freedom, p-value, effect size, and confidence interval. Example: "An independent samples t-test showed that Group A (M = 85.2, SD = 7.1) scored significantly higher than Group B (M = 78.4, SD = 8.3), t(48) = 3.21, p = 0.002, d = 0.87, 95% CI [2.5, 11.1]."

Why does the confidence interval matter if I already have a p-value?

The p-value tells you whether the difference is likely real; the confidence interval tells you the plausible range of that difference. A CI of [0.5, 10.0] for a mean difference is much more informative than just "p < 0.05." If the interval barely excludes zero, you know the effect could be tiny.

Limitations and Scope

• Normality matters for small samples: With n < 30, severe skewness or heavy tails can distort p-values. Consider non-parametric tests (Mann-Whitney, Wilcoxon) for clearly non-normal data.

• Independence is non-negotiable: Correlated observations (clustered data, repeated measures without proper structure) require different methods—mixed models or repeated measures ANOVA.

• Pooled variance requires similar spreads: When group variances differ substantially, Welch's method protects against inflated Type I error. Default to Welch unless you have strong reason to believe variances are equal.

• Effect size adds meaning: A statistically significant result (p < 0.05) with a trivial effect size (d = 0.1) may not warrant action. Always interpret p-values alongside practical magnitude.

Note: This calculator is for educational purposes. For research publications, clinical decisions, or business planning, verify results with statistical software and consult a statistician when assumptions are unclear.

Sources

Frequently Asked Questions

Common questions about t-tests, one-sample and two-sample tests, paired t-tests, p-values, confidence intervals, effect sizes, assumptions, and how to use this calculator for homework and statistics practice.

What is a t-test and when should I use it?

A t-test is a statistical hypothesis test used to determine if there's a significant difference between means. Use a one-sample t-test to compare a sample mean to a known value, a two-sample t-test to compare two independent groups, or a paired t-test for before/after or matched pair comparisons. The t-test is appropriate when you have continuous data and want to test hypotheses about means.

What's the difference between one-tailed and two-tailed tests?

A two-tailed test checks if the mean is different (either higher or lower) from the hypothesized value. A one-tailed test checks for a difference in only one direction (either higher OR lower). Use two-tailed tests when you don't have a specific directional hypothesis. One-tailed tests have more power to detect differences in the specified direction but can miss effects in the opposite direction.

How do I interpret the p-value?

The p-value is the probability of observing results as extreme as yours (or more extreme) if the null hypothesis is true. If p < α (commonly 0.05), reject the null hypothesis—the difference is statistically significant. A small p-value doesn't prove the alternative hypothesis; it just suggests the data is unlikely under the null. Also consider effect size for practical significance.

What is Cohen's d and why does it matter?

Cohen's d is a measure of effect size that tells you the practical significance of a difference, independent of sample size. Values around 0.2 indicate a small effect, 0.5 a medium effect, and 0.8+ a large effect. Even if a result is statistically significant, a small effect size might mean the difference isn't practically meaningful. Always report both p-value and effect size.

When should I use Welch's t-test vs. the pooled t-test?

Use the pooled (equal variance) t-test when you're confident both groups have similar variances. Use Welch's t-test when variances are unequal or when you're unsure. Welch's test is more robust and is often recommended as the default choice because it performs well even when variances are equal, while the pooled test can give misleading results with unequal variances.

What sample size do I need for a t-test?

The t-test can work with small samples (even n < 30), but smaller samples require the data to be more normally distributed. With larger samples (n > 30), the Central Limit Theorem helps make the test robust to non-normality. For reliable results with small samples, check that your data doesn't have severe skewness or outliers. Power analysis can help determine the sample size needed to detect a specific effect.

What does the confidence interval tell me?

The confidence interval provides a range of plausible values for the true population parameter (mean or mean difference). A 95% CI means that if you repeated the study many times, 95% of the calculated intervals would contain the true value. For two-sample and paired tests, if the CI for the difference doesn't include zero, the difference is significant at that confidence level.

Can I use a t-test with non-normal data?

The t-test is fairly robust to violations of normality, especially with larger samples (n > 30). For small samples with clearly non-normal data, consider: (1) transforming the data, (2) using a non-parametric alternative like the Mann-Whitney U test or Wilcoxon signed-rank test, or (3) bootstrapping methods. Check for severe skewness and outliers, which can affect results more than mild non-normality.

What's the difference between statistical significance and practical significance?

Statistical significance (p < α) tells you whether an effect is likely real (not due to chance). Practical significance tells you whether the effect is meaningful in the real world. With a large enough sample, even tiny differences can be statistically significant. That's why effect size (Cohen's d) is important—it helps you judge whether a significant result actually matters in practice.

Why do I need summary statistics instead of raw data?

This calculator accepts summary statistics (mean, standard deviation, sample size) for efficiency and privacy. These are the values you'd compute from raw data anyway. Most statistical software and textbooks provide formulas using summary statistics. If you have raw data, first calculate the mean and standard deviation before using this tool. For paired data, calculate the mean and standard deviation of the differences.

T-Test Calculator: t, p, CI + Cohen's d