Skip to main content

Confidence Intervals for Means, Proportions, and Differences

Compute confidence intervals for means (Z/t), proportions, and differences. Shows standard error, critical value, margin of error, and error-bar graph.

Last Updated: February 13, 2026

A confidence interval is an interval estimate, not a point. Instead of reporting "the mean is 47.3," you report a range that, under repeated-sampling logic, captures the true parameter at the stated rate: 95 times in 100 for a 95% CI built from fresh samples over and over. This page handles a single mean (z when σ is known, t when estimated from data), a single proportion, and the difference of two means or two proportions.

Width = critical value × standard error. The critical value is z (1.96 at 95%) or t (df-dependent, wider than z at small n). Standard error scales as s/√n for a mean, √(p(1−p)/n) for a proportion. Halving the CI width takes quadrupling n, not doubling, since n enters through √n. The interpretation trap people miss: a single 95% CI doesn't have a 95% probability of containing the true value. That's the long-run capture rate of the procedure, not a posterior. If you actually want the posterior probability, you want a Bayesian credible interval, not a frequentist CI.

Your Interval, Margin of Error, and Standard Error

A confidence interval calculator takes your sample mean, sample size, and variability estimate, then returns three interconnected values: the interval bounds, the margin of error, and the standard error. Polling firms report "52% approve, margin of error 3 points" because readers grasp that the true approval sits somewhere between 49% and 55%. That margin of error comes directly from multiplying the standard error by a critical value linked to your chosen confidence level.

Standard error measures how much the sample statistic would bounce around if you repeated the study many times. For a mean, it equals s / √n—the sample standard deviation divided by the square root of n. Larger samples shrink that denominator, tightening your estimate. Double the sample size and your standard error drops by roughly 30%. This relationship shapes every study design: precision costs data.

Margin of error then scales standard error by a multiplier from the z or t distribution. A 95% interval uses approximately 1.96 standard errors on each side; a 99% interval stretches to about 2.58. The wider you want your net, the larger that multiplier. But remember, the interval isn't a probability statement about where the true parameter currently lives—it's a statement about how often this procedure captures the truth across repeated samples.

Clinical trials sometimes report a confidence interval of [0.5, 1.2] for a hazard ratio. If that interval excludes 1, researchers conclude the treatment changes risk. A quality engineer might get [4.97, 5.03] millimeters for a shaft diameter. That tight band signals the production process stays on spec. The numbers vary, but the mechanics stay the same: center plus or minus margin.

Z vs t: Small-Sample Reality Check

When your sample exceeds about 30 observations and you estimate population spread from the data, z and t critical values nearly overlap. At df = 120, t sits around 1.98 versus z's 1.96—barely a rounding difference. But drop to df = 10 and t jumps to 2.23. That extra width compensates for the added uncertainty in estimating the standard deviation from a handful of points.

The t-distribution was William Gosset's workaround at the Guinness brewery, where small batches made relying on large-sample theory foolish. His insight: replace the normal curve with one carrying heavier tails, and those tails shrink as data accumulates. Software usually defaults to t when you supply a sample standard deviation rather than a known population sigma. Unless you have historical data pinning the variance exactly—like decades of quality-control records for the same production line—stick with t.

A common mistake is plugging in n instead of n − 1 for degrees of freedom. For a single mean, df = n − 1. For the difference between two independent means using pooled variance, df = n₁ + n₂ − 2. Paired data? df = number of pairs minus one. Getting this wrong widens or narrows your interval incorrectly, especially when n is small enough for the difference to matter.

A researcher with 8 blood-pressure readings might see a t multiplier near 2.36 instead of 1.96. That stretch adds roughly 20% to the margin of error compared with a z interval. Ignoring it underestimates uncertainty and over-promises precision. Small-sample analysis demands the heavier-tailed distribution, no shortcuts.

Confidence Level Tradeoffs (90/95/99)

A 95% confidence level became the social-science default partly because Ronald Fisher needed a round probability, but that doesn't mean it fits every scenario. Medical device approvals might demand 99% to limit false assurances about safety. Exploratory marketing surveys often settle for 90% because a bit more uncertainty is acceptable when stakes are lower and budgets tighter.

Raising confidence from 95% to 99% inflates the multiplier from roughly 1.96 to 2.58. If your standard error is 2 units, the margin jumps from 3.92 to 5.16—about 30% wider. That extra width buys coverage in 99 out of 100 repeated samples rather than 95. Whether that reassurance is worth the fuzzier estimate depends on consequences: a missed defect in an airplane bolt matters more than overestimating how many people prefer cola A.

Dropping to 90% does the opposite. The multiplier shrinks to 1.645, giving tighter bounds. Some regulatory frameworks permit 90% intervals for noninferiority trials when the cost of missing a real difference is acceptable. The key is matching the level to the decision context rather than defaulting mindlessly to 95%.

Finance analysts building Value-at-Risk models often use 99% or even 99.5%. Environmental scientists might use 95% but then conduct sensitivity analyses at 90% and 99% to see how conclusions shift. Choosing a confidence level is a judgment call blending convention, regulatory requirements, and the cost of being wrong.

Interpreting a CI Without Overclaiming

A 95% confidence interval does not mean "there is a 95% chance the true value lies inside." Once you calculate the interval, the parameter is either in or out—probability is 0 or 1. The 95% refers to the procedure's long-run hit rate: repeat the sampling many times, and roughly 95 of every 100 intervals capture the truth. This subtlety trips up journalists and students alike.

Think of it like archery with a blindfold. Each shot corresponds to one sample, and the bullseye is the true parameter. A 95% confidence interval is akin to a method that hits the bullseye 95% of the time across many shots, but any single arrow either struck or missed—there's no partial credit after the fact.

For differences—say, treatment minus control—an interval excluding zero signals statistical significance at the complementary alpha level. A 95% CI for a mean difference that runs from 2.1 to 8.3 implies p < 0.05 in a two-tailed test. But an interval of [−0.5, 4.2] spanning zero doesn't prove no effect; it means the data can't rule out zero at that confidence level. Maybe a larger sample would tighten things.

When reporting, say "we are 95% confident" rather than "there is a 95% probability." Alternatively, phrase it as "the interval [L, U] captures the true parameter in 95% of similarly conducted studies." Clear wording prevents readers from treating the interval as a Bayesian credible region, which would require a prior distribution—a different framework entirely.

Computation Notes and Assumptions

Every confidence interval rests on assumptions. For means, the classic t-interval assumes the underlying population is roughly normal or that n is large enough for the central limit theorem to kick in. Severely skewed distributions—like income data with a long right tail—can distort coverage in small samples. A log transformation or bootstrap approach may work better.

Independence is another pillar. Observations must not cluster in ways that inflate similarity—students from the same classroom, multiple measurements from the same patient. Ignoring clustering underestimates the true variance, making intervals too narrow and false confidence too high.

For proportions, the Wald interval (p-hat plus or minus z times standard error) can misbehave when the sample proportion is near zero or one. The Wilson score interval adjusts for this by inverting a hypothesis test, and modern software often defaults to Wilson or the Agresti-Coull correction. If your data include few successes or failures, check which method your tool uses.

Two-sample intervals for mean differences can assume pooled variance—appropriate when both groups share roughly equal spreads—or use Welch's adjustment for unequal variances. Welch's method is generally safer and is the default in many statistical packages, including R's t.test function. Pooled intervals assume homoscedasticity; violating this assumption can distort coverage in unpredictable directions.

Common questions about confidence intervals

Can a confidence interval go negative when the quantity can't be negative?

Yes, mathematically. A standard deviation interval might yield −0.3 to 1.2, even though variance can't be negative. The fix is a transformation—work in log scale for ratios, or use a method designed for bounded quantities like proportions. Negative bounds signal the model doesn't fit the parameter space well.

How do I shrink my margin of error?

Increase sample size or accept a lower confidence level. Quadrupling n roughly halves the standard error and thus the margin. Switching from 99% to 95% cuts the multiplier, narrowing the interval without collecting more data—but you accept a higher miss rate.

What's the difference between confidence interval and credible interval?

A confidence interval is frequentist: it says the method captures the parameter in some percentage of repeated samples. A Bayesian credible interval says there's a given probability the parameter lies inside, conditional on observed data and a prior. Same numeric output sometimes, fundamentally different interpretation.

Why does my software give asymmetric intervals for some statistics?

For ratios, odds ratios, or hazard ratios, the sampling distribution is often skewed. Symmetric intervals would ignore that shape, so methods compute intervals on a log scale and exponentiate back. The result is asymmetric around the point estimate, which better reflects uncertainty for multiplicative quantities.

Is a wider interval always worse?

Not necessarily. A wide interval honestly reflects high uncertainty—maybe your sample is small or variability is large. A misleadingly narrow interval gives false precision. Width should match reality. If the interval is too wide for your decision, collect more data rather than pretending certainty you don't have.

Limitations of confidence interval procedures

Distribution assumptions: normal for means via z, t for means with estimated σ, normal approximation for proportions when np and n(1−p) both exceed 5. Violations produce intervals with the wrong coverage.

Random sampling: required. Convenience samples, self-selection, or non-random recruitment produce biased estimates that no interval width can correct.

Independence: assumed. Clustered, time-series, or multi-stage sampled data needs design-based variance (R's survey package, Stata's svyset).

Interpretation: a single 95% CI doesn't have a 95% probability of containing the true parameter. That's the long-run capture rate of the procedure across repeated samples, not a posterior. If you actually want the posterior, build a Bayesian credible interval instead.

Note: Coverage probability is approximate, not identical to the nominal confidence level. The discrepancy is largest for proportion intervals near 0 or 1 and shrinks as n grows. For mean intervals, R's t.test() and scipy.stats.t.interval give the same numbers as this page. For proportions, see the proportion CI page in this category.

Sources & References

The mathematical formulas and statistical concepts used in this calculator are based on established statistical theory and authoritative academic sources:

  • NIST/SEMATECH e-Handbook: Confidence Intervals - Comprehensive guide to confidence interval construction from the National Institute of Standards and Technology.
  • Khan Academy: Confidence Intervals - Educational resource explaining confidence interval concepts and interpretation.
  • Penn State STAT 500: Confidence Intervals - University course material on confidence interval theory and applications.
  • Statistics By Jim: Confidence Intervals Guide - Practical explanation of confidence intervals with examples.
  • OpenStax Introductory Statistics: Confidence Intervals - Free, peer-reviewed textbook chapter on confidence interval fundamentals.

Confidence intervals: working questions

What does "95% confidence" actually mean?

It's a property of the procedure, not of any specific interval. Build a 95% CI from a fresh sample, repeat the experiment many times, and 95 of every 100 intervals will capture the true parameter. For any single interval you've already computed, the parameter either is or isn't inside; there's no 95% probability statement to make about that one interval. This is the most-misinterpreted result in frequentist statistics. If you actually want "there's a 95% probability the parameter is in this range," that's a Bayesian credible interval, which requires a prior.

How do I interpret a CI that includes zero (or the null value)?

The data don't rule out the null at your chosen confidence level. For a difference in means, if the 95% CI is [−0.3, 4.1], you can't rule out a true difference of zero, so the corresponding two-tailed test at α = 0.05 doesn't reject. "CI excludes zero" is equivalent to "two-sided test rejects." That said, an interval that barely includes zero ([−0.1, 4.0]) is very different from one that solidly does ([−5, 5]), even though both are non-significant. Width and position carry information beyond the dichotomous reject/fail-to-reject call.

Z or t, which one for the mean?

Use t whenever you estimate σ from sample data, which is almost always. The t-distribution has heavier tails than the normal to reflect the extra uncertainty in σ̂. With n &lt; 30 or so the difference is meaningful (t₀.₀₂₅ at df = 10 is 2.228 versus z = 1.96); above n = 100 it's a rounding error. If σ is genuinely known from population-level data (rare in practice), z is appropriate. R defaults to t in confint() and t.test(); SciPy's stats.t.interval is the equivalent. Using t is never wrong; using z when σ is estimated is technically wrong but harmless at large n.

How does sample size affect CI width?

Width scales with 1/√n. Halving the CI width takes quadrupling n, not doubling. Going from n = 25 to n = 100 cuts the half-width by 2; going to n = 400 cuts it by 4. The √n relationship is why studies hit diminishing returns on width past a certain n: the marginal precision per added subject keeps shrinking. For planning, decide your target half-width first, then back-solve for n given an estimated σ. The page's sample-size calculator does this directly.

How do I compute a CI for the difference between two means?

Welch's CI by default: (x̄₁ − x̄₂) ± t* · √(s₁²/n₁ + s₂²/n₂), with Welch-Satterthwaite df. Pooled CI assumes equal variances and replaces the standard error with a single pooled estimate; only worth it when you have strong reason to assume equal variances. R's t.test(x1, x2)$conf.int returns Welch by default. If the CI excludes zero, the corresponding two-tailed Welch t-test rejects at the same α. For paired data, the CI is on the differences, not on the raw values: use t.test(x1, x2, paired = TRUE).

When does the standard CI formula break down?

Three common failure modes. Heavily skewed distributions at small n: the CLT hasn't kicked in yet and the t interval undercovers. Use bootstrap (boot package in R, scipy's bootstrap in Python) instead. Bounded parameters near the boundary: a CI for a proportion built around p̂ near 0 or 1 can extend below 0 or above 1; use Wilson or Clopper-Pearson. Ratios and other non-additive quantities: the CI for log(ratio) is symmetric and well-behaved; back-transform for the asymmetric CI on the ratio scale. Don't trust the symmetric ± formula in any of these regimes.

Are confidence intervals and p-values redundant?

Mostly equivalent, but the CI carries more information. "95% CI for the difference is [0.4, 2.1]" tells you the test rejects at α = 0.05 (because the interval excludes zero) and gives the magnitude of the effect with uncertainty bounds. The p-value alone tells you only that the test rejects. APA and most journals now require CIs alongside (or instead of) p-values for that reason. Same goes for effect sizes: report d or r with their CIs, not just whether p crossed 0.05.

What's the difference between a CI and a prediction interval?

A confidence interval covers the population parameter (mean, proportion, regression coefficient). A prediction interval covers a future individual observation. Prediction intervals are wider because they include both the uncertainty in estimating the mean and the variability of an individual data point around that mean. For a normal regression, the 95% prediction interval at x is ŷ(x) ± t*·√(σ̂²·(1 + 1/n + (x − x̄)²/Σ(xᵢ − x̄)²)), versus the CI which drops the leading 1. Mixing these up is a frequent error in regression reporting.

Related Math & Statistics Tools