Skip to main content

Confidence Intervals for a Single Proportion

Compute a confidence interval for a single proportion using Wald (normal approximation) and Wilson (score) methods. Enter the number of successes, sample size, and confidence level to see interval bounds, margin of error, and method comparisons.

Last Updated: February 13, 2026

The Wald interval, p̂ ± z·√(p̂(1−p̂)/n), is the one most textbooks teach first, and it's wrong often enough that you should stop using it. Brown, Cai & DasGupta (2001) showed Wald undercovers the nominal level chaotically across n and p, especially near 0 or 1, where it can produce bounds outside [0, 1]. Three successes in 10 at 95% confidence: Wald gives [0.016, 0.584]. One success in 10: Wald returns negative lower bounds. That isn't a small-n curiosity, it's a genuine failure mode.

Use Wilson's score interval as the default. It inverts a hypothesis test rather than approximating around p̂, stays bounded in [0, 1] by construction, and tracks the nominal coverage even at small n. Agresti-Coull (Wald with the +2/+4 adjustment) is fine too and easier to teach by hand. For very small n or extreme proportions, Clopper-Pearson is conservative but exact. The page returns all four side-by-side so you can see the disagreement and pick the one your reporting standard requires.

Inputs: Successes, n, and Confidence Level

A confidence interval for a proportion starts with counting successes out of total trials. An e-commerce site might track 312 purchases from 4,200 visitors; a medical researcher might record 47 adverse events among 1,500 patients. The sample proportion p-hat equals successes divided by n. Plug those numbers into a formula alongside your chosen confidence level, and you get a range estimating the true population proportion.

Confidence level controls interval width. At 95%, the method captures the true proportion in 95 of every 100 repeated samples—not a statement that this specific interval has a 95% probability of containing the truth. Raising the level to 99% stretches the interval; dropping it to 90% tightens the bounds. Pick a level that matches the stakes: medical device clearances might demand 99%, while a quick marketing poll might tolerate 90%.

Sample size n drives precision. Doubling n shrinks the standard error by a factor of roughly 1.4, which narrows the interval. But sample collection costs money and time, so practical work involves balancing acceptable margin of error against budget. Pre-study power analysis helps decide n before data collection begins, but the calculator works with whatever sample you already have.

Successes must be whole numbers between zero and n. Zero successes or n successes require special handling—some methods produce degenerate intervals in these edge cases. If your count sits at an extreme, the calculator may flag a warning or switch methods automatically.

Wald vs Wilson: When Methods Diverge

The Wald interval is the formula most textbooks introduce first: p-hat plus or minus z times the square root of p-hat times (1 − p-hat) divided by n. Simple to compute, easy to teach. But its coverage can fall short when p-hat is near zero or one, or when n is small. A study with 3 successes out of 10 trials can produce a Wald interval that dips below zero—impossible for a proportion.

Wilson's score interval fixes these problems by inverting a hypothesis test rather than relying purely on the normal approximation. The formula looks more complicated, but the payoff is intervals that stay bounded between zero and one and maintain actual coverage closer to the stated confidence level. Simulation studies show Wilson outperforms Wald almost universally, especially when p-hat is extreme.

When n is large—say, thousands of observations—and p-hat sits near 0.5, Wald and Wilson give almost identical results. The divergence grows as p-hat approaches the boundaries or n shrinks. A survey of 50 people finding 2 positive responses will see substantially different intervals depending on the method. If your software offers both, run them side by side on edge cases to understand how they differ.

Agresti-Coull is a compromise: add two pseudosuccesses and two pseudofailures before computing a Wald-like interval. This simple adjustment often achieves coverage comparable to Wilson without changing the formula structure. Some practitioners default to Agresti-Coull because it's easy to explain while avoiding the worst Wald pitfalls.

Continuity and Small-n Caveats

Proportion data are inherently discrete—you can observe 5 successes or 6, not 5.3. The normal distribution underlying Wald and Wilson is continuous, so there's a mismatch. Continuity correction adds 0.5 to or subtracts 0.5 from the success count before calculating, bridging discrete counts to a continuous curve. Some textbooks recommend it; others argue Wilson already handles the discreteness adequately.

Small samples amplify every approximation error. With n = 15, a single extra success can shift p-hat by nearly 7 percentage points. Interval width balloons because standard error is inversely proportional to the square root of n. Below about n = 30, many statisticians suggest exact binomial methods (Clopper-Pearson), which invert the cumulative binomial distribution rather than leaning on normal theory.

Clopper-Pearson intervals are wider than Wilson for the same data because they guarantee coverage at or above the stated level. That conservatism can feel frustrating when you want a tight estimate, but for high-stakes applications—like estimating the failure rate of a safety system—overestimating uncertainty beats underestimating it.

If your sample is small and p-hat is extreme, examine the calculator's output critically. A Wilson interval of [0.02, 0.35] might look wide, but it honestly reflects the data. Resist the temptation to pick whichever method gives the narrowest result—choose the method that best fits your assumptions and risk tolerance.

Interpreting a Proportion Interval Correctly

A 95% confidence interval for a proportion does not mean there is a 95% probability the true value lies inside. Once you compute the bounds, the parameter is either captured or not—probability is 0 or 1. The 95% refers to long-run coverage: repeat the survey many times, and about 95 of 100 intervals will contain the truth. This distinction matters when communicating results to decision-makers.

A poll reports 48% support with a margin of error of 3 points. Readers often think "the true support is somewhere between 45% and 51%." That's roughly correct for practical purposes, but technically the interval is a property of the method, not a probability envelope around the unknown parameter. Bayesian credible intervals do offer that probability interpretation, but they require specifying a prior distribution.

When comparing two proportions—say, conversion rates between landing pages—check whether the intervals overlap. Non-overlap suggests a statistically significant difference at roughly the stated confidence level, though a formal two-sample test is more precise. Overlapping intervals do not prove equality; the difference might still be real but smaller than your precision can detect.

Context shapes interpretation. An interval of [0.08, 0.12] for defect rate might be excellent in consumer electronics but alarming in aerospace. Always pair the statistical result with domain knowledge about acceptable thresholds and practical consequences of estimation error.

Method Notes: What's Assumed

The binomial model underpins all standard proportion intervals. It assumes each trial is independent, with a constant probability of success. Drawing names from a hat without replacement violates independence once the pool shrinks noticeably. Surveying friends of friends can cluster responses, inflating apparent precision. Check whether your sampling design matches the model before trusting the output.

Normal approximation methods—Wald, Wilson, Agresti-Coull—work best when both np and n(1 − p) exceed about 5. Below that threshold, the binomial distribution is too skewed for a symmetric normal curve to mimic. The calculator may warn you, but vigilance on your part catches edge cases the software misses.

Cluster sampling, stratified sampling, and other complex designs need design-based variance estimators. Standard formulas assume simple random sampling. Using the wrong variance estimator can underestimate uncertainty by a factor of two or more, producing falsely narrow intervals that mislead stakeholders.

If you have paired or matched data—say, pre-and-post measurements on the same subjects—proportion intervals for independent samples do not apply. You would need McNemar's test framework or a paired-proportion interval, which accounts for within-subject correlation. Applying the wrong model leads to incorrect inference.

Proportion CI Questions

Which method should I default to?

Wilson or Agresti-Coull are safer choices than Wald for general use. They maintain coverage across a wider range of p-hat and n values. Reserve Wald for large samples with moderate proportions, or when you need to match legacy reports using that formula.

Why does my Wald interval go below zero?

Because the formula can produce negative lower bounds when p-hat is near zero and n is small. Clipping to zero is one fix, but switching to Wilson or exact methods is cleaner. A negative bound signals the method is unsuitable for your data.

How do I choose between 90%, 95%, and 99%?

Consider the cost of missing the true value. High-stakes safety applications often warrant 99%. Exploratory analyses can accept 90%. Convention and regulatory guidance also play a role—many journals expect 95% unless justified otherwise.

Can I compare two intervals by looking at overlap?

Rough rule: non-overlapping intervals suggest a significant difference at approximately the stated level. Overlapping intervals don't guarantee no difference—they just can't rule it out easily. A formal two-sample proportion test is more accurate for significance.

What if I have zero successes?

Zero out of n presents a special case. Wald gives [0, 0], which is misleading—absence of successes doesn't mean the true rate is exactly zero. Wilson and exact methods produce one-sided intervals starting at zero with a positive upper bound, honestly reflecting uncertainty.

Limitations of proportion CI methods

Don't use Wald: below n ≈ 100 or for p̂ near 0 or 1, Wald undercovers the nominal level. Brown, Cai & DasGupta (2001) showed the chaotic coverage drop. Wilson is the right default for almost every practical case.

Paired data needs paired methods: for pre-post on the same subjects, this page doesn't apply. Use McNemar's test or a paired binomial CI instead.

Independent Bernoulli with constant p: all four methods assume that. Cluster sampling, complex survey designs, or contagion structures invalidate the formulas. Use design-based variance estimation (R's survey package).

Approximate coverage: the actual coverage of a nominal 95% interval is approximate, especially at small n. Wilson and Agresti-Coull track nominal level closely. Clopper-Pearson is conservative (covers ≥ 95%) by construction.

Note: R's binom.confint() in the binom package returns all four intervals (Wald, Wilson, Agresti-Coull, Clopper-Pearson) side by side. statsmodels.stats.proportion.proportion_confint is the Python equivalent. Wilson's interval traces to Wilson (1927). Agresti-Coull is the +2/+4 adjustment from Agresti and Coull (1998).

Sources & References

The mathematical formulas and statistical concepts used in this calculator are based on established statistical theory and authoritative academic sources:

Proportion CIs: Wald, Wilson, and beyond

Wald vs Wilson, which interval should I actually use?

Wilson, almost always. Wald (the textbook formula p̂ ± z·√(p̂(1−p̂)/n)) undercovers the nominal level chaotically across n and p, especially near 0 or 1 where it can produce bounds outside [0, 1]. Brown, Cai & DasGupta (2001) is the canonical demonstration. Wilson inverts the score test rather than approximating around p̂, stays inside [0, 1] by construction, and tracks nominal coverage closely even at small n. R's binom::binom.confint() returns Wilson by default; statsmodels.stats.proportion.proportion_confint(method='wilson') is the Python equivalent.

Why does my Wald interval sometimes go below 0 or above 1?

Because Wald approximates the binomial with a normal centered at p̂, and that normal extends past [0, 1] when p̂ is close to a boundary. With p̂ = 0.05 and n = 30, the Wald 95% CI is roughly [−0.03, 0.13], which includes negative probabilities. Mathematically nonsense. Wilson, Agresti-Coull, and Clopper-Pearson all stay inside [0, 1] by construction. If you see a Wald interval crossing the boundary, that's a signal to switch methods, not to truncate the bounds at 0 and 1.

When do I need Clopper-Pearson?

When you need guaranteed coverage at or above the nominal level. Clopper-Pearson uses the exact binomial distribution and is conservative by construction; the actual coverage can exceed 95% (sometimes substantially) but never falls below. Use it for regulatory submissions, safety-critical applications, and small samples where you can't afford undercoverage. The cost is wider intervals than Wilson. For routine inference, Wilson is usually preferred because it tracks 95% nominal more tightly.

How do I compute a CI when I have 0 successes (or n successes)?

Wald collapses uselessly: p̂ = 0 gives [0, 0] which says nothing. Use Wilson, Agresti-Coull, or Clopper-Pearson, all of which produce a non-degenerate interval. Wilson with x = 0, n = 20 gives roughly [0, 0.16]; Clopper-Pearson gives [0, 0.168]. The "rule of 3" approximation is the upper bound for x = 0: 3/n. So with no observed events in n = 20 trials, you can say the true rate is at most about 0.15 with 95% confidence.

What's the +2/+4 adjustment in Agresti-Coull?

Agresti and Coull (1998) showed that adding two successes and two failures (so 4 total "pseudo-observations") and then computing a Wald-style interval on the adjusted counts gives much better coverage than plain Wald. The center moves to (x + 2)/(n + 4) and the standard error uses the adjusted counts. It's the simplest fix to Wald that works well, and it's easy to compute by hand. Wilson is theoretically cleaner but Agresti-Coull is what teachers use to demonstrate why Wald is broken.

How big does n need to be for the normal approximation to work?

The textbook rule is np̂ ≥ 5 and n(1 − p̂) ≥ 5, but that's optimistic. Brown et al. (2001) showed Wald coverage is genuinely poor up to n in the hundreds for p̂ near 0.1 or 0.9. A safer working rule: avoid Wald entirely below n = 100, use Wilson at all sample sizes. For n above 1000 and p̂ between 0.2 and 0.8, all four methods agree to two decimal places. For paper-thin coverage requirements, Clopper-Pearson is the only one with guaranteed coverage at any n.

Two-proportion CI, do I just stack two single-proportion CIs?

No. The CI for the difference p₁ − p₂ has its own formula. Wald: (p̂₁ − p̂₂) ± z·√(p̂₁(1−p̂₁)/n₁ + p̂₂(1−p̂₂)/n₂). Newcombe's method combines two separate Wilson intervals in a way that maintains coverage; it's the standard recommendation for the two-proportion case. R's prop.test() returns Wilson-style intervals; DescTools::BinomDiffCI offers Newcombe's method. Stacking individual intervals and reading non-overlap as significance is approximately right but not exact, especially at small n.

Continuity correction, when do I apply it?

When matching the discrete binomial to a continuous normal approximation. The half-step adjustment (±0.5) corrects for the fact that P(X ≤ k) in the discrete world should match Φ((k + 0.5 − np)/√(np(1−p))) in the continuous world. R's prop.test() applies it by default. For Wilson and Clopper-Pearson, the correction is built into the method or doesn't apply. For modern reporting, the consensus is that Wilson without continuity correction is preferred over Wald with it; the continuity correction is mostly a relic of the era when Wald was standard.