Skip to main content

Sample Size & Power Calculator

Calculate required sample size, statistical power, or minimum detectable effect for hypothesis tests. Master power analysis for research planning and study design.

📊

Sample Size & Power Calculator

Enter parameters to compute sample sizes, power, or minimum detectable effects

What Decision Your Sample Size Answers

Your product manager asks: “How many users do we need before we can call this A/B test?” You open a sample size calculator, type in a baseline conversion rate of 3.2%, a minimum lift you care about (say 10% relative), pick 80% power and α = 0.05, and get back a number — maybe 38,000 per arm. That number is the answer to a very specific question: “How many observations do I need so that if a real difference of this size exists, my test has an 80% chance of catching it?”

The mistake that derails most experiments: running the calculator after the test ends. Post-hoc power analysis is circular — the observed effect size already determines the p-value, so computing power from it adds no new information. Run the calculator before you launch. The whole point is to commit to a sample size up front so you are not tempted to peek, stop early when results look good, or keep running when they do not.

Alpha, Power, and Effect Size — the Three Inputs That Move n

Four parameters are linked: sample size (n), significance level (α), power (1 − β), and effect size. Fix any three and the fourth is determined. In practice you choose α (usually 0.05), power (usually 0.80 or 0.90), and the minimum detectable effect (MDE) — the calculator returns n.

Effect size has the biggest impact. Halving the MDE roughly quadruples the required n because the sample-size formula has the effect in the denominator squared. Detecting a 1-percentage-point lift on a 10% baseline needs about 14,700 per arm at 80% power. Detecting a 2-point lift on the same baseline needs about 3,700 — a 4× reduction. Before asking for more traffic, ask whether you really need to detect that small an effect.

Power is the next lever. Moving from 80% to 90% power increases n by roughly 30%. Moving to 95% adds another 25% on top. Most product tests run at 80% because the cost of a false negative (missing a real improvement) is usually lower than the cost of running for weeks longer. Clinical trials often require 90% because missing a real drug effect has patient-safety consequences.

MDE and Why Smaller Lifts Cost Exponentially More Traffic

MDE — minimum detectable effect — is not the effect you expect. It is the smallest effect you refuse to miss. If the true lift turns out to be bigger than your MDE, your test catches it with more than the target power. If the true lift is smaller, your test probably will not reach significance, and you accept that risk intentionally.

Teams routinely set MDE too tight. A product manager says “I want to detect a 1% relative lift on a 5% conversion rate” without realising that is an absolute difference of 0.05 percentage points — requiring millions of users per arm. The right conversation is: what is the smallest lift that would actually change your decision? If a 5% relative lift would not justify the engineering cost, there is no point running a test to detect it. Set MDE at the threshold where the business would act.

You can also flip the calculation: given your available traffic and test duration, what is the smallest effect you can reliably detect? If the answer is a 15% relative lift and your hypothesis is a 3% lift, do not run the test — it will almost certainly come back inconclusive, wasting time and losing trust in experimentation.

Unequal Splits and Multi-Arm Power Penalties

A 50/50 split gives the most power for a fixed total n. But sometimes you cannot give the variant half the traffic — maybe the change is risky, or you need a large control for other concurrent tests. A 90/10 split means the treatment arm is tiny, and the standard error is driven by the smaller group. At 90/10, you need roughly 2.5× the total traffic of a 50/50 split to achieve the same power.

The penalty is not linear. Going from 50/50 to 70/30 adds about 15% more total n. Going from 50/50 to 90/10 adds 150%. If you must use an unequal split, keep the smaller arm at 20% or more to avoid blowing up the required sample size. Below 10%, the efficiency loss is severe enough that it is usually better to wait for more traffic than to starve the treatment arm.

Multi-arm tests (A/B/C/D) need multiple-comparison corrections — Bonferroni or Dunnett — which lower the per-comparison α and raise n per arm. Three variants against one control roughly doubles the total traffic compared to a single A/B test at the same MDE. Before adding another variant, ask if you can test sequentially instead.

Power Curve Interpretation and Sensitivity Ranges

My power curve flattens near the top. Why bother going from 80% to 95%?
Because the incremental n buys diminishing returns. Going from 50% to 80% power might need 2,500 more users. Going from 80% to 95% might need another 3,000. The extra 15 percentage points of power cost more traffic than the first 30. Whether that is worth it depends on the cost of a false negative in your context.

I do not know the true effect size. How do I pick MDE?
Run the calculator for a range — say 5%, 10%, 15%, and 20% relative lift. Present the table to your stakeholder: “At 5% lift we need 60k users and 6 weeks. At 15% we need 7k users and 5 days.” Let the business decide which trade-off they want.

The calculator says I need 200k users but I only get 10k per week. Is the test hopeless?
Not necessarily. You can (a) raise MDE so you only detect larger effects, (b) lower power to 70% if you accept more false negatives, (c) reduce variance by using CUPED or pre-experiment covariates, or (d) switch to a more sensitive metric (revenue per user instead of conversion rate, for example). Do not just run the test for 20 weeks — user behaviour drifts over time, and a 5-month test is comparing January users to June users.

Should I use one-tailed or two-tailed?
Two-tailed unless you are genuinely uninterested in the possibility that the variant is worse. One-tailed tests need about 20% less sample for the same power, but if the variant actually harms the metric, a one-tailed test will not flag it. Most teams default to two-tailed.

Sample-Size Formula for Proportions and Means

Two core formulas cover the most common test types:

Two-sample proportions (equal arms)
n per arm ≈ (Zα/2 + Zβ)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₁−p₂)²
Two-sample means (equal arms, pooled σ)
n per arm ≈ 2(Zα/2 + Zβ)² × σ² / (μ₁−μ₂)²
Equivalent: n ≈ 2(Zα/2 + Zβ)² / d² where d = Cohen’s d
Common Z-values
α=0.05 two-tailed → Zα/2=1.96
Power 80% → Zβ=0.842 | Power 90% → Zβ=1.282

Units note: p₁ and p₂ are proportions (0–1, not percentages). σ is in the same units as the metric. Round n up — 63.2 per arm means 64.

Checkout Conversion A/B Test Sample-Size Walkthrough

Scenario: Your e-commerce checkout converts at 4.0%. You redesigned the payment form and want to detect at least a 10% relative lift (to 4.4%). You choose α = 0.05, two-tailed, 80% power, 50/50 split.

Step 1 — Identify inputs.
p₁ = 0.040 (control), p₂ = 0.044 (treatment). Zα/2 = 1.96, Zβ = 0.842.

Step 2 — Plug into the proportions formula.
Numerator: (1.96 + 0.842)² × [0.04×0.96 + 0.044×0.956] = 7.85 × 0.0806 = 0.6327.
Denominator: (0.044 − 0.040)² = 0.000016.
n per arm = 0.6327 / 0.000016 ≈ 39,544. Round up to 39,545 per arm. Total: 79,090.

Step 3 — Translate to duration.
If the site gets 8,000 checkout-eligible visitors per day, you need 79,090 / 8,000 ≈ 10 days. Add a day or two as buffer for weekday/weekend traffic variation. Plan for 12 days.

Step 4 — Sanity check.
If 10 days feels too long, the lever is MDE. Raising MDE to 15% relative lift (to 4.6%) drops n per arm to about 17,600 — roughly 4.5 days. But you accept that a real 10% lift will likely go undetected. Present both options to the product manager and let them decide.

Sources

NIST/SEMATECH — Sample Sizes for Two Proportions: Derivation of the sample-size formula for proportion tests with worked examples.

Evan Miller — Sample Size Calculator: Interactive tool and formula explanations for A/B test planning.

NCBI — Power and Sample Size Determination: Review of sample-size methodology for clinical and behavioural research.

Kohavi, Tang & Xu — Trustworthy Online Controlled Experiments: Industry reference for MDE selection, power trade-offs, and multi-arm corrections.

Frequently Asked Questions About Sample Size & Power Analysis

What is statistical power in simple terms?

Statistical power is the probability that your study will detect a true effect if it exists. It's defined as 1 − β, where β is the Type II error rate (the chance of missing a real effect). For example, 80% power means that if there really is an effect of the size you're looking for, you have an 80% chance of finding it statistically significant and a 20% chance of missing it. Higher power is better, but requires larger sample sizes. Think of power as the 'sensitivity' of your study—high power means you're less likely to miss real findings.

Why is 0.80 power often used as a rule of thumb?

A power of 0.80 (80%) is a widely accepted convention in statistics and research methods, balancing the desire to detect true effects with practical constraints on sample size and resources. It means you accept a 20% risk of Type II error (missing a true effect), which is considered reasonable in many fields. Some studies aim for 90% power (especially when missing an effect would be costly), but 80% is the standard for most homework, thesis proposals, and preliminary research. The 80% threshold comes from statistical tradition rather than a strict mathematical rule—it's a practical compromise.

What is the difference between α (alpha) and power?

α (alpha) and power control different types of errors. α is the significance level, typically 0.05, and represents the maximum probability of a Type I error (false positive—rejecting the null hypothesis when it's true). Power (1 − β) represents the probability of correctly rejecting the null hypothesis when the alternative is true—it controls Type II error (false negative). α is what you set before the study; power is what you achieve based on your sample size, effect size, and α. Think of α as your 'false alarm rate' and power as your 'detection rate.' Both are important, but they protect against different mistakes.

What is an effect size and how do I choose one for power analysis?

Effect size quantifies how big the difference or relationship is that you're trying to detect. For means, Cohen's d (standardized difference) is common: d = 0.2 (small), 0.5 (medium), 0.8 (large). For proportions, use the difference p₁ − p₂. For correlations, use the expected r value. To choose an effect size: (1) Look at similar prior studies and see what effects they found. (2) Use domain knowledge—experts often know what's a 'meaningful' difference. (3) Define the smallest effect you care about detecting (minimum meaningful difference). (4) Run pilot data for rough estimates. Be conservative: if in doubt, assume a smaller effect to avoid underpowering your study. Never pick a huge effect just to force a small sample size—it'll backfire if the true effect is smaller.

How does increasing sample size affect power?

Increasing sample size (n) increases power, holding effect size and α constant. More data gives you a better chance of detecting a true effect. However, the relationship isn't linear—there are diminishing returns. For example, going from n = 20 to 40 per group might boost power from 50% to 80%, but going from 100 to 200 might only raise it from 95% to 99%. Eventually, adding more participants gives tiny power gains. The calculator helps you find the 'sweet spot' where you achieve adequate power (e.g., 80%) without over-recruiting. Very large samples can detect tiny, meaningless effects, so balance statistical power with practical significance.

What is the difference between one-tailed and two-tailed tests in power analysis?

One-tailed tests look for an effect in only one direction (e.g., 'treatment is better than control'), while two-tailed tests look for effects in either direction (e.g., 'treatment is different from control'). One-tailed tests require smaller sample sizes for the same power because they concentrate the α in one tail of the distribution. However, one-tailed tests are only appropriate when you have strong prior justification for directionality and are willing to ignore effects in the opposite direction. Most homework, research, and standard practice use two-tailed by default. Use one-tailed only if your instructor or field explicitly expects it and you can justify it.

Can I rely only on this calculator to design a real clinical or business study?

No. This calculator is designed for education, homework, thesis planning, and preliminary exploration—not as the sole basis for high-stakes clinical trials, regulatory submissions, or business-critical experiments. Real-world studies often involve complexities (interim analyses, multiple endpoints, non-normal distributions, stratification, etc.) that require expert statistical guidance and specialized software (e.g., nQuery, PASS, G*Power). Use this calculator to learn power concepts, check homework answers, and explore trade-offs, but consult professional statisticians for real clinical trials, medical device studies, pharmaceutical research, or large-scale business experiments. Transparency: this is a learning tool, not a replacement for rigorous study design.

What happens if my actual data has more variability than I assumed?

If the true variability (standard deviation σ) is larger than you assumed in your power calculation, your actual power will be lower than planned—you're more likely to miss the effect (Type II error). For example, if you planned for σ = 10 but the true σ = 15, your calculated n might give you 80% power in theory but only 60% in reality. To protect against this: (1) Use conservative (larger) estimates of σ based on pilot data or prior research. (2) Consider inflating your sample size by 10–20% as a buffer. (3) Report your assumptions transparently: 'Assuming σ = 10, we require n = 64. If σ is larger, power will be lower.' This acknowledges uncertainty and prepares readers for the possibility of null results.

Can I use this tool after I collect data to 'fix' a low-power study?

No. Power analysis should be done before data collection (a priori), not after (post-hoc). Computing 'observed power' after finding a non-significant result is controversial and generally uninformative—it's directly tied to your p-value and doesn't add new information. Post-hoc power is often used incorrectly to excuse null results ('our power was low, so we can't conclude anything'), which is circular reasoning. Instead, report your results with confidence intervals and effect sizes, which convey both the estimated effect and its precision. If you're concerned about power after the fact, acknowledge it as a limitation and suggest that future studies with larger samples are needed. Always plan power prospectively.

How should I report sample size and power calculations in homework or project reports?

Report power calculations clearly and transparently: (1) State the test type: 'We used two-sample t-test power analysis.' (2) List all parameters: α = 0.05, power = 0.80, effect size d = 0.5 (or specify means, SDs, proportions). (3) Cite justification for effect size: 'Based on Smith et al. (2020), we expect a medium effect.' (4) Report required sample size: 'n = 64 per group (128 total).' (5) Mention assumptions: 'Assumes approximately normal distributions and equal variances.' (6) If relevant, discuss trade-offs: 'If we can only recruit 50 per group, power drops to 70%.' This demonstrates rigorous planning and helps reviewers or instructors assess your study design. Never just say 'we used 64 participants' without explaining why.

What is the minimum sample size I need for any statistical test?

There's no universal minimum—it depends on your test, effect size, power, and α. However, as a very rough guideline, most statistical tests become unreliable with fewer than 10–15 observations per group. For power = 0.80, α = 0.05, and a medium effect (d = 0.5 for means), you typically need 60–80 per group. For small effects (d = 0.2), you need 300+ per group. For proportions with small baseline rates or small differences, thousands may be needed. Always use a power calculator rather than guessing. Very small samples (n < 20 per group) rarely have adequate power unless effects are very large. If resources are limited, adjust your expectations: either accept lower power or focus on detecting larger effects (MDE).

Does higher power always mean a better study?

Not necessarily. While adequate power (typically 80–90%) is important, extremely high power (e.g., 99%) can be overkill and may detect tiny, practically meaningless effects. With very large samples, even trivial differences become statistically significant (e.g., a 0.1-point difference on a 100-point scale). This can lead to 'statistically significant but practically irrelevant' findings. The goal is to achieve adequate power to detect the smallest effect you care about (minimum meaningful difference), not to maximize power infinitely. Balance power with practical significance, cost, and feasibility. For homework and thesis work, 80% power is typically sufficient. For critical medical trials, 90% might be warranted. But there's no need to aim for 99% unless failure to detect an effect would have severe consequences.

Can I use power analysis for non-normal data or non-parametric tests?

Standard power formulas (like those in this calculator) assume approximately normal distributions for means-based tests. If your data are severely non-normal (e.g., highly skewed, binary outcomes, count data), these formulas may be approximate or less accurate. For proportions, the calculator uses normal approximations that work well with reasonably large samples. For non-parametric tests (e.g., Mann-Whitney U, Wilcoxon), power calculations are more complex and may require simulation or specialized software. As a rough guideline, non-parametric tests often have slightly lower power than parametric tests for the same sample size, so you might increase your sample by 5–15% as a buffer. For homework or preliminary planning, standard power calculations give a reasonable starting point even for mildly non-normal data.

What if I can't afford the sample size the calculator says I need?

If the required sample size exceeds your budget or resources, you have several options: (1) Accept lower power (e.g., 70% instead of 80%) and acknowledge this limitation. (2) Increase the minimum detectable effect—focus on detecting larger effects that require fewer participants. (3) Use the 'MDE' mode to find the smallest effect you can detect with your available sample, then assess whether that's meaningful in your field. (4) Consider paired or within-subjects designs (if appropriate), which often require smaller samples. (5) Treat your study as a pilot for a larger future study. (6) Collaborate with other researchers to pool data or resources. Always be transparent: report your actual power and explain resource constraints. Don't force a study with 20% power and then overinterpret null results.

Why do different power calculators sometimes give slightly different answers?

Small differences in results across calculators can occur due to: (1) Different approximation methods (some use exact distributions, others use normal approximations). (2) Rounding conventions (some round up sample sizes, others don't). (3) Treatment of continuity corrections for proportions. (4) Different handling of unequal group sizes or allocation ratios. (5) Assumptions about one-tailed vs two-tailed tests. These differences are usually minor (within a few participants) and don't change conclusions. For homework or planning, any reputable power calculator is fine. For critical research, document which calculator and version you used, and report exact assumptions (α, power, effect size, test type). If in doubt, try multiple calculators and see if results converge—they usually do within 5–10%.

Master Sample Size Planning & Power Analysis

Build essential skills in power analysis, effect size estimation, and rigorous study design for research and statistical success

Explore All Statistics & Research Design Calculators

How helpful was this calculator?

Sample Size & Power - n + MDE in one shot