Estimate Statistical Power and Required Sample Size
Explore statistical power and sample size for simple z/t tests on means. Compute power given sample size, or find the sample size needed to achieve a target power. Visualize power curves to understand tradeoffs.
Power is the probability of catching a real effect when one exists. Formally, 1 − β, the complement of the Type II error rate. A study with 60% power against the effect size you actually care about will miss it 40% of the time even when the effect is real. Underpowered studies are the main reason the literature fills up with false negatives that look like "no effect."
Four quantities are linked: sample size n, effect size (Cohen's d for means, h for proportions), significance level α, and power. Fix any three and the fourth is determined. Two modes you'll actually use: solve for n given a target power (0.80 is conventional, 0.90 if you can afford it), or solve for power given a fixed n and effect size. The math runs through the noncentral t for t-tests and noncentral F for ANOVA. The recurring trap worth flagging: don't estimate effect size from your own pilot data and feed it back into the power calculation. Pilot estimates are noisy and biased upward by selection. Anchor on a domain-driven "smallest effect of interest" instead.
Define Effect Size (Practical, Not Just Statistical)
A power calculator for hypothesis tests asks you to specify the effect size you want to detect before you run the study. Effect size is the magnitude of the difference between null and alternative hypotheses—maybe a 5-point IQ gain, a 2-millimeter reduction in tumor diameter, or a 0.3-unit shift in customer satisfaction score. Choosing this number forces you to think about what matters practically, not just statistically.
Standardized effect sizes like Cohen's d express the difference in units of standard deviation. A d of 0.2 is considered small, 0.5 medium, and 0.8 large. These benchmarks come from behavioral science, where Cohen cataloged typical study outcomes. But domain matters: a "small" effect in psychology might be clinically meaningful in medicine or trivial in physics. Use field-specific baselines when available.
Picking an unrealistically large effect size will give you optimistic power estimates and undersized samples. The study then fails to detect a real but smaller effect, wasting resources. Picking too small an effect inflates sample requirements beyond budget. The sweet spot is the minimum effect that would actually change a decision—what some call the "smallest effect of interest."
Suppose a new drug costs twice as much as the current standard. A 5% improvement might not justify the price; only a 15% improvement would. That 15% becomes your target effect size. Power analysis then tells you how many patients you need to detect it reliably. The calculation ties statistical planning to real-world stakes.
Alpha, Tails, and What You're Testing
Alpha is the probability of rejecting the null hypothesis when it's actually true—a false positive or Type I error. Setting alpha to 0.05 means you accept a 5% chance of crying wolf. Regulatory agencies sometimes require alpha at 0.01 or even 0.005 for confirmatory trials, trading off power for stricter control of false claims.
One-tailed tests concentrate all rejection probability in a single direction. If theory firmly predicts a new therapy can only help—not harm—a one-tailed test at alpha 0.05 puts the entire 5% in the upper tail. This boosts power for detecting improvements but blinds you to worsening effects. Two-tailed tests split alpha evenly, detecting differences in either direction at the cost of slightly lower power per direction.
Most journal guidelines recommend two-tailed tests unless you can justify one-tailed a priori. "I expected the new method to be better" after seeing the data doesn't count—that's p-hacking. Pre-registration of one-tailed hypotheses protects against post-hoc rationalization.
Alpha, tails, and effect size interact in the power formula. Lowering alpha from 0.05 to 0.01 shrinks the rejection region, requiring larger samples to maintain the same power. Switching from two-tailed to one-tailed at fixed alpha increases power but limits generalizability. Understand these tradeoffs before committing to a design.
Power Curve: How n Changes Detection
A power curve plots power—probability of detecting a true effect—against sample size or effect size. At small n, power hovers near alpha because you'd barely do better than chance. As n grows, the curve rises steeply, then flattens as power approaches 1.0. Most studies aim for power around 0.80, meaning an 80% chance of finding a real effect of the specified size.
The shape reveals diminishing returns. Jumping from n = 50 to n = 100 might lift power from 0.55 to 0.80—a big gain. But going from n = 200 to n = 400 might only bump power from 0.95 to 0.99. After a point, adding more data yields marginal benefit while doubling costs. The curve helps identify the "elbow" where investment efficiency peaks.
You can also fix n and vary effect size. Larger effects are easier to detect, so power climbs rapidly for big differences. The curve then shows the range of effects your study can reliably capture. If your minimum detectable effect exceeds what's practically meaningful, redesign before wasting resources.
Visualizing power curves clarifies tradeoffs better than a single number. Grant reviewers appreciate seeing how the proposed sample handles a range of plausible effects. Presenting curves also guards against "just significant" thinking—0.80 power is conventional, but 0.90 or 0.95 may be warranted for high-stakes decisions.
Solve for n vs Solve for Power (Two Modes)
Power calculators typically offer two modes. In "solve for n" mode, you supply effect size, alpha, and target power—say, 0.80—and the calculator returns the sample size needed. This is standard for grant proposals and protocol planning, where you need a concrete recruitment target.
In "solve for power" mode, you fix sample size—perhaps constrained by budget or available patients—and the calculator tells you the power you'll achieve. If the answer comes back at 0.45, you face a hard choice: accept high risk of missing a real effect, or cut costs elsewhere to boost enrollment.
Some researchers run both modes iteratively. They start by asking how many subjects they can realistically recruit, check power, then adjust effect size assumptions or alpha to see if acceptable power is attainable. This exploratory dance often reveals whether the study is even feasible under budget constraints.
Neither mode changes the underlying relationship: power depends on effect size, variability, alpha, and n. Solving for one fixes the others. Understanding this constraint prevents magical thinking—you can't conjure power from thin air without bigger effects or bigger samples.
Planning traps: when pilot estimates inflate
Researchers often plug in optimistic effect sizes pulled from pilot studies or published literature. But pilot studies are noisy, and published effects suffer from winner's curse—only the largest estimates clear the publication bar. Using inflated inputs produces undersized samples that fail to replicate the expected result.
Another mistake is ignoring attrition. If 20% of enrolled participants drop out, your final sample shrinks accordingly. Power analysis should target the post-attrition n, not the enrollment figure. Ignoring dropouts can turn a well-powered design into an underpowered mess once data collection finishes.
Using the wrong variance estimate is equally dangerous. If prior studies measured a different population or used different instruments, their variance may not transfer. Underestimating variability inflates apparent power; overestimating it wastes resources on excess enrollment.
Finally, some planners pick one-tailed tests to juice power without theoretical justification. Reviewers catch this gambit, and it can undermine credibility. If you can't defend directionality before data collection, stick with two-tailed and budget for the extra sample accordingly.
Power Planning Q&A
What's the relationship between power and Type II error?
Power equals 1 minus beta, where beta is the Type II error rate—the probability of failing to reject a false null. At 0.80 power, beta is 0.20, meaning a 20% chance of missing a real effect. Raising power lowers beta but requires larger samples or bigger effects.
Can I calculate post-hoc power after the study?
Statisticians discourage "observed power" analysis because it's mathematically redundant with the p-value. A non-significant result always implies low observed power. Instead, report confidence intervals and discuss the range of effects consistent with your data.
How do I handle multiple comparisons?
Each comparison inflates familywise Type I error. Bonferroni or other adjustments lower alpha per test, which reduces power per comparison. Power analysis for multiple endpoints is more complex—consider simulation or consulting a statistician.
What if my effect size is uncertain?
Run sensitivity analyses across a range of plausible effect sizes. Report the sample sizes needed at the low, mid, and high ends. This transparency helps funders understand risk and supports adaptive designs that adjust enrollment based on interim data.
Does power analysis work for non-normal data?
Standard formulas assume normality or rely on large-sample approximations. For skewed or ordinal data, simulation-based power analysis or specialized methods (e.g., for rank tests) are more accurate. The calculator here covers common z and t scenarios.
Limitations of these power formulas
Idealized conditions: the formulas assume complete data, no protocol deviations, and no measurement error beyond what σ captures. Real studies lose power to dropout and missing data, so inflate n by your expected attrition rate before fielding.
Normality and balance: the standard formulas assume normality (or large enough n for the CLT) and equal group sizes. Unbalanced designs and non-normal outcomes need bespoke calculation, often via simulation.
Single comparison: multiple comparisons aren't handled here. If you're testing several endpoints, the corrected α (Bonferroni or BH-FDR) drives the per-test power calculation.
Fixed designs only: this page does power for fixed-design studies. Adaptive designs, group-sequential trials, and interim analyses need specialized software.
Note: The pitfall worth flagging: don't estimate effect size from your own pilot data and feed it back into the power calculation. Pilot estimates are noisy and selection-biased upward, so the resulting n is systematically too small. Anchor on a domain-driven "smallest effect of interest" instead. G*Power is the standard free tool. R's pwr package and Python's statsmodels.stats.power match it for the common designs.
Sources & References
The mathematical formulas and statistical power concepts used in this calculator are based on established statistical theory and authoritative academic sources:
- •NIST/SEMATECH e-Handbook: Power and Sample Size - Authoritative reference from the National Institute of Standards and Technology.
- •G*Power Documentation: G*Power - Industry-standard power analysis software documentation.
- •Cohen (1988): Statistical Power Analysis for the Behavioral Sciences - Seminal book on effect sizes and power analysis.
- •Penn State STAT 500: Power - University course material on statistical power concepts.
- •Statistics By Jim: Statistical Power - Practical explanations of power and sample size calculations.
Power planning: working questions
What's a good level of statistical power?
0.80 is the conventional minimum, set informally by Cohen and adopted by most journal guidelines. 0.90 if you can afford it, especially when missing the effect would be costly (drug trials, policy decisions). Below 0.80, you have a meaningful chance of running a study, finding a non-significant result, and concluding nothing when a real effect was actually present. Above 0.95 buys little additional protection at high recruitment cost. The right answer depends on the cost of a Type II error in your specific context.
How do I increase the power of a study?
Four levers, in roughly decreasing order of how often they apply. Increase n: power scales with √n, so quadrupling sample size roughly doubles your effective signal-to-noise. Increase the effect size you're looking for (often not under your control). Reduce noise: tighter measurement, blocking on covariates, or paired designs that subtract within-subject variability. Loosen α: from 0.01 to 0.05, for example, if the field accepts the looser threshold. Pre-registration ties you to a specific n before data collection, which keeps power honest.
How does effect size relate to power?
Inversely, in the calculation. Bigger effect, less data needed for the same power. Cohen's d of 0.5 (medium) at α = 0.05 needs about n = 64 per group for 80% power. d = 0.2 (small) at the same α and power needs n ≈ 393 per group. d = 0.8 (large) needs n ≈ 26. The trap is that you don't know d in advance, you assume it. If you guess too high (the typical error), your study comes out underpowered for the real effect. Anchor on a domain-driven smallest effect of interest, not a pilot estimate.
How do Cohen's d benchmarks (small/medium/large) work?
Cohen (1988) suggested d = 0.2 as small, d = 0.5 as medium, d = 0.8 as large for behavioral science. The benchmarks are field-relative. In clinical trials, d = 0.3 may be clinically meaningful and worth detecting; in physics, d = 2 is unremarkable. Use the benchmarks for sanity-checking only, not for setting effect size targets blindly. The right input to a power calculation is the smallest effect you'd care about detecting, derived from domain knowledge, not from Cohen's table.
Type I vs Type II error, in plain terms?
Type I (α): you reject the null when it was actually true. False alarm. The conventional 0.05 threshold means you accept a 5% false-positive rate. Type II (β): you fail to reject the null when it was actually false. Missed signal. Power = 1 − β, so power = 0.80 means a 20% Type II rate. The two trade off: lowering α (stricter significance threshold) reduces Type I but raises Type II for a given n. Both rates depend on the true effect size, which you don't know.
How do I do a power analysis for a t-test?
Three inputs determine the fourth. For two-sample t at α = 0.05 and power 0.80, the formula is roughly n ≈ 16 / d² per group. d = 0.5 gives n = 64, d = 0.3 gives n ≈ 178, d = 0.8 gives n = 25. R's pwr::pwr.t.test() and statsmodels' tt_ind_solve_power match these. G*Power has the same formulas behind a GUI. For paired t-tests, the variance is on the difference scores, so d_paired and d_unpaired aren't directly comparable. Use the right d for the right design.
What if I can't recruit enough subjects?
Don't run an underpowered study without acknowledging it. Options: shrink the smallest effect of interest if a smaller effect is still actionable, increase α, switch to a paired or repeated-measures design that reduces noise, or accept lower power and report it. Some journals require minimum power thresholds, so check before starting. Equivalence testing (TOST) flips the question: instead of detecting an effect, you test whether the effect is below a meaningful bound, which has its own power requirements but different framing.
Why is post-hoc power analysis controversial?
Post-hoc power computed from your observed effect size is mathematically a function of the p-value. A non-significant result will always have low post-hoc power; a significant one will always have high post-hoc power. So post-hoc power adds no information beyond the p-value itself. Hoenig and Heisey (2001) is the canonical critique. What's actually useful: a sensitivity analysis that says "with this n and α, we had 80% power to detect d = 0.5 or larger," which describes what the study could have found regardless of what it actually found.
Related Tools
Z-Score & P-Value Calculator
Convert between z-scores and p-values for hypothesis testing
Normal Distribution Calculator
Calculate probabilities and quantiles under the normal curve
Confidence Interval Calculator
Build confidence intervals for means and proportions
T-Test Calculator
Run one-sample, two-sample, and paired t-tests
CI for Proportions
Compute confidence intervals for proportions using Wald and Wilson methods
Correlation Significance
Test whether a correlation coefficient is statistically significant