Estimate Statistical Power and Required Sample Size
Explore statistical power and sample size for simple z/t tests on means. Compute power given sample size, or find the sample size needed to achieve a target power. Visualize power curves to understand tradeoffs.
Define Effect Size (Practical, Not Just Statistical)
A power calculator for hypothesis tests asks you to specify the effect size you want to detect before you run the study. Effect size is the magnitude of the difference between null and alternative hypotheses—maybe a 5-point IQ gain, a 2-millimeter reduction in tumor diameter, or a 0.3-unit shift in customer satisfaction score. Choosing this number forces you to think about what matters practically, not just statistically.
Standardized effect sizes like Cohen's d express the difference in units of standard deviation. A d of 0.2 is considered small, 0.5 medium, and 0.8 large. These benchmarks come from behavioral science, where Cohen cataloged typical study outcomes. But domain matters: a "small" effect in psychology might be clinically meaningful in medicine or trivial in physics. Use field-specific baselines when available.
Picking an unrealistically large effect size will give you optimistic power estimates and undersized samples. The study then fails to detect a real but smaller effect, wasting resources. Picking too small an effect inflates sample requirements beyond budget. The sweet spot is the minimum effect that would actually change a decision—what some call the "smallest effect of interest."
Suppose a new drug costs twice as much as the current standard. A 5% improvement might not justify the price; only a 15% improvement would. That 15% becomes your target effect size. Power analysis then tells you how many patients you need to detect it reliably. The calculation ties statistical planning to real-world stakes.
Alpha, Tails, and What You're Testing
Alpha is the probability of rejecting the null hypothesis when it's actually true—a false positive or Type I error. Setting alpha to 0.05 means you accept a 5% chance of crying wolf. Regulatory agencies sometimes require alpha at 0.01 or even 0.005 for confirmatory trials, trading off power for stricter control of false claims.
One-tailed tests concentrate all rejection probability in a single direction. If theory firmly predicts a new therapy can only help—not harm—a one-tailed test at alpha 0.05 puts the entire 5% in the upper tail. This boosts power for detecting improvements but blinds you to worsening effects. Two-tailed tests split alpha evenly, detecting differences in either direction at the cost of slightly lower power per direction.
Most journal guidelines recommend two-tailed tests unless you can justify one-tailed a priori. "I expected the new method to be better" after seeing the data doesn't count—that's p-hacking. Pre-registration of one-tailed hypotheses protects against post-hoc rationalization.
Alpha, tails, and effect size interact in the power formula. Lowering alpha from 0.05 to 0.01 shrinks the rejection region, requiring larger samples to maintain the same power. Switching from two-tailed to one-tailed at fixed alpha increases power but limits generalizability. Understand these tradeoffs before committing to a design.
Power Curve: How n Changes Detection
A power curve plots power—probability of detecting a true effect—against sample size or effect size. At small n, power hovers near alpha because you'd barely do better than chance. As n grows, the curve rises steeply, then flattens as power approaches 1.0. Most studies aim for power around 0.80, meaning an 80% chance of finding a real effect of the specified size.
The shape reveals diminishing returns. Jumping from n = 50 to n = 100 might lift power from 0.55 to 0.80—a big gain. But going from n = 200 to n = 400 might only bump power from 0.95 to 0.99. After a point, adding more data yields marginal benefit while doubling costs. The curve helps identify the "elbow" where investment efficiency peaks.
You can also fix n and vary effect size. Larger effects are easier to detect, so power climbs rapidly for big differences. The curve then shows the range of effects your study can reliably capture. If your minimum detectable effect exceeds what's practically meaningful, redesign before wasting resources.
Visualizing power curves clarifies tradeoffs better than a single number. Grant reviewers appreciate seeing how the proposed sample handles a range of plausible effects. Presenting curves also guards against "just significant" thinking—0.80 power is conventional, but 0.90 or 0.95 may be warranted for high-stakes decisions.
Solve for n vs Solve for Power (Two Modes)
Power calculators typically offer two modes. In "solve for n" mode, you supply effect size, alpha, and target power—say, 0.80—and the calculator returns the sample size needed. This is standard for grant proposals and protocol planning, where you need a concrete recruitment target.
In "solve for power" mode, you fix sample size—perhaps constrained by budget or available patients—and the calculator tells you the power you'll achieve. If the answer comes back at 0.45, you face a hard choice: accept high risk of missing a real effect, or cut costs elsewhere to boost enrollment.
Some researchers run both modes iteratively. They start by asking how many subjects they can realistically recruit, check power, then adjust effect size assumptions or alpha to see if acceptable power is attainable. This exploratory dance often reveals whether the study is even feasible under budget constraints.
Neither mode changes the underlying relationship: power depends on effect size, variability, alpha, and n. Solving for one fixes the others. Understanding this constraint prevents magical thinking—you can't conjure power from thin air without bigger effects or bigger samples.
Common Planning Mistakes (Too-Optimistic Inputs)
Researchers often plug in optimistic effect sizes pulled from pilot studies or published literature. But pilot studies are noisy, and published effects suffer from winner's curse—only the largest estimates clear the publication bar. Using inflated inputs produces undersized samples that fail to replicate the expected result.
Another mistake is ignoring attrition. If 20% of enrolled participants drop out, your final sample shrinks accordingly. Power analysis should target the post-attrition n, not the enrollment figure. Ignoring dropouts can turn a well-powered design into an underpowered mess once data collection finishes.
Using the wrong variance estimate is equally dangerous. If prior studies measured a different population or used different instruments, their variance may not transfer. Underestimating variability inflates apparent power; overestimating it wastes resources on excess enrollment.
Finally, some planners pick one-tailed tests to juice power without theoretical justification. Reviewers catch this gambit, and it can undermine credibility. If you can't defend directionality before data collection, stick with two-tailed and budget for the extra sample accordingly.
Power Planning Q&A
What's the relationship between power and Type II error?
Power equals 1 minus beta, where beta is the Type II error rate—the probability of failing to reject a false null. At 0.80 power, beta is 0.20, meaning a 20% chance of missing a real effect. Raising power lowers beta but requires larger samples or bigger effects.
Can I calculate post-hoc power after the study?
Statisticians discourage "observed power" analysis because it's mathematically redundant with the p-value. A non-significant result always implies low observed power. Instead, report confidence intervals and discuss the range of effects consistent with your data.
How do I handle multiple comparisons?
Each comparison inflates familywise Type I error. Bonferroni or other adjustments lower alpha per test, which reduces power per comparison. Power analysis for multiple endpoints is more complex—consider simulation or consulting a statistician.
What if my effect size is uncertain?
Run sensitivity analyses across a range of plausible effect sizes. Report the sample sizes needed at the low, mid, and high ends. This transparency helps funders understand risk and supports adaptive designs that adjust enrollment based on interim data.
Does power analysis work for non-normal data?
Standard formulas assume normality or rely on large-sample approximations. For skewed or ordinal data, simulation-based power analysis or specialized methods (e.g., for rank tests) are more accurate. The calculator here covers common z and t scenarios.
Limitations & Assumptions
• Simplified Formulas: This calculator uses standard closed-form power formulas that assume idealized conditions. Real studies face missing data, protocol deviations, and measurement error that reduce actual power below theoretical estimates.
• Normality and Equal Variance: Power formulas assume normally distributed data or large samples where the central limit theorem applies. For two-sample tests, equal variances and equal sample sizes per group are typically assumed. Violations affect accuracy.
• Effect Size Uncertainty: Power calculations are highly sensitive to effect size assumptions. Small changes in assumed effect dramatically alter required sample sizes. Consider sensitivity analyses across plausible ranges.
• No Adjustment for Practical Constraints: This tool does not account for dropout rates, interim analyses, multiple comparisons, or adaptive designs essential for clinical trial planning.
Important Note: This calculator is for educational and informational purposes only. It demonstrates how statistical power analysis works mathematically, not for clinical trial design, regulatory submissions, or high-stakes research planning. Professional power analysis requires proper consideration of dropout rates, interim analyses, multiple comparisons, and regulatory requirements. For real studies, use dedicated software (G*Power, PASS, nQuery, R packages) and consult with qualified biostatisticians.
Sources & References
The mathematical formulas and statistical power concepts used in this calculator are based on established statistical theory and authoritative academic sources:
- •NIST/SEMATECH e-Handbook: Power and Sample Size - Authoritative reference from the National Institute of Standards and Technology.
- •G*Power Documentation: G*Power - Industry-standard power analysis software documentation.
- •Cohen (1988): Statistical Power Analysis for the Behavioral Sciences - Seminal book on effect sizes and power analysis.
- •Penn State STAT 500: Power - University course material on statistical power concepts.
- •Statistics By Jim: Statistical Power - Practical explanations of power and sample size calculations.
Frequently Asked Questions
Common questions about statistical power, power analysis, sample size calculation, z-test power, t-test power, effect size, type II error, and how to use this calculator for homework and study design practice.
What is statistical power?
Statistical power is the probability that a test will correctly reject a false null hypothesis (detect a true effect). It equals 1 − β, where β is the Type II error rate (false negative). Power of 80% means that if there really is an effect, you have an 80% chance of detecting it with your test.
What is the difference between z-tests and t-tests?
Z-tests are used when the population standard deviation (σ) is known or when sample sizes are very large. T-tests are used when σ is unknown and must be estimated from the sample, which is more common in practice. For large samples (n > 30), z and t tests give very similar results.
When should I use a one-sided vs two-sided test?
Use a two-sided test when you want to detect effects in either direction (μ ≠ μ₀). Use a one-sided test only when you have strong prior reason to expect the effect in a specific direction (e.g., a new treatment can only help, not hurt). One-sided tests have more power but risk missing effects in the unexpected direction.
Why does power increase with sample size?
Larger samples provide more precise estimates of the population mean, reducing the standard error. This makes it easier to distinguish the true effect from random noise. The relationship is nonlinear: doubling n doesn't double power, and there are diminishing returns as n gets very large.
What factors affect power?
Four main factors affect power: (1) Effect size - larger effects are easier to detect; (2) Sample size - more data means more precision; (3) Significance level (α) - higher α increases power but also false positives; (4) Variability (σ) - less noise makes effects clearer. These factors are interconnected: fixing three determines the fourth.
Why is this tool not enough for clinical trial planning?
Clinical trials require much more sophisticated power analysis accounting for dropout rates, interim analyses, multiple comparisons, adaptive designs, and regulatory requirements. This tool uses simplified formulas and normal approximations. For real studies, use professional software (G*Power, PASS, nQuery) and consult with biostatisticians.
What assumptions does this calculator make?
This calculator assumes: (1) Normally distributed data or large samples; (2) Independent observations; (3) For two-sample tests: equal variances and equal sample sizes per group; (4) Known or well-estimated standard deviation; (5) Simple random sampling. Violations of these assumptions may affect the accuracy of power calculations.
How should I interpret the power curve?
The power curve shows how power changes as you vary sample size or effect size. For sample size curves: find where the curve crosses your target power (typically 80%) to determine required n. For effect size curves: see how sensitive your test is to detecting effects of different magnitudes. The curve also shows diminishing returns.
Related Tools
Z-Score & P-Value Calculator
Convert between z-scores and p-values for hypothesis testing
Normal Distribution Calculator
Calculate probabilities and quantiles under the normal curve
Confidence Interval Calculator
Build confidence intervals for means and proportions
T-Test Calculator
Run one-sample, two-sample, and paired t-tests
CI for Proportions
Compute confidence intervals for proportions using Wald and Wilson methods
Correlation Significance
Test whether a correlation coefficient is statistically significant