Skip to main content

Compare Three or More Group Means With ANOVA

Perform one-way analysis of variance to compare means across multiple groups with F-test statistics.

Last Updated: February 13, 2026

Three or more groups, and you want to know whether at least one mean differs from the others. Running pairwise t-tests inflates the family-wise error rate fast: three groups means three comparisons and roughly a 14% chance of a spurious "significant" at α = 0.05 even when all means are truly equal. ANOVA wraps the comparison into one F-test against H₀: μ₁ = μ₂ = … = μₖ.

The F statistic is the ratio of between-group variance to within-group variance: F = MSB / MSW. Under the null and the standard assumptions (independence, within-group normality, equal variances), F follows an F-distribution with (k − 1, N − k) degrees of freedom and the p-value is the right tail. The page returns F, p, the full ANOVA table (SSB, SSW, SST, MSB, MSW), and η² as effect size. ANOVA flags that something differs but not which groups; for the where, run a post-hoc test. Tukey HSD when n's and variances roughly match, Games-Howell when variances differ.

Group Entry Rules (Unequal n Allowed)

If you're comparing three or more groups and want to know whether any of them differ on average, the one-way ANOVA calculator handles it. Unlike running multiple t-tests (which inflates your false positive rate), ANOVA tests all groups at once with a single F-test. But first, you need to enter your data correctly.

Each group needs at least two observations. You can have different sample sizes across groups—ANOVA handles unequal n just fine, though equal sizes give slightly cleaner results. Enter numeric values separated by commas or one per line. The calculator labels groups automatically, but you can rename them if that helps interpretation.

Common setup examples: three teaching methods with 25, 28, and 22 students each; four fertilizer treatments applied to different numbers of plots; five dosage levels in a clinical trial with varying enrollment. The groups must be independent—different subjects in each group, no crossover or repeated measures.

Entry checklist: At least 3 groups total. At least 2 values per group. Numeric data only. Independent groups (same subjects can't appear in multiple groups). Missing values should be removed beforehand.

ANOVA Table: SS, MS, F, and df

The ANOVA table breaks down total variation into two pieces: variation between groups (due to different group means) and variation within groups (individual scatter around each group's mean). If the between-group variation is large relative to the within-group noise, you have evidence that the groups differ.

SourceSSdfMSF
Between GroupsSS_Bk − 1SS_B / (k−1)MS_B / MS_W
Within GroupsSS_WN − kSS_W / (N−k)
TotalSS_TN − 1

SS_Between: Σ n_i × (x̄_i − x̄_grand)²

SS_Within: ΣΣ (x_ij − x̄_i)²

SS_Total: SS_Between + SS_Within

k is the number of groups, N is the total sample size across all groups. MS (mean square) equals SS divided by its degrees of freedom. The F-statistic is the ratio of MS_Between to MS_Within. Under the null hypothesis (all group means equal), F should be around 1. Large F values suggest group differences.

Post-Result Reading: What p-Value Means Here

The p-value tells you: if all groups truly had the same population mean, what's the probability of seeing an F-statistic at least as large as yours? A small p-value (typically below 0.05) indicates your data would be unusual under that assumption, leading you to reject the null hypothesis of equal means.

Critical point: A significant ANOVA only tells you that at least one group differs from the others. It doesn't tell you which groups. If you get p = 0.002 with four groups, you know something is going on, but you don't yet know whether Group A differs from B, C, or D specifically.

That's where post-hoc tests come in. Common choices include Tukey's HSD (controls family-wise error rate while comparing all pairs), Bonferroni correction (conservative but simple), and Scheffé's method (flexible for complex contrasts). This calculator provides the omnibus F-test; for post-hoc comparisons, you'd run follow-up analyses.

Common mistake: Running multiple t-tests after ANOVA without correction. If you compare 4 groups pairwise (6 comparisons) at α = 0.05 each, your overall false positive rate balloons to about 26%. Post-hoc tests exist precisely to control this.

Effect Size: Eta-Squared in Context

Eta-squared (η²) measures how much of the total variance in your outcome is explained by group membership. It's calculated as SS_Between divided by SS_Total. A value of 0.10 means 10% of the variation in scores is accounted for by which group subjects belong to.

η² = SS_Between / SS_Total

Cohen's benchmarks for η²: roughly 0.01 is small, 0.06 is medium, 0.14 is large. But these are rough guides. In tightly controlled lab experiments, even 5% explained variance might be substantial. In noisy real-world data, you might need 20%+ to care practically.

η² ≈ 0.01

Small effect

η² ≈ 0.06

Medium effect

η² ≥ 0.14

Large effect

Note: η² is slightly biased upward as an estimate of the population effect. Some researchers prefer omega-squared (ω²), which adjusts for this bias. For one-way ANOVA with reasonable sample sizes, the difference is usually small.

Assumptions and Diagnostics to Check

ANOVA isn't assumption-free. Violating these conditions can inflate false positives, reduce power, or produce misleading F-statistics. Here's what to verify before trusting your results.

Independence

Observations must be independent within and across groups. If the same subject appears in multiple groups (repeated measures) or subjects within a group influence each other (clustering), standard one-way ANOVA is inappropriate. Use repeated measures ANOVA or mixed models instead.

Normality

Data within each group should be approximately normal. With larger samples (15–20+ per group), the Central Limit Theorem provides some protection. With smaller samples, check histograms or Q-Q plots. Severe skewness or heavy outliers warrant attention.

Homogeneity of Variance

Groups should have similar variances (homoscedasticity). Levene's test can check this formally. A rough rule: if the largest group variance is more than 4× the smallest, and sample sizes are unequal, results become unreliable. Welch's ANOVA doesn't assume equal variances.

What breaks this test: Repeated measures on the same subjects (needs repeated measures ANOVA), highly unequal variances with unequal n (use Welch's ANOVA), severe non-normality with small samples (consider Kruskal-Wallis), or dependent observations (needs multilevel models).

ANOVA Explained in Plain Words

Why not just run multiple t-tests?

Every test at α = 0.05 has a 5% chance of a false positive. With 4 groups, you'd run 6 pairwise t-tests. Even if no real differences exist, the probability of at least one false positive rises to about 26%. ANOVA tests all groups simultaneously with a single F-test, keeping the error rate at your chosen α.

What does a significant ANOVA actually tell me?

It tells you that at least one group's mean differs from the others. That's it. It doesn't identify which group, how many groups differ, or the direction of differences. For that, you need post-hoc tests (Tukey, Bonferroni, etc.) or planned contrasts.

What if my ANOVA isn't significant but I expected a difference?

Non-significance doesn't prove equality—it means you lack sufficient evidence of a difference. Possible reasons: true effect is small, sample sizes are too small to detect it (low power), or there really is no meaningful difference. Check your effect size. If η² is moderate but p is high, you may simply need more data.

Can I use ANOVA with only 2 groups?

Technically yes—ANOVA with 2 groups gives F = t². But most people use a t-test for two groups because it's more intuitive and allows one-tailed tests. ANOVA makes more sense when you have 3+ groups.

How do I report ANOVA results?

Include the F-statistic, degrees of freedom (both between and within), p-value, and effect size. Example: "A one-way ANOVA revealed significant differences among the three training methods, F(2, 87) = 6.42, p = 0.002, η² = 0.13. Post-hoc Tukey tests showed Method A outperformed Method C (p = 0.001)."

What's the difference between one-way and two-way ANOVA?

One-way ANOVA has one factor (grouping variable). Two-way ANOVA has two factors and can test for interactions between them. If you're comparing teaching methods across different grade levels, that's two factors—method and grade—and requires two-way ANOVA.

Limitations of one-way ANOVA

Independence: required across all observations and groups. The most common silent violation is treating observations from the same subject as independent. If your design has nesting, repeated measures, or clustering, this calculator won't give you the right p. Use a mixed-effects model (R's lme4 or statsmodels.regression.mixed_linear_model in Python).

Omnibus only: a significant F means at least one group differs, not which one. Run Tukey HSD when n's and variances are roughly equal, Games-Howell when variances differ.

Equal variances and within-group normality: matter at small n per group (below 15-20). Welch's ANOVA handles heteroscedasticity. Kruskal-Wallis is the rank-based fallback for non-normal data.

Note: For one-way independent designs this calculator is fine. For anything more structured (factorial, blocked, hierarchical), R's aov() and Anova() or statsmodels' AnovaRM and mixedlm are the standard tools. The ASA Statement on p-values (2016) applies here just as it does anywhere else.

Sources

ANOVA in practice: working questions

Why ANOVA instead of running pairwise t-tests?

Family-wise error inflation. Three groups means three pairwise comparisons, four groups means six. At α = 0.05 per test, the probability of at least one false positive across three comparisons is roughly 14%. ANOVA wraps the omnibus comparison into one F-test that holds the overall α at 0.05. Once F is significant, you run a post-hoc procedure that controls the family-wise error: Tukey HSD, Bonferroni, Scheffé, or Holm-Bonferroni. The post-hoc step is what tells you which groups differ; ANOVA only flags that something does.

How do I choose a post-hoc test after a significant F?

Tukey HSD when sample sizes and variances are roughly equal. It's the standard choice and controls family-wise error tightly. Games-Howell when variances differ across groups; it's the heteroscedastic counterpart of Tukey. Bonferroni when you have a small number of pre-specified comparisons (it's conservative for many comparisons). Holm-Bonferroni is uniformly more powerful than plain Bonferroni and is a free upgrade. Dunnett when comparing several groups against one control. Don't run pairwise t-tests with no correction after ANOVA.

What does eta-squared (η²) tell me?

η² = SSB / SST, the fraction of total variance explained by group membership. Cohen's benchmarks (1988): 0.01 small, 0.06 medium, 0.14 large. η² is upward-biased at small samples; partial η² and ω² (omega-squared) are bias-corrected alternatives. ω² = (SSB − (k − 1)·MSE) / (SST + MSE) is closer to the population effect size and can be slightly negative when the effect is genuinely zero. Report alongside the F statistic and p-value, especially with small group sizes where significance can outpace meaningful effect.

What if my groups have unequal variances?

Use Welch's ANOVA, which doesn't assume homogeneity of variance. R: oneway.test(y ~ group) defaults to Welch's. SPSS exposes it as a checkbox. Standard ANOVA tolerates moderate variance differences when group sizes are roughly equal, but with unequal variances and unequal n, the Type I error rate can be substantially off (either inflated or deflated depending on the direction). Levene's test or Brown-Forsythe screens for homogeneity, but defaulting to Welch's avoids the screening step entirely.

Repeated measures vs one-way ANOVA, when does each apply?

One-way assumes independent observations across groups: each subject contributes to exactly one group. Repeated measures applies when the same subjects are measured under all conditions, so observations are paired across treatments. Treating repeated-measures data as if it were one-way independent inflates the variance estimate and robs the test of power. R: aov(y ~ treatment + Error(subject/treatment)) for repeated measures, or lme4::lmer(y ~ treatment + (1 | subject)) for the mixed-effects equivalent. SPSS has a separate repeated-measures ANOVA module.

How do I read the F statistic and df?

F = MSB / MSW, the ratio of between-group variance to within-group variance. Numerator df = k − 1 where k is the number of groups. Denominator df = N − k. F is reported as F(df_num, df_den) = X.XX. Under H₀ (all means equal), F follows the F distribution with those df, and the p-value is the right-tail. F < 1 means within-group variance exceeds between-group variance, which trivially gives p > 0.5; F much greater than 1 is what produces small p-values.

How do I report a one-way ANOVA in a results section?

Standard format: "F(df_num, df_den) = X.XX, p = .XXX, η² = .XX." Example: "A one-way ANOVA showed a significant effect of treatment on yield, F(2, 87) = 7.42, p = .001, η² = .15. Tukey HSD post-hoc tests indicated that treatment B differed significantly from controls (p = .002, mean difference = 4.3, 95% CI [1.7, 6.9])." Include the omnibus F, the post-hoc results that drove the conclusion, and effect sizes (η² for the omnibus, mean differences with CIs for the pairwise comparisons).

What if my data are non-normal or have outliers?

ANOVA is fairly robust to non-normality at moderate group sizes (n ≥ 15-20 per group), thanks to the CLT. Severe skew or outliers at small n can distort the F-test. Kruskal-Wallis is the rank-based non-parametric alternative; Dunn's test is its post-hoc counterpart. Permutation ANOVA (R's coin::oneway_test) avoids both the normality and equal-variance assumptions. Don't quietly trim outliers; document any data exclusions, justify them, and ideally pre-register the rule.