Chi-Square Tests With Expected Counts and Residuals
Perform chi-square tests for independence and goodness of fit with detailed statistical analysis.
Chi-square is for counts, not measurements. You have categorical data (frequencies in cells) and want to test whether the observed pattern matches a hypothesized distribution (goodness-of-fit) or whether two categorical variables are independent of each other (test of independence). Both reduce to χ² = Σ (O − E)² / E with degrees of freedom that depend on which test you ran.
Goodness-of-fit takes one variable plus a hypothesized distribution; df = k − 1 where k is the number of categories. Test of independence takes a contingency table; df = (r − 1)(c − 1), with expected counts E_ij = (row total · column total) / grand total. The p-value is the right tail of χ² at that df. One assumption that matters in practice: expected counts under 5 in any cell make the χ² approximation unreliable. Cochran's rule (no more than 20% of cells with E < 5, none with E < 1) is the usual cutoff. Below that, switch to Fisher's exact test for 2×2 tables or pool sparse categories.
Choose Test: Goodness-of-Fit vs Independence
If you're working with categorical data—counts in categories rather than continuous measurements—the chi-square test calculator determines whether observed frequencies match what you'd expect. But there are two distinct tests, and picking the wrong one gives you nonsense results.
Goodness-of-fit test: Use this when you have a single categorical variable and want to check if your observed distribution matches some expected distribution. Does a die roll fairly (each face 1/6)? Do survey responses match the company's claimed demographic proportions? You provide observed counts and specify what you expect.
Test of independence: Use this when you have two categorical variables and want to know if they're related. Is political party preference associated with education level? Does treatment type relate to outcome category? You build a contingency table (cross-tabulation) and test whether the row and column variables are independent or associated.
Decision rule: One variable with expected proportions = goodness-of-fit. Two variables in a cross-tabulation = independence. Both involve comparing observed to expected frequencies, but the expected values come from different places.
Expected Counts and Degrees of Freedom
Expected counts represent what you'd see if the null hypothesis were true. How they're calculated depends on which test you're running.
Goodness-of-fit:
E_i = n × p_i (expected count for category i)
df = k − 1 (where k = number of categories)
Test of independence:
E_ij = (Row_i total × Column_j total) / Grand total
df = (r − 1) × (c − 1) (where r = rows, c = columns)
For goodness-of-fit, you supply the expected proportions (e.g., 1/6 for each die face) and multiply by sample size. For independence, expected values are calculated from the marginal totals—what you'd predict for each cell if the variables were unrelated.
Degrees of freedom work differently too. In goodness-of-fit, once you know k−1 category counts, the last one is fixed (they must sum to n). In independence, with r rows and c columns, (r−1)×(c−1) cells can vary freely once you fix the margins.
Watch out: Degrees of freedom directly affect your p-value. Using the wrong df can push a non-significant result to significant or vice versa. Double-check your category and table dimensions.
χ² Statistic and p-Value Interpretation
The chi-square statistic measures overall discrepancy between observed and expected frequencies. Each cell or category contributes (O − E)² / E to the total. Squaring ensures all contributions are positive; dividing by E weights smaller expectations more heavily.
χ² = Σ (O_i − E_i)² / E_i
Larger χ² values mean bigger mismatches between what you observed and what you expected under the null. The p-value tells you: if the null were true, what's the probability of seeing a χ² at least this large? Chi-square tests are always one-tailed (right-tailed), because only large χ² values indicate deviation—small values just mean good fit.
If p < 0.05 (or your chosen α), you reject the null. For goodness-of-fit, that means the observed distribution differs from expected. For independence, it means the two variables are associated—knowing one tells you something about the other.
Interpreting association: A significant independence test means the variables are related, but it doesn't measure how strongly. For that, look at effect size measures like Cramér's V or phi coefficient (for 2×2 tables).
Cell Residuals: Where the Mismatch Lives
A significant chi-square result tells you something deviates from expectation, but not where. Standardized residuals pinpoint which cells or categories contribute most to the discrepancy.
Standardized residual = (O − E) / √E
Residuals follow approximately a standard normal distribution under the null. Values beyond ±2 are notable; beyond ±3, strong. Positive residuals mean observed exceeds expected; negative means observed falls short.
Residual > +2
More observed than expected. This cell or category is overrepresented.
Residual < −2
Fewer observed than expected. This cell or category is underrepresented.
Example: In a survey, you find χ² is significant. Residuals show that young voters are overrepresented for Party A (+2.8) and underrepresented for Party C (−2.4). Now you know where the association lies, not just that one exists.
Assumption Warnings (Small Expected Cells)
The chi-square test relies on the chi-square distribution as an approximation. That approximation breaks down when expected counts are too small.
Expected Frequency Rule
The traditional guideline: all expected values should be at least 5. A more lenient version allows up to 20% of cells below 5, but none below 1. When violated, p-values become unreliable.
Independence of Observations
Each observation must be counted in exactly one cell. The same subject can't appear multiple times. If you have paired data (same subjects measured twice), use McNemar's test, not chi-square.
Count Data Only
Enter raw counts (frequencies), not percentages, proportions, or averages. If you have 30% and 70%, convert them to actual counts given your sample size.
Workarounds for small expected values: (1) Combine adjacent categories to increase expected counts—but only if it makes conceptual sense. (2) For 2×2 tables, use Fisher's exact test instead. (3) For larger tables, simulation-based methods (Monte Carlo p-values) bypass the chi-square approximation entirely.
Reading χ² output, in practice
What's the difference between chi-square and t-test?
T-tests compare means of continuous variables. Chi-square tests compare frequencies of categorical variables. If you're asking "are the averages different?" use a t-test. If you're asking "are the proportions or distributions different?" use chi-square.
My expected counts are below 5. Can I still run the test?
The chi-square approximation may be inaccurate. Options: combine categories if sensible, use Fisher's exact test for 2×2 tables, or use Monte Carlo simulation for larger tables. Don't just ignore the warning—your p-value could be misleading.
How do I measure effect size for chi-square?
For 2×2 tables, use phi (φ = √(χ²/n)). For larger tables, use Cramér's V (V = √(χ²/(n × min(r−1, c−1)))). Both range from 0 to 1, where 0 means no association and values closer to 1 mean stronger association. Cohen's benchmarks: 0.1 small, 0.3 medium, 0.5 large.
Does a significant chi-square prove causation?
No. It shows association—the variables are related—but not which one causes the other, or whether both are caused by something else. Establishing causation requires experimental design, temporal ordering, and ruling out confounders.
How do I report chi-square results?
Include the chi-square value, degrees of freedom, p-value, sample size, and effect size. Example: "A chi-square test of independence revealed a significant association between education level and voting preference, χ²(4) = 18.3, p = 0.001, Cramér's V = 0.27."
Can I use chi-square for more than two categories or variables?
Yes. Goodness-of-fit handles any number of categories. Independence tests handle r × c tables with any number of rows and columns. Degrees of freedom adjust accordingly. Just watch for sparse cells with small expected counts in large tables.
Limitations of the χ² approximation
Small expected counts: the χ² approximation breaks when expected counts are small. Cochran's rule of thumb: no more than 20% of cells with E < 5, and none with E < 1. Below that, use Fisher's exact for 2×2 tables or Monte Carlo simulation for larger ones.
Counts, not percentages: the test wants raw frequencies, not rates or percentages. Converting back from a percentage requires the total n, and the conversion error compounds.
One cell per subject: each subject must contribute to exactly one cell. Paired or repeated categorical data needs McNemar's test (for 2×2 paired) or Cochran's Q (for matched k×2).
Association ≠ causation: a significant χ² says variables are associated. Confounders explain plenty of significant results. Causal claims need a different framework.
Note: The pitfall worth flagging: don't inflate sample size after a non-significant result and re-run, calling that a follow-up analysis. It's optional stopping in disguise. R's chisq.test() with simulate.p.value = TRUE handles small expected counts honestly. scipy.stats.chi2_contingency is the SciPy equivalent.
Sources
- •NIST/SEMATECH e-Handbook: Chi-Square Test
- •Penn State STAT 500: Chi-Square Tests Module
- •Cohen, J. (1988): Statistical Power Analysis for the Behavioral Sciences — Cramér's V benchmarks
Chi-square testing: working questions
How do I compute expected counts for a contingency table?
For a test of independence with rows i and columns j, the expected count in cell (i, j) is E_ij = (row total · column total) / grand total. The expected counts represent what you'd see if the row and column variables were truly independent. Then χ² = ΣΣ (O_ij − E_ij)² / E_ij summed across all cells, with df = (r − 1)(c − 1). For a 2×2 contingency table that gives df = 1. R's chisq.test() returns the expected counts in the result object as $expected.
What if my expected counts are below 5?
The χ² approximation gets unreliable. Cochran's rule of thumb (1954): no more than 20% of cells should have expected counts under 5, and no cell should have expected count under 1. Below that threshold, switch to Fisher's exact test for 2×2 tables, or to a Monte Carlo simulation for larger tables. R: chisq.test(table, simulate.p.value = TRUE) does the simulation cleanly. fisher.test(table) handles the exact case. For very small samples, exact methods are not just preferred, they're required if you want a defensible p.
Goodness-of-fit vs test of independence, when do I use each?
Goodness-of-fit takes one categorical variable and tests against a hypothesized distribution. Example: rolling a die 600 times to test if it's fair (expected counts 100 per face). df = k − 1 where k is the number of categories. Test of independence takes a contingency table with two categorical variables and asks whether they're related. Example: smoking status by lung-cancer outcome. df = (r − 1)(c − 1). Both compute χ² = Σ (O − E)² / E, but the way E is calculated differs. Goodness-of-fit takes E from your hypothesis. Independence computes E from the marginal totals.
Chi-square vs Fisher's exact test, which?
Chi-square uses the χ² approximation, which assumes large enough expected counts. Fisher's exact computes the exact probability under the null using the hypergeometric distribution and works for any sample size. For 2×2 tables with small expected counts, Fisher's exact is the default in R (fisher.test()). For larger tables, exact computation gets expensive but Monte Carlo simulation handles it. The cost of using Fisher's instead of χ² when both apply is mild conservatism (slightly larger p), so for borderline cases, use Fisher's and report it.
How do I read standardized residuals to find which cells drive the result?
Standardized residual: z_ij = (O_ij − E_ij) / √E_ij. Adjusted residuals are similar but additionally scale by row and column proportions (R's chisq.test() returns these as $stdres). |z| above 2 flags a cell as contributing meaningfully to the χ² total; |z| above 3 is strong. The sign tells direction: positive means observed exceeded expected, negative means it fell short. After a significant omnibus test, the residuals tell you which cells drove the rejection. Report the largest absolute residuals alongside the test statistic and p.
Why is df = (r − 1)(c − 1) for a contingency table?
Once you fix the row and column totals (the marginals), most cells are determined. In an r × c table, you can freely choose values in (r − 1)(c − 1) cells. The remaining cells are forced by the marginals. That's the count of free parameters under the independence model, hence the degrees of freedom. For 2×2 the formula gives df = 1, for 3×4 it gives df = 6. Yates' continuity correction adjusts the test statistic for 2×2 tables specifically.
How do I report a chi-square test in a results section?
Standard format: "χ²(df, N = total) = X.XX, p = .XXX, V = X.XX." Cramér's V is the effect size analog for chi-square, ranging [0, 1] with 0.1 small, 0.3 medium, 0.5 large (Cohen 1988). Example: "A chi-square test of independence between smoking status and outcome was significant, χ²(1, N = 412) = 14.7, p < .001, V = 0.19." Report the largest standardized residuals if the table has more than two cells contributing notably. APA 7 expects effect size, which is why Cramér's V matters.
Can I use chi-square on paired or repeated categorical data?
No. Chi-square assumes independent observations across cells. For paired binary data (the same subject categorized before and after), use McNemar's test. The test statistic is (b − c)² / (b + c) where b and c are the discordant cell counts. For more than two paired categories, Cochran's Q generalizes McNemar. R has mcnemar.test() and DescTools::CochranQTest. Running standard chi-square on paired data underestimates the variance and inflates Type I error.
Related Math & Statistics Tools
T-Test Calculator
Compare means with one-sample, two-sample, and paired t-tests
Probability Toolkit
Compute various probability calculations
Normal Distribution
Calculate probabilities under the normal curve
Descriptive Stats
Calculate mean, median, standard deviation, and more
Confidence Interval
Build confidence intervals for means and proportions
Z-Score / P-Value
Convert between z-scores and probabilities
Poisson Distribution
Calculate probabilities for count-based rare events
Sample Size for Proportions
Calculate required sample size for proportion-based statistical tests