Chi-Square Tests With Expected Counts and Residuals
Perform chi-square tests for independence and goodness of fit with detailed statistical analysis.
Choose Test: Goodness-of-Fit vs Independence
If you're working with categorical data—counts in categories rather than continuous measurements—the chi-square test calculator determines whether observed frequencies match what you'd expect. But there are two distinct tests, and picking the wrong one gives you nonsense results.
Goodness-of-fit test: Use this when you have a single categorical variable and want to check if your observed distribution matches some expected distribution. Does a die roll fairly (each face 1/6)? Do survey responses match the company's claimed demographic proportions? You provide observed counts and specify what you expect.
Test of independence: Use this when you have two categorical variables and want to know if they're related. Is political party preference associated with education level? Does treatment type relate to outcome category? You build a contingency table (cross-tabulation) and test whether the row and column variables are independent or associated.
Decision rule: One variable with expected proportions = goodness-of-fit. Two variables in a cross-tabulation = independence. Both involve comparing observed to expected frequencies, but the expected values come from different places.
Expected Counts and Degrees of Freedom
Expected counts represent what you'd see if the null hypothesis were true. How they're calculated depends on which test you're running.
Goodness-of-fit:
E_i = n × p_i (expected count for category i)
df = k − 1 (where k = number of categories)
Test of independence:
E_ij = (Row_i total × Column_j total) / Grand total
df = (r − 1) × (c − 1) (where r = rows, c = columns)
For goodness-of-fit, you supply the expected proportions (e.g., 1/6 for each die face) and multiply by sample size. For independence, expected values are calculated from the marginal totals—what you'd predict for each cell if the variables were unrelated.
Degrees of freedom work differently too. In goodness-of-fit, once you know k−1 category counts, the last one is fixed (they must sum to n). In independence, with r rows and c columns, (r−1)×(c−1) cells can vary freely once you fix the margins.
Watch out: Degrees of freedom directly affect your p-value. Using the wrong df can push a non-significant result to significant or vice versa. Double-check your category and table dimensions.
χ² Statistic and p-Value Interpretation
The chi-square statistic measures overall discrepancy between observed and expected frequencies. Each cell or category contributes (O − E)² / E to the total. Squaring ensures all contributions are positive; dividing by E weights smaller expectations more heavily.
χ² = Σ (O_i − E_i)² / E_i
Larger χ² values mean bigger mismatches between what you observed and what you expected under the null. The p-value tells you: if the null were true, what's the probability of seeing a χ² at least this large? Chi-square tests are always one-tailed (right-tailed), because only large χ² values indicate deviation—small values just mean good fit.
If p < 0.05 (or your chosen α), you reject the null. For goodness-of-fit, that means the observed distribution differs from expected. For independence, it means the two variables are associated—knowing one tells you something about the other.
Interpreting association: A significant independence test means the variables are related, but it doesn't measure how strongly. For that, look at effect size measures like Cramér's V or phi coefficient (for 2×2 tables).
Cell Residuals: Where the Mismatch Lives
A significant chi-square result tells you something deviates from expectation, but not where. Standardized residuals pinpoint which cells or categories contribute most to the discrepancy.
Standardized residual = (O − E) / √E
Residuals follow approximately a standard normal distribution under the null. Values beyond ±2 are notable; beyond ±3, strong. Positive residuals mean observed exceeds expected; negative means observed falls short.
Residual > +2
More observed than expected. This cell or category is overrepresented.
Residual < −2
Fewer observed than expected. This cell or category is underrepresented.
Example: In a survey, you find χ² is significant. Residuals show that young voters are overrepresented for Party A (+2.8) and underrepresented for Party C (−2.4). Now you know where the association lies, not just that one exists.
Assumption Warnings (Small Expected Cells)
The chi-square test relies on the chi-square distribution as an approximation. That approximation breaks down when expected counts are too small.
Expected Frequency Rule
The traditional guideline: all expected values should be at least 5. A more lenient version allows up to 20% of cells below 5, but none below 1. When violated, p-values become unreliable.
Independence of Observations
Each observation must be counted in exactly one cell. The same subject can't appear multiple times. If you have paired data (same subjects measured twice), use McNemar's test, not chi-square.
Count Data Only
Enter raw counts (frequencies), not percentages, proportions, or averages. If you have 30% and 70%, convert them to actual counts given your sample size.
Workarounds for small expected values: (1) Combine adjacent categories to increase expected counts—but only if it makes conceptual sense. (2) For 2×2 tables, use Fisher's exact test instead. (3) For larger tables, simulation-based methods (Monte Carlo p-values) bypass the chi-square approximation entirely.
Chi-Square Questions, Answered
What's the difference between chi-square and t-test?
T-tests compare means of continuous variables. Chi-square tests compare frequencies of categorical variables. If you're asking "are the averages different?" use a t-test. If you're asking "are the proportions or distributions different?" use chi-square.
My expected counts are below 5. Can I still run the test?
The chi-square approximation may be inaccurate. Options: combine categories if sensible, use Fisher's exact test for 2×2 tables, or use Monte Carlo simulation for larger tables. Don't just ignore the warning—your p-value could be misleading.
How do I measure effect size for chi-square?
For 2×2 tables, use phi (φ = √(χ²/n)). For larger tables, use Cramér's V (V = √(χ²/(n × min(r−1, c−1)))). Both range from 0 to 1, where 0 means no association and values closer to 1 mean stronger association. Cohen's benchmarks: 0.1 small, 0.3 medium, 0.5 large.
Does a significant chi-square prove causation?
No. It shows association—the variables are related—but not which one causes the other, or whether both are caused by something else. Establishing causation requires experimental design, temporal ordering, and ruling out confounders.
How do I report chi-square results?
Include the chi-square value, degrees of freedom, p-value, sample size, and effect size. Example: "A chi-square test of independence revealed a significant association between education level and voting preference, χ²(4) = 18.3, p = 0.001, Cramér's V = 0.27."
Can I use chi-square for more than two categories or variables?
Yes. Goodness-of-fit handles any number of categories. Independence tests handle r × c tables with any number of rows and columns. Degrees of freedom adjust accordingly. Just watch for sparse cells with small expected counts in large tables.
Limitations and Scope
• Expected frequency rule: When expected counts fall below 5 in many cells, the chi-square approximation becomes unreliable. Use Fisher's exact test or simulation methods.
• Association ≠ causation: Significant results indicate variables are related, not that one causes the other. Confounding variables may explain the association.
• Count data only: Chi-square requires raw frequencies, not percentages or continuous measurements. Converting percentages to counts requires knowing the total sample size.
• Independence of observations: Each subject should contribute to exactly one cell. For paired or repeated data, different methods apply (McNemar's test).
Note: This calculator is for educational purposes. For research, clinical, or policy applications, verify with statistical software and consider exact tests when expected counts are small.
Sources
- •NIST/SEMATECH e-Handbook: Chi-Square Test
- •Penn State STAT 500: Chi-Square Tests Module
- •Cohen, J. (1988): Statistical Power Analysis for the Behavioral Sciences — Cramér's V benchmarks
Frequently Asked Questions
Common questions about chi-square tests, goodness-of-fit tests, independence tests, expected frequencies, standardized residuals, assumptions, and how to use this calculator for homework and statistics practice.
What does the Chi-Square statistic measure?
The chi-square statistic measures the overall discrepancy between observed and expected frequencies. It sums the squared differences between observed and expected values, divided by the expected values: χ² = Σ(O-E)²/E. A larger χ² indicates a greater deviation from what was expected under the null hypothesis. The value itself has no inherent meaning until compared to the chi-square distribution with the appropriate degrees of freedom.
When should expected frequencies be at least 5?
The rule that expected frequencies should be ≥ 5 is a guideline to ensure the chi-square approximation is valid. When expected counts are too low, the chi-square distribution doesn't accurately model the test statistic. If you have expected values < 5, consider: (1) combining adjacent categories to increase expected counts, (2) using Fisher's exact test for 2×2 tables, or (3) using simulation-based methods. Some sources accept 80% of cells having E ≥ 5 with none below 1.
What is a standardized residual in chi-square analysis?
A standardized residual is (Observed - Expected) / √Expected, showing how much each cell contributes to the overall χ² statistic. Residuals > 2 or < -2 indicate cells with notable deviations from expectation. Positive residuals mean observed counts exceed expected; negative means fewer than expected. Examining residuals helps identify which specific categories or cells are driving a significant result.
What is the difference between independence and association?
Independence means two variables are not related—knowing one tells you nothing about the other. The chi-square test of independence tests whether the row and column variables in a contingency table are independent. If we reject independence, we conclude there's an association (relationship) between the variables. However, association doesn't imply causation—it just means the variables vary together in some systematic way.
How do I choose between goodness-of-fit and independence tests?
Use goodness-of-fit when you have ONE categorical variable and want to compare observed frequencies to a specific expected distribution (e.g., testing if a die is fair). Use the test of independence when you have TWO categorical variables and want to test if they're related (e.g., is there an association between education level and voting preference). The goodness-of-fit compares to a theoretical distribution; independence compares to what you'd expect if variables were unrelated.
Why is the chi-square test always right-tailed?
The chi-square test is always right-tailed because the χ² statistic can only be positive (it's a sum of squared terms). Large χ² values indicate large deviations from expected frequencies, which is evidence against the null hypothesis. Small χ² values (near zero) indicate observed frequencies are close to expected, supporting the null hypothesis. There's no such thing as 'too good' a fit in the left tail.
Can I use chi-square for continuous data?
The chi-square test is designed for categorical (count) data, not continuous measurements. If you have continuous data, you could: (1) categorize it into bins and use chi-square, though this loses information, (2) use parametric tests like t-tests or ANOVA for comparing means, or (3) use non-parametric tests like Mann-Whitney or Kruskal-Wallis. Converting continuous to categorical should be done thoughtfully, as bin choice affects results.
What's the relationship between chi-square and correlation?
Chi-square tests for association between categorical variables, while correlation (like Pearson's r) measures linear relationships between continuous variables. For ordinal categorical data, you might use Spearman's correlation or Kendall's tau instead. A significant chi-square tells you variables are associated but doesn't measure strength or direction. For 2×2 tables, Cramér's V or phi coefficient can measure association strength after a significant chi-square.
How do I handle a 2×2 contingency table?
For 2×2 tables, the standard chi-square test works but has special considerations: (1) With small expected counts (< 5), use Fisher's exact test instead, (2) Some apply Yates' continuity correction to reduce chi-square slightly, though this is debated, (3) The degrees of freedom is always 1 for 2×2 tables. You can also use odds ratios or relative risk to describe the association. McNemar's test is used for paired 2×2 data.
What should I report from a chi-square analysis?
A complete chi-square report should include: (1) χ² value and degrees of freedom: χ²(df) = value, (2) p-value, (3) sample size (N), (4) a description of what was tested, (5) whether the result was significant at your chosen α level, (6) for independence tests, consider reporting an effect size like Cramér's V, and (7) mention any cells with low expected frequencies. For example: 'χ²(2) = 8.45, p = 0.015, N = 150. There was a significant association between variables.'
Related Math & Statistics Tools
T-Test Calculator
Compare means with one-sample, two-sample, and paired t-tests
Probability Toolkit
Compute various probability calculations
Normal Distribution
Calculate probabilities under the normal curve
Descriptive Stats
Calculate mean, median, standard deviation, and more
Confidence Interval
Build confidence intervals for means and proportions
Z-Score / P-Value
Convert between z-scores and probabilities
Poisson Distribution
Calculate probabilities for count-based rare events
Sample Size for Proportions
Calculate required sample size for proportion-based statistical tests