Skip to main content

Confusion Matrix Calculator

Calculate precision, recall, F1 score, and classification metrics from TP/FP/TN/FN counts. Evaluate binary and multiclass models with confusion matrices for machine learning and data science projects.

📊

Classification Metrics

Analyze classification performance with confusion matrices and metrics

Which Metric Answers Your Actual Question

Your fraud model flags 200 transactions per day. The PM asks “Is it working?” but that question has at least three versions: are we catching most fraud (recall), are the alerts mostly real (precision), or is the overall picture reasonable (accuracy)? A confusion matrix lays out TP, FP, TN, and FN so you can compute whichever metric matches the question. Without it you are guessing which flavour of “good” you are measuring.

The mistake that wastes weeks: optimising for accuracy on an imbalanced dataset. If 1% of transactions are fraudulent, a model that always says “legit” scores 99% accuracy and catches zero fraud. The confusion matrix exposes this instantly — the TP cell is empty. Precision, recall, F1, and MCC each answer different questions; choosing the wrong one sends the project in the wrong direction from the start.

Precision–Recall Trade-Off When You Move the Threshold

A classifier that outputs probabilities becomes a binary decision only after you pick a threshold. At 0.5, your spam filter might have 92% precision and 78% recall. Lower the threshold to 0.3 and recall jumps to 90% — but precision drops to 80% because you are now flagging more borderline messages. The confusion matrix shifts: FN shrinks while FP grows. Every threshold yields a different 2×2 table.

The right threshold is not a statistical question — it is a business one. In cancer screening, a missed positive (FN) is far more costly than a false alarm (FP), so you push recall high even if precision suffers. In email blocking, a legitimate message in the spam folder (FP) annoys users more than one spam getting through (FN), so you protect precision. Map error costs before tuning the threshold, not after.

One practical check: plot precision and recall across thresholds (PR curve) alongside the confusion matrix at a few key points — 0.3, 0.5, 0.7. This lets stakeholders see the trade-off as a concrete table of TP/FP/FN counts, not an abstract curve.

Why 99% Accuracy Can Mean a Useless Model

Class imbalance is the root cause. If only 50 out of 5,000 test samples are positive, a model that never predicts positive gets 99% accuracy, zero recall, and undefined precision. The confusion matrix shows TN = 4,950, FN = 50, TP = 0, FP = 0. Accuracy hides the failure; recall surfaces it.

Two metrics handle imbalance better than accuracy. Balanced accuracy averages recall across classes, so a majority-class-only model scores 50% instead of 99%. MCC (Matthews Correlation Coefficient) uses all four cells and returns a value between −1 and +1, where 0 means no better than random. MCC = 0 for the all-negative model, which is far more honest than 99%.

Rule of thumb: if positive prevalence is below 10%, do not report accuracy at all. Lead with precision, recall, and F1 for the minority class, and include MCC as a single-number summary. Stakeholders who see “99% accuracy” will assume the model works; stakeholders who see “MCC = 0” will not.

Reading the AUC Curve Alongside the Confusion Matrix

The ROC curve plots true positive rate (recall) against false positive rate across all thresholds. AUC summarises the curve into a single number: 0.50 means random guessing, 1.00 means perfect separation. But AUC does not tell you which threshold to use — that is the confusion matrix’s job. A model with AUC = 0.92 might still have poor precision at the threshold you actually deploy, because AUC averages over thresholds you will never choose.

Use AUC to compare models during development (Model A vs. Model B), then use the confusion matrix at your chosen threshold to evaluate deployment readiness. If two models have similar AUC but different confusion matrices at the operating threshold, the matrices are what matter for the decision.

One caveat: under severe class imbalance, AUC can look good even when the model is poor. The precision-recall curve (AUPRC) is more informative in those cases because it focuses on the positive class. Report both when your positive rate is below 5%.

Multiclass Matrices and Per-Class F1 Breakdown

My 5-class model has 82% overall accuracy. Is that good enough?
Not without checking per-class recall. If Class D has 40% recall while the others sit above 90%, the model is failing on Class D — and the 82% overall hides it. The multiclass confusion matrix shows exactly where the off-diagonal mass concentrates: Class D being confused with Class B in 35% of cases, for example.

Should I report macro-F1 or weighted-F1?
Macro-F1 treats every class equally — useful when minority classes matter as much as majority classes. Weighted-F1 weights by support (class frequency), so large classes dominate. If Class D has 50 samples and Class A has 2,000, macro-F1 gives them equal voice; weighted-F1 barely notices Class D. Choose based on whether all classes carry equal business importance.

The matrix shows symmetric confusion between two classes. What should I do?
If Class B and Class C confuse each other roughly equally, the features you have cannot distinguish them well. Options: add features that separate them, merge them into one class if the distinction is not meaningful, or accept the error and document it. The confusion matrix diagnoses the problem; domain knowledge prescribes the fix.

Confusion Matrix, F1, MCC, and Balanced Accuracy Equations

Four equations cover most evaluation needs:

Precision and Recall
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score
F1 = 2 × Precision × Recall / (Precision + Recall)
Equivalent: F1 = 2TP / (2TP + FP + FN)
Matthews Correlation Coefficient
MCC = (TP×TN − FP×FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
Range: −1 (total disagreement) to +1 (perfect), 0 = random
Balanced Accuracy
BA = (Recall + Specificity) / 2
Specificity = TN / (TN + FP)

Units note: all metrics are dimensionless ratios between 0 and 1 (or −1 to +1 for MCC). TP, FP, TN, FN are raw integer counts from the evaluation set.

Fraud-Detection Model Evaluation With 10,000 Transactions

Scenario: You evaluate a fraud classifier on 10,000 transactions. 120 are actually fraudulent. The model at threshold 0.5 produces: TP = 96, FN = 24, FP = 180, TN = 9,700.

Step 1 — Core metrics.
Precision = 96 / (96 + 180) = 34.8%. Recall = 96 / (96 + 24) = 80.0%. F1 = 2 × 0.348 × 0.800 / (0.348 + 0.800) = 0.485. Accuracy = (96 + 9,700) / 10,000 = 97.96%.

Step 2 — Imbalance-aware metrics.
Specificity = 9,700 / (9,700 + 180) = 98.2%. Balanced accuracy = (80.0% + 98.2%) / 2 = 89.1%. MCC = (96×9,700 − 180×24) / √[(276)(120)(9,880)(9,724)] ≈ 0.50. Accuracy looks excellent at 98%, but MCC at 0.50 and precision at 35% tell the real story: two out of three fraud alerts are false alarms.

Step 3 — Decision.
If the compliance team can handle 276 daily alerts (TP + FP) and the cost of a missed fraud exceeds the cost of investigating a false alarm, 80% recall at 35% precision may be acceptable. If alert fatigue is the concern, raise the threshold to 0.7 — precision will improve, but some frauds will slip through. The confusion matrix at each threshold gives the concrete numbers to make that call.

Sources

scikit-learn — Classification Metrics: Precision, recall, F1, MCC, and confusion matrix computation with worked examples.

NCBI — The Advantages of MCC Over F1 and Accuracy: Why MCC is more informative than accuracy on imbalanced datasets.

Google ML Crash Course — Classification: Threshold tuning, ROC curves, and precision-recall trade-offs for binary classifiers.

Machine Learning Mastery — Confusion Matrix: Practical walkthrough of confusion matrix interpretation with imbalanced and multiclass examples.

Frequently Asked Questions about Confusion Matrices

What is a confusion matrix in simple terms?
A confusion matrix is a table that summarizes how well a classification model performs by comparing its predictions to the actual true labels. For binary classification (two classes), it's a 2×2 table showing four counts: True Positives (correct positive predictions), False Positives (incorrect positive predictions), True Negatives (correct negative predictions), and False Negatives (incorrect negative predictions). For multiclass problems, it expands into a larger table. The confusion matrix reveals not just overall accuracy, but the specific types of errors your model makes.
What do TP, FP, TN, and FN mean?
TP (True Positive): The model predicted positive, and the actual class was positive—a correct positive prediction. FP (False Positive): The model predicted positive, but the actual class was negative—an incorrect positive prediction (also called a Type I error or false alarm). TN (True Negative): The model predicted negative, and the actual class was negative—a correct negative prediction. FN (False Negative): The model predicted negative, but the actual class was positive—an incorrect negative prediction (also called a Type II error or miss).
What is the difference between accuracy, precision, and recall?
Accuracy is the proportion of all predictions that were correct: (TP + TN) / (TP + FP + TN + FN). It's simple but misleading for imbalanced data. Precision is the proportion of positive predictions that were actually positive: TP / (TP + FP). It answers: 'Of all things I called positive, how many were right?' Recall (also called sensitivity or true positive rate) is the proportion of actual positives that were correctly identified: TP / (TP + FN). It answers: 'Of all actual positives, how many did I catch?' Precision focuses on false alarms (FP), recall focuses on misses (FN).
When should I care more about precision vs recall?
It depends on the cost of errors. Prioritize precision when false positives are costly or annoying—e.g., spam filters (don't block real emails), or recommending products (don't show irrelevant items). Prioritize recall when false negatives are costly or dangerous—e.g., disease detection (don't miss any cases), fraud detection (don't let frauds slip through), or safety systems (don't fail to detect hazards). Many problems require balancing both, which is where F1 score (harmonic mean of precision and recall) comes in.
What is F1 score and when is it useful?
F1 score is the harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). It balances both metrics, giving a single number that considers both false positives and false negatives. F1 is especially useful for imbalanced datasets where accuracy is misleading, and when you care about both precision and recall but don't want to favor one over the other. F1 ranges from 0 to 1, with 1 being perfect. However, always check precision and recall separately—F1 alone can hide important trade-offs.
Why is accuracy misleading for imbalanced datasets?
In imbalanced datasets, one class vastly outnumbers the other (e.g., 99% negatives, 1% positives). A naive model that always predicts the majority class achieves high accuracy (99%) but is useless—it catches zero positives. For example, a fraud detector that labels everything 'not fraud' has 99% accuracy but 0% recall. Solution: Use precision, recall, F1, and balanced accuracy to evaluate minority-class performance. These metrics reveal whether your model is actually learning or just predicting the majority class.
How do I interpret specificity and sensitivity in this calculator?
Sensitivity (also called recall or true positive rate) is TP / (TP + FN)—the proportion of actual positives correctly identified. High sensitivity means you're catching most positives (few false negatives). Specificity (also called true negative rate) is TN / (TN + FP)—the proportion of actual negatives correctly identified. High specificity means you're correctly ruling out negatives (few false positives). These terms are common in medical testing: sensitivity = 'don't miss any diseases,' specificity = 'don't misdiagnose healthy people.'
What is the difference between macro and micro averages in multiclass problems?
Macro-average computes the metric (precision, recall, F1) for each class separately, then takes the unweighted average. It treats all classes equally, regardless of size—good for class-balanced insights. Micro-average pools all TP, FP, FN across all classes, then computes a single metric. It weighs by class frequency—larger classes contribute more. When to use: Use macro-average when all classes are equally important (e.g., rare disease classes matter as much as common ones). Use micro-average when larger classes are more important or when you want an overall 'average prediction quality.'
Can I use this tool for multiclass confusion matrices?
Yes! This calculator supports multiclass classification. Select 'Multiclass' from the Classification Type dropdown, then enter counts in a confusion matrix grid (each cell [i, j] = number of actual class i predicted as class j). The tool computes per-class precision, recall, and F1, plus macro, micro, and weighted averages. You can also add or remove classes dynamically using the +/− Class buttons. This is useful for problems with 3+ classes, like image classification (cat/dog/bird), sentiment analysis (positive/neutral/negative), or text categorization.
How should I present confusion matrix metrics in reports, slides, or homework?
Be clear and transparent: (1) Define your classes and which is 'positive' (e.g., 'positive = disease present'). (2) Show the confusion matrix table (with labels on rows/columns). (3) Report key metrics with context: 'Accuracy 87%, Precision 90%, Recall 85%, F1 0.87.' (4) Explain trade-offs: 'High recall ensures we catch most frauds, but lower precision means some false alarms.' (5) Show normalization or rates (e.g., 'Of 100 actual positives, we caught 85') for interpretability. (6) If comparing models, show side-by-side metrics. (7) Acknowledge limitations (small sample size, class imbalance). Clear presentation builds credibility and invites constructive feedback.
What is Matthews Correlation Coefficient (MCC)?
MCC is a single metric that considers all four cells of the binary confusion matrix: TP, TN, FP, FN. It's computed as MCC = (TP×TN − FP×FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)). MCC ranges from -1 to +1: +1 = perfect prediction, 0 = random guessing, -1 = total disagreement. MCC is especially useful for imbalanced datasets because it's robust to class imbalance and gives a balanced measure of prediction quality even when one class dominates. Think of it as a 'correlation between predicted and actual labels.'
What is balanced accuracy and when should I use it?
Balanced accuracy is the average of recall (sensitivity) and specificity: (Recall + Specificity) / 2. It weighs the positive and negative classes equally, making it useful for imbalanced datasets where raw accuracy is misleading. For example, if a model has 90% recall (catches most positives) but only 50% specificity (misclassifies many negatives), balanced accuracy is 70%—a fairer summary than raw accuracy, which might be high simply because negatives dominate. Use balanced accuracy when you want a single metric that treats both classes fairly.
How do I choose between multiple models using confusion matrices?
Evaluate each model on the same test set, compute confusion matrices and metrics (precision, recall, F1, etc.), then compare side-by-side. Steps: (1) Identify your priority metric based on problem costs (e.g., recall for disease detection, precision for spam filtering, F1 for balanced problems). (2) Use the calculator for each model's confusion matrix. (3) Compare metrics: which model has higher recall? Higher precision? Best F1? (4) Consider trade-offs: a model with slightly lower F1 but much higher recall might be better for safety-critical problems. (5) If metrics are close, check robustness (test on multiple folds, datasets) or business impact (which errors cost more?).
Why does my model have high accuracy but low F1?
This usually happens with imbalanced data. If 95% of examples are negative, a model that always predicts negative achieves 95% accuracy but has 0% recall and undefined precision (no positive predictions). F1 score, which depends on precision and recall, will be 0 or undefined. Diagnosis: Check your confusion matrix—if FN is large (many missed positives) or FP is large (many false alarms), accuracy looks good but F1 reveals poor performance. Solution: Tune your model to increase recall (catch more positives) or precision (reduce false alarms), or use techniques like resampling, cost-sensitive learning, or threshold adjustment.
Can I use this calculator to compare different classification thresholds?
Yes, conceptually. If you have a probabilistic classifier (outputs probabilities), changing the decision threshold (e.g., from 0.5 to 0.3) changes which predictions are positive, thus changing TP, FP, TN, FN and all derived metrics. How to use: For each threshold, compute the confusion matrix (TP, FP, TN, FN) from your predictions, then input those counts into the calculator to see precision, recall, F1, etc. Advanced: The calculator shows ROC and PR curves (if data supports it), which visualize how metrics change across all thresholds—use these to pick the threshold that best balances your business trade-offs.

Master Machine Learning & Model Evaluation

Build essential skills in classification metrics, model diagnostics, and data-driven performance evaluation for ML projects

Explore All Data Science Tools

How helpful was this calculator?

Confusion Matrix Metrics - F1, MCC, ROC/AUC