Confusion Matrix Calculator
Calculate precision, recall, F1 score, and classification metrics from TP/FP/TN/FN counts. Evaluate binary and multiclass models with confusion matrices for machine learning and data science projects.
Classification Metrics
Analyze classification performance with confusion matrices and metrics
Which Metric Answers Your Actual Question
Your fraud model flags 200 transactions per day. The PM asks âIs it working?â but that question has at least three versions: are we catching most fraud (recall), are the alerts mostly real (precision), or is the overall picture reasonable (accuracy)? A confusion matrix lays out TP, FP, TN, and FN so you can compute whichever metric matches the question. Without it you are guessing which flavour of âgoodâ you are measuring.
The mistake that wastes weeks: optimising for accuracy on an imbalanced dataset. If 1% of transactions are fraudulent, a model that always says âlegitâ scores 99% accuracy and catches zero fraud. The confusion matrix exposes this instantly â the TP cell is empty. Precision, recall, F1, and MCC each answer different questions; choosing the wrong one sends the project in the wrong direction from the start.
PrecisionâRecall Trade-Off When You Move the Threshold
A classifier that outputs probabilities becomes a binary decision only after you pick a threshold. At 0.5, your spam filter might have 92% precision and 78% recall. Lower the threshold to 0.3 and recall jumps to 90% â but precision drops to 80% because you are now flagging more borderline messages. The confusion matrix shifts: FN shrinks while FP grows. Every threshold yields a different 2Ă2 table.
The right threshold is not a statistical question â it is a business one. In cancer screening, a missed positive (FN) is far more costly than a false alarm (FP), so you push recall high even if precision suffers. In email blocking, a legitimate message in the spam folder (FP) annoys users more than one spam getting through (FN), so you protect precision. Map error costs before tuning the threshold, not after.
One practical check: plot precision and recall across thresholds (PR curve) alongside the confusion matrix at a few key points â 0.3, 0.5, 0.7. This lets stakeholders see the trade-off as a concrete table of TP/FP/FN counts, not an abstract curve.
Why 99% Accuracy Can Mean a Useless Model
Class imbalance is the root cause. If only 50 out of 5,000 test samples are positive, a model that never predicts positive gets 99% accuracy, zero recall, and undefined precision. The confusion matrix shows TNÂ =Â 4,950, FNÂ =Â 50, TPÂ =Â 0, FPÂ =Â 0. Accuracy hides the failure; recall surfaces it.
Two metrics handle imbalance better than accuracy. Balanced accuracy averages recall across classes, so a majority-class-only model scores 50% instead of 99%. MCC (Matthews Correlation Coefficient) uses all four cells and returns a value between â1 and +1, where 0 means no better than random. MCCÂ =Â 0 for the all-negative model, which is far more honest than 99%.
Rule of thumb: if positive prevalence is below 10%, do not report accuracy at all. Lead with precision, recall, and F1 for the minority class, and include MCC as a single-number summary. Stakeholders who see â99% accuracyâ will assume the model works; stakeholders who see âMCCÂ =Â 0â will not.
Reading the AUC Curve Alongside the Confusion Matrix
The ROC curve plots true positive rate (recall) against false positive rate across all thresholds. AUC summarises the curve into a single number: 0.50 means random guessing, 1.00 means perfect separation. But AUC does not tell you which threshold to use â that is the confusion matrixâs job. A model with AUCÂ =Â 0.92 might still have poor precision at the threshold you actually deploy, because AUC averages over thresholds you will never choose.
Use AUC to compare models during development (Model A vs. Model B), then use the confusion matrix at your chosen threshold to evaluate deployment readiness. If two models have similar AUC but different confusion matrices at the operating threshold, the matrices are what matter for the decision.
One caveat: under severe class imbalance, AUC can look good even when the model is poor. The precision-recall curve (AUPRC) is more informative in those cases because it focuses on the positive class. Report both when your positive rate is below 5%.
Multiclass Matrices and Per-Class F1 Breakdown
My 5-class model has 82% overall accuracy. Is that good enough?
Not without checking per-class recall. If Class D has 40% recall while the others sit above 90%, the model is failing on Class D â and the 82% overall hides it. The multiclass confusion matrix shows exactly where the off-diagonal mass concentrates: Class D being confused with Class B in 35% of cases, for example.
Should I report macro-F1 or weighted-F1?
Macro-F1 treats every class equally â useful when minority classes matter as much as majority classes. Weighted-F1 weights by support (class frequency), so large classes dominate. If Class D has 50 samples and Class A has 2,000, macro-F1 gives them equal voice; weighted-F1 barely notices Class D. Choose based on whether all classes carry equal business importance.
The matrix shows symmetric confusion between two classes. What should I do?
If Class B and Class C confuse each other roughly equally, the features you have cannot distinguish them well. Options: add features that separate them, merge them into one class if the distinction is not meaningful, or accept the error and document it. The confusion matrix diagnoses the problem; domain knowledge prescribes the fix.
Confusion Matrix, F1, MCC, and Balanced Accuracy Equations
Four equations cover most evaluation needs:
Units note: all metrics are dimensionless ratios between 0 and 1 (or â1 to +1 for MCC). TP, FP, TN, FN are raw integer counts from the evaluation set.
Fraud-Detection Model Evaluation With 10,000 Transactions
Scenario: You evaluate a fraud classifier on 10,000 transactions. 120 are actually fraudulent. The model at threshold 0.5 produces: TPÂ =Â 96, FNÂ =Â 24, FPÂ =Â 180, TNÂ =Â 9,700.
Step 1 â Core metrics.
Precision = 96Â /Â (96Â +Â 180) = 34.8%. Recall = 96Â /Â (96Â +Â 24) = 80.0%. F1 = 2Â ĂÂ 0.348Â ĂÂ 0.800Â /Â (0.348Â +Â 0.800) = 0.485. Accuracy = (96Â +Â 9,700)Â /Â 10,000 = 97.96%.
Step 2 â Imbalance-aware metrics.
Specificity = 9,700 / (9,700 + 180) = 98.2%. Balanced accuracy = (80.0% + 98.2%) / 2 = 89.1%. MCC = (96Ă9,700 â 180Ă24) / â[(276)(120)(9,880)(9,724)] â 0.50. Accuracy looks excellent at 98%, but MCC at 0.50 and precision at 35% tell the real story: two out of three fraud alerts are false alarms.
Step 3 â Decision.
If the compliance team can handle 276 daily alerts (TPÂ +Â FP) and the cost of a missed fraud exceeds the cost of investigating a false alarm, 80% recall at 35% precision may be acceptable. If alert fatigue is the concern, raise the threshold to 0.7 â precision will improve, but some frauds will slip through. The confusion matrix at each threshold gives the concrete numbers to make that call.
Sources
scikit-learn â Classification Metrics: Precision, recall, F1, MCC, and confusion matrix computation with worked examples.
NCBI â The Advantages of MCC Over F1 and Accuracy: Why MCC is more informative than accuracy on imbalanced datasets.
Google ML Crash Course â Classification: Threshold tuning, ROC curves, and precision-recall trade-offs for binary classifiers.
Machine Learning Mastery â Confusion Matrix: Practical walkthrough of confusion matrix interpretation with imbalanced and multiclass examples.
Frequently Asked Questions about Confusion Matrices
What is a confusion matrix in simple terms?
What do TP, FP, TN, and FN mean?
What is the difference between accuracy, precision, and recall?
When should I care more about precision vs recall?
What is F1 score and when is it useful?
Why is accuracy misleading for imbalanced datasets?
How do I interpret specificity and sensitivity in this calculator?
What is the difference between macro and micro averages in multiclass problems?
Can I use this tool for multiclass confusion matrices?
How should I present confusion matrix metrics in reports, slides, or homework?
What is Matthews Correlation Coefficient (MCC)?
What is balanced accuracy and when should I use it?
How do I choose between multiple models using confusion matrices?
Why does my model have high accuracy but low F1?
Can I use this calculator to compare different classification thresholds?
Explore More Data Science & ML Tools
Master Machine Learning & Model Evaluation
Build essential skills in classification metrics, model diagnostics, and data-driven performance evaluation for ML projects
Explore All Data Science Tools