Confusion Matrix, Precision, Recall, and F1: A Practical Medical Screening Guide

A model can report 95% accuracy and still be unusable.

That sounds dramatic until you look at an imbalanced medical screening task. If only a few patients are positive, predicting "healthy" for everyone can still produce a high accuracy score while missing every case that matters.

Why 95% Accuracy Can Still Mean a Useless Model

If 95 out of 100 patients are healthy, a model that always predicts "healthy" gets 95% accuracy — while catching zero actual disease cases. Accuracy is blind to class imbalance.

A Cancer Screening Example

Suppose a classifier is tested on 25 patients:

Actual positive: 5 patients
Actual negative: 20 patients

The model produces:

True Positive (TP) = 4
False Positive (FP) = 2
False Negative (FN) = 1
True Negative (TN) = 18

	Predicted Positive	Predicted Negative
Actual Positive	TP = 4	FN = 1
Actual Negative	FP = 2	TN = 18

Accuracy: The First Number People Quote

Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{4 + 18}{25} = \frac{22}{25} = 0.88

The model is 88% accurate. That sounds good.

But accuracy answers only one question: "How often is the model correct overall?" It does not tell you whether the model is good at catching the rare class.

Precision: If the Model Says Positive, Can You Trust It?

Precision = \frac{TP}{TP + FP} = \frac{4}{4 + 2} = \frac{4}{6} \approx 0.667

Precision is about false alarms. Here, when the model predicts positive, it is correct about 66.7% of the time.

Recall: How Many Real Cases Did We Catch?

Recall = \frac{TP}{TP + FN} = \frac{4}{4 + 1} = \frac{4}{5} = 0.80

Recall is about missed cases. The model catches 80% of the truly positive patients.

When Recall Matters Most

In cancer screening, fraud detection, or safety monitoring, recall is often the number people care about most. A false alarm is annoying. A missed case can be fatal.

F1 Score: Balancing Precision and Recall

F_1 = \frac{2PR}{P + R} = \frac{2 \times 0.667 \times 0.80}{0.667 + 0.80} \approx 0.727

The F1 score punishes models that are strong on one side and weak on the other. It is useful when both precision and recall matter.

F2 Score: When Recall Matters More Than Precision

F_2 = \frac{(1 + 2^2)PR}{2^2P + R} = \frac{5 \times 0.667 \times 0.80}{4 \times 0.667 + 0.80} \approx 0.769

F2 gives more weight to recall. That makes it a better fit for domains where missing a positive sample is substantially worse than flagging a false one.

All Metrics in One View

Metric	Value	Best For
Accuracy	0.88	Balanced datasets
Precision	0.667	Reducing false positives
Recall	0.80	Reducing false negatives
F1	0.727	Balancing precision and recall
F2	0.769	Recall-heavy decision making

Key Takeaways

Accuracy is not wrong: it's just incomplete for imbalanced problems.
Precision vs Recall is a tradeoff: improving one often hurts the other.
F1 is your default: when you need a single score that reflects both.
Use the confusion matrix first: TP, FP, FN, TN tell the full story before any derived metric.

Want to Go Deeper on ML Evaluation?

Explore ROC curves, AUC, cross-validation, and more in our comprehensive Machine Learning course.