MathIsimple
Article
12 min read

Confusion Matrix, Precision, Recall, and F1: A Practical Medical Screening Guide

Why 95% accuracy can still mean a useless model

Evaluation Metrics
Confusion Matrix
Precision
Recall
F1 Score

A model can report 95% accuracy and still be unusable.

That sounds dramatic until you look at an imbalanced medical screening task. If only a few patients are positive, predicting "healthy" for everyone can still produce a high accuracy score while missing every case that matters.

Why 95% Accuracy Can Still Mean a Useless Model

If 95 out of 100 patients are healthy, a model that always predicts "healthy" gets 95% accuracy — while catching zero actual disease cases. Accuracy is blind to class imbalance.

A Cancer Screening Example

Suppose a classifier is tested on 25 patients:

  • Actual positive: 5 patients
  • Actual negative: 20 patients

The model produces:

  • True Positive (TP) = 4
  • False Positive (FP) = 2
  • False Negative (FN) = 1
  • True Negative (TN) = 18
Predicted PositivePredicted Negative
Actual PositiveTP = 4FN = 1
Actual NegativeFP = 2TN = 18

Accuracy: The First Number People Quote

Accuracy=TP+TNTP+TN+FP+FN=4+1825=2225=0.88Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{4 + 18}{25} = \frac{22}{25} = 0.88

The model is 88% accurate. That sounds good.

But accuracy answers only one question: "How often is the model correct overall?" It does not tell you whether the model is good at catching the rare class.

Precision: If the Model Says Positive, Can You Trust It?

Precision=TPTP+FP=44+2=460.667Precision = \frac{TP}{TP + FP} = \frac{4}{4 + 2} = \frac{4}{6} \approx 0.667

Precision is about false alarms. Here, when the model predicts positive, it is correct about 66.7% of the time.

Recall: How Many Real Cases Did We Catch?

Recall=TPTP+FN=44+1=45=0.80Recall = \frac{TP}{TP + FN} = \frac{4}{4 + 1} = \frac{4}{5} = 0.80

Recall is about missed cases. The model catches 80% of the truly positive patients.

When Recall Matters Most

In cancer screening, fraud detection, or safety monitoring, recall is often the number people care about most. A false alarm is annoying. A missed case can be fatal.

F1 Score: Balancing Precision and Recall

F1=2PRP+R=2×0.667×0.800.667+0.800.727F_1 = \frac{2PR}{P + R} = \frac{2 \times 0.667 \times 0.80}{0.667 + 0.80} \approx 0.727

The F1 score punishes models that are strong on one side and weak on the other. It is useful when both precision and recall matter.

F2 Score: When Recall Matters More Than Precision

F2=(1+22)PR22P+R=5×0.667×0.804×0.667+0.800.769F_2 = \frac{(1 + 2^2)PR}{2^2P + R} = \frac{5 \times 0.667 \times 0.80}{4 \times 0.667 + 0.80} \approx 0.769

F2 gives more weight to recall. That makes it a better fit for domains where missing a positive sample is substantially worse than flagging a false one.

All Metrics in One View

MetricValueBest For
Accuracy0.88Balanced datasets
Precision0.667Reducing false positives
Recall0.80Reducing false negatives
F10.727Balancing precision and recall
F20.769Recall-heavy decision making

Key Takeaways

  • Accuracy is not wrong: it's just incomplete for imbalanced problems.
  • Precision vs Recall is a tradeoff: improving one often hurts the other.
  • F1 is your default: when you need a single score that reflects both.
  • Use the confusion matrix first: TP, FP, FN, TN tell the full story before any derived metric.

Want to Go Deeper on ML Evaluation?

Explore ROC curves, AUC, cross-validation, and more in our comprehensive Machine Learning course.

Ask AI ✨