A model can report 95% accuracy and still be unusable.
That sounds dramatic until you look at an imbalanced medical screening task. If only a few patients are positive, predicting "healthy" for everyone can still produce a high accuracy score while missing every case that matters.
Why 95% Accuracy Can Still Mean a Useless Model
If 95 out of 100 patients are healthy, a model that always predicts "healthy" gets 95% accuracy — while catching zero actual disease cases. Accuracy is blind to class imbalance.
A Cancer Screening Example
Suppose a classifier is tested on 25 patients:
- Actual positive: 5 patients
- Actual negative: 20 patients
The model produces:
- True Positive (TP) = 4
- False Positive (FP) = 2
- False Negative (FN) = 1
- True Negative (TN) = 18
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP = 4 | FN = 1 |
| Actual Negative | FP = 2 | TN = 18 |
Accuracy: The First Number People Quote
The model is 88% accurate. That sounds good.
But accuracy answers only one question: "How often is the model correct overall?" It does not tell you whether the model is good at catching the rare class.
Precision: If the Model Says Positive, Can You Trust It?
Precision is about false alarms. Here, when the model predicts positive, it is correct about 66.7% of the time.
Recall: How Many Real Cases Did We Catch?
Recall is about missed cases. The model catches 80% of the truly positive patients.
When Recall Matters Most
In cancer screening, fraud detection, or safety monitoring, recall is often the number people care about most. A false alarm is annoying. A missed case can be fatal.
F1 Score: Balancing Precision and Recall
The F1 score punishes models that are strong on one side and weak on the other. It is useful when both precision and recall matter.
F2 Score: When Recall Matters More Than Precision
F2 gives more weight to recall. That makes it a better fit for domains where missing a positive sample is substantially worse than flagging a false one.
All Metrics in One View
| Metric | Value | Best For |
|---|---|---|
| Accuracy | 0.88 | Balanced datasets |
| Precision | 0.667 | Reducing false positives |
| Recall | 0.80 | Reducing false negatives |
| F1 | 0.727 | Balancing precision and recall |
| F2 | 0.769 | Recall-heavy decision making |
Key Takeaways
- Accuracy is not wrong: it's just incomplete for imbalanced problems.
- Precision vs Recall is a tradeoff: improving one often hurts the other.
- F1 is your default: when you need a single score that reflects both.
- Use the confusion matrix first: TP, FP, FN, TN tell the full story before any derived metric.