From 0.913 to 0.9552 AUC: A Heart Disease Modeling Case Study

This is the kind of case study that reminds you machine learning progress rarely comes from one magic setting.

The task was straightforward on paper: predict heart disease from 13 clinical features. The evaluation metric was ROC-AUC. But the path from a decent baseline to a strong result came from a chain of practical decisions, not one dramatic trick.

AUC Progression

0.913

Baseline OOF AUC

+0.0422

0.9552

Final OOF AUC

Why the Problem Was Harder Than It Looked

▸Several features were categorical, so naive numeric handling could distort relationships.
▸Clinical variables interact in nonlinear ways.
▸Small gains in AUC are difficult once the baseline is already competent.

The Three Levers That Moved the Score

Feature Engineering With Medical Logic

Good feature engineering was not about randomly manufacturing columns. It was about asking whether a transformation had a plausible medical or statistical interpretation.

▸Ratio features: blood pressure relative to age, heart rate relative to age
▸Interaction features: chest pain with age, vessels with age
▸Binned features: age groups, heart-rate groups

More features don't always help — but meaningful extra views of the same patient profile can help tree ensembles split better.

Moving Toward CatBoost

A big improvement came from using a model that handles categorical structure more naturally. CatBoost reduced the amount of brittle manual encoding and gave the model a cleaner way to exploit mixed data types.

When your tabular dataset includes a meaningful share of category-like features, model choice becomes part of preprocessing strategy.

Multi-Seed, Multi-Fold Ensembling

The final setup used 5 random seeds × 10 folds, averaging 50 submodels.

Different folds see slightly different validation landscapes, and different seeds change the training trajectory. Averaging those models reduces variance and produces smoother final predictions.

Why Early Stopping Helped More Than It Seems

Early stopping creates hidden training diversity.

One fold may stop at 1,800 iterations. Another at 2,700. Those are not the same model, even if the hyperparameters are nominally identical. That hidden diversity makes the ensemble more robust — and it's free.

A Representative Parameter Direction

CatBoost Parameters

loss_function = "Logloss"
eval_metric   = "AUC"
learning_rate = 0.02
depth         = 6
l2_leaf_reg   = 5
iterations    = 4000
bootstrap_type = "Bernoulli"
subsample     = 0.8
early_stopping_rounds = 150

Notice the pattern: lower learning rate, higher iteration cap, then let early stopping decide where useful training actually ends.

The Transferable Workflow

The mindset behind the number

1Start with a baseline you can explain.
2Identify the specific bottleneck: encoding, variance, capacity, or validation instability.
3Choose the next change for a reason, not because it is fashionable.
4Verify whether the gain survives cross-validation, not just a lucky split.

That mindset is more transferable than any one leaderboard trick. In practical machine learning, the best improvements often come from a series of small, well-justified decisions that compound.

Want to Apply These Techniques?

Explore ensemble learning, cross-validation strategies, and feature engineering in depth in our comprehensive Machine Learning course.