This is the kind of case study that reminds you machine learning progress rarely comes from one magic setting.
The task was straightforward on paper: predict heart disease from 13 clinical features. The evaluation metric was ROC-AUC. But the path from a decent baseline to a strong result came from a chain of practical decisions, not one dramatic trick.
Why the Problem Was Harder Than It Looked
- ▸Several features were categorical, so naive numeric handling could distort relationships.
- ▸Clinical variables interact in nonlinear ways.
- ▸Small gains in AUC are difficult once the baseline is already competent.
The Three Levers That Moved the Score
Feature Engineering With Medical Logic
Good feature engineering was not about randomly manufacturing columns. It was about asking whether a transformation had a plausible medical or statistical interpretation.
- ▸Ratio features: blood pressure relative to age, heart rate relative to age
- ▸Interaction features: chest pain with age, vessels with age
- ▸Binned features: age groups, heart-rate groups
More features don't always help — but meaningful extra views of the same patient profile can help tree ensembles split better.
Moving Toward CatBoost
A big improvement came from using a model that handles categorical structure more naturally. CatBoost reduced the amount of brittle manual encoding and gave the model a cleaner way to exploit mixed data types.
When your tabular dataset includes a meaningful share of category-like features, model choice becomes part of preprocessing strategy.
Multi-Seed, Multi-Fold Ensembling
The final setup used 5 random seeds × 10 folds, averaging 50 submodels.
Different folds see slightly different validation landscapes, and different seeds change the training trajectory. Averaging those models reduces variance and produces smoother final predictions.
Why Early Stopping Helped More Than It Seems
Early stopping creates hidden training diversity.
One fold may stop at 1,800 iterations. Another at 2,700. Those are not the same model, even if the hyperparameters are nominally identical. That hidden diversity makes the ensemble more robust — and it's free.
A Representative Parameter Direction
loss_function = "Logloss" eval_metric = "AUC" learning_rate = 0.02 depth = 6 l2_leaf_reg = 5 iterations = 4000 bootstrap_type = "Bernoulli" subsample = 0.8 early_stopping_rounds = 150
Notice the pattern: lower learning rate, higher iteration cap, then let early stopping decide where useful training actually ends.
The Transferable Workflow
- 1Start with a baseline you can explain.
- 2Identify the specific bottleneck: encoding, variance, capacity, or validation instability.
- 3Choose the next change for a reason, not because it is fashionable.
- 4Verify whether the gain survives cross-validation, not just a lucky split.
That mindset is more transferable than any one leaderboard trick. In practical machine learning, the best improvements often come from a series of small, well-justified decisions that compound.