Most beginner machine learning bugs are not model bugs. They are data preprocessing bugs.
People love to jump from read_csv() straight to model.fit(). That feels productive for about thirty seconds. Then the errors start: missing values, strange categories, wildly different scales, leakage from the test set, or a result that somehow looks "too good to be true."
Think Like a Doctor Before You Think Like a Model
A good preprocessing workflow looks less like "feature engineering magic" and more like a medical checkup: inspect, clean, encode, scale only when needed, then split in the right order.
The 5 Steps
Explore Before You Touch Anything
Before cleaning, ask four basic questions:
- ▸What columns do I have?
- ▸Which are numeric, categorical, datetime, or IDs?
- ▸Where are the missing values?
- ▸Do any values look impossible or suspicious?
Clean Without Pretending It Is Perfect
- Missing values: often contain signal — absence may mean something.
- Duplicates: can make a model look better if copies land in both train and test.
- Impossible values: negative age or 900 logins in a minute should trigger investigation.
Encode Features the Model Can Actually Read
- Ordinal features: preserve order if the categories have a real ranking.
- Nominal features: use one-hot encoding when order does not exist.
- High-cardinality categories: naive one-hot can explode dimensionality.
Rule: encoding should match the meaning of the feature, not just the convenience of the library.
Scale Only When Scale Matters
Not every algorithm cares about feature scale:
Needs Scaling
kNN, SVM, logistic regression, neural networks, PCA
Usually Doesn't
Decision trees, random forests, gradient boosting trees
Split Data at the Right Time
If you normalize, impute, or select features using the full dataset before the train-test split, the model has already "peeked" at the test set. The reported performance will be artificially optimistic.
The Leakage Trap: Correct Code Order
Data Leakage: The Silent Model Poison
Fitting a scaler on the full dataset before splitting means test-set statistics "leaked" into training. Always split first, then fit preprocessors on the training set only.
# Correct order — no leakage X_train, X_test, y_train, y_test = train_test_split(X, y, ...) scaler.fit(X_train) # fit on train only X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)
A Practical Preprocessing Checklist
- Run head, info, describe, and missing-value counts.
- Remove obvious duplicates and impossible values.
- Impute missing values with a rule you can explain.
- Encode categories according to meaning, not habit.
- Split train and test before fitting preprocessors.
- Scale only if the algorithm depends on geometry or gradients.
- Wrap the full preprocessing path in a pipeline if possible.
Preprocessing rarely gets the headline, but it decides whether the model is solving the real problem or just reacting to a messy spreadsheet.