Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training

Most beginner machine learning bugs are not model bugs. They are data preprocessing bugs.

People love to jump from read_csv() straight to model.fit(). That feels productive for about thirty seconds. Then the errors start: missing values, strange categories, wildly different scales, leakage from the test set, or a result that somehow looks "too good to be true."

Think Like a Doctor Before You Think Like a Model

A good preprocessing workflow looks less like "feature engineering magic" and more like a medical checkup: inspect, clean, encode, scale only when needed, then split in the right order.

The 5 Steps

Explore Before You Touch Anything

Before cleaning, ask four basic questions:

▸What columns do I have?
▸Which are numeric, categorical, datetime, or IDs?
▸Where are the missing values?
▸Do any values look impossible or suspicious?

Clean Without Pretending It Is Perfect

Missing values: often contain signal — absence may mean something.
Duplicates: can make a model look better if copies land in both train and test.
Impossible values: negative age or 900 logins in a minute should trigger investigation.

Encode Features the Model Can Actually Read

Ordinal features: preserve order if the categories have a real ranking.
Nominal features: use one-hot encoding when order does not exist.
High-cardinality categories: naive one-hot can explode dimensionality.

Rule: encoding should match the meaning of the feature, not just the convenience of the library.

Scale Only When Scale Matters

Not every algorithm cares about feature scale:

Needs Scaling

kNN, SVM, logistic regression, neural networks, PCA

Usually Doesn't

Decision trees, random forests, gradient boosting trees

Split Data at the Right Time

If you normalize, impute, or select features using the full dataset before the train-test split, the model has already "peeked" at the test set. The reported performance will be artificially optimistic.

The Leakage Trap: Correct Code Order

Data Leakage: The Silent Model Poison

Fitting a scaler on the full dataset before splitting means test-set statistics "leaked" into training. Always split first, then fit preprocessors on the training set only.

# Correct order — no leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)
scaler.fit(X_train)                    # fit on train only
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

A Practical Preprocessing Checklist

Before You Train — Always Check These

Run head, info, describe, and missing-value counts.
Remove obvious duplicates and impossible values.
Impute missing values with a rule you can explain.
Encode categories according to meaning, not habit.
Split train and test before fitting preprocessors.
Scale only if the algorithm depends on geometry or gradients.
Wrap the full preprocessing path in a pipeline if possible.

Preprocessing rarely gets the headline, but it decides whether the model is solving the real problem or just reacting to a messy spreadsheet.

Ready to Put This Into Practice?

Build a solid foundation from data preprocessing to feature engineering, model selection, and evaluation in our Machine Learning course.