MathIsimple
Article
12 min read

Why Feature Scaling Matters: Three Cases Where the Same Data Gives Opposite Results

How unscaled features can flip kNN, distort PCA, and stall neural networks

Feature Scaling
Standardization
kNN
PCA
Neural Networks

The same dataset can produce completely different results depending on whether you scale the features.

That is not a theoretical edge case. It happens every day with k-nearest neighbors, PCA, and neural networks.

A Tiny Dataset With Two Very Different Units

Consider a churn prediction toy dataset with two features:

  • X₁: spending in the last 30 days (range: ~$9,000)
  • X₂: login count in the last 30 days (range: 1–18)
UserSpend ($)LoginsLabel
A9,1001Churn
B9,50018Retain
C9,20017?

Intuitively, user C behaves much more like user B — their login counts are almost identical. But the raw spending feature dominates the numeric range by a factor of ~500×.

Failure 1: kNN Can Flip the Prediction

Without Scaling — Wrong Nearest Neighbor

Euclidean distance is dominated by the spending column (range ~$9,000 vs. logins range ~17):

d(C,A)=(92009100)2+(171)2101.3d(C,A) = \sqrt{(9200-9100)^2 + (17-1)^2} \approx 101.3d(C,B)=(92009500)2+(1718)2300.0d(C,B) = \sqrt{(9200-9500)^2 + (17-18)^2} \approx 300.0

Raw kNN says C is closest to A → predicts Churn. That's wrong.

After Min-Max Scaling — Correct Result

x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}

After scaling, logins count fairly. C becomes closer to B → predicts Retain. The prediction flipped.

Failure 2: PCA Thinks "Bigger Variance" Means "More Important"

PCA maximizes variance along principal components. If one feature is measured in thousands and another in tens, the larger-scale feature can dominate the first principal component — even when it is not the most informative signal.

Unscaled

Spending variance overwhelms login variance. PC1 mostly just captures spending.

Scaled

PCA compares features on fairer footing. Both contribute meaningfully to PC1.

PCA inherits your preprocessing choices. It does not know which unit is "right."

Failure 3: Neural Networks Can Stall Early

Gradient Saturation

Large-magnitude inputs push activations into saturated regions of sigmoid or tanh. Gradients shrink, learning stalls.

z=w1x1+w2x2+bz = w_1 x_1 + w_2 x_2 + b

If x1x_1 ≈ 9,000 while x2x_2 ≈ 10, the first term dominates. The network spends early training correcting for bad scaling instead of learning real structure.

Which Scaling Method Should You Choose?

MethodBest ForWatch Out For
Min-max scalingBounded features, distance models, neural netsSensitive to outliers
Standardization (z-score)PCA, regression, SVM, many general workflowsStill affected by heavy tails
Robust scalingData with significant outliersLoses some density info

The Simple Rule

"If the algorithm depends on distance, variance, or gradient stability, scaling is usually not optional."

Trees are mostly scale-invariant. kNN, PCA, and neural nets are not. That one preprocessing choice can decide whether the model tells a coherent story or a misleading one.

Master Feature Preprocessing

Go from raw data to model-ready features — covering scaling, encoding, selection, and dimensionality reduction in our full ML course.

Ask AI ✨