Master the essential vocabulary of machine learning with practical housing price dataset examples
Understanding these fundamental terms is essential for working with machine learning. We'll use our housing price dataset to illustrate each concept.
A collection of data used for training and testing machine learning models
A characteristic or property that describes an object
The specific value of a feature for a given sample
The number of features describing each sample
The space formed by all features, also called attribute space or input space
The output or result associated with a sample
The space formed by all possible labels, also called output space
A single object's input features without the label
Instance + Label together
Collection of samples used to train the model
Collection of samples used to evaluate model performance
Set used for model selection and hyperparameter tuning, preventing 'cheating' on test set
Feature engineering is the process of selecting, extracting, and transforming features to improve model performance. It includes:
Choosing the most relevant features for the task (e.g., selecting square footage and location for house value)
Deriving new features from existing ones (e.g., creating a "ripeness score" from color and texture)
Converting features to better representations (e.g., normalizing weight to 0-1 scale)
Warning: High-dimensional feature spaces can lead to the "curse of dimensionality" - as dimensions increase, data becomes sparse and models require exponentially more samples to maintain performance.
ML tasks are categorized based on the type of label (target) we're trying to predict.
The label takes discrete (categorical) values. We want to assign each sample to a specific category.
Only two possible classes. Examples: (High Value House, Standard Value House), (Positive Class, Negative Class), (+1, -1)
| ID | Sqft | Beds | Baths | Loc Score | Rating |
|---|---|---|---|---|---|
| 1 | 2400 | 4 | 2.5 | 8.5 | Excellent |
| 2 | 1200 | 2 | 1 | 4.2 | Fair |
| 3 | 2800 | 5 | 3 | 9.1 | Excellent |
| 4 | 1800 | 3 | 2 | 7.3 | Good |
| 5 | 1000 | 2 | 1 | 3.8 | Fair |
Showing 5 of 20 samples
More than two possible classes. Example: (Urban, Suburban, Rural) or (Excellent, Good, Fair, Poor)
More complex than binary classification as the model must distinguish between multiple categories
The label takes continuous (numerical) values. We want to predict a quantity or measurement.
Instead of classifying as "High Value" or "Standard", we predict the exact house price ($100k-$900k range)
| ID | Sqft | Beds | Baths | Loc Score | Price |
|---|---|---|---|---|---|
| 1 | 2400 | 4 | 2.5 | 8.5 | $450k |
| 2 | 1200 | 2 | 1 | 4.2 | $185k |
| 3 | 2800 | 5 | 3 | 9.1 | $580k |
| 4 | 1800 | 3 | 2 | 7.3 | $325k |
| 5 | 1000 | 2 | 1 | 3.8 | $145k |
Showing 5 of 20 samples
Labels are empty (unlabeled data). We want to automatically group similar samples together.
Find natural groupings in houses (e.g., luxury homes, family homes, starter homes) without predefined labels
| ID | Sqft | Beds | Baths | Loc Score | Label |
|---|---|---|---|---|---|
| 1 | 2400 | 4 | 2.5 | 8.5 | ? |
| 2 | 1200 | 2 | 1 | 4.2 | ? |
| 3 | 2800 | 5 | 3 | 9.1 | ? |
| 4 | 1800 | 3 | 2 | 7.3 | ? |
| 5 | 1000 | 2 | 1 | 3.8 | ? |
No quality labels available - algorithm must find patterns
Based on label availability, we categorize machine learning into different paradigms:
All samples have labels. The model learns from labeled examples to make predictions.
Example: Learning to predict watermelon quality when we know the quality of all training watermelons
No samples have labels. The model discovers patterns and structure in the data.
Example: Grouping watermelons without knowing their quality, finding natural categories
Few samples have labels, many don't. Combines labeled and unlabeled data.
Example: Having quality labels for only 20 watermelons but features for 1000 watermelons
Labels exist but may be incorrect. Model must be robust to label noise.
Example: Some watermelons mislabeled as "good" when they're actually "bad"
The fundamental goal of machine learning is generalization ability - the model's capacity to handle future, unseen test samples effectively.
We use historical data (training set) to approximate the model's generalization ability. The training set is our "experience," and we hope this experience generalizes to new situations.
Example: We train on 15 watermelons and hope the learned patterns apply to thousands of future watermelons
Independent and Identically Distributed: We assume training and test samples are drawn from the same underlying distribution and are independent of each other.
This assumption allows us to use training data statistics to estimate test performance. If violated, the model may fail on new data.
In reality, data often suffers from distribution shift - training and test data come from different distributions, violating the I.I.D. assumption.
Example: Training on summer watermelons but testing on winter watermelons, which have different characteristics
Solutions: Domain Adaptation and Transfer Learning techniques help models adapt when the training and test distributions differ.
All possible hypotheses (models/rules) that could explain the data. In concept learning, we search through this space to find hypotheses consistent with the training data.
A subset of the hypothesis space: the "set of hypotheses consistent with the training data."
Multiple hypotheses may fit the training data perfectly, but they might make different predictions on new data.
The learning algorithm's preference for certain types of hypotheses during the learning process. When multiple hypotheses are consistent with training data, we need a preference to choose one.
Example: Preferring simpler models over complex ones (Occam's Razor), or preferring linear relationships over non-linear ones
"If an algorithm performs better than another on some problems, there must exist other problems where the second algorithm performs better."
In other words, there is no universally best learning algorithm. Each algorithm has its strengths and weaknesses, and which one works best depends on the specific problem.
It's meaningless to discuss "which learning algorithm is better" without considering the specific problem. The key is understanding which algorithm is suitable for which types of problems - this is why we study various algorithms and their characteristics.
Data Terms: Features, labels, samples, training/test sets are fundamental ML vocabulary
Task Types: Classification (discrete), Regression (continuous), Clustering (unlabeled)
Learning Paradigms: Supervised (all labeled), Unsupervised (no labels), Semi-supervised (some labeled)
Goal: Generalization to unseen data, not just memorizing training examples
No Free Lunch: No single algorithm works best for all problems