MathIsimple

Basic Terminology & Concepts

Master the essential vocabulary of machine learning with practical housing price dataset examples

Module 2 of 4
Beginner Level
50-70 min

Core Data-Related Terminology

Understanding these fundamental terms is essential for working with machine learning. We'll use our housing price dataset to illustrate each concept.

Dataset

A collection of data used for training and testing machine learning models

Example: Our housing dataset with 200 property sales

Feature (Attribute)

A characteristic or property that describes an object

Example: Square footage, bedrooms, bathrooms, year built, location score

Attribute Value

The specific value of a feature for a given sample

Example: Sqft = 2400, Bedrooms = 4, Location Score = 8.5

Sample Dimensionality

The number of features describing each sample

Example: Our housing data has 6 dimensions (6 features)

Feature Space

The space formed by all features, also called attribute space or input space

Example: 6-dimensional space for our housing features

Label (Target)

The output or result associated with a sample

Example: Price: $450,000 or Rating: 'Excellent' for houses

Label Space

The space formed by all possible labels, also called output space

Example: Continuous price range ($100k-$900k) or categorical {Excellent, Good, Fair}

Instance (Example)

A single object's input features without the label

Example: One house's characteristics and measurements

Sample (Training Example)

Instance + Label together

Example: House characteristics WITH price or rating

Training Set

Collection of samples used to train the model

Example: 150 houses with known prices

Test Set

Collection of samples used to evaluate model performance

Example: 30 houses to test price prediction accuracy

Validation Set

Set used for model selection and hyperparameter tuning, preventing 'cheating' on test set

Example: 20 houses for choosing the best model

Feature Engineering

Feature engineering is the process of selecting, extracting, and transforming features to improve model performance. It includes:

Feature Selection

Choosing the most relevant features for the task (e.g., selecting square footage and location for house value)

Feature Extraction

Deriving new features from existing ones (e.g., creating a "ripeness score" from color and texture)

Feature Transform

Converting features to better representations (e.g., normalizing weight to 0-1 scale)

Warning: High-dimensional feature spaces can lead to the "curse of dimensionality" - as dimensions increase, data becomes sparse and models require exponentially more samples to maintain performance.

Machine Learning Task Types

ML tasks are categorized based on the type of label (target) we're trying to predict.

Classification Tasks

Discrete Labels

The label takes discrete (categorical) values. We want to assign each sample to a specific category.

Binary Classification

Only two possible classes. Examples: (High Value House, Standard Value House), (Positive Class, Negative Class), (+1, -1)

IDSqftBedsBathsLoc ScoreRating
1240042.58.5
Excellent
21200214.2
Fair
32800539.1
Excellent
41800327.3
Good
51000213.8
Fair

Showing 5 of 20 samples

Multi-Class Classification

More than two possible classes. Example: (Urban, Suburban, Rural) or (Excellent, Good, Fair, Poor)

More complex than binary classification as the model must distinguish between multiple categories

Regression Tasks

Continuous Labels

The label takes continuous (numerical) values. We want to predict a quantity or measurement.

Example: Predicting House Prices

Instead of classifying as "High Value" or "Standard", we predict the exact house price ($100k-$900k range)

IDSqftBedsBathsLoc ScorePrice
1240042.58.5
$450k
21200214.2
$185k
32800539.1
$580k
41800327.3
$325k
51000213.8
$145k

Showing 5 of 20 samples

Clustering Tasks

No Labels

Labels are empty (unlabeled data). We want to automatically group similar samples together.

Example: Grouping Similar Houses

Find natural groupings in houses (e.g., luxury homes, family homes, starter homes) without predefined labels

IDSqftBedsBathsLoc ScoreLabel
1240042.58.5
?
21200214.2
?
32800539.1
?
41800327.3
?
51000213.8
?

No quality labels available - algorithm must find patterns

Learning Paradigms

Based on label availability, we categorize machine learning into different paradigms:

Supervised Learning

All samples have labels. The model learns from labeled examples to make predictions.

Includes: Classification & Regression

Example: Learning to predict watermelon quality when we know the quality of all training watermelons

Unsupervised Learning

No samples have labels. The model discovers patterns and structure in the data.

Includes: Clustering & Dimensionality Reduction

Example: Grouping watermelons without knowing their quality, finding natural categories

Semi-Supervised Learning

Few samples have labels, many don't. Combines labeled and unlabeled data.

Use case: When labeling is expensive or time-consuming

Example: Having quality labels for only 20 watermelons but features for 1000 watermelons

Noisy Label Learning

Labels exist but may be incorrect. Model must be robust to label noise.

Challenge: Improving robustness under noisy conditions

Example: Some watermelons mislabeled as "good" when they're actually "bad"

The Goal: Generalization Ability

The fundamental goal of machine learning is generalization ability - the model's capacity to handle future, unseen test samples effectively.

Key Principle

We use historical data (training set) to approximate the model's generalization ability. The training set is our "experience," and we hope this experience generalizes to new situations.

Example: We train on 15 watermelons and hope the learned patterns apply to thousands of future watermelons

I.I.D. Assumption

Independent and Identically Distributed: We assume training and test samples are drawn from the same underlying distribution and are independent of each other.

This assumption allows us to use training data statistics to estimate test performance. If violated, the model may fail on new data.

Distribution Shift Problem

In reality, data often suffers from distribution shift - training and test data come from different distributions, violating the I.I.D. assumption.

Example: Training on summer watermelons but testing on winter watermelons, which have different characteristics

Solutions: Domain Adaptation and Transfer Learning techniques help models adapt when the training and test distributions differ.

Concept Learning Basics

Hypothesis Space

All possible hypotheses (models/rules) that could explain the data. In concept learning, we search through this space to find hypotheses consistent with the training data.

Version Space

A subset of the hypothesis space: the "set of hypotheses consistent with the training data."

Multiple hypotheses may fit the training data perfectly, but they might make different predictions on new data.

Inductive Bias

The learning algorithm's preference for certain types of hypotheses during the learning process. When multiple hypotheses are consistent with training data, we need a preference to choose one.

Example: Preferring simpler models over complex ones (Occam's Razor), or preferring linear relationships over non-linear ones

No Free Lunch (NFL) Theorem

"If an algorithm performs better than another on some problems, there must exist other problems where the second algorithm performs better."

In other words, there is no universally best learning algorithm. Each algorithm has its strengths and weaknesses, and which one works best depends on the specific problem.

Practical Implication

It's meaningless to discuss "which learning algorithm is better" without considering the specific problem. The key is understanding which algorithm is suitable for which types of problems - this is why we study various algorithms and their characteristics.

Key Takeaways

Data Terms: Features, labels, samples, training/test sets are fundamental ML vocabulary

Task Types: Classification (discrete), Regression (continuous), Clustering (unlabeled)

Learning Paradigms: Supervised (all labeled), Unsupervised (no labels), Semi-supervised (some labeled)

Goal: Generalization to unseen data, not just memorizing training examples

No Free Lunch: No single algorithm works best for all problems