Lesson 5.3: Data Distributions & Outliers

Master Data Distributions & Outliers

Discover the patterns in your data! Learn to identify different types of distributions, spot outliers, and understand how they affect your statistical analysis.

Learning Objectives

Identify different types of data distributions

Recognize and analyze outliers

Understand skewness and its effects

Evaluate data quality and reliability

Types of Data Distributions

Normal Distribution

• Bell-shaped curve

• Symmetric around the mean

• Mean = Median = Mode

• Most data near the center

Skewed Distribution

• Right-skewed: Tail extends right

• Left-skewed: Tail extends left

• Mean ≠ Median ≠ Mode

• Asymmetric shape

Uniform Distribution

• All values equally likely

• Flat, rectangular shape

• No clear center

• Equal frequency across range

Bimodal Distribution

• Two distinct peaks

• Two modes

• Often indicates two groups

• Valley between peaks

Identifying Outliers

What are Outliers?

• Data points that are significantly different from other observations

• Values that fall far outside the normal range

• Can be caused by measurement errors, rare events, or data entry mistakes

• Can significantly affect statistical measures

Example 1: Test Scores with Outlier

Data: 85, 88, 92, 89, 87, 91, 90, 88, 86, 45

Step 1: Calculate statistical measures

Mean = 84.1

Median = 88.5

Range = 47

Step 2: Identify the outlier

The score of 45 is much lower than all other scores

It's more than 30 points below the next lowest score

Step 3: Impact analysis

Without outlier: Mean = 89.4, Median = 88.5

The outlier significantly lowered the mean but didn't affect the median much

Example 2: IQR Method for Outliers

Data: 12, 15, 18, 20, 22, 25, 28, 30, 35, 50

Step 1: Find Q1, Q2 (median), and Q3

Q1 = 18 (25th percentile)

Q2 = 23.5 (50th percentile)

Q3 = 30 (75th percentile)

Step 2: Calculate IQR

IQR = Q3 - Q1 = 30 - 18 = 12

Step 3: Find outlier boundaries

Lower boundary = Q1 - 1.5 × IQR = 18 - 18 = 0

Upper boundary = Q3 + 1.5 × IQR = 30 + 18 = 48

Step 4: Identify outliers

The value 50 is above the upper boundary (48)

Therefore, 50 is an outlier

Skewness Analysis

Example 3: Right-Skewed Distribution

Data: Income levels in a neighborhood: $30k, $35k, $40k, $45k, $50k, $55k, $60k, $65k, $70k, $200k

Statistical Measures:

Mean = $69,000

Median = $52,500

Mode = No clear mode

Skewness Analysis:

Mean > Median → Right-skewed

The high income ($200k) pulls the mean to the right

Interpretation:

Most people earn around $30k-$70k, but one person earns much more, creating a long tail to the right.

Example 4: Left-Skewed Distribution

Data: Test scores: 20, 85, 88, 90, 92, 94, 95, 96, 97, 98

Statistical Measures:

Mean = 85.5

Median = 93

Most scores are high (85-98)

Skewness Analysis:

Mean < Median → Left-skewed

The low score (20) pulls the mean to the left

Interpretation:

Most students scored very well (85-98), but one student scored much lower, creating a long tail to the left.

Data Quality Assessment

Good Data Quality

• Consistent measurement units

• No obvious outliers

• Reasonable distribution shape

• Mean and median close together

Poor Data Quality

• Many extreme outliers

• Inconsistent units or scales

• Unusual distribution patterns

• Large gaps between mean and median

Example 5: Data Quality Check

Scenario: Analyzing heights of 8th grade students: 150cm, 155cm, 160cm, 165cm, 170cm, 175cm, 180cm, 185cm, 190cm, 250cm

Statistical Analysis:

Mean = 176cm

Median = 172.5cm

Range = 100cm

Quality Issues:

• 250cm is an extreme outlier (unrealistic for 8th grader)
• Large gap between mean and median
• Range is unusually large
• Likely data entry error

Recommendation:

Investigate the 250cm measurement - it's likely a mistake (should be 150cm or 160cm).

Common Mistakes to Avoid

❌ Mistake 1: Automatically removing all outliers

Outliers might be legitimate data points. Always investigate before removing them.

❌ Mistake 2: Using only the mean for skewed data

For skewed distributions, the median is often a better measure of central tendency.

❌ Mistake 3: Ignoring distribution shape

The shape of the distribution tells you important information about your data.

Practice Problems

Problem 1:

Identify the outlier in: 12, 15, 18, 20, 22, 25, 28, 30, 35, 100

Show Solution

The outlier is 100 - it's much larger than all other values.

Problem 2:

Is this distribution right-skewed or left-skewed: Mean = 45, Median = 50?

Show Solution

Left-skewed. When mean < median, the distribution has a long tail to the left.

Problem 3:

Which measure is most affected by outliers: mean, median, or mode?

Show Solution

The mean is most affected by outliers because it includes all values in its calculation.

← Previous Lesson Next Lesson: Sampling & Bias →