Logo

dev-resources.site

for different kinds of informations.

how to handle outliers in machine learning

Published at
10/13/2024
Categories
machinelearning
beginners
datascience
ai
Author
Ashwin Kumar
how to handle outliers in machine learning

Outliers are unusual data points that stand out from the rest of your data because they are either much higher or much lower than the rest. Imagine a classroom where most students score between 50 and 80 marks on a test, but one student scores 5, and another scores 100. These extremely different scores are examples of outliers.

In realworld data, outliers are common, and how you handle them can significantly impact your results. So, let’s break down some simple techniques to deal with outliers, using simple examples and coding demos to help you get started.

What is an Outlier?

Before we jump into the techniques, let’s define what an outlier is. In simple terms, an outlier is a value in a dataset that’s far away from the average or the majority of the other values. For example, in a class of students where most are 18-22 years old, if someone is 50 years old, they would be considered an outlier.

Why Deal with Outliers?

Outliers can distort your results, make your analysis less accurate, and lead to wrong conclusions. For instance, imagine you're trying to find the average income of a neighborhood, but a billionaire lives there. Their income would skew the average, giving you a false impression of the neighborhood’s wealth.

Common Techniques to Deal with Outliers

Let’s explore a few simple and effective techniques to deal with outliers. We'll also include a coding demo to show how to use each technique.

1. Z-Score Method

What it does: The Z-score method tells you how far a value is from the mean (average) of your data in terms of standard deviations. If a value is more than 3 standard deviations away from the mean, it is considered an outlier. Z-score table is useful

When to use: When your data is normally distributed (bell-shaped curve).

Z-Score Method

Example:

Imagine you have the heights of 100 people, most of them are between 150 cm and 180 cm, but one person is 250 cm tall. This is an outlier.

Coding Demo:

import pandas as pd
import numpy as np

# Sample data: heights of people (in cm)
data = pd.DataFrame({'Height': np.random.normal(170, 10, 100)})

# Adding an outlier
data.loc[0, 'Height'] = 250  # This is the outlier

# Calculate the Z-scores
data['Z_score'] = (data['Height'] - data['Height'].mean()) / data['Height'].std()

# Identifying outliers (Z-score > 3 or Z-score < -3)
outliers = data[np.abs(data['Z_score']) > 3]

print("Outliers:")
print(outliers)

2. IQR Method (Interquartile Range)

What it does: The IQR method calculates the range within which the middle 50% of your data lies. It helps identify outliers by finding values that fall significantly outside this range.

How it works:

  1. Calculate the first quartile (Q1): The 25th percentile of the data.
  2. Calculate the third quartile (Q3): The 75th percentile of the data.
  3. Find the IQR: Subtract Q1 from Q3.

IQR = Q3 - Q1

Find the IQR

  1. Determine the outlier boundaries:
    • Lower Bound: Q1 - 1.5 × IQR
    • Upper Bound: Q3 + 1.5 × IQR
  2. Identify outliers: Any data point below the lower bound or above the upper bound is an outlier.

Example: In a survey of people’s monthly expenses, if most spend between $500 and $1500 but a few spend over $4000, those high expenses are outliers.

Coding Demo:

import pandas as pd
import numpy as np

# Sample data for monthly expenses
data = {
    'Monthly Expenses': [500, 600, 700, 800, 1500, 1600, 2000, 4000, 4500, 5000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df['Monthly Expenses'].quantile(0.25)
Q3 = df['Monthly Expenses'].quantile(0.75)
IQR = Q3 - Q1

# Calculate bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['Monthly Expenses'] < lower_bound) | (df['Monthly Expenses'] > upper_bound)]

print("Identified Outliers using IQR:")
print(outliers)

3. Modified Z-Score

What it does: The modified Z-score is similar to the Z-score but is more robust against outliers. It uses the median and the median absolute deviation (MAD) to calculate how far a data point is from the median.

How it works:

  1. Calculate the median of the dataset.
  2. Compute the absolute deviation from the median for each data point.
  3. Calculate the median of those absolute deviations (MAD).
  4. Identify outliers: Any data point below the lower bound or above the upper bound is an outlier.
  5. Calculate the modified Z-score:

modified Z-score

X: This represents the specific data point you are evaluating. It could be any individual observation in your dataset.

Median: This is the middle value of your dataset when it is sorted.

MAD (Median Absolute Deviation): This is a measure of variability that quantifies how much the values in a dataset deviate from the median.

0.6745: This constant is a scaling factor used to make the modified Z-score comparable to the standard normal distribution.

Example: In a group of people's daily steps, if most walk between 2000 and 10000 steps but a few walk 30000 steps, those high step counts could be outliers.

Coding Demo:

# Sample data for daily steps
steps_data = {
    'Daily Steps': [2000, 3000, 5000, 7000, 9000, 10000, 15000, 30000]
}

# Create DataFrame
df_steps = pd.DataFrame(steps_data)

# Calculate median and MAD
median = df_steps['Daily Steps'].median()
mad = np.median(np.abs(df_steps['Daily Steps'] - median))

# Calculate modified Z-scores
df_steps['Modified Z'] = 0.6745 * (df_steps['Daily Steps'] - median) / mad

# Identify outliers
outliers_modified = df_steps[np.abs(df_steps['Modified Z']) > 3]

print("\nIdentified Outliers using Modified Z-Score:")
print(outliers_modified)

4. Box Plot Visualization

What it does: A box plot visually displays the distribution of your data, making it easy to spot outliers. The box represents the interquartile range (IQR), and any points outside the “whiskers” (lower and upper bounds) are considered outliers.

Example: In analyzing the heights of basketball players, you might find that most players fall between 180 cm and 210 cm, but a few exceed 230 cm, clearly visible in a box plot.

Box Plot Visualization

Coding Demo:

import matplotlib.pyplot as plt

# Sample data for heights
heights = [180, 185, 190, 195, 200, 210, 220, 230, 250]

# Create box plot
plt.boxplot(heights)
plt.title('Box Plot of Heights')
plt.ylabel('Height (cm)')
plt.show()

5. Winsorization

What it does: Winsorization involves capping the outlier values to reduce their influence without completely removing them. For example, you might replace extreme high values with the next highest non-outlier value.

Example: In a dataset of home prices, if one home is listed at $10 million while most are under $1 million, you might replace $10 million with the highest non-outlier price to maintain a realistic range.

Coding Demo:

# Winsorization example
data_prices = {
    'Home Prices': [150000, 200000, 250000, 300000, 10000000]  # One extreme outlier
}

# Create DataFrame
df_prices = pd.DataFrame(data_prices)

# Winsorization: cap outliers at the 95th percentile
cap = df_prices['Home Prices'].quantile(0.95)
df_prices['Capped Prices'] = np.where(df_prices['Home Prices'] > cap, cap, df_prices['Home Prices'])

print("\nData After Winsorization:")
print(df_prices)

6. Log Transformation

What it does: Log transformation reduces the effect of extreme values by applying a logarithmic scale to the data. This is particularly useful for positively skewed data.

Example: In analyzing incomes, where most values are clustered around a certain range, log transformation can help normalize the data and make it easier to analyze.

log transformation

Coding Demo:

# Sample income data
income_data = {
    'Annual Income': [20000, 30000, 50000, 80000, 200000, 1000000]  # Includes a large outlier
}

# Create DataFrame
df_income = pd.DataFrame(income_data)

# Apply log transformation
df_income['Log Income'] = np.log(df_income['Annual Income'])

print("\nData After Log Transformation:")
print(df_income)

Conclusion

Outliers are a natural part of data, but how you handle them can make a big difference in your analysis. By using techniques like Z-score, IQR, modified Z-score, box plots, winsorization, and log transformation, you can effectively manage outliers and improve the accuracy of your insights. Remember, the choice of technique depends on your data's characteristics and the specific context of your analysis.

Tips

  • Always visualize your data before and after handling outliers to understand their impact.
  • Consider the context of your data: sometimes, outliers are valid observations that should be kept for analysis.

Happy Coding ❤️

Featured ones: