How to deal with outliers

* Outliers badly affect the mean and standard deviation of the dataset. These may statistically give erroneous results.* It increases the error variance and reduces the power of statistical tests.* If the outliers are non-randomly distributed, they can decrease normality.* Most machine learning algorithms do not work well in the presence of outliers. So it is desirable to detect and remove outliers.* They can also impact the basic assumption of Regression, ANOVA, and other statistical model assumptions.

For all these reasons we must be careful about outliers and treat them before building a statistical/machine learning model. There are some techniques used to deal with outliers.

1. Deleting observations.

2. Transforming values.

3. Imputation.

4. Separately treating

DELETING OBSERVATIONS:

We delete outlier values if it is due to data entry errors, data processing errors, or outlier observations that are very small in numbers. We can also use trimming at both ends to remove outliers. However, deleting the observation is not a good idea when we have a small dataset.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
train = pd.read_csv('../input/cost-of-living/cost-of-living-2018.csv')
sns.boxplot(train['Cost of Living Index'])
plt.title("Box Plot before outlier removing")
plt.show()
def drop_outliers(df, field_name):
    iqr = 1.5 * (np.percentile(df[field_name], 75) - np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > (iqr + np.percentile(df[field_name], 75))].index, inplace=True)
    df.drop(df[df[field_name] < (np.percentile(df[field_name], 25) - iqr)].index, inplace=True)
drop_outliers(train, 'Cost of Living Index')
sns.boxplot(train['Cost of Living Index'])
plt.title("Box Plot after outlier removing")
plt.show()

TRANSFORMING VALUES:

Transforming variables can also eliminate outliers. These transformed values reduce the variation caused by extreme values.

1. Scalling

2. Log transformation

3. Cube Root Normalization

4. Box-Cox transformation

* These techniques convert values in the dataset to smaller values.* If the data has too many extreme values or is skewed, this method helps to make your data normal.* But These techniques do not always give you the best results.* There is no loss of data from these methods.* In all these methods boxcox transformation gives the best result.

#Scalling
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn import preprocessing
train = pd.read_csv('../input/cost-of-living/cost-of-living-2018.csv')
plt.hist(train['Cost of Living Index'])
plt.title("Histogram before Scalling")
plt.show()
scaler = preprocessing.StandardScaler()
train['Cost of Living Index'] = scaler.fit_transform(train['Cost of Living Index'].values.reshape(-1,1))
plt.hist(train['Cost of Living Index'])
plt.title("Histogram after Scalling")
plt.show()

#Log Transformation
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
train = pd.read_csv('../input/cost-of-living/cost-of-living-2018.csv')
sns.distplot(train['Cost of Living Index'])
plt.title("Distribution plot before Log transformation")
sns.despine()
plt.show()
train['Cost of Living Index'] = np.log(train['Cost of Living Index'])
sns.distplot(train['Cost of Living Index'])
plt.title("Distribution plot after Log transformation")
sns.despine()
plt.show()

#cube root Transformation
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
train = pd.read_csv('../input/titanic/train.csv')
plt.hist(train['Age'])
plt.title("Histogram before cube root Transformation")
plt.show()
train['Age'] = (train['Age']**(1/3))
plt.hist(train['Age'])
plt.title("Histogram after cube root Transformation")
plt.show()

#Box-transformation
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import scipy
train = pd.read_csv('../input/cost-of-living/cost-of-living-2018.csv')
sns.boxplot(train['Rent Index'])
plt.title("Box Plot before outlier removing")
plt.show()
train['Rent Index'],fitted_lambda= scipy.stats.boxcox(train['Rent Index'] ,lmbda=None)
sns.boxplot(train['Rent Index'])
plt.title("Box Plot after outlier removing")
plt.show()

IMPUTATION ¶

Like imputation of missing values, we can also impute outliers. We can use mean, median, zero value in this methods. Since we imputing there is no loss of data. Here median is appropriate because it is not affected by outliers.

#mean imputation
import pandas as pd
import numpy as np
train = pd.read_csv('../input/titanic/train.csv')
sns.boxplot(train['Age'])
plt.title("Box Plot before mean imputation")
plt.show()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
m = np.mean(train['Age'])
for i in train['Age']:
    if i > Upper_tail or i < Lower_tail:
            train['Age'] = train['Age'].replace(i, m)
sns.boxplot(train['Age'])
plt.title("Box Plot after mean imputation")
plt.show()

#median imputation
import pandas as pd
import numpy as np
train = pd.read_csv('../input/titanic/train.csv')
sns.boxplot(train['Age'])
plt.title("Box Plot before median imputation")
plt.show()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
med = np.median(train['Age'])
for i in train['Age']:
    if i > Upper_tail or i < Lower_tail:
            train['Age'] = train['Age'].replace(i, med)
sns.boxplot(train['Age'])
plt.title("Box Plot after median imputation")
plt.show()

#Zero value imputation
import pandas as pd
import numpy as np
train = pd.read_csv('../input/titanic/train.csv')
sns.boxplot(train['Age'])
plt.title("Box Plot before Zero value imputation")
plt.show()
q1 = train['Age'].quantile(0.25)
q3 = train['Age'].quantile(0.75)
iqr = q3-q1
Lower_tail = q1 - 1.5 * iqr
Upper_tail = q3 + 1.5 * iqr
for i in train['Age']:
    if i > Upper_tail or i < Lower_tail:
            train['Age'] = train['Age'].replace(i, 0)
sns.boxplot(train['Age'])
plt.title("Box Plot after Zero value imputation")
plt.show()

SEPARATELY TREATING

If there are a significant number of outliers and the dataset is small, we should treat them separately in the statistical model. One of the approaches is to treat both groups as two different groups build individual models for both groups and then combine the output. However, this technique is tedious when the dataset is large.

CONCLUSION

1. Median is the best measure of central tendency when the data has an outlier or skewed.2. Winsorization Method or Percentile Capping is the better outlier detection technique than the others.3. Median imputation completely removes outlier.

Outlier is one of the major problems in machine learning. If you neglect the outlier result with bad performance of the model. In this kernel, I'm trying to cover almost all the topics related to outliers, outlier detection, and outlier treatment techniques.

Please note that some of the techniques mentioned in this kernel may not gave the best result all the time. So be careful when you try to detect/impute outliers.

REFERENCES

1. Finding outliers in the dataset using python

2. Outlier Detection - Basics

3. Ways to Detect and Remove the Outliers

4. Four Techniques for Outlier Detection - KDnuggets

Previous6. A/B test

Last updated 1 year ago