2. Z-SCORE


Using Z score method,we can find out how many standard deviations value away from the mean.


Figure in the left shows area under normal curve and how much area that standard deviation covers.* 68% of the data points lie between + or - 1 standard deviation.* 95% of the data points lie between + or - 2 standard deviation* 99.7% of the data points lie between + or - 3 standard deviation


Z-score formula

Zscore=XMean/StandardDeviationZscore=X−Mean/StandardDeviation


If the z score of a data point is more than 3 (because it covers 99.7% of the area), it indicates that the data value is quite different from the other values. It is taken as outlier.

import pandas as pd
import numpy as np
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
out=[]
def Zscore_outlier(df):
    m = np.mean(df)
    sd = np.std(df)
    for i in df: 
        z = (i-m)/sd
        if np.abs(z) > 3: 
            out.append(i)
    print("Outliers:",out)
Zscore_outlier(train['LotArea'])
Outliers: [50271, 159000, 215245, 164660, 53107, 70761, 53227, 46589, 115149, 53504, 45600, 63887, 57200

Last updated