Skip to content

Outliers Detection in PySpark #2 – Interquartile Range

In the first part, I talked about what Data Quality, Anomaly Detection and Outliers Detection are and what’s the difference between outliers detection and novelty detection.
In this part, I will talk about a very known and easy method to detect outliers called Interquartile Range.

Introduction

The Interquartile Range method, also known as IQR, was developed by John Widler Turky, an American mathematician best known for development of the FFT algorithm and box plot.

IQR is a measure of statistical dispersion, which is equal to the difference between the 75th percentile and the 25th percentile.
In other words:

    \[ IQR = Q3 - Q1 \]

How Interquartile Range works

Representation of the Interquartile Range - Wikipedia
Representation of the Interquartile Range – Wikipedia

IQR is a fairly interpretable method, often used to draw Box Plots and display the distribution of a dataset.

IQR Can also be used to detect outliers in a few easy and straightforward steps:

  1. Calculate the 1st quartile Q1.
  2. Calculate the 3rd quartile Q3.
  3. Calculate IQR = Q3 - Q1.
  4. Calculate the bounds:
    1. Lower bound: Q1 - 1.5 * IQR
    2. Upper bound: Q3 + 1.5 * IQR
  5. Flag any points outside the bounds as suspected outliers.

If you’re wondering why John chose 1.5 * IQR and not some other value, it can be explained in the Wikipedia picture and also this post.

If we look at the picture, we can see that the lower bound and the upper bound correspond, respectively, to -2.698\sigma and 2.698\sigma.
This reminds me of the 68-95-99.7 rule, which says that about 99.7% of points are within 3\sigma and that the rest can be considered as outliers.

PySpark Implementation

This PySpark implementation can work on any Dataframe and all of its columns. The Dataframe should have a unique column used to identify outliers.

Imports

Calculating the bounds

Firstly, we calculate the lower and upper bounds for each column:

Flagging outliers

Then we flag outliers based on the calculated bounds. For each column, we create a new column named columnName_outlier that contains yes or no.

Conclusion

To sum up, IQR or Interquartile Range is a very interpretable method to detect outliers. In this blog post, I explained how the method works and showed how we can easily implement it in PySpark.

Published inAlgorithmsData Mining

1
Leave a Reply

avatar
1 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
0 Comment authors
Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
trackback

[…] Outliers Detection in PySpark #2 – Interquartile Range […]