Skip to content

Outliers Detection in PySpark #3 – K-means

In parts #1 and #2 of the “Outliers Detection in PySpark” series, I talked about Anomaly Detection, Outliers Detection and the interquartile range (boxplot) method.
In this third and last part, I will talk about how one can use the popular K-means clustering algorithm to detect outliers.

K-means

K-means is one of the easiest and most popular unsupervised algorithms in Machine Learning for Clustering. It aims to cluster data points into different K clusters in an iterative process.

Most importantly, the algorithm is parametric. It needs K, the number of clusters and, sometimes, the maximum number of iterations so that it doesn’t run forever.

How it works

  1. Generate K random centroids.
  2. Associate each data point to the nearest centroid.
  3. Re-generate the K centroids.
  4. Repeat steps 2 and 3 until nothing changes (or other conditions, since it doesn’t guarantee an optimal solution).
K-means illustration by David Runyan
K-means Illustration – Introduction to Clustering (David Runyan)

Using K-means to detect outliers

Although it’s not the best of solutions, K-means can actually be used to detect outliers. The idea is very simple: After constructing the clusters, we flag points that are far as outliers. In other words, we consider points that are far from the centroid of the cluster they belong to (distance-wise) as outliers.

Above all, this technique is also parametric and expects you to provide the fraction of outliers the data contains, which isn’t always possible.

Here are the steps:

  1. Run the K-means on all the data points.
  2. For each point:
    1. Predict the cluster they belong to.
    2. Calculate the distance between the point and the centroid of that cluster.
  3. Based on a given fraction, flag outliers.

Illustration

The following images are generated using this website.

K-means - outliers detection illustration 1
Outliers detection – Illustration #1
K-means - outliers detection illustration 2
Outliers detection – Illustration #2

PySpark Implementation

Firstly, we’ll run K-means on all the data points and predict the clusters they belong to:

Secondly, calculate the distance between each point and the centroid of the cluster it belongs to:

Lastly, calculate the number of outliers based on the given fraction and flag outliers:

To sum up, data quality is an important side of data mining overall. I focused on Anomaly detection (Outliers detection specifically) because that’s what I was working on these months and I saw how important it was in real use cases, such as banking systems.

In these posts I introduced what these terms meant and explain 2 easy and popular algorithms: Interquartile Range and K-means clustering.
Hope you enjoyed the series!

Published inAlgorithmsData Mining

Leave a Reply

avatar
  Subscribe  
Notify of