Outliers Detection in PySpark #1

These last months, while working on my graduation project, I had the chance to learn a lot about Data Quality, Anomaly Detection and especially Outliers Detection.
In these series, I will be explaining what outliers are, the difference between novelty and outliers detection and how we can detect outliers using different algorithms.
All code will be implemented in PySpark.

In this first part, I will be talking about what Data Quality, Anomaly Detection and Outliers Detection are and what’s the difference between outliers detection and novelty detection.

Data Quality

Data quality refers to the overall utility of a dataset(s) as a function of its ability to be easily processed and analyzed for other uses, usually by a database, data warehouse, or data analytics system.

Informatica - What is Data Quality?

Many people don’t realize the importance of Data Quality in Big Data.
A good example of this is the use of whatever data is available in Machine Learning projects. Training ML models on bad data will obviously decrease its accuracy/quality.

The data we ingest and extract everyday must be validated and checked. Quality data is useful data and to be of high quality, the data must be consistent and unambiguous.

Anomaly Detection

In Data Mining, Anomaly Detection is the identification of dataset elements which do not respect a certain pattern and/or don’t look like most of that dataset’s elements.
There are 3 main types of anomalies:

Point anomalies: If a dataset element/row/instance is flagged as an anomaly with respect to the other elements/rows/instances in that dataset, then it’s called a point anomaly.
Contextual anomalies: These are anomalies that are detected in a specific context. For example, time series data.
Collective anomalies: These are anomalies that contain multiple data elements. which may not be anomalies individually. For instance, a set of consecutive actions on a computer can be flagged as an attack.

Outliers Detection

Outliers detection is more used in statistics than anomaly detection.
Although, most of the times, these terms refer to the same thing.
You often hear of outliers in ML models and datasets where you need to remove them.

Outliers Detection vs Novelty Detection

Outliers detection assumes that the given data already contains outliers, thus it tries to fit the regions where the training data is the most concentrated.
Novelty detection assumes that the given data does not contain outliers. It’s used when we want to detect outliers in new data.

Conclusion

Data Quality is a very important aspect of Big Data and should be taken very seriously. Insuring the quality of your data will benefit you in many ways. Detecting anomalies as soon as possible is a must nowadays.
In the next part, I’ll be presenting a couple of algorithms used to detect outliers.

Outliers Detection in PySpark #1 – Intro

Data Quality

Anomaly Detection

Outliers Detection

Outliers Detection vs Novelty Detection

Conclusion