Category: Data Mining rss

Posts

When I was looking for K-means use cases, I found out about Color quantization, a very interesting . I implemented it in Python and was wondering whether it would be as easy to implement in ML.NET. All the code is available in this GitHub repository. What is color quantization Color quantization is the usage of quantization, a lossy compression technique, in color spaces in order to reduce the number of unique colors in an image.
In parts #1 and #2 of the “Outliers Detection in PySpark” series, I talked about Anomaly Detection, Outliers Detection and the interquartile range (boxplot) method. In this third and last part, I will talk about how one can use the popular K-means clustering algorithm to detect outliers. K-means K-means is one of the easiest and most popular unsupervised algorithms in Machine Learning for Clustering.
In the first part, I talked about what Data Quality, Anomaly Detection and Outliers Detection are and what’s the difference between outliers detection and novelty detection. In this part, I will talk about a very known and easy method to detect outliers called Interquartile Range. Introduction The Interquartile Range method, also known as IQR, was developed by John Widler Turky, an American mathematician best known for development of the FFT algorithm and box plot.
These last months, while working on my graduation project, I had the chance to learn a lot about Data Quality, Anomaly Detection and especially Outliers Detection. In these series, I will be explaining what outliers are, the difference between novelty and outliers detection and how we can detect outliers using different algorithms.
Have you ever wondered how Amazon suggets to us items to buy when we’re looking at a product (labeled as “Frequently bought together”)? For example, when checking a GPU product (e.g. GTX 1080), amazon will tell you that the gpu, i7 cpu and RAM are frequently bought together. Which is true because a lot of people buy their components grouped when building a desktop pc.