Learning the clustering algorithms by doing

Dr. Saptarsi Goswami
2 min readJul 21, 2020

Learning the clustering algorithms by running simple experiments

Clustering has significant application across application domains to understand the overall variability structure of the data. Be it clustering of customers, students, genes, ECG Signals, images, prices of stock. Even in the era of deep learning, clustering can give you invaluable insight to distribution of the data, which can affect your design decision. Here, we give you some overview about five clustering algorithm and the Kaggle notebook links where we have done some experiments.

K- Means: — It is the simplest of all algorithms works based on distances. It’s a partitioning algorithm. Some of the issues are a) we need to know the value of ‘k’ b) it is affected by outliers. Some experiments are done in the following Kaggle notebook.

https://www.kaggle.com/saptarsi/kmeans-dbs-sg

We start by explaining how k-means work, then with example describe why scaling is required. We also discuss how Sum of Squared Error plotted using elbow plot is applied find the optimal value of k. A simple example follows with ‘iris’

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=FFhmNy0W4tE

Hands-on: https://www.youtube.com/watch?v=w0CTqS_KFjY

K-Medoid: K-Means is not robust against outliers. We move to median and when we have more than one attribute and we want find an overall median, that is called medoid. Some experiments are done using the following notebooks.

https://www.kaggle.com/saptarsi/kmedoid-sg

The main thing we discuss is how to measure quality of clusters when we have labels unavailable (Silhouette Width) and when they are available (Purity). A short example with extreme observations (Outliers) comparing k-means and K-Medoid follows.

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=q3plVFIgjGQ

Hands-on: https://www.youtube.com/watch?v=L1ykPtlonAU

DBSCAN: Most promising thing of DBSCAN is it identifies some points as noise points, points which if included in the clustering exercise would have distorted the overall clusters. DBSCAN, K-Medoid, K-Means are compared on the moons and concentric circles. The following notebook contains simple experiments.

https://www.kaggle.com/saptarsi/dbscan-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=RZg51WH7caQ

Hands-on: https://www.youtube.com/watch?v=A8OnRH42hWE

Agglomerative Clustering: The above three are partition-based clustering while this is a hierarchical clustering revealing more semantics about the overall structure through a dendrogram. One of the important considerations in hierarchical clustering is which distance to consider when merging to clusters is it the minimum one, the maximum one or the average one. Various experiments are done on Concentric Circle, Half Moon, Gaussian Data and the seeds dataset. A notebook containing some of the experiments is as follows:

https://www.kaggle.com/saptarsi/agglomarative-clustering-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=RZg51WH7caQ

Hands-on: https://www.youtube.com/watch?v=z7jXh_RzL_k

Expected Maximization: The way it is implemented is through Gaussian Mixture Modelling. Basically, each cluster can be thought to be coming from different multivariate normal distribution. This is a soft clustering method, compared to all 4 methods. The below is a notebook containing below simple experiments

https://www.kaggle.com/saptarsi/gmm-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=LaL0BfvUurs

Hands on: https://www.youtube.com/watch?v=A8OnRH42hWE

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Dr. Saptarsi Goswami
Dr. Saptarsi Goswami

Written by Dr. Saptarsi Goswami

Asst Prof — CS Bangabasi Morning Clg, Lead Researcher University of Calcutta Data Science Lab, ODSC Kolkata Chapter Lead

No responses yet

Write a response