Learning the clustering algorithms by doing

2 min readJul 21, 2020

Learning the clustering algorithms by running simple experiments

Clustering has significant application across application domains to understand the overall variability structure of the data. Be it clustering of customers, students, genes, ECG Signals, images, prices of stock. Even in the era of deep learning, clustering can give you invaluable insight to distribution of the data, which can affect your design decision. Here, we give you some overview about five clustering algorithm and the Kaggle notebook links where we have done some experiments.

K- Means: — It is the simplest of all algorithms works based on distances. It’s a partitioning algorithm. Some of the issues are a) we need to know the value of ‘k’ b) it is affected by outliers. Some experiments are done in the following Kaggle notebook.

https://www.kaggle.com/saptarsi/kmeans-dbs-sg

We start by explaining how k-means work, then with example describe why scaling is required. We also discuss how Sum of Squared Error plotted using elbow plot is applied find the optimal value of k. A simple example follows with ‘iris’

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=FFhmNy0W4tE

Hands-on: https://www.youtube.com/watch?v=w0CTqS_KFjY

K-Medoid: K-Means is not robust against outliers. We move to median and when we have more than one attribute and we want find an overall median, that is called medoid. Some experiments are done using the following notebooks.

https://www.kaggle.com/saptarsi/kmedoid-sg

The main thing we discuss is how to measure quality of clusters when we have labels unavailable (Silhouette Width) and when they are available (Purity). A short example with extreme observations (Outliers) comparing k-means and K-Medoid follows.

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=q3plVFIgjGQ

Hands-on: https://www.youtube.com/watch?v=L1ykPtlonAU

DBSCAN: Most promising thing of DBSCAN is it identifies some points as noise points, points which if included in the clustering exercise would have distorted the overall clusters. DBSCAN, K-Medoid, K-Means are compared on the moons and concentric circles. The following notebook contains simple experiments.

https://www.kaggle.com/saptarsi/dbscan-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=RZg51WH7caQ

Hands-on: https://www.youtube.com/watch?v=A8OnRH42hWE

Agglomerative Clustering: The above three are partition-based clustering while this is a hierarchical clustering revealing more semantics about the overall structure through a dendrogram. One of the important considerations in hierarchical clustering is which distance to consider when merging to clusters is it the minimum one, the maximum one or the average one. Various experiments are done on Concentric Circle, Half Moon, Gaussian Data and the seeds dataset. A notebook containing some of the experiments is as follows:

https://www.kaggle.com/saptarsi/agglomarative-clustering-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=RZg51WH7caQ

Hands-on: https://www.youtube.com/watch?v=z7jXh_RzL_k

Expected Maximization: The way it is implemented is through Gaussian Mixture Modelling. Basically, each cluster can be thought to be coming from different multivariate normal distribution. This is a soft clustering method, compared to all 4 methods. The below is a notebook containing below simple experiments

https://www.kaggle.com/saptarsi/gmm-sg

Corresponding Videos:

Theory: https://www.youtube.com/watch?v=LaL0BfvUurs

Hands on: https://www.youtube.com/watch?v=A8OnRH42hWE

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Dr. Saptarsi Goswami

179 Followers

8 Following

Asst Prof — CS Bangabasi Morning Clg, Lead Researcher University of Calcutta Data Science Lab, ODSC Kolkata Chapter Lead

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

D.H. Jang

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Nov 3, 2024

How Does Our Sense of Humor Change With Age? A Statistical Analysis

Fanfare

Daniel Parris

How Does Our Sense of Humor Change With Age? A Statistical Analysis

How do our comedic sensibilities form and transform over time?

Jun 22, 2024

343

Lists

Predictive Modeling w/ Python

20 stories1856 saves

Practical Guides to Machine Learning

10 stories2225 saves

Natural Language Processing

1977 stories1619 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories563 saves

Data Science All Algorithm Cheatsheet 2025

Artificial Intelligence in Plain English

Ritesh Gupta

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Jan 5

1.4K

Surrogate Modeling: The Secret to Faster, Smarter Engineering

AI Advances

Shuai Guo, PhD

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

6d ago

377

Understanding Distance Metrics in Hierarchical Clustering: A Comparative Study

Raghda Al taei

Understanding Distance Metrics in Hierarchical Clustering: A Comparative Study

Hierarchical clustering is a powerful method for grouping data based on their similarities. The choice of distance metric significantly…

Oct 17, 2024

Jo Wang

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while retaining most of the variance…

Oct 21, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams

Learning the clustering algorithms by doing

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dr. Saptarsi Goswami

No responses yet

More from Dr. Saptarsi Goswami

How to use a pre-trained model (VGG) for image classification

why reinvent the wheel

Using the Chi-Squared test for feature selection with implementation

The fewer the features, the easier to interpret the model

Building LSTM based model for solar energy forecasting

Handling some of the Design issues of LSTM

Feature Engineering and deep learning for solar energy prediction

Does feature Engineering help DL Models

Recommended from Medium

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

How Does Our Sense of Humor Change With Age? A Statistical Analysis

How do our comedic sensibilities form and transform over time?

Lists

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Surrogate Modeling: The Secret to Faster, Smarter Engineering

Its fundamentals, capabilities, and engineering applications

Understanding Distance Metrics in Hierarchical Clustering: A Comparative Study

Hierarchical clustering is a powerful method for grouping data based on their similarities. The choice of distance metric significantly…

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction while retaining most of the variance…