Learning the clustering algorithms by doing

Learning the clustering algorithms by running simple experiments
Clustering has significant application across application domains to understand the overall variability structure of the data. Be it clustering of customers, students, genes, ECG Signals, images, prices of stock. Even in the era of deep learning, clustering can give you invaluable insight to distribution of the data, which can affect your design decision. Here, we give you some overview about five clustering algorithm and the Kaggle notebook links where we have done some experiments.
K- Means: — It is the simplest of all algorithms works based on distances. It’s a partitioning algorithm. Some of the issues are a) we need to know the value of ‘k’ b) it is affected by outliers. Some experiments are done in the following Kaggle notebook.
https://www.kaggle.com/saptarsi/kmeans-dbs-sg
We start by explaining how k-means work, then with example describe why scaling is required. We also discuss how Sum of Squared Error plotted using elbow plot is applied find the optimal value of k. A simple example follows with ‘iris’
Corresponding Videos:
Theory: https://www.youtube.com/watch?v=FFhmNy0W4tE
Hands-on: https://www.youtube.com/watch?v=w0CTqS_KFjY
K-Medoid: K-Means is not robust against outliers. We move to median and when we have more than one attribute and we want find an overall median, that is called medoid. Some experiments are done using the following notebooks.
https://www.kaggle.com/saptarsi/kmedoid-sg
The main thing we discuss is how to measure quality of clusters when we have labels unavailable (Silhouette Width) and when they are available (Purity). A short example with extreme observations (Outliers) comparing k-means and K-Medoid follows.
Corresponding Videos:
Theory: https://www.youtube.com/watch?v=q3plVFIgjGQ
Hands-on: https://www.youtube.com/watch?v=L1ykPtlonAU
DBSCAN: Most promising thing of DBSCAN is it identifies some points as noise points, points which if included in the clustering exercise would have distorted the overall clusters. DBSCAN, K-Medoid, K-Means are compared on the moons and concentric circles. The following notebook contains simple experiments.
https://www.kaggle.com/saptarsi/dbscan-sg
Corresponding Videos:
Theory: https://www.youtube.com/watch?v=RZg51WH7caQ
Hands-on: https://www.youtube.com/watch?v=A8OnRH42hWE
Agglomerative Clustering: The above three are partition-based clustering while this is a hierarchical clustering revealing more semantics about the overall structure through a dendrogram. One of the important considerations in hierarchical clustering is which distance to consider when merging to clusters is it the minimum one, the maximum one or the average one. Various experiments are done on Concentric Circle, Half Moon, Gaussian Data and the seeds dataset. A notebook containing some of the experiments is as follows:
https://www.kaggle.com/saptarsi/agglomarative-clustering-sg
Corresponding Videos:
Theory: https://www.youtube.com/watch?v=RZg51WH7caQ
Hands-on: https://www.youtube.com/watch?v=z7jXh_RzL_k
Expected Maximization: The way it is implemented is through Gaussian Mixture Modelling. Basically, each cluster can be thought to be coming from different multivariate normal distribution. This is a soft clustering method, compared to all 4 methods. The below is a notebook containing below simple experiments
https://www.kaggle.com/saptarsi/gmm-sg
Corresponding Videos: