Hierarchical Clustering#

Hierarchical clustering is a type of unsupervised learning that is used to group data points into a hierarchy of clusters. Unlike other clustering techniques such as k-means and DBSCAN, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a tree-like structure that represents the hierarchical relationships between the clusters. This structure is called a dendrogram.

How Does Hierarchical Clustering Work?#

There are two main types of hierarchical clustering: Agglomerative and Divisive. In Agglomerative hierarchical clustering, we start with each data point as its own cluster and then merge the closest pairs of clusters until we reach a desired number of clusters or the desired stopping criterion is met. In Divisive hierarchical clustering, we start with all data points in a single cluster and then repeatedly split the clusters until each data point is in its own cluster or the desired stopping criterion is met.

Advantages#

The main advantage of hierarchical clustering is that it allows for a flexible and intuitive way to explore and visualize the clustering structure of the data. Moreover, hierarchical clustering is capable of handling datasets with a non-convex shape, which is a common limitation of other clustering algorithms.

Disadvantages#

However, hierarchical clustering also has some disadvantages, such as its sensitivity to the scale of the data and its relatively high computational cost for large datasets. In addition, the choice of the linkage method, which determines the distance between clusters, can significantly impact the results of the clustering, making it difficult to select the best method for a particular dataset.

Example Code#

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate sample data for clustering
X, y = make_blobs(n_samples=150, n_features=2, centers=3, random_state=0)

# Create an instance of AgglomerativeClustering
agg_cluster = AgglomerativeClustering(n_clusters=3)

# Fit the model to the data
agg_cluster.fit(X)

# Extract the cluster labels
labels = agg_cluster.labels_

# Plot the data points with different colors based on the cluster labels
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()

Conclusion#

Despite its limitations, hierarchical clustering remains a useful tool for exploring and analyzing the clustering structure of the data, especially when used in combination with other clustering techniques. To implement hierarchical clustering in Python, the scikit-learn library provides the AgglomerativeClustering class. These classes can be used in a similar way to other clustering algorithms, such as k-means and DBSCAN, by calling the fit method on the dataset and then accessing the cluster labels through the labels_ attribute.

In conclusion, hierarchical clustering is a powerful unsupervised learning technique that provides a flexible and intuitive way to group data points into clusters. While it has some limitations, such as sensitivity to the scale of the data and high computational cost, it remains a useful tool for exploring and analyzing the clustering structure of the data.

Where to Learn More#

I’ve covered Hierarchical Clustering in-depth in the following course:

Cluster Analysis and Unsupervised Machine Learning in Python