K-Means Clustering#

K-Means Clustering is a widely used unsupervised machine learning algorithm for partitioning a set of n data points into k clusters, where k is a user-specified number. The algorithm works by iteratively assigning each data point to one of the k clusters, with the goal of minimizing the sum of squared distances between the data points and their respective cluster centroids. The cluster centroids are calculated as the mean of the data points in each cluster, and are updated after each iteration of the algorithm. The process continues until either the cluster assignments no longer change or a maximum number of iterations is reached.

The K-Means Clustering Algorthm#

The steps of the K-Means Clustering algorithm are as follows:

Initialization: Choose the number of clusters k and randomly initialize k centroids.

Assignment: Assign each data point to the closest centroid based on the Euclidean distance between the data point and each centroid.

Recalculation: Recalculate the centroids by taking the mean of all the data points assigned to each cluster.

Repeat steps 2 and 3 until the cluster assignments no longer change or a maximum number of iterations is reached.

Advantages#

  1. Simplicity: The K-Means Clustering algorithm is straightforward and easy to implement.

  2. Speed: The algorithm is very fast for large datasets and can handle large numbers of data points.

  3. Versatility: K-Means Clustering can be applied to a wide range of data types, including numerical and categorical data.

Disadvantages#

  1. Sensitivity to Initialization: The K-Means Clustering algorithm is sensitive to the initial choice of centroids and can lead to different results each time it is run.

  2. Assumes spherical clusters: The algorithm assumes that the clusters are spherical in shape, which may not always be the case.

  3. Determining k: The number of clusters k must be specified in advance, which can be difficult to determine for complex data sets.

Example Code#

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=0.60)

# Plot the generated data
plt.scatter(X[:,0], X[:,1], s=50)
plt.show()

# Fit the KMeans model to the data
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)

# Predict the cluster labels
y_pred = kmeans.predict(X)

# Plot the resulting clusters
plt.scatter(X[:,0], X[:,1], c=y_pred, cmap='viridis')
plt.show()

Conclusion#

K-Means Clustering is a simple and effective unsupervised learning algorithm for partitioning a data set into k clusters. Despite its limitations, K-Means Clustering is widely used due to its simplicity, speed, and versatility. It is a useful technique for exploring patterns and relationships in data and for discovering meaningful subgroups within a dataset.

Where to Learn More#

Where to Learn More#

I’ve covered K-Means Clustering in-depth in the following course:

Cluster Analysis and Unsupervised Machine Learning in Python

And we apply K-Means in the following courses:

Unsupervised Deep Learning in Python

Machine Learning and AI: Support Vector Machines in Python