K-Nearest Neighbors#
K-Nearest Neighbors (KNN) is a simple and powerful machine learning algorithm used for both classification and regression problems. The KNN algorithm is based on the idea that data points that are close to each other are more likely to be similar to each other. The algorithm is trained on a set of labeled data points, and for a new, unlabeled data point, the algorithm will find the k-nearest labeled data points and assign the new data point the most common label among those k nearest neighbors.
How Does K-Nearest Neighbors Work?#
In KNN, the distance between data points is used to determine their similarity. The most commonly used distance measure is Euclidean distance, which is calculated as the square root of the sum of the squares of the differences between the values of each feature.
Once the distance between data points has been calculated, the k nearest neighbors can be found by sorting the data points by their distances and selecting the k data points with the smallest distances. The label of the new data point is then determined by the majority vote of the k nearest neighbors.
KNN can be used for both classification and regression problems. In a classification problem, the majority vote of the k nearest neighbors is used to assign the new data point a class label. In a regression problem, the mean or median of the k nearest neighbors is used to predict a continuous value for the new data point.
Advantages of KNN#
Simple and easy to understand: The KNN algorithm is based on a simple and intuitive idea, which makes it easy to understand and implement.
No assumptions about the data distribution: Unlike other algorithms, KNN makes no assumptions about the distribution of the data, making it more robust to outliers and non-linear data distributions.
Versatile: KNN can be used for both classification and regression problems, making it a versatile algorithm that can be applied to a wide range of use cases.
Fast training time: KNN is fast when it comes to training, as it only has to save the training set.
Disadvantages of KNN#
High memory usage: KNN requires storing the entire training dataset, which can result in high memory usage for large datasets.
Slow prediction time: Prediction time for KNN is slow, as it requires computing the distances between the new data point and all the training data points, which can be time-consuming for large datasets.
Sensitive to irrelevant features: KNN is sensitive to irrelevant features, as they can greatly affect the distance calculation and result in incorrect predictions.
High dimensionality can be a problem: KNN can perform poorly in high-dimensional datasets, as the curse of dimensionality can lead to a loss of information and a decrease in accuracy.
Example Code for Classification#
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
# Convert the dataset into a pandas dataframe
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df["Target"] = iris.target
# Assign the features and target
X = df.drop("Target", axis=1)
y = df["Target"]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the KNN Classifier model with k=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# Predict the target on the test set
y_pred = knn.predict(X_test)
# Calculate the accuracy score
acc = accuracy_score(y_test, y_pred)
print("Accuracy Score:", acc)
Example Code for Regression#
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
# Generating a random dataset
np.random.seed(0)
X = np.sort(5 * np.random.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - np.random.rand(16))
# Splitting the dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting the KNN regression model
regressor = KNeighborsRegressor(n_neighbors=3)
regressor.fit(X_train, y_train)
# Predicting the target values for the test data
y_pred = regressor.predict(X_test)
# Plotting the results
plt.scatter(X_test, y_test, color='red')
plt.plot(X_test, y_pred, color='blue')
plt.title("KNN Regression (k = 3)")
plt.xlabel("X")
plt.ylabel("y")
plt.show()
Conclusion#
KNN is a simple and powerful algorithm that can be applied to both classification and regression problems. It has many advantages, such as being simple and easy to understand, making no assumptions about the data distribution, and being versatile. However, it also has some disadvantages, such as high memory usage, slow prediction times, sensitivity to irrelevant features, and potential issues in high-dimensional datasets. Despite its disadvantages, KNN is still a popular algorithm in the field of machine learning, and can be a good choice for many use cases, especially when working with small to medium-sized datasets.
Where to Learn More#
We cover kNN in-depth in the following course: