K-Nearest Neighbors (KNN) is a straightforward algorithm that stores all available instances and classifies new instances based on a similarity measure. It is versatile and can be used for classification or regression problems. In this article, we will explore how to perform KNN classification using the Scikit-Learn library in Python.
Understanding the KNN Algorithm
The KNN algorithm works by identifying the 'k' closest training examples in the feature space of a query instance and predicts the label based on majority voting (for classification). It is lazy, meaning it does not learn a discriminative function from the training data explicitly.
Installing Scikit-Learn
Before diving into code, ensure that you have Scikit-Learn installed. You can install it using pip if it’s not already installed:
pip install scikit-learnImplementation of KNN Classification
Let's implement a basic KNN classifier. We'll use a popular dataset, the Iris dataset, available directly from Scikit-Learn.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize KNN with k=3
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X_train, y_train)
# Predicting
predictions = knn.predict(X_test)
# Evaluate accuracy
print("Accuracy: {:.2f}%".format(accuracy_score(y_test, predictions) * 100))
Explaining the Code
The code begins by importing necessary classes and functions. We then load the Iris dataset and split it into training and testing sets with a 70-30 ratio. The features are standardized using StandardScaler, which helps improve the KNN algorithm's performance by ensuring each feature contributes equally to the distance calculations.
A KNeighborsClassifier object is created with n_neighbors=3, meaning the algorithm will consider the three nearest neighbors to classify a data point. After training the model, we predict the outcomes and calculate accuracy to evaluate the classifier's performance.
Choosing the Right Value of K
The choice of 'k' is crucial in KNN classification. A small value for 'k' gives poor results due to noise sensitivity, whereas a large 'k' makes it computationally expensive and may overlook smaller patterns. A common practice is to test different 'k' values and choose the one that provides the highest accuracy on your validation set.
import matplotlib.pyplot as plt
# Trying different k values
k_range = range(1, 26)
accuracies = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
accuracies.append(accuracy_score(y_test, predictions))
# Plotting
plt.plot(k_range, accuracies)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')
plt.title('K-value vs Accuracy')
plt.show()
Pros and Cons of KNN
The principal advantage of KNN is its simplicity and interpretability. There are no assumptions about data distribution, and it works well with small and well-kept datasets. However, KNN can be inefficient with large datasets since it stores all examples. It is also sensitive to irrelevant features and the scale of data, which could require preprocessing steps like normalization.
Conclusion
K-Nearest Neighbors is a potent classification technique, especially for problems with low to moderate data sizes. Utilizing Scikit-Learn, its implementation is straightforward, making it a useful tool for rapid testing and prototype solutions in your machine learning ventures. Always ensure proper preprocessing and experimentation with different 'k' values to optimize your model's performance.