Classification
This algorithm computes distances between an input data point and the data points of a training dataset. Then, it yields the label associated with the training data point that is closest to the input. So, to use this algorithm, we need to calculate distances between vectors (or data points), and the Euclidean distance function is a natural way for doing so. For two vectors \( x, y \in \mathbb{R}^d \), their Euclidean distance is defined as $$\|x - y\| = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}.$$ For our use case, we can omit the square root to simplify the computation.
images
attribute and
the labels with the target
attribute.
from sklearn.datasets import load_digits
# loading the digits dataset
digits = load_digits()
Now, let us flatten the images to compute the distances between their vector representations.
# get the number of images in the datasets
num_samples = len(digits.images)
# reshape the matrix from (num_samples x 8 x 8) to (num_samples x 64)
data = digits.images.reshape((num_samples, -1))
Then, we split the data between training and testing datasets. For that, we will use the
train_test_split
method form sklearn
.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=.25, shuffle=False)
2) Compute Euclidean distances:- As mentioned above we will drop the square root as it is not important for our purposes.
import numpy as np
def euclidean_dist(x, y):
return np.sum( np.square( x - y) )
3) Find the nearst neighbor:- calculate the distances between the input to be classified
an the training dataset.
def find_NN(x, dataset):
distances = [euclidean_dist(x, y) for y in dataset]
return np.argmin(distances)
4) Classify:- return the label of the nearst neighbor.
def classify(x, dataset):
idx = find_NN(x, dataset)
return y_train[idx]
sample_test = 210
test_digit = X_test[sample_test]
test_label = y_test[sample_test]
pred = classify(test_digit, X_train)
print(f"Predicted digit {pred}, Correct digit {test_label}" )
import matplotlib.pyplot as plt
plt.imshow(test_digit.reshape((8,8)), cmap=plt.cm.gray)
plt.show()
Output Predicted digit 2, Correct digit 2
errors = [ y_test[idx] != classify(X_test[idx], X_train) \
for idx in np.arange(len(y_test)) ]
print(f'The error percentage is {np.sum(errors) / len(y_test) * 100:.2f}%')
Output: The error percentage is 3.78%