This repository consists of code and example implementations for my medium article on building k-Nearest Neighbors from scratch and evaluating it using k-Fold Cross validation which is also built from scratch
For PyPI package version please refer to this repository
Neighbors (Image Source: Freepik)
k-Nearest Neighbors, kNN for short, is a very simple but powerful technique used for making predictions. The principle behind kNN is to use “most similar historical examples to the new data.”
- Choose a value for k
- Find the distance of the new point to each record of training data
- Get the k-Nearest Neighbors
- Making Predictions
- For classification problem, the new data point belongs to the class that most of the neighbors belong to.
- For regression problem, the prediction can be average or weighted average of the label of k-Nearest Neighbors
Finally, we evaluate the model using k-Fold Cross Validation technique
This technique involves randomly dividing the dataset into k-groups or folds of approximately equal size. The first fold is kept for testing and the model is trained on remaining k-1 folds.
5 fold cross validation. Blue block is the fold used for testing. (Image Source: sklearn documentation)
The datasets used here are taken from UCI Machine Learning Repository
Car Evaluation and Breast cancer datasets contain text attributes. As we cannot run the classifier on text attributes, we need to convert categorical input features. This is done using LabelEncoder
of sklearn.preprocessing
. LabelEncoder can be applied on a dataframe or a list. LabelEncoder encodes labels with value between 0 and n_classes-1.
Applying LabelEncoder on entire dataframe
from sklearn import preprocessing
df = pd.DataFrame(data)
df = df.apply(preprocessing.LabelEncoder().fit_transform)
Applying LabelEncoder on a list
labels = preprocessing.LabelEncoder().fit_transform(inputList)
- More info on Cross Validation can be seen here
- kNN
- kFold Cross Validation