Assignment 1
Due Date
 October 13th
Problem Statement
Your task is to experiment and compare 2 variants of the kNearest Neighbor algorithm using the Wine Quality Data Set.
This assignment can be completed by teams of \(2\) students.
Your Work

[30 pts] Correct implementations of two knn versions. You can use a single Python
class
, taking different parameters to indicate the version in the constructor method. Feel free to use the notebook provided in class as starter code.vanilla
: this is the method presented in classball
: same principle asvanilla
, but instead of using the majority of \(k\) nearest neighbors, use the majority vote of all examples \(\vec{x}\) such that \(\text{distance}(\vec{x}',\vec{x}) \leq \epsilon\). You can treat \(\epsilon\) as a hyperparameter to this method, instead of \(k\)

[50 pts] Perform hyperparameter selection using the
trainvalidtest
splits as shown in class. It is important to note, that the test set cannot be used during the hyperparameter selection process, use the validation set instead. Experiment with at least 10 different values for \(k\) and \(\epsilon\), and the following distance functions:euclidean
,cityblock
,cosine
,chebyshev
,sqeuclidean
(this list can be different if you go for extra points). If you are using thecdist
function, refer to the documentation. find the best \(k\) and distance function for
vanilla
 find the best \(\epsilon\) and distance function for
ball
 find the best \(k\) and distance function for
For each of the two cases above (vanilla and ball), generate two heatmaps. One showing accuracies on the training set, and the other showing accuracies on the validation set. You will use this information to select the best pair of hyperparameters on each case. Each axis of the heatmap should represent a different hyperparameter value. The heatmap can be created using matplotlib (see example).
 [20 points] After selecting the best hyperparameters, report the accuracy of your best classifier on the test set. At this point, you can actually merge the training and validation set to estimate the test set accuracy.
Dataset
The dataset you will use for your experiments is the Wine Quality Data Set available at the UCI ML Repository. You may want to inspect the features and decide to preprocess (e.g. normalization) before applying machine learning. The repository provides two datasets, related to red and white vinho verde wine samples, from the north of Portugal, The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009]).
For this assignment you will only work with one of both datasets. Pick either red or white data.
Bonus Points
[25 pts] If you want to walk the extramile. Bonus points will be awarded if you manage to implement a faster version of the kNN classifier by using a kdtree data structure. Scikitlearn provides an implementation of kdtrees that you can use. In this case, please be aware that the distance functions may be a bit different from the ones available for cdist
, check DistanceMetric.
Submission Instructions
You will submit your work as a Jupyter Notebook file via Gradescope. You can download a .ipynb
file from Google's Colab. Make sure all cells have been properly run before you save it, and the output is being shown.
You are strongly encouraged to add cells
(sections) to your notebook, containing further explanations for your code, and more importantly, for discussing and analyzing the results you obtained.