Skip to content

Assignment 1

Due Date

  • October 13th

Problem Statement

Your task is to experiment and compare 2 variants of the k-Nearest Neighbor algorithm using the Wine Quality Data Set.

This assignment can be completed by teams of \(2\) students.

Your Work

  • [30 pts] Correct implementations of two k-nn versions. You can use a single Python class, taking different parameters to indicate the version in the constructor method. Feel free to use the notebook provided in class as starter code.

    • vanilla: this is the method presented in class
    • ball: same principle as vanilla, but instead of using the majority of \(k\) nearest neighbors, use the majority vote of all examples \(\vec{x}\) such that \(\text{distance}(\vec{x}',\vec{x}) \leq \epsilon\). You can treat \(\epsilon\) as a hyperparameter to this method, instead of \(k\)
  • [50 pts] Perform hyperparameter selection using the train-valid-test splits as shown in class. It is important to note, that the test set cannot be used during the hyperparameter selection process, use the validation set instead. Experiment with at least 10 different values for \(k\) and \(\epsilon\), and the following distance functions: euclidean, cityblock, cosine, chebyshev, sqeuclidean (this list can be different if you go for extra points). If you are using the cdist function, refer to the documentation.

    • find the best \(k\) and distance function for vanilla
    • find the best \(\epsilon\) and distance function for ball

For each of the two cases above (vanilla and ball), generate two heatmaps. One showing accuracies on the training set, and the other showing accuracies on the validation set. You will use this information to select the best pair of hyperparameters on each case. Each axis of the heatmap should represent a different hyperparameter value. The heatmap can be created using matplotlib (see example).

  • [20 points] After selecting the best hyperparameters, report the accuracy of your best classifier on the test set. At this point, you can actually merge the training and validation set to estimate the test set accuracy.

Dataset

The dataset you will use for your experiments is the Wine Quality Data Set available at the UCI ML Repository. You may want to inspect the features and decide to preprocess (e.g. normalization) before applying machine learning. The repository provides two datasets, related to red and white vinho verde wine samples, from the north of Portugal, The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009]).

For this assignment you will only work with one of both datasets. Pick either red or white data.

Bonus Points

[25 pts] If you want to walk the extra-mile. Bonus points will be awarded if you manage to implement a faster version of the k-NN classifier by using a kd-tree data structure. Scikit-learn provides an implementation of kd-trees that you can use. In this case, please be aware that the distance functions may be a bit different from the ones available for cdist, check DistanceMetric.

Submission Instructions

You will submit your work as a Jupyter Notebook file via Gradescope. You can download a .ipynb file from Google's Colab. Make sure all cells have been properly run before you save it, and the output is being shown.

You are strongly encouraged to add cells (sections) to your notebook, containing further explanations for your code, and more importantly, for discussing and analyzing the results you obtained.