Skip to content

Assignment 3

Due Date

  • Nov 5th

Problem Statement

Your task is to perform model selection to solve a binary classification task. The dataset is one of the datasets publicly available at the UCI ML Repository. The goal is to predict whether income exceeds 50K/yr based on census data.

This assignment can be completed by teams of \(2\) students.

Notes

Please carefully read this section for implementation details.

  • You will use Kaggle to test your implementation. You can visit the Kaggle competition set for this assignment to download the dataset and to upload your predictions. The contest will remain open until the assignments's due date. Registration on this competition is restricted to those with access to this link.

  • For this assignment you must use the dataset files provided on the Kaggle competition. There are two data files for training (xTrain.csv and yTrain.csv) and one data file for testing (xTest.csv). You can use the training files in any way you want to perform model selection (e.g. train/valid/test split or cross-validation). After selecting your best model, use the test file xTest.csv to calculate predictions. Note that in this file, only the inputs are provided and your model will calculate the new labels. Your predictions must be submitted to Kaggle for evaluation. Follow the competition instructions related to how to format and upload your predictions.

  • The original dataset contains about 75% negative instances (salary less than or equal to 50K) and 25% positive instances (salary greater than 50K). This is an example of an imbalanced dataset.

  • Some of the features are discrete and some others are continuous. We advise you to perform feature encoding (e.g. sklearn.preprocessing.OrdinalEncoder) on the categorical columns. Depending on the ML method you select, feature normalization may also be necessary (e.g. sklearn.preprocessing.Normalizer).

  • The evaluation metric for this competition is Mean F1-Score. The F1 score combines precision and recall. Precision is the ratio of true positives to all predicted positives. Recall is the ratio of true positives to all actual positives. The F1 score weights recall and precision equally, and a good model will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.

  • You are allowed to use any of the Python's ML libraries. You can only use ML algorithms covered in class including any of their variants (e.g. Gradient Boosting). The list of algorithms include k-NN, decision trees, bagging, and boosting methods.

Submission Instructions

After finalizing the assignment, you will submit your files via Gradescope. Your submission should include:

Please do use exactly these formats and naming conventions.

  • source files, all source files used in your solution. These can be either python scripts (.py) or jupyter notebooks (.ipynb). Please do not upload data files. If you are uploading jupyter notebooks, please make sure all cells are run prior to saving the file.
  • hw3.pdf, your write-up.

Your write-up should contain:

We strongly encourage to generate plots from your scripts and include them in the sections of your write-up.

  • the names of the people in your team (and each member's contribution)
  • [30 pts] a clear description of the model selection strategy used (e.g. train/valid/test split, cross-validation, feature/data pre-processing, etc.)
  • [40 pts] a description of all machine learning algorithms and hyperparameters used or attempted (may include plots of performance statistics or model comparisons)
  • [30 pts] details of the model used for your final submission. Include all final hyperparameters and public and private score(s) (from Kaggle).

You must be reminded that students caught cheating or plagiarizing will receive no credit. Additional actions, including a failing grade in the class or referring the case for disciplinary action, may also be taken.