# Assignment 3

• Nov 5th

## Problem Statement

Your task is to perform model selection to solve a binary classification task. The dataset is one of the datasets publicly available at the UCI ML Repository. The goal is to predict whether income exceeds 50K/yr based on census data.

This assignment can be completed by teams of $$2$$ students.

## Notes

• You will use Kaggle to test your implementation. You can visit the Kaggle competition set for this assignment to download the dataset and to upload your predictions. The contest will remain open until the assignments's due date. Registration on this competition is restricted to those with access to this link.

• For this assignment you must use the dataset files provided on the Kaggle competition. There are two data files for training (xTrain.csv and yTrain.csv) and one data file for testing (xTest.csv). You can use the training files in any way you want to perform model selection (e.g. train/valid/test split or cross-validation). After selecting your best model, use the test file xTest.csv to calculate predictions. Note that in this file, only the inputs are provided and your model will calculate the new labels. Your predictions must be submitted to Kaggle for evaluation. Follow the competition instructions related to how to format and upload your predictions.

• The original dataset contains about 75% negative instances (salary less than or equal to 50K) and 25% positive instances (salary greater than 50K). This is an example of an imbalanced dataset.

• Some of the features are discrete and some others are continuous. We advise you to perform feature encoding (e.g. sklearn.preprocessing.OrdinalEncoder) on the categorical columns. Depending on the ML method you select, feature normalization may also be necessary (e.g. sklearn.preprocessing.Normalizer).

• The evaluation metric for this competition is Mean F1-Score. The F1 score combines precision and recall. Precision is the ratio of true positives to all predicted positives. Recall is the ratio of true positives to all actual positives. The F1 score weights recall and precision equally, and a good model will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.

• You are allowed to use any of the Python's ML libraries. You can only use ML algorithms covered in class including any of their variants (e.g. Gradient Boosting). The list of algorithms include k-NN, decision trees, bagging, and boosting methods.

## Submission Instructions

Please do use exactly these formats and naming conventions.

• source files, all source files used in your solution. These can be either python scripts (.py) or jupyter notebooks (.ipynb). Please do not upload data files. If you are uploading jupyter notebooks, please make sure all cells are run prior to saving the file.
• hw3.pdf, your write-up.

You must be reminded that students caught cheating or plagiarizing will receive no credit. Additional actions, including a failing grade in the class or referring the case for disciplinary action, may also be taken.