Homework #6: Classification

DUE: Sunday 2/22 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, open up a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw06 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw06/ ~/shared/hw06/
cd ~/shared/hw06/

1. Laptop vs. Phone Revisited

In this exercise we are going to revisit the task of deciphering reviews (guessing whether it is for a laptop or a mobile phone) in Lab #5. This time you are going to write your own Naive Bayes classifier for this task.

In the reviews/ subdirectory, you will find a (poorly formatted) CSV file named reviews.csv. First, you need to run the Python script prepare.py, which parses these file, constructs the objects (e.g., feature matrix and outcome vector) to be used by your classifier, and writes them into a .pik for faster loading by your subsequent analysis:

cd ~/shared/hw06/reviews/
python prepare.py

For a description of what these objects mean, see comments in classify.py.

Your job is to modify and complete the file classify.py to implement Naive Bayes classification according to the specification given in the code. You need to implement the algorithm by yourself; do NOT use the Naive Bayes implementations provided by Python sklearn package. To run the classification algorithm, type:

python classify.py

In addition to accuracy, note that the program will also print out the most important features used by this classifier---assuming that you have implemented the algorithm correctly according to the specification. This information can be very useful debugging. Do your top features make intuitive sense?

WHAT TO SUBMIT: Submit your modified classify.py file as well as the result of running it. To save the result to a file named reviews-output.txt, you can use the shell command python classify.py > reviews-output.txt.

2. Predicting Party Affiliation by Votes

So far in our classification assignments, we have provided most of the code for extracting features. Now it is your turn to get your hands dirty and implement feature extraction from data. For this problem, we will use the congress database, first introduced in Homework #2. You should still have this database on your VM (you can check it with psql congress); if not, follow the instructions in Homework #2 to recreate this database.

Your task for this problem is to predict, using the votes in the House cast by a Representative in the 2014 session, whether this Representative is Republican. Under the congress/ subdirectory, the script classify.py is ready to go. It uses Python sklearn's multivariate Bernoulli Naive Bayes classification, which models each Representative's voting history as bit vector, where each component corresponds to what the Reprsentative did for each vote. Your job is complete the script prepare.py. What your code needs to produce is explained by the comments in prepare.py.

To run the classification algorithm, follow the usual sequence of steps:

cd ~/shared/hw06/congress/
python prepare.py
python classify.py

The program will first run a particular train-test split to print out more specific information (for debugging); then it will run 10-fold cross validation on the entire dataset.

Remember to rerun prepare.py if you make any modification to it (otherwise classify.py will get the stale .pik file).

WHAT TO SUBMIT: Submit your modified prepare.py file as well as the result of running the classification algorithm. To save the result to a file named congress-output.txt, you can use the shell command python classify.py > congress-output.txt.