DUE: Sunday 2/22 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select
compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
To get ready for this assignment, open up a VM shell, and type the following command:
Next, type the following commands to create a working directory for this homework. Here we use
hw06 under your
shared/ directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/hw06/ ~/shared/hw06/ cd ~/shared/hw06/
In this exercise we are going to revisit the task of deciphering reviews (guessing whether it is for a laptop or a mobile phone) in Lab #5. This time you are going to write your own Naive Bayes classifier for this task.
reviews/ subdirectory, you will find a (poorly formatted) CSV file named
reviews.csv. First, you need to run the Python script
prepare.py, which parses these file, constructs the objects (e.g., feature matrix and outcome vector) to be used by your classifier, and writes them into a
.pik for faster loading by your subsequent analysis:
cd ~/shared/hw06/reviews/ python prepare.py
For a description of what these objects mean, see comments in
Your job is to modify and complete the file
classify.py to implement Naive Bayes classification according to the specification given in the code. You need to implement the algorithm by yourself; do NOT use the Naive Bayes implementations provided by Python
sklearn package. To run the classification algorithm, type:
In addition to accuracy, note that the program will also print out the most important features used by this classifier---assuming that you have implemented the algorithm correctly according to the specification. This information can be very useful debugging. Do your top features make intuitive sense?
WHAT TO SUBMIT: Submit your modified
classify.py file as well as the result of running it. To save the result to a file named
reviews-output.txt, you can use the shell command
python classify.py > reviews-output.txt.
So far in our classification assignments, we have provided most of the code for extracting features. Now it is your turn to get your hands dirty and implement feature extraction from data. For this problem, we will use the
congress database, first introduced in Homework #2. You should still have this database on your VM (you can check it with
psql congress); if not, follow the instructions in Homework #2 to recreate this database.
Your task for this problem is to predict, using the votes in the House cast by a Representative in the 2014 session, whether this Representative is Republican. Under the
congress/ subdirectory, the script
classify.py is ready to go. It uses Python
sklearn's multivariate Bernoulli Naive Bayes classification, which models each Representative's voting history as bit vector, where each component corresponds to what the Reprsentative did for each vote. Your job is complete the script
prepare.py. What your code needs to produce is explained by the comments in
To run the classification algorithm, follow the usual sequence of steps:
cd ~/shared/hw06/congress/ python prepare.py python classify.py
The program will first run a particular train-test split to print out more specific information (for debugging); then it will run 10-fold cross validation on the entire dataset.
Remember to rerun
prepare.py if you make any modification to it (otherwise
classify.py will get the stale
WHAT TO SUBMIT: Submit your modified
prepare.py file as well as the result of running the classification algorithm. To save the result to a file named
congress-output.txt, you can use the shell command
python classify.py > congress-output.txt.