Lab #6: Tweaking Classifiers¶

DUE: Monday 2/22 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge (Part 2), you must get your solutions to all parts of the lab checked off in class, but there is no need to submit anything.

0. Getting Ready¶

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this homework.

Next, type the following commands to create a working directory for this homework. Here we use lab06 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab06/ ~/shared/lab06/
cd ~/shared/lab06/

1. Selecting Votes as Features for Party Classification¶

We are going to continue with Homework #6, Part 2, where we predict the party affiliation of Representatives based on the votes they cast in 2014. The subdirectory congress/ contains a complete implementation of a Bernoulli Naive Bayes classifier. To get ready, run:

cd ~/shared/lab06/congress/
python prepare.py

The feature extraction code has been extended to read in an SQL file pick.sql, which selects a (pretty arbitrary) subset of 10 votes as features to be used for prediction (instead of using all hundreds of votes cast in 2014. Run the classifier, and you will notice that it still performs pretty well with just these 10 features:

python classify.py

Your job is to modify pick.sql to select 10 votes as features such as the classifier will perform very poorly with these 10 votes. The query must return 10 vote ids, and they must be for votes cast by the House in the 2014 session. (Whenever you modify pick.sql, you just need to rerun classify.py; there is no need to rerun prepare.py.)

Raise your hands to have your answer checked by the course staff when you the classifier to perform poorly. Explain, intuitively, why your choice leads to poor accuracy. You get extra credit worth of 5% of a homework grade, if your classifier's accuracy is below 70%. The team who achieves the lowest accuracy the fastest will also win a prize.

2. Newsgroup Classification Challenge¶

We use the textual contents of the messages to predict which newgroups the messages were posted to. For the purpose of this exercise, we limit ourselves to six newsgroups on the following subjects:

cryptography
medical
electronics
space
politics, mideast
politics, misc

We have several options for classification. Your job is to explore these options and find the best one for our task at hand. Our measure of classifier performance is the F-measure, or the harmonic mean of precision and recall. The higher the F-measure the better.

You can run the classification algorithm using the following command:

cd ~/shared/lab06/newsgroups/
python classify.py

You will see our default 1NN classifier. The f1-score (F-measure) is just 0.332. Not very impressive, is it?

Luckily, you get to explore a lot of possible ways to improve this result. You do so by running classify.py with different command-line options, and by modifying another file lab.py to specify your feature extractor and classifier. (You won't need to modify classify.py or prepare.py.) Here is a summary of what you can try:

We consider three classifiers---Multinomial Naive Bayes, k Nearest Neighbor, and Support Vector Machines, each with its own hyper parameters. You can specify what you want by modifying the code for function get_classifier() in lab.py; we have some sample code and explanation there already. Feel free to try other classifier on your own if you want to!
Note that under get_classifier() in lab.py, we also have an example of automatically tuning SVM hyper parameters using "grid search". Study the example code, so you can use grid search to systematically explore parameter tuning for other two classiers as well.
By modifying get_vectorizer() in lab.py, you can extract as features either TFIDF scores or counts of terms.
classify.py has a number of command-line options that you might find useful. You can use them in combination.
- --report will print a detailed classification report, with the breakdown of precision and recall for each class.
- --select=N will select only the top N features using the \(\chi^2\)-test.
- --scale will scale the features, which may be useful to some classifiers, such as SVM with RBF kernel. Scaling a big sparse matrix might make it dense and overflow your memory, so use it with caution, or in combination with --select.
- --top10 prints ten most discriminative terms per class. This option only makes sense for certain classifiers, such as Naive Bayes.

Once you get your F-measure above 0.65, raise your hands to get your result checked off by course staff. If you manage to get your F-measure above 0.79, you receive extra credits worth 5% of a homework. The team who achieves the highest accuracy the fastest will also win a prize.