Homework #7: Clustering¶

DUE: Sunday 3/1 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started¶

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, open up a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw07 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw07/ ~/shared/hw07/
cd ~/shared/hw07/

1. Clustering Reviews¶

In this exercise we are going to revisit the task of deciphering reviews. In Lab #5 where we asked you to guess whether a review was for a laptop or a mobile phone. This time we are not telling what which categories of products these reviews correspond to. Instead, you are going to find out the categories of products yourself by clustering the reviews.

In the hw07/ subdirectory, you will find a CSV file named reviews.csv, which contains 250 products and one review for each product. We have provided a python script kmeans.py that clusters the reviews using sklearn's builtin k-means clustering implementation. Read the script carefully. Note:

we set n_clusters=2 just as an example and not because there are only 2 clusters in the data.
features are created for each review using the TfidfVectorizer.

The script clusters the data and prints (a) a score related to the objective function being minimized, (b) all the ids and the predicted cluster they belong to, and (c) the cluster centers.

Yor task is to answer the following questions: - What is your estimate for the best k for this dataset? - Justify your k value using the output of the clustering.

HINT: You can modify k-means.py to loop over many other k values and compute measures/visualize the clusters to help you decide what the best k is. You can also choose to use a different clustering algorithm if you want.

WHAT TO SUBMIT: Submit a single file k.txt. The first line should read k=<your answer>. Your justification starts on the second line.