CompSci 101 : Fall 2013

HOWTO: Recommender

Reading Data

Each data reading module you write will have a function getData that returns two values: a list and dictionary.

A toy example, FoodReader.py and foodratings_example.txt, is given for this assignment. This code is provided as an example of what a Reader module should do and can be used to test the functions in your Recommender module. Also, feel free to use parts of this code in your modules.

BookReader

In this module, you will write one function, getData, that reads data from two files, about book titles and their ratings, and return that data in the common format to be used by the Recommender module. You may write as many helper functions as you want.

As input, you are given two CSV files: one of unique book titles and one of ratings for each book by unique students; where each line in the file corresponds to one complete piece of "data". Each file is in the following format:

Titles	Ratings
bookTitle #1 bookTitle #2 ...	ratername,rating for book #1,rating for book #2,... ratername,rating for book #1,rating for book #2,... ...

The ratings for each student are given as numbers where the i^th number is the rating for the i^th book in the titles file. For example, if the first three lines of titles file contain the following data:

The Hitchhiker's Guide To The Galaxy
Watership Down
The Five People You Meet in Heaven

and the first two lines of the ratings file contain the following data:

student1001,5,0,0,...
student1002,5,5,0,...

then that means user ‘student1001’ rated ‘The Hitchhiker's Guide To The Galaxy’ with a 5 (really liked it), ‘Watership Down’ with a 0 (did not read it), and ‘The Five People You Meet in Heaven’ with a 0 (did not read it). Similarly user ‘student1002’ rated ‘The Hitchhiker's Guide To The Galaxy’ with a 5 (really liked it), ‘Watership Down’ with a 5 (really liked it), and ‘The Five People You Meet in Heaven’ with a 0 (did not read it).

As output, your function returns two sequences, a list of the book titles (string) and a dictionary whose keys are the rater's ID (string) and whose values are a list of their ratings of the books (int):

List	Dictionary
[ 'bookTitle', 'bookTitle', ... ]	{ 'ratername' : [ int rating for book #1,int rating for book #2,... ] 'ratername' : [ int rating for book #1,int rating for book #2,... ] ... }

Thus, if your module function is used as below:

import BookReader 
itlist,rdict = BookReader.getData('books.txt', 'bookratings.txt')
print(itlistl[:3]) 
print('---ratings---')
print(rdict['student1001'])

You should see the following output:

The Hitchhiker's Guide To The Galaxy
Watership Down
The Five People You Meet in Heaven
---ratings---
[5,0,0,0,0,0,0,1,0,1,-3,5,0,0,0,5,5,0,0,0,0,5,0,0,0,0,0,0,0,0,1,3,0,1,0,-5,0,0,5,5,0,5,5,5,0,5,5,0,0,0,5,5,5,5,-5]

Recommender

In this module, you will write the three functions shown below. You may write as many helper functions as you want.

Averages

This function returns a list of tuples where the first element is the item being rated (string) and the second element is the average rating for all those who have rated the item (float). The list should be sorted so that the highest rated item is first, ties should be sorted alphabetically. In calculating averages you should not count raters who give a value of 0 meaning "not rated". Divide by (n+1) where n is the number of non-zero raters to ensure you do not get a division-by-zero error for an item that no one rates.

The parameters are the list of items (string) and the dictionary that was returned by a reading module's getData function.

Similarities

This function returns a list of tuples where the first element is a rater's name (string) and the second element is a similarity index (int). The list is sorted with the highest similarity index first, ties should be sorted alphabetically. Similarity should be calculated for the user whose name is a parameter using dot-products as described below.

The parameters are the name of a rater (string) and the dictionary that was returned by a reading module's getData function. The rater whose name is the parameter should not be evaluated as how similar she is to herself, i.e., the list returned should have one less element than the number of elements in ratings since the rater is not judged as similar to himself.

A similarity measure can be calculated by finding the dot-product of two rating-lists. For example, for the rating lists [-3,0,5,3] and [-1,3,0,5] the similarity is -3*-1 + 0*3 + 5*0 + 3*5 where each corresponding element of the lists are multiplied and summed. This yields a similar measure of 3+15 = 18. For the lists [-3,0,5,3] and [3,0,-3,3] the similarity measure is -3*3 + 0*0 + 5*-3 + 3*3 = -9 + -15 + 9 = -15. The rater with [-1,3,0,5] is closer to [-3,0,5,3] than is the rater with [3,0,-3,3] since the measures are 18 and -15, respectively. The idea is that two negative or two positive ratings make users closer than do a negative and a positive rating.

The arithmetic result of summing the corresponding products is called the dot-product and is actually related to a measure of the angle between two ratings in a mathematical ratings space.

The parameters are the list returned by similarities, the list of items returned by getData, the dictionary returned by getData, and a number that indicates how many ratings from simList should be used. Scores are calculated using the top count entries from the list simList, so that if count is 1 use only the closest rater's ratings and if count is len(simList) use them all.

The idea is to weight the ratings of similar raters more than the ratings of those with whom you don't agree. Consider these ratings, for example for a user whose ratings are [5,3,-5].

[1,5,-3]
[5,-3,5]
[1,3,0]

The similarity measures are

1*5 + 5*3 + -3*-5 = 35
5*5 + -3*3 + 5*-5 = -9
1*5 + 3*3 + 0*-5 = 14

So the first set of ratings should be weighted most and the second set of ratings least because of how similar these raters are to us and our ratings.

We do this by accumulating a weighted sum as follows:

35 * [1,5,-3] = [ 35, 175,-105]
-9 * [5,-3,5] = [-45,  27, -45]
14 * [1,3,0]  = [ 14,  42,   0]
--------------------------------
                [  4, 244,-150]

This means that the best choice for us is the second item whose score is 244, the next is the first item whose score is 4, and the least-recommended is the last item whose score is -150.

MovieReader

In this module, you will write one function, getData, that reads data from just one file and returns that data in the common format to be used by the Recommender module. You may write as many helper functions as you want.

As input, you are given one CSV file where each line in the file corresponds to one complete piece of "data": the rater's name, the movie title, and the the rating for that movie. Specifically, the file is in the following format:

Ratings
ratername,movie title,rating for movie ratername,movie title,rating for movie ...

There are many ways to to read this data, but you will not know the total number of movies until after you have read the entire file. Thus, you will either need to read the file twice or store the data as its read in some way so you can create the dictionary you will return after reading the entire file. You will also have to choose an order for the movies before assigning the ratings in the dictionary to make sure the i^th number is the rating for the i^th movie and that 0 is assigned to where the user did not rate a movie.

As output, your function returns two sequences, a list of the movie titles (string) and a dictionary whose keys are the rater's ID (string) and whose values are a list of their ratings of the movies (int).