For data files see the data directory or snarf the assignment. You are not given any already-written Python code.
You are asked to write several Python modules for this assignment. Details are here and in the howto pages.
Here is the input file.
Sarah Lee (DivinityCafe)(3) (IlForno)(3) (TheSkillet)(-3) (LoopPizzaGrill)(3) (FarmStead)(3) (Tandoor)(5) (PandaExpress)(-3) Melanie (McDonalds)(1) (Tandoor)(3) (DivinityCafe)(5) (TheCommons)(3) (TheSkillet)(1) (IlForno)(3) (PandaExpress)(3) J J (TheSkillet)(1) (McDonalds)(1) (LoopPizzaGrill)(-1) (Tandoor)(3) (FarmStead)(1) (PandaExpress)(1) Sly one (TheCommons)(3) (Tandoor)(3) (DivinityCafe)(5) (TheSkillet)(3) (IlForno)(1) (LoopPizzaGrill)(3) Sung-Hoon (LoopPizzaGrill)(5) (McDonalds)(1) (Tandoor)(-3) (IlForno)(-1) (TheSkillet)(-3) (FarmStead)(-1) (PandaExpress)(3) (TheCommons)(1) Nana Grace (IlForno)(3) (LoopPizzaGrill)(-5) (DivinityCafe)(5) (McDonalds)(-1) (TheCommons)(3) (Tandoor)(1) Harry (McDonalds)(-3) (TheCommons)(5) (DivinityCafe)(5) (FarmStead)(3) (TheSkillet)(1) (LoopPizzaGrill)(-1) (PandaExpress)(-5) Wei (FarmStead)(1) (McDonalds)(-1) (DivinityCafe)(1) (LoopPizzaGrill)(3) (TheCommons)(3) (Tandoor)(5)
Sarah Lee is the first rater. She rated seven places. DivinityCafe was rated a 3, IlForno was rated a 3, TheSkillet was rated -3, LoopPizzaGrill was rated 3, FarmStead was rated a 3, Tandoor was rated a 5, and PandaExpress was rated -3.
Melanie is the next rater. She rated McDonalds a 1, Tandoor a 3, etc.
The restaurants may not be in the same order for each rater, and a rater may only rate a few restaurants. Any they do not rate you should assign a 0.
The function processdata should return a list of the unique items, that might be this list (your list may have a different ordering):
['IlForno', 'TheCommons', 'FarmStead', 'DivinityCafe', 'PandaExpress', 'TheSkillet', 'Tandoor', 'LoopPizzaGrill', 'McDonalds']
You also will return a dictionary. It might look like this (may not be the same order):
{'Sung-Hoon': [-1, 1, -1, 0, 3, -3, -3, 5, 1], 'Wei': [0, 3, 1, 1, 0, 0, 5, 3, -1], 'Sly one': [1, 3, 0, 5, 0, 3, 3, 3, 0], 'Nana Grace': [3, 3, 0, 5, 0, 0, 1, -5, -1], 'Melanie': [3, 3, 0, 5, 3, 1, 3, 0, 1], 'J J': [0, 0, 1, 0, 1, 1, 3, -1, 1], 'Harry': [0, 5, 3, 5, -5, 1, 0, -1, -3], 'Sarah Lee': [3, 0, 3, 3, -3, -3, 5, 3, 0]}
You will have to think about how you will process the data. You will need to first know all the unique restaurants. Once you know them, you could put them in a list and then use that ordering for rating restaurants. As you process the initial data, you may want to store it so you can then process it a second time once you know how many different restaurants there are.
Then for each rater you could create an initial list of ratings as all 0's. As you process the data, you could update the appropriate rating.
For example, with this file there are 9 restaurants. You could initialize each key value pair in the dict with the value [0,0,0,0,0,0,0,0,0]. Assume the ordering of the restaurants your program comes up with is:
['IlForno', 'TheCommons', 'FarmStead', 'DivinityCafe', 'PandaExpress', 'TheSkillet', 'Tandoor', 'LoopPizzaGrill', 'McDonalds']Then when you process Sarah Lee, you would update her fourth slot to be 3, since Sarah Lee rated DivinityCafe a 3. You would update her first slot to be 3 since Sarah Lee rated IlForno a 1. You would update her sixth slot to be -3, since she rated The Skillet a -3.
For book information, you will read data from two files and combine the data.
Here are the first eight lines from the file AllBooksAuthors.txt
. Each book title is on one line
with the book number followed by a period first, then the title. The next
line is the same book number followed by a period, then the author.
1.Postmortem 1.Patricia Cornwell 2.The Secret Adversary 2.Agatha Cristie 3.The Firm 3.John Grisham 4.The Hitchhiker's Guide To The Galaxy 4.Douglas Adams ...
Here are the first few lines in the file
AllBooksRatings.txt
. The first ten lines represent a rater,
with the word RATER, followed by a colon, followed by ratings on this
line and the next nine lines separated by colons, but no colon at the
end of any line. There will always be at least one rating on each line
that contains ratings.
For the ratings for Rus you can see that
Rus did not rate the first book Postmortem. She rated the second book The Secret
Adversary a 3, and did not rate the next ten books. Canra follows Rus
and rated the book Postmortem a 1, she did not rate the next two books
and then rated The Hitchhiker's Guide to the Galaxy a 5.
Note that 0 means no rating. Also note that Canra's ratings are spread
over 13 lines. Each rater has their ratings spread over one or more
lines, but they each have the same number of ratings, which equal the
number of books.
RATER:Rus:0:3:0:0 0:0:0:0:0:0:0 0:3:3:3:0:0:-3:0:0:0 1:0:0:0:0:0 0:0:0:1:0:0:1:5 0:1:0:5:0:0:0:0 0:0:5 5:0:0:0:0:0:0:-3 0:0:0:0 0:0:1:3:5:3:3 RATER:Canra:1:0:0:5:3:0:0:0 0 0:5 0:3:1:0:0:0 1:0:1:0:3:0:0 0:0:-3:0 5:0 0:0:0:5:5:0:1 0:-5:5:0:3:3 0:5:5:5:0 0:0:0:0:5:5:0:1:0 1:0:0:1:3:5 -1:3 RATER:Dos:-1:3:0:0:0:0:0:0:0:0 0:0
The function processdata should return a list of the unique items, that might be this list (in this case it makes sense to have the same order as the file):
['Postmortem,Patricia Cornwell', 'The Secret Adversary,Agatha Cristie', 'The Firm,John Grisham', "The Hitchhiker's Guide To The Galaxy,Douglas Adams", 'Watership Down,Richard Adams', 'The Five People You Meet in Heaven,Mitch Albom', 'Speak,Laurie Halse Anderson', 'I Know Why the Caged Bird Sings,Maya Angelou', 'Thirteen Reasons Why,Jay Asher', 'Foundation Series,Isaac Asimov', 'The Sisterhood of the Travelling Pants,Ann Brashares', 'A Great and Terrible Beauty,Libba Bray', ... [NOT ALL SHOWN] ... ]
And also return a dictionary of ratings, only partly shown below, just two entries are shown.
dict [('ender', [0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 0, 0, 3, 0, 5, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 5, 0, 3, 0, 0, 1, 3, 5, 1, 3] ), ('Leah', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), ... ]
Here are the first few lines of the file AllMoviesRatings.txt
.
Each movie rated is on three lines and has three pieces of information:
the first line is the rater, the second line is the the movie title and
the third line is the rating. For example, on the
first line student1367 rated the movie on the second line "Star Trek
Beyond" and gave it a 3 (from the third line).
student1367 Star Trek Beyond 3 student1367 Rogue One 3 student1367 Moano 1 student1367 The Edge of Seventeen 3 student1367 The Revenant 5 student1367 Blade Runner 2049 5 student1046 The Good Dinosaur 3 ...
The function processdata should return a list of the unique items, that might be this list (yours may be in a different order). Here are some of the movies, not all are shown.
['Knight and Day', 'The Butterfly Effect', '50 First Dates', 'Love Actually', 'Date Night', 'Unstoppable', 'Tooth Fairy', 'Secretariat', 'A Nightmare on Elm Street', 'Kill Bill: Vol. 2', 'The Simpsons Movie', 'Rocky Balboa', 'The Town', 'The Ring', .....]
And also return a dictionary of ratings, partly shown below. For example, student1250 rated "The Butterfly Effect" a 3 and "50 First Dates" a 1. They did not rate "Knight and Day".
[('student1250', [0, 3, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 5, 0, 5, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -3, 0, 1, 0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 5, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 3, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 0, 3, 3, 0, 0, 0, 0, 5, 0, 0, 0]), ('student1251', [0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 3, 3, -5, 1, 0, 0, 5, 0, 5, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -3, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 0, -3, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 5, 5, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0]), ('student1252', [0, 5, -3, 3, 3, 1, 0, 0, 1, 3, 3, 0, 0, 0, 5, 3, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 0, 0, 5, 0, 1, 1, 0, 1, 5, 3, 5, 3, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 3, 0, 0, 5, 0, 0, 0, 0, 3, 5, 1, 3, 0, 0, 3, 3, 5, 3, 3, 5, 0, 0, 0, 0, 0, 1, 0, 0, 1, 3, 0, 0, 1, 5, 5, 5, 0, 5, 0, 0, 0, 0, 3, 0, 0, 3, 0, 3, 5, 3, 0, 1, 3, 0, 1, 0, 5, 0, 3, -3, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 3, -3, 0, 1, 3, 0, 0, 3, 3, 0, 0, 0, 0, 3, 0, 3, 3, 5, 3, 5, 3, 3, 3, 1, 0]), ('student1253', [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -3, 0, 0, 0, 0, 0, 0, 0, 5, -3, 3, 5, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 5, 5, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0]), ('student1254', [0, 0, 3, 1, 3, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 5, 0, 3, 0, 1, 3, 0, 0, 5, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 3, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 3, 5, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 1, 0, 3, 1, 0, 3, 0, 3, 0, 5, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 3, 0, 5, 3, 0, 3, 1, 0, 0, 0, 5, 0, 3, 0, 3, 5, 0, 3, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1, 0, 0, 1, 0, 0, 3, 0, 0, 0, 3, 3, 3, 0, 0, 3, 0, 1, 0, 0, 5, 5, 0, -3]), ....
Each data-reading module you write has a function processdata
that returns two values: a list and dictionary.
First a toy example about food is shown below. Feel free to use parts of
this code in your module. This module shows how to process a file
that contains data for one rater, the rater's name is the first line of
the file. So a sample data file for this format might look like the
following. This is a contrived example and the data is in a different
format, also real examples have
multiple users. In this example all users would need to have
food in the same order so that the ratings in rdict
would
be in the same order for multiple users.
Charlie vanilla milkshake:5 burrito:4 butterflied leg of lamb:-3 eggplant parmesan:1Code to process this file:
The example code above shows a hypothetical module for rating food for a single rater named Charlie. The function has a parameter that's the name of a file storing foods and ratings as shown above.
In a real example the data ratings would be stored in one or more of the files to
be read.
This example shows how to return two values from a function, essentially
returning a tuple.
NOTE: The AllFoodRatings.txt file is in a slightly different format than
this. Look at the file to figure out how to process it.
This is the main program for restaurants. You will use code from RecommenderForAll and ProcessAllFood. To accomplish this you will need to import the functions you want to use.
from RecommenderForAll import averages from RecommenderForAll import similarities from RecommenderForAll import recommended from ProcessAllFood import processData
Then you will call processfood to process the datafile. Here is an outline of what you might do. (NOT ALL CODE SHOWN)
foodfile = "AllFoodRatings.txt" fooditems, fooddict = processData(foodfile) ... resultavg = averages(fooditems,fooddict) ... person1 = "Sung-Hoon" resultsim = similarities(person1, fooddict) ... resultrec = recommended(resultsim, fooditems, fooddict,3) ... person1 = "Sarah Lee" resultsim = similarities(person1, fooddict) ... resultrec = recommended(resultsim, fooditems, fooddict,3) ...
The output would then be ( Floats should be shown for averages. Don't worry about the number of digits displayed.):
RESTAURANTS IlForno TheCommons FarmStead DivinityCafe PandaExpress TheSkillet Tandoor LoopPizzaGrill McDonalds RATER and their Ratings: Sung-Hoon [-1, 1, -1, 0, 3, -3, -3, 5, 1] Wei [0, 3, 1, 1, 0, 0, 5, 3, -1] Sly one [1, 3, 0, 5, 0, 3, 3, 3, 0] Nana Grace [3, 3, 0, 5, 0, 0, 1, -5, -1] Melanie [3, 3, 0, 5, 3, 1, 3, 0, 1] J J [0, 0, 1, 0, 1, 1, 3, -1, 1] Harry [0, 5, 3, 5, -5, 1, 0, -1, -3] Sarah Lee [3, 0, 3, 3, -3, -3, 5, 3, 0] OPTIONAL OUTPUT ABOVE HERE, BUT A GOOD IDEA TO PRINT TO LOOK AT Restaurants and their average ratings ------------------------------------- ('DivinityCafe', 4.0) ('TheCommons', 3.0) ('Tandoor', 2.4285714285714284) ('IlForno', 1.8) ('FarmStead', 1.4) ('LoopPizzaGrill', 1.0) ('TheSkillet', 0.0) ('PandaExpress', -0.2) ('McDonalds', -0.3333333333333333) Ratings similar to Sung-Hoon ---------------------------------- Wei 1 Sly one -1 Melanie -2 Sarah Lee -6 J J -14 Harry -24 Nana Grace -29 Recommendations for Sung-Hoon with 3 most similar raters ------------------------------------------------------------ ('FarmStead', 1.0) ('LoopPizzaGrill', 0.0) ('Tandoor', -1.3333333333333333) ('McDonalds', -1.5) ('TheCommons', -2.0) ('TheSkillet', -2.5) ('IlForno', -3.5) ('DivinityCafe', -4.666666666666667) ('PandaExpress', -6.0) Ratings similar to Sarah Lee ---------------------------------- Wei 40 Sly one 33 Harry 33 Melanie 27 Nana Grace 14 J J 9 Sung-Hoon -6 Recommendations for Sarah Lee with 3 most similar raters ------------------------------------------------------------ ('Tandoor', 149.5) ('TheCommons', 128.0) ('DivinityCafe', 123.33333333333333) ('FarmStead', 69.5) ('TheSkillet', 66.0) ('LoopPizzaGrill', 62.0) ('IlForno', 33.0) ('McDonalds', -69.5) ('PandaExpress', -165.0) Ratings similar to Melanie ---------------------------------- Sly one 49 Nana Grace 45 Wei 28 Sarah Lee 27 Harry 23 J J 14 Sung-Hoon -2 Recommendations for Melanie with 3 most similar raters ------------------------------------------------------------ ('DivinityCafe', 166.0) ('TheSkillet', 147.0) ('TheCommons', 122.0) ('Tandoor', 110.66666666666667) ('IlForno', 92.0) ('FarmStead', 28.0) ('LoopPizzaGrill', 2.0) ('McDonalds', -36.5)
This is similar to RecommenderFood. You need to print out a different number of items. See the requirements.
This is similar to RecommenderFood. You need to print out a different number of items. See the requirements.
RecommenderForAll.py
that you write. You'll also need to call these functions and document the
results you get.
averages(itemlist,dictratings)
-- returns a list of tuples
where each tuple includes an item being rated and the average rating for
all those who've rated the item. The list should be sorted so that the
highest rated item is first. Each tuple will contain a string (item) and
a float (average) with the string first.
The parameters are the list of items and the dictionary that are
returned by a reading module's processdata
method. In
calculating averages you should not count raters who give a value of 0
meaning "not rated". You should check before dividing by n, where n is the number
of non-zero raters, to make sure n is not 0. If it is 0 you do
not divide, the result is just 0 for an item that noone rated.
similarities(name, dictratings)
-- returns a list of
two-tuples where each tuple contains a rater-name (string) and a
similarity-index (int). The list is sorted with the most-similar rater
first. Similarity should be calculated for the user whose name is a
parameter using dot-products as described below.
The parameters are a string (the name of a rater) and a dictionary of ratings as
returned from processdata
.
The rater whose name is the parameter should not be evaluated as how similar she is to herself, i.e., the list returned should have one less element than the number of elements in ratings since the rater is not judged as similar to himself.
A similarity measure can be calculated by finding the dot-product of two rating-lists. For example, for the rating lists [-3,0,5,3] and [-1,3,0,5] the similarity is -3*-1 + 0*3 + 5*0 + 3*5 where each corresponding element of the lists are multiplied and summed. This yields a similar measure of 3+15 = 18. For the lists [-3,0,5,3] and [3,0,-3,3] the similarity measure is -3*3 + 0*0 + 5*-3 + 3*3 = -9 + -15 + 9 = -15. The rater with [-1,3,0,5] is closer to [-3,0,5,3] than is the rater with [3,0,-3,3] since the measures are 18 and -15, respectively. The idea is that two negative or two positive ratings make users closer than do a negative and a positive rating.
The arithmetic result of summing the corresponding products is called the dot-product and is actually related to a measure of the angle between two ratings in a mathematical ratings space.
recommended(simlist,itemlist,dictratings,n)
-- returns a
list of recommended items. The parameter itemslist
is the list
of items returned by processdata
as is the dictionary dictratings
.
The parameter simlist
is the list returned by similarities
,
and n
is a number that indicates how many ratings from slist
should be used.
The idea is to weight the ratings of similar raters more than the ratings of those with whom you don't agree. Consider these ratings, for example for a user whose ratings are [5,3,-5].
[1,5,-3] [5,-3,5] [1,3,0]The similarity measures are
1*5 + 5*3 + -3*-5 = 35 5*5 + -3*3 + 5*-5 = -9 1*5 + 3*3 + 0*-5 = 14So we should weight the first set of ratings most and the second set of ratings least because of how similar these raters are to us and our ratings.
We do this by accumulating a weighted sum as follows:
35 * [1,5,-3] = [ 35, 175,-105] -9 * [5,-3,5] = [-45, 27, -45] 14 * [1,3,0] = [ 14, 42, 0] -------------------------------- [ 4, 244,-150] /3 /3 /2 ------------------------------------ 1.33 81.33 -75
Note in that last step we divide by the number of ratings there are. The last column was -105 + -45 = -150. There were only two entries that were nonzero, so we divide -150 by 2 and the result is -75
This means that the best choice for us is the second item whose score is 83.33, the next is the first item whose score is 1.33, and the least-recommended is the last item whose score is -75.
The list returned is sorted from most-recommended to least
recommended and is a list of tuples where the first element is the
name of an item and the second element is the score ( a float) for that
item. Scores are calculated using n
entries from the
list simlist
, so that if n==1 we use only the closest
rater's ratings and if n==len(simlist) we use them all.