DUE: Tuesday 3/31 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select
compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
Start up your VM and issue the following commands, which will install
mrjob python package and create a working directory for your assignment:
/opt/datacourse/sync.sh cp -pr /opt/datacourse/assignments/hw10/ ~/hw10/
To get you started, We have provided a sample file
tweet_sm.txt containing a fairly small number of tweets and a simple program
hashtag_count.py that counts the total number of tweets and the total number of uses of hashtags.
You can execute the MapReduce program locally on your VM, without using a cluster, as follows:
python hashtag_count.py tweets_sm.txt
As a rule of thumb, you should always test and debug your MapReduce program locally on smaller datasets, before you attempt it on a big cluster on Amazon---it will cost you money!
If you haven't done so, follow the instructions here to sign up for Amazon AWS. You do not need to create new VM instances on Amazon at this time. You will simply use your own course VM to program; later on, when you run your code, you can tell it to automatically make use of additional computing resources, remotely, on Amazon AWS (if your course VM is hosted by Amazon, it will be separate from the additional computing resources).
Next, you need to make sure that your VM can access the Amazon AWS key pair, which you obtained by following the instructions above. Make a copy of the file
datacourse.pem in the assignment directory inside the VM. Run the following command in the assignment directory inside VM:
chmod go= datacourse.pem
Now, edit the file
mrjob.conf in the assignemnt working directory on your VM. Replace the first two fields with your own access key id and secret access key. Make sure there is a space between each colon and its ensuing value in
mrjob.conf; it's touchy about that!
Now, you are ready to run a program on Amazon EMR (Elastic MapReduce using Hadoop)! Just type the following command:
python hashtag_count.py -c mrjob.conf -r emr tweets_sm.txt
If you get an invalid SSH key error, it might be a region match error. Key pairs are bound to specific regions. You can check the region in the upper right of the AWS Management Console, and make sure it matches what's specified in
You will see lots of diagonistic info flying by. The program will actually take longer to run on Amazon than on your local machine! That's expected, because of the various overhead involved. MapReduce on a cluster is really meant for problems much, much larger than this one.
mrjob.conf, you can optionally tweak:
num_ec2_instances, i.e., the number of Amazon machines) on which to run your program. For massive data you will need a larger cluster to finish in a reasonable amount of time. But don't go overboard because more machines imply more money will be charged to your account.
ec2_instance_type:), which is currently set to
m1.small. This basic machine should suffice for us. Fancier machine types cost more.
bootstrap_actions). Generally speaking, machines with more cores can run more concurrent tasks. You can leave this out and just trust EMR's default setting.
jobconf). You can usually leave them unspecified; EMR/Hadoop will pick reasonable defaults based on what it thinks of as a reasonable unit amount of work per task. Even if you specify them, Hadoop may choose to ignore them in some cases, so think of them as only optimization "hints".
In this exercise, you will write a MapReduce program to find the 50 most popular hashtags from a file containing approximately 3.5 million tweets (a couple hours in the twitterverse).
We strongly recommend that you debug first locally before running on the full file in Amazon. You can make a copy of the example code:
cp hashtag_count.py topk.py
topk.py to suit your needs. When running on Amazon, you can use the following files:
Small test file: s3://cs216-spring2015/twitter/tweets_sm.txt
Big file: s3://cs216-spring2015/twitter/tweets-full/*
You can give these file URLs to your python program directly, e.g.:
python topk.py -c mrjob.conf -r emr s3://cs216-spring2015/twitter/tweets-full/*
Hint 1: Your approach should work on much bigger datasets than what we are using here. Keep in mind the tips from class about potential problems when computing on data in parallel.
Hint 2: You can create standard python data structures within MRJob functions or class instances. Keep in mind that these will be stored in-memory for each mapper or reducer, so if used, any such structure should be kept small.
WHAT TO SUBMIT: Submit a plain-text file named
topk.txt with your results. You need to submit the results on the big file (the small file is there in case you need to debug your program). Submit also your code (
topk.py) including comments explaining the code.