DUE: Sunday 4/5 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216
and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
To get ready for this assignment, open up a VM shell, and type the following command:
/opt/datacourse/sync.sh
Next, type the following commands to create a working directory for this homework. Here we use hw11
under your shared/
directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/hw11/ ~/shared/hw11/
cd ~/shared/hw11/
First, study the code in in pagerank.py
, which we went over in class. You can run this code, as is, on the Web graph of Stanford Web pages (from http://snap.stanford.edu/data/web-Stanford.html). To test the code locally on a tiny subgraph, run the command:
python pagerank.py tiny-web-Stanford.txt > out-tiny.txt
This command will save the output results in out-tiny.txt
. Each result line lists the page id, its PageRank, and then the list of page ids that it points to. The full dataset, web-Stanford.txt
, has 281903 nodes, and will take much longer to run:
python pagerank.py web-Stanford.txt > out-full.txt
1) Your first job is to try pagerank.py
on Amazon EMR. To do so, you will need the two files datacourse.pem
and mrjob.conf
from Homework #10. Assuming that you have working versions of these files in your Homework #10 directory, you can just copy them to your working directory as follows:
cp -p ~/shared/hw10/datacourse.pem ~/shared/hw10/mrjob.conf ~/shared/hw11/
Feel free to tweak mrjob.conf
as you see fit (such as increasing num_ec2_instances
). Then, you can just run pagerank.py
as follows:
python pagerank.py -c mrjob.conf -r emr s3://cs216-spring2015/web/web-Stanford.txt > out-full.txt
Here, for your convenience, we have put a copy of the input file on Amazon S3 here at the s3://
URL above.
2) Your next job here is to retool pagerank.py
to accomplish two things:
Note that we set the number of PageRank iterations to 4; there is no need to tweak it for the purpose of this problem. Also, to help to debug, you will find output files produced by our sample solution in ref-tiny.txt
and ref-full.txt
.
WHAT TO SUBMIT: For (1), a text file 1.txt
commenting on the relative speeds of local vs. Amazon executions; for (2), your modified pagerank.py
, as well as the output files out-tiny.txt
and out-full.txt
.