Homework #11: Graph and MapReduce

DUE: Sunday 4/5 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, open up a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw11 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw11/ ~/shared/hw11/
cd ~/shared/hw11/

1. PageRank in MapReduce

First, study the code in in pagerank.py, which we went over in class. You can run this code, as is, on the Web graph of Stanford Web pages (from http://snap.stanford.edu/data/web-Stanford.html). To test the code locally on a tiny subgraph, run the command:

python pagerank.py tiny-web-Stanford.txt > out-tiny.txt

This command will save the output results in out-tiny.txt. Each result line lists the page id, its PageRank, and then the list of page ids that it points to. The full dataset, web-Stanford.txt, has 281903 nodes, and will take much longer to run:

python pagerank.py web-Stanford.txt > out-full.txt

1) Your first job is to try pagerank.py on Amazon EMR. To do so, you will need the two files datacourse.pem and mrjob.conf from Homework #10. Assuming that you have working versions of these files in your Homework #10 directory, you can just copy them to your working directory as follows:

cp -p ~/shared/hw10/datacourse.pem ~/shared/hw10/mrjob.conf ~/shared/hw11/

Feel free to tweak mrjob.conf as you see fit (such as increasing num_ec2_instances). Then, you can just run pagerank.py as follows:

python pagerank.py -c mrjob.conf -r emr s3://cs216-spring2015/web/web-Stanford.txt > out-full.txt

Here, for your convenience, we have put a copy of the input file on Amazon S3 here at the s3:// URL above.

2) Your next job here is to retool pagerank.py to accomplish two things:

  • The current implementation assumes that when a random surfer reaches a "dead end," she immediately teleports to a random page (i.e., Option 1 on slide 13 of Lecture 11). Change the implementation such that when a random surfer reaches a "dead end," with probability \(d\) she simply stays on the same page (i.e., Option 2 on the same slide).
  • Modify the code so that it only returns the 20 pages with the highest PageRank. Each result line should just list the page id and its PageRank.

Note that we set the number of PageRank iterations to 4; there is no need to tweak it for the purpose of this problem. Also, to help to debug, you will find output files produced by our sample solution in ref-tiny.txt and ref-full.txt.

WHAT TO SUBMIT: For (1), a text file 1.txt commenting on the relative speeds of local vs. Amazon executions; for (2), your modified pagerank.py, as well as the output files out-tiny.txt and out-full.txt.