Lab #11: Graph Sampling

DUE: Monday 4/6 11:59pm

Let's Go, Duke!

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this lab.

Next, type the following commands to create a working directory for this lab. Here we use lab11 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab11/ ~/shared/lab11/
cd ~/shared/lab11/

1. Downsizing the Facebook Graph

In the working directory of this lab, you will find a file facebook.xml.gz representing a sample of the Facebook social network obtained by a survey conducted by Stanford researchers. This graph consists of \(4039\) nodes (users) and \(88234\) edges (friendships). Your job for this lab is to further downsample the graph to \(400\) nodes, while trying to preserve the degree distribution and clustering coefficient distribution as much as possible.

The working directory contains a Python program sample.py to get you started. It already implements three sampling procedures. Besides node sampling and edge sampling discussed in the lab, it also implement a "highest-degree sampling," which basically return the subgraph induced by the nodes with the highest degrees (which was the procedure used in generating the smaller Facebook graph for Lab #8). The code also provides a template for random jump sampling. You should replace the implementation of random_jump_sample() with your own; also free feel to tweak parameter \(d\) to your liking.

To test your sampling procedure, run:

python sample.py

It will try each of the four sampling procedures in turn and print out out some information useful for debugging. Observe how these procedures compare (note that the stub implementation of random_jump_sample() simply returns the whole graph, so its result isn't meaningful until you replace it with your own implementation).

As discussed in the lab, we measure the qualtiy of the sample using the D-statistic (aka Kolmogorov–Smirnov statistic), which computes the "difference" between two distributions. The smaller this statistic the better. Since samples are random, a better comparison would be to run multiple trials and compare the average qualities. You can do so by running:

python sample.py -s -t 20

GETTING CHECKED OFF: Once you feel that you have an answer, raise your hands to get your result checked off by course staff. A correct implementation of random jump sampling should produce a D-statistic lower than \(0.45\) for both degree and cluster coefficient distributions.

WINNING: The teams will be ranked by the average D-statistic for both degree and cluster coefficient distributions over 20 trials. If there is still a tie, it will be broken by the time of the answer. You might want to implement your sampling procedure in order to beat random jump sampling.

EXTRA CREDITS: If you manage to get your D-statistic to below \(0.25\) for both cases, you will receive extra credits worth 10% of a homework.