Lab #8: Graph Detective

DUE: Monday 3/16 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.

0. Getting Ready

To get ready for this lab, fire up your virtual machine and enter the following command:

/opt/datacourse/sync.sh

This command will download the files which we are going to use for this lab.

Next, type the following commands to create a working directory for this lab. Here we use lab08 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab08/ ~/shared/lab08/
cd ~/shared/lab08/

1. Which is Which?

In the working directory of this lab, you will find a bunch of graphs in the GraphML format: 1.xml.gz, 2.xml.gz, etc. These are compressed XML files. They are actually human-readable---you can use the command less to view them (e.g., type the command less 1.xml.gz). Their format is fairly self-explanatory, and you don't need to understand the details about this format for this lab.

These files correspond to the following graphs. Can you tell which one is which? Of course, the ordering of files has been scrambled. Also, in the case of real-world graphs below, the node ids have been permuted and don't correspond to the ids in the original datasets.

(Random-Sparse) A random graph with \(p < 1/50\). See Lecture #8, slides 11-14 for the definition of a random graph. Recall that \(p\) is the probability that an edge exists between a given pair of nodes.

(Random-Dense) A random graph with \(p > 1/4\).

(BA-1) A Barabasi-Albert graph (preferential attachment model) with \(m = 1\). See Lecture #8, slides 24-27 for the definition of Barabasi-Albert model of graph construction. Recall that nodes are added one at a time, and each node will try to preferentially attach itself to \(m\) existing nodes; those already with high degrees will be preferred.

(BA-3) A Barabasi-Albert graph with \(m = 3\).

(Congress) The nodes in this graph represent Representatives and roll calls in the House in 2013-2014. If a Representative \(u\) voted "Yea" or "Aye" in a roll call \(v\), then we draw an edge between \(u\) and \(v\); otherwise, there is no edge between \(u\) and \(v\). To obtain a graph with \(N\) nodes, we pick them uniformly at random and consider edges among them. The data came from govtrack.us (in fact it's in our congress PostgreSQL database).

(Coauthor) The nodes in this graph represent authors in the field of data mining. There is an edge between two nodes if the two authors have written at least one paper together. To obtain a graph with \(N\) nodes, we pick \(N\) nodes with the highest degrees in the original graph, and then consider just these \(N\) nodes and the edges among them. The data came from the WSDM Data Challenge.

(Facebook) The nodes in this graph are Facebook users and edges indicate friendships. The dataset is derived from a sample of the Facebook social network obtained by a survey conducted by Stanford researchers. To further downsample the graph to \(N\) nodes, we use the same procedure as what we do for the coauthorship graph: just pick \(N\) nodes with the higest degrees, and then consider just them and the edges among them.

TOOL OF TRADE: To find out which file corresponds to which graph, you are free to use any tool and run any kind of analysis on the files. For your convenience, however, we've included the script graph.py that you used in Homework #8. (Hint: In fact, graph.py alone should suffice for this lab.)

To analyze 1.xml.gz, for example, just issue the command:

python graph.py 1.xml.gz

The two figures (degree and distance distributions) will be automatically saved in two PDF files (with prefix 1.xml). Alternatively, you can just run graph.py in a batch mode (which saves you the trouble of clicking on windows):

python graph.py --batch 1.xml.gz > 1.xml-report.txt

(Remember to change both occurrences of 1.xml above to match the file you want to analyze.) This command will save the two figures in PDF files and the textual output in the .txt file, which you can look at further.

GETTING CHECKED OFF: Once you feel that you have an answer, raise your hands to get your result checked off by course staff. To discourage random guessing:

  • You must give a "complete" answer, i.e., one in which every file is identified as one of the graphs.
    • It is okay to guess if you are still unsure about some graphs, but guess strategically.
  • You will be told how many correct matches you have in your answer, but not which ones.
  • You have three "lifelines." A lifeline allows you to get a yes/no answer to a question of the form "is our answer for file \(i\) correct?"

WINNING: The teams will be ranked by the number of correct matches in their best answer. Ties are broken by the number of lifelines used. If there is still a tie, it will be broken by the time of the answer.

EXTRA CREDITS: If you correctly identifying all but no more than two graphs, you receive extra credits worth 10% of a homework.