DUE: Monday 3/16 11:59pm
HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select
compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file
team.txt listing members of your team who are present at the lab. To earn extra credit for the lab challenge, you must get your solutions checked off in class, but you don't need to submit anything.
To get ready for this lab, fire up your virtual machine and enter the following command:
This command will download the files which we are going to use for this lab.
Next, type the following commands to create a working directory for this lab. Here we use
lab08 under your
shared directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/lab08/ ~/shared/lab08/ cd ~/shared/lab08/
In the working directory of this lab, you will find a bunch of graphs in the GraphML format:
2.xml.gz, etc. These are compressed XML files. They are actually human-readable---you can use the command
less to view them (e.g., type the command
less 1.xml.gz). Their format is fairly self-explanatory, and you don't need to understand the details about this format for this lab.
These files correspond to the following graphs. Can you tell which one is which? Of course, the ordering of files has been scrambled. Also, in the case of real-world graphs below, the node ids have been permuted and don't correspond to the ids in the original datasets.
(Random-Sparse) A random graph with \(p < 1/50\). See Lecture #8, slides 11-14 for the definition of a random graph. Recall that \(p\) is the probability that an edge exists between a given pair of nodes.
(Random-Dense) A random graph with \(p > 1/4\).
(BA-1) A Barabasi-Albert graph (preferential attachment model) with \(m = 1\). See Lecture #8, slides 24-27 for the definition of Barabasi-Albert model of graph construction. Recall that nodes are added one at a time, and each node will try to preferentially attach itself to \(m\) existing nodes; those already with high degrees will be preferred.
(BA-3) A Barabasi-Albert graph with \(m = 3\).
(Congress) The nodes in this graph represent Representatives and roll calls in the House in 2013-2014. If a Representative \(u\) voted "Yea" or "Aye" in a roll call \(v\), then we draw an edge between \(u\) and \(v\); otherwise, there is no edge between \(u\) and \(v\). To obtain a graph with \(N\) nodes, we pick them uniformly at random and consider edges among them. The data came from govtrack.us (in fact it's in our
congress PostgreSQL database).
(Coauthor) The nodes in this graph represent authors in the field of data mining. There is an edge between two nodes if the two authors have written at least one paper together. To obtain a graph with \(N\) nodes, we pick \(N\) nodes with the highest degrees in the original graph, and then consider just these \(N\) nodes and the edges among them. The data came from the WSDM Data Challenge.
(Facebook) The nodes in this graph are Facebook users and edges indicate friendships. The dataset is derived from a sample of the Facebook social network obtained by a survey conducted by Stanford researchers. To further downsample the graph to \(N\) nodes, we use the same procedure as what we do for the coauthorship graph: just pick \(N\) nodes with the higest degrees, and then consider just them and the edges among them.
TOOL OF TRADE: To find out which file corresponds to which graph, you are free to use any tool and run any kind of analysis on the files. For your convenience, however, we've included the script
graph.py that you used in Homework #8. (Hint: In fact,
graph.py alone should suffice for this lab.)
1.xml.gz, for example, just issue the command:
python graph.py 1.xml.gz
The two figures (degree and distance distributions) will be automatically saved in two PDF files (with prefix
1.xml). Alternatively, you can just run
graph.py in a batch mode (which saves you the trouble of clicking on windows):
python graph.py --batch 1.xml.gz > 1.xml-report.txt
(Remember to change both occurrences of
1.xml above to match the file you want to analyze.) This command will save the two figures in PDF files and the textual output in the
.txt file, which you can look at further.
GETTING CHECKED OFF: Once you feel that you have an answer, raise your hands to get your result checked off by course staff. To discourage random guessing:
WINNING: The teams will be ranked by the number of correct matches in their best answer. Ties are broken by the number of lifelines used. If there is still a tie, it will be broken by the time of the answer.
EXTRA CREDITS: If you correctly identifying all but no more than two graphs, you receive extra credits worth 10% of a homework.