Homework #8: Graph Data

DUE: Monday 3/16 11:59pm (extended)

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, open up a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw08 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw08/ ~/shared/hw08/
cd ~/shared/hw08/

1. From Bipartite to Social

For this assignment, you will build and analyze a social network graph for characters in the Marvel Universe such as Captain America, Spider-Man, Iron Man, etc.

We've put raw data in character_issues.tsv. This is a file of tab-separated values, where each value is enclosed in double quotes. Each line lists a character (e.g., "FROST, CARMILLA") and a comic book issue (e.g., "AA2 35"), meaning that this character appears in the issue. Overall, this file specifies a bipartite graph, which consists of two disjoint sets of nodes (characters and issues in this case), with edges going across these sets but never within each set.

We are going to build a social network graph where nodes are the characters, and there is an edge between two characters if they ever appear together in the same issue. Your job is write a program to convert data in character_issues.tsv into a file characters.tsv of the same format for the social network graph. For example, suppose characters \(A\) and \(B\) appear together in at least one issue. Then, in characters.tsv, you should have a single line with \(A\) and \(B\), specifying an undirected edge \((A, B)\). Do not have multiple lines for \((A, B)\), and do not list \((B, A)\) if you already list \((A, B)\) (by convention, you can require that the first node precedes the second node lexicographically).

There are many ways to code up this conversion. We have provided a code template, convert.py, in the assignment directory, as a starting point. Modify this file such that running the command

python convert.py

will produce the correct characters.tsv as specified above. We have also provided a file characters-ref.tsv, which is the official answer that you can check our output against.

NOTE: The ordering of lines in the output files may be different even though their contents may be the same. To compare contents, you may use the command:

diff <(sort characters.tsv) <(sort characters-ref.tsv)

WHAT TO SUBMIT: Your convert.py. There is no need to submit your characters.tsv.

2. Getting to Know the Marval Social Network

In this part of the assignment, you will use Python graph_tool to analyze the Marval social network graph you obtained in Part 1. (If you didn't manage to get Part 1 to work, don't worry; you can still work on the rest of the homework by substituting characters-ref.tsv for characters.tsv in the instructions below.)

We have prepared a Python program graph.py in the assignment directory, which computes the degree distribution, distance distribution (and diameter), and clustering coefficients for an undirected graph. By default, running

python graph.py

will generate and analyze a random graph of \(1000\) nodes with edge probability \(p=0.01\). It will show some plots and store copies of them in .pdf files. Note that you need to close the plot window to let the program proceed to the next analysis. If you ever get tired of clicking, you can disable the display using the command-line option --batch (you will still get the .pdf files):

python graph.py --batch

Study the code in graph.py carefully to learn how to use graph_tool for graph analytics and how to generate various plots. (Ignore the last part of the code about "my clustering coefficients" for now.) If you want to learn more about graph_tool, check out its "quick start" guide, as well as general documentation.

Your job is to use graph.py to analyze the Marval social network graph by running:

python graph.py characters.tsv

Examine the various characteristics you have computed for the Marval social network graph. What is the most "unusual"? Why would it be considered unusual and is there an explanation for this unusual characteristic?

WHAT TO SUBMIT: A plain-text file 2.txt with your response to the questions above.

3. Computing Clustering Coefficients

Your job for this part of the assignment is to implement, by yourself, the computation of cluster coefficients for all nodes in an undirected graph. Modify the function my_compute_clustering_coeffients in graph.py for this purpose. This function is supposed to return a list or array object, whose \(i\)-th entry stores the clustering coefficient for the \(i\)-th node in the given graph \(G\).

Uncomment the last several lines of graph.py to test your code. Note that graph.py will first call graph_tool's built-in function for computing clustering coefficients; you should compare its output with that produced by your code.

WHAT TO SUBMIT: Your modified graph.py.