# Homework #4: Basic Stats and Probability¶

DUE: Sunday 2/8 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

# 0. Getting Started¶

WHAT TO SUBMIT: Nothing is required for this part.

For this assignment, we will install quite a bit more software. Before you proceed, make sure that your host is connected to a power source and a fast network, and allocate yourself enough time (say 30 minutes) to complete the setup without interrupting it.

To get ready for this assignment, open up a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw04 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw04/ ~/shared/hw04/
cd ~/shared/hw04/

# 1. Probabilities¶

There are 64 students in our class. We had one round of random team assignment and will soon have a second round. In each round, we randomly divide the students into 16 groups of size 4 each.

(A) $A$ and $B$ are teammates in Round 1. What's the probability that they are teammates again in Round 2?

(B) What's probability that no pair of teammates in Round 1 are teammates again in Round 2?

NOTE: You can arrive at your answer either analytically or computationally (by simluation). With the latter approach, you simulate the process many times, and see what fraction of these times the event of interest occcurs. In the prob/ subdirectory, you will find some example simulation code, which computes the expected percentage of boys in a population consisting of a given number of families. For example:

cd ~/shared/hw04/prob/
python sim-ferengi.py 1 5000

computes the expected percentage of boys in one family using 5000 simulations. Play around with different numbers of families and see how the percentage changes. Feel free to use the code as a template and modify it to suite your needs.

WHAT TO SUBMIT: Submit one file containing your answers to all questions below. Name this file 1-prob with appropriate extension (e.g., .txt for plain text or .pdf for PDF). We prefer plain text, but it's also fine if you want to handwrite your answers and scan them into a PDF file, or if you write your answers in LaTeX and compile them into PDF. If you use the simultation approach, submit your code as well, and explain which file does what.

# 2. Testing against Benford's Law¶

Suppose that you measure some naturally-occurring phenomenon, such as the land area of cities or lakes, or the price of stocks on the stock market. You can plot a histogram of of the first digits of your measurements, showing how frequent each first digit (1-9) is.

Your measurements were made in some arbitrary units, such as square miles or dollars. What if you made the measurements in some other units, such as acres or euros? You would not expect the shape of the histogram to differ just because you changed the units of measurement, because the units are arbitrary and unrelated to the underlying phenomenon. If the shape of the histogram did change, that could indicate that the data were fraudulent. This approach has been used to detect fraud, especially in accounting and economics but also in science.

Data are called scale-invariant if measuring them in any scale (say, meters vs. feet vs. inches) yields similar results, such as the same histogram of first digits. Many natural processes produce scale-invariant data. Benford's Law states the remarkable fact there is only one possible histogram that results from all of these processes! Let $P(d)$ be the probability of the given digit $d \in \{1, 2, \ldots, 9\}$ being the first one in a measurement. Then, we have $P(d) = \log_{10}(d+1) - \log_{10}(d)$.

NOTE: Benford's law only holds when the data has no natural limit nor cutoff. For example, it would not hold for grades in a class (which are in the range 0% to 100% or 0.0 to 4.0) nor for people's height in centimeters (where almost every value would start with 1 or 2). If you are interested, Wikipedia has more information about Benford's Law. (You don't need to read the Wikipedia article in order to complete this problem, however.)

(A) In the benford/ subdirectory, you will find a script test-and-plot.py that parses population numbers from the U.S. census, plots the distribution of their first digits side-by-side with the "theoretical" distribution, and performs hypothesis testing (using a $\chi^2$ test) to get a sense of how well the observed first-digit frequencies conform to the theoretical distribution. Try running this script and see what it does:

cd ~/shared/hw04/benford/
python test-and-plot.py

Closing the figure window will exit the script. It will also produce a PDF version of the figure in the same directory.

The p-value computed by the supplied script is wrong! The theoretical distribution in this case should be governed by Benford's Law, but the current implementation uses the uniform distribution. Your task is to modify the benford() function in the script so it reflects Benford's Law instead. Run your modified script and report the p-value. Is it close to 1 or 0? What does that tell you?

(B) Next, you are going to modify the script further to analyze data from the 2009 Iranian election in election-iran-2009.csv. (You might want to save some backup versions of the script.) The numbers of interest in the .csv file are the vote counts for candidates Ahmadinejad, Rezai, Karrubi, and Mousavi. Note that these numbers have commas in them. You might find Python's csv.DictReader useful.

Consider the first-digit distribution of all candidates' vote counts, as well as that of each candidate's vote counts. Compare each of these distributions to Benford Law. Which distribution gives you the lowest p-value? How does it compare with the magic p-value of 0.05, which people often use to "reject" the hypothesis that data follows the theoretical distribution?

WHAT TO SUBMIT: Submit a plain-text file named 2-benford.txt with your answers to questions above. In addition, submit the .pdf file produced by your script for the distribution in (B) with the lowest p-value, as well as the .py script itself.

# 3. Exploring MovieLens Data¶

In the movielens/ subdirectory, you will find data about 100,000 ratings from 1000 users on 1700 movies, made available by grouplen.org. The raw files can found under ml-100k/ (note that we have removed some of the files that are irrelevant for this assignment). We have provided a setup script (setup.sh) that will create a database called movielens for you to explore, but feel free to use any other tool (e.g., Excel). To run the setup script, use the following commands:

cd ~/shared/hw04/movielens/
./setup.sh

If you plan to use the database, take a look at create.sql: you will notice that we have created a helper view movie_genre that converts the "one-hot" encoding of genres of the input data into a more "relational" format.

WHAT TO SUBMIT: For (A) and (B) below, submit your code (be it SQL, Python, or any other language) as well as output; name the files 3A and 3B with appropriate extensions. For (C), submit a plain-text file 3C.txt (with your answer and explanation), as well as any code used.

(A) Compute the mean and standard deviation of ratings by movie genre and reviewer gender. (In PostgreSQL, there is an aggregation function called STDDEV.) Sort the results by mean in descending order. The top three rows of your output should look like this:

    genre     | gender |        avg         |         stddev
--------------+--------+--------------------+------------------------
film_noir    | M      | 3.9732937685459941 | 0.95831397959902417403
war          | M      | 3.8263282008600361 |     1.0689702552913549
war          | F      | 3.7811786203746003 |     1.1162307132041698

(B) Plot a histogram for the distribution of average ratings for movies. In other words, each number in the dataset corresponds to one movie's average rating. (There is a sample script histo-movies.py that demonstrates how a python script talks to the database using psycopg2.)

(C) Suppose a user rates the two movies To Be or Not to Be (1942) and Bad Taste (1987). What are the odds that this user is female, based on the data given?

The most straightforward method would just look at all users who have rated these two movies, compute the breakdown of the number of reviewers by gender, and use that ratio to predict the gender of the mysterious reviewer. Does this method work in this case? If not, what can you do?

Hint: To make this problem a bit less open-ended, you may assume the following (but feel free to come up with your own assumptions):

• We will only use the following information in making our bet: movies(id, title), ratings(user_id, movie_id), and users(id, gender).

• Given a user's gender, the chances that the user rates any two given movies are independent. In other words, P($U$ rates movies $A$ and $B$ | $U$ is female) = P($U$ rates $A$ | $U$ is female) $\times$ P($U$ rates $B$ | $U$ is female) (and the same goes for male).