Homework #3: Relational Data Processing

DUE: Monday 1/27 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, fire up your virtual machine, open up a terminal, and type the following command:


This command will (among other things) download and install the PostgreSQL database server, which we are going to use for this homework.

Next, type the following commands to create a working directory (hw03 under your home directory) for this homework:

cp -pr /opt/datacourse/assignments/hw03-template/ ~/hw03/
cd ~/hw03/

The last command will take quite a while---you will see lots of text flying by. It will download, from govtrack, YAML and JSON data files about all past and current members of the Congress as well as bills introduced and votes cast by the 113th Congress. It will also parse these data files, transform them into the relational format, and load them into a database called congress.

Once the above commands finish running, you should be able to access the database by issuing the following command:

psql congress

Before continuing with the rest of the homework, we recommend that you go through the simple tutorial on the course website to familiarize yourself with psql command-line interface.

1. Querying the Congress Database

WHAT TO SUBMIT: For each of the questions (A, B, ...) below, submit a plain-text .sql file (1A.sql, 1B.sql, ...). Note that you can enter comments in SQL using -- (anything following -- on the same line will be ignored as comments). If a query does not run or returns an incorrect answer, please document it in the comments.

To get a sense of what's in the congress database, study the file create.sql (which specifies the database schema) as well as the .dat files under the data/ subdirectory (which contains the structured data files parsed from raw data to be loaded into the database---there is one for each table). These files can be opened and viewed as plain-text files (see shell tips for help on working with plain-text files).

Also feel free to explore the data in congress by issuing SQL queries inside psql. Write SQL queries to answer the following questions. (We recommend a two-window setup demonstrated in class and described in the tutorial.)

(A) List all persons (past and present members of the Congress) whose last name is Kennedy.

(B) List all persons with last name Kennedy who served in the Senate, together with their terms. Output only first name, last name, birthday, state, party, as well as the start and end dates of the term. Here are some example output rows (the answer has more):

 first_name | last_name |  birthday  | state |  party   | start_date |  end_date  
 John       | Kennedy   | 1917-05-29 | MA    | Democrat | 1953-01-03 | 1958-08-24
 John       | Kennedy   | 1917-05-29 | MA    | Democrat | 1959-01-07 | 1960-09-01
 Robert     | Kennedy   | 1925-11-20 | NY    | Democrat | 1965-01-04 | 1968-10-14

(C) For each state, count the current number of House Representatives. List the 10 states with largest number Representatives (in descending order).

(D) In 2013, House passed a resolution to repeal the Patient Protection and Affordable Care Act. Find out which Democrats voted for this resolution.

You should be able to find this vote by looking for votes rows with type being 'On Passage of the Bill' and subject containing strings 'repeal' and 'Affordable'. For the purpose of this question, you may assume that all Representatives of interest to this question are still current members (actually, three Representatives who voted are longer current, but all three are Republicans).

NOTE: In SQL, you can do simple string matching using the LIKE operator. For example, subject LIKE '%health%' checks if subject contains the string 'health' (% matches any sequence of zero or more characters). Alternatively, in PostgreSQL, you can do regular expression matching using the ~ operator. For example, subject ~ '.*repeal.*' is effectively the same as subject LIKE '%health%'.

(E) Write a SQL query using WITH to accomplish the following.

In the WITH clause, define a temporary view called person_party_votes(vote_id, person_id, party, vote), which basically augments the information in person_votes with the voter's party at the time of the vote. To ensure correct answer, you need to handle the case where the person is no longer a current member (e.g., John Kerry), as well as the case where the person's party affiliation has changed.

In the query that follows, compute the breakdown of votes by party for each vote_id. Here are some example output rows (the answer certainly has more rows):

       vote_id        |    party    |     vote     | count 
 s100-113.2013        | Democrat    | Nay          |    40
 s100-113.2013        | Democrat    | Yea          |    13
 s100-113.2013        | Independent | Nay          |     2
 s100-113.2013        | Republican  | Nay          |     1
 s100-113.2013        | Republican  | Yea          |    44

2. Detective Work on Vote Correlation

In Lecture #3, we showed how queries computing vote correlations for David Price (D-NC) with Nancy Pelosi (D-CA) and Eric Cantor (R-VA) revealed unexpected results (slides 24-27). Can you explain why it seems that Price votes more often with Cantor than with Pelosi, when in reality it should obviously be the other way around?

For your convenience, the file price-pelosi.sql in your working directory for this homework contains the SQL query used in the class. You may edit this file to run variations of this query to help you debug.

WHAT TO SUBMIT: A plain-text file named price.txt. Give your explanation in this file, together with any SQL queries and results that support your explanation.

3. Making Claims and Checking Facts

Your job for this part is to find an interesting claim base on our congress database that may be "checked" by others in the lab on Tuesday. The following properties are desirable for a claim:

  1. The correctness of the claim can be verified by SQL queries over the congress database alone, without drawing from external data sources.

  2. The claim sounds interesting enough (that somebody might feel like tweeting it). (Okay, this is purely subjective, but try your best.)

  3. The claim may be technically correct, but nonetheless misleading. Others will need to come up with non-trivial couterarguments to refute what the claim implies.

The claim about Price vs. Pelosi and Price vs. Cantor in Part 2 of this homework meets (a) and (b) above, but doesn't meet (c) because it is technically incorrect.

One example of a claim meeting (a), (b), and to some extent (c), would be

The average age of U.S. Representatives in Alaska is a whopping 81!

True, but somewhat misleading---Alaska has only one Representative in the House.

Real-world examples of claims meeting (a), (b), and (c) can be found in the back-and-forth attacks in election campaigns. A TV ad in the 2010 elections claimed that Jim Marshall, a Democratic incumbent from Georgia "voted the same as Republican leaders 65 percent of the time." This comparison was made with Republican Leader John Boehner over the votes in 2010. If we look at the history since 2007, however, the number would have been only 56 percent, which is not very high considering the fact that even the Democratic Whip, Jim Clyburn, voted 44 percent of the time with Boehner during that period. In fact, this ad was in response to an earlier ad attacking Marshall for voting with Nancy Pelosi "almost 90 percent of the time," which, not surprisingly, also tailored the claim in ways to further its own argument. You can read more about such claims here at FactCheck.org.

Your job is to make an interesting claim from our congress database, which contains all historical members of the Congress but voting records for only 2013. If your claim is misleading at the same time, that's even better---in that case, please also provide a counterargument to your claim (similar to the counterarguments made in the two examples above).

WHAT TO SUBMIT: A plain-text file named claim.txt. Structure it in three sections: 1) The claim in English (and, if it is intended to mislead, what you are trying to imply). 2) One more more SQL queries over the congress database to establish the correctness of your claim. 3) A counterargument to your claim, together with supporting SQL queries, if your claim is intended to mislead.