Homework #1: Self-Introduction and OpenRefine Tutorial

DUE: Monday 1/13 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

1. Self-Introduction

WHAT TO SUBMIT: Submit a plain-text file named intro.txt.

Write a paragraph about yourself that tells us a bit about your background and goals for this course. Specifically, please state the following:

As explained on the course website, we are not expecting you to be an expert in all of the disciplines related to data, and you are not going to be "graded" by your background. Feel free to answer "no" to some of the questions above. We are doing this survey to help us tailor the course towards your background and interest, and to help you form project teams and ideas.

2. OpenRefine Tutorial

WHAT TO SUBMIT: You don't need to submit anything for this part of the homework.

OpenRefine is a powerful data wrangling tool. Data often comes along messy. Either people make mistakes entering and collecting data, or the data you got is in the wrong format for what you want to do with it. OpenRefine was built to deal with these kinds of problems and to bring data into the shape you need.

This tutorial originally comes from http://unurl.org/cbclean. It's been cleaned up and modified a bit.

For this exercise we'll use:

Once you’ve downloaded what we’ll need: Let’s start.

Let’s first look at the dataset on the website---fortunately the website gives us a nice preview of the data. The data is a recording of the tenders (proposals/bids) awarded in Bosnia and Herzegovina, scraped for the period of 01.04.2013 - 31.10.2013. There are several things wrong with this dataset. The amounts always have the currency symbols with them and are formatted so humans can read them---computers struggle with it. The dates contain additional information and are not really the dates. Company names may be inconsistent throughout the document---even for the same company.


Cleaning Up the Data

3. Using OpenRefine to Clean up a Congressional Member Listing

NOTE: We encourage you to attempt this part but it is optional. You don't need it to get a "V" (90%) on this homework but you will need it to get an "E" (100%). While we will go through much of this excerise in the lab on Tuesday, doing this exercise by yourself will give you an advantage when tackling the "challenge" in the lab on Tuesday.

WHAT TO SUBMIT: Submit a plain-text file named congress.txt. Type (or copy and paste from OpenRefine) your answer to each question below into that file. Please clearly number and delineate the answers to different questions.

govtrack.us contains a wealth of data on the U.S. Congress. You can get a JSON feed of congressional members from https://www.govtrack.us/api/v2/person. Import the feed into OpenRefine, and use the data therein (not from any other sources!) to answer the following questions:

  1. For each possible party affiliation (R for Republicans, D for Democrats, and I for independents), list the number of members with that affiliation.
  2. How many members are born after 1950?
  3. Find the 10 longest serving members.
  4. Check your answers above against sources like Wikipedia. Are your answers correct with respect to reality? What does that say about the govtrack.us data source you used?