Homework #1: Self-Introduction and OpenRefine Tutorial¶

DUE: Monday 1/13 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci290 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

1. Self-Introduction¶

WHAT TO SUBMIT: Submit a plain-text file named intro.txt.

Write a paragraph about yourself that tells us a bit about your background and goals for this course. Specifically, please state the following:

How comfortable are you with programming? (You can list some courses you have taken or programming languages you know.)
How comfortable are you with stats and math? (You can list some recent courses you have taken.)
Do you have a knack for creating visualizations and websites?
Do you have experience, expertise, or passion in any application domain (e.g., health, public policy, journalism, sports, etc.) that can benefit from data?
Do you have particular datasets in mind you would like to work on for the course project?
Is there anything specific that you would like to learn from this course?

As explained on the course website, we are not expecting you to be an expert in all of the disciplines related to data, and you are not going to be "graded" by your background. Feel free to answer "no" to some of the questions above. We are doing this survey to help us tailor the course towards your background and interest, and to help you form project teams and ideas.

2. OpenRefine Tutorial¶

WHAT TO SUBMIT: You don't need to submit anything for this part of the homework.

OpenRefine is a powerful data wrangling tool. Data often comes along messy. Either people make mistakes entering and collecting data, or the data you got is in the wrong format for what you want to do with it. OpenRefine was built to deal with these kinds of problems and to bring data into the shape you need.

This tutorial originally comes from http://unurl.org/cbclean. It's been cleaned up and modified a bit.

For this exercise we'll use:

OpenRefine: Download and install from http://openrefine.org
A dataset: Download the CSV from: http://data.okfn.org/community/mihi-tr/bosnia-tenders

Once you’ve downloaded what we’ll need: Let’s start.

Let’s first look at the dataset on the website---fortunately the website gives us a nice preview of the data. The data is a recording of the tenders (proposals/bids) awarded in Bosnia and Herzegovina, scraped for the period of 01.04.2013 - 31.10.2013. There are several things wrong with this dataset. The amounts always have the currency symbols with them and are formatted so humans can read them---computers struggle with it. The dates contain additional information and are not really the dates. Company names may be inconsistent throughout the document---even for the same company.

Importing

Start refine and click on “create project”
Now choose the CSV file you’ve downloaded from the link above and click “next”
You’ll see a preview screen - showing you how refine would interpret the data. We’ll need to change a few things:
- First select “CSV” as the format
- Click in the “character encoding” + field and select “UTF-8” for the encoding. This should make your preview look good.
Next name the project and click on Create Project.

You will see the data in Refine. Notice a few things: Data is shown by rows and columns as you know it from a spreadsheet---you can also edit the single cells---although you generally avoid this in refine. Further: You don’t see all the data, you only see a sub-section. This is because refine is made for large amounts of data and your browser would work too hard to display it. Once you familiarized yourself with the interface let’s go and clean up the data.

Cleaning Up the Data

The date always says the same things before the actual date starts, we can easily remove this. To do so select the column options for the “date” column (the blue triangle) and “Edit Cells -> Transform...”
A new window appears asking you for an “expression” how to transform the cells. An expression is very similar to a formula in spreadsheets. We’ll need to remove stuff, so we’ll use the “replace” expression: when you change your expression you’ll see the preview in the bottom change with it.
One nice things about expression is: We can chain things to do: notice the “,” at the end, let’s remove that as well.
Now click on “OK” to apply this to all the cells. However, there is still something strange about the data: There is a lot of white spaces in the field (you can see this by clicking to edit one of the cells).
Luckily this is so common, that refine has build in functions to remove them: Select “Edit cells -> Common Transforms -> Trim leading and trailing white spaces” from the column options of the “date” column.
Great---now let’s deal with our next problem-child: The amount of money paid. You can see that the numbers are formatted nicely readable (in a layout that’s typical for non-english speaking countries, where a comma is used as decimal separator and a dot is used to tell blocks apart). Let’s reformat this.
The first we want to do is get the currency out into a different column (there could be cases where the currency is different). We’ll do this by adding a column: Select “edit column -> add column based on this column” from the column options of the “amount” column. The menu you get is very similar to the one you got on the transformation - refine wants to know how we want to “transform” this column. We’ll use a match expression (don’t worry about the part that looks like a comic character swearing. It tells the computer to search for letters and extract them).
- These patterns are called regular expressions. They are very handy in parsing and extracting data. If you haven't used them before, you should check out this Coursera lecture: https://class.coursera.org/nlp/lecture/125. It's only 12 minutes.
If you click OK you’ll have a new column created.
Now let’s clean up the amount column, we’ll need to remove the dots, all letters and spaces - we can do this using a replace expression as we’ve done above. This time it looks a little different….
Last thing: we want to change the comma to a dot - for the computer to understand it - so we’ll add another replace in the back.
Now everything left is to convert it to a number. This is done by adding .toNumber()
Click “OK” and you’ll have the amount column cleaned up.
Perfect - the last thing we want to do is to make sure the issuers of tenders are always referred to with the same name. Let’s look at some of the names. To look at all the names mentioned - we’ll do a “facet”: Facets are refines way of filtering data - but they are much more powerful. Let’s do a facet on the “tenderer” column - click on the column options then select “Facet -> text facet” You will get a text facet of the column - the text facet allows you to see all options and filter by them. However, this is not what we’ll do: We’ll use the clustering button to find tenderers that should be the same but written slightly different.
Clicking on the Cluster Button will open a new menu. Select “Key Collison” as a method and “cologne-phonetic” as a keying function This will show some tenderers that have some problems.
If you check the “merge?” item you can merge them together.
Click on “Merge Selected and re-cluster” to merge them - switch between the keying functions to merge entries that belong together.
One thing you might have noticed looking through the tenderer names was: some of them are all capitals, some of them are the first letters caps etc. We can easily change this for all of them. If we want them to be titlecase, we can simply do so by using one of the “common transforms…”
Select “edit cells -> Common Transforms -> to titlecase” from the column options of the tenderer column:

3. Using OpenRefine to Clean up a Congressional Member Listing¶

NOTE: We encourage you to attempt this part but it is optional. You don't need it to get a "V" (90%) on this homework but you will need it to get an "E" (100%). While we will go through much of this excerise in the lab on Tuesday, doing this exercise by yourself will give you an advantage when tackling the "challenge" in the lab on Tuesday.

WHAT TO SUBMIT: Submit a plain-text file named congress.txt. Type (or copy and paste from OpenRefine) your answer to each question below into that file. Please clearly number and delineate the answers to different questions.

govtrack.us contains a wealth of data on the U.S. Congress. You can get a JSON feed of congressional members from https://www.govtrack.us/api/v2/person. Import the feed into OpenRefine, and use the data therein (not from any other sources!) to answer the following questions:

For each possible party affiliation (R for Republicans, D for Democrats, and I for independents), list the number of members with that affiliation.
How many members are born after 1950?
Find the 10 longest serving members.
Check your answers above against sources like Wikipedia. Are your answers correct with respect to reality? What does that say about the govtrack.us data source you used?