Homework #3: Record Linkage and Fuzzy Matching

DUE: Sunday 2/1 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, get a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use hw03 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw03/ ~/shared/hw03/
cd ~/shared/hw03/

Next, create the restaurants database that you need for this homework:

./setup.sh

Note that the last two lines in this shell script installs two PostgreSQL "extensions" that allows you to compute Levenshtein distance (levenshtein(string1, string2)) and Jaccard similarity (similarity(string1, string2)) that we discussed in class.

You can now use the database of restaurants:

psql restaurants

The restaurants(id, name, addr, city, type) table in the restaurants database has been merged improperly! It contains data from two different restaurant listing services, and there are duplicates among the rows, e.g.:

6,aqua,252 california st.,san francisco,seafood
54,aqua,252 california st.,san francisco,american (new)

Your goal in this homework is to figure out which pairs of restaurant rows are duplicates of each other. You can write a SQL query to find all such pairs (identified by the id's); e.g., (6,54). Obviously, your query doesn't need to list (6,6) (as the same row always matches itself), and it only needs to list one of (6,54) and (54,6) (because the relationship is symmetric).

In your homework directory, we have provided a simple example matching procedure in match.sql that does not work very well. Understand what it does before moving on to the next part.

1. Quantifying the Error in Matching

First, we need a way to assess the quality of the matching procedure in match.sql. We have provided the true matching pairs of id's in table official_matches(id1, id2) (where id1<id2). We would like to compute the precision, recall, and f1-score of the output of match.sql against the true set of matches.

We have provided a shell script called test.sh to automate this process. Simply type the following command from your VM shell prompt (and not within psql!):

./test.sh match.sql

But the output looks all wrong at this moment! That's because test.sh uses views TP, FP, and FN in measures.sql to compute the true positives, false positives, and false negatives, respectively. But currently, these definitions are wrong---they simply return whatever is in the matches table, which holds the result of match.sql.

Edit measures.sql so it correctly computes TP, FP, and FN. To help you debug, if your definitions are correct, the output of running ./test.sh match.sql (the default one we provided) should contain the following lines:

true positives: 48
  (to show true positives: SELECT * FROM TP_rows;)
false positives: 1
  (to show false positives: SELECT * FROM FP_rows;)
false negatives: 13
  (to show false negatives: SELECT * FROM FN_rows;)

HINT: In order to compute the true positives, false positives, and false negatives you might want to familiarize yourself with the following set operations in SQL: UNION, INTERSECT, and EXCEPT. You can find them here.

WHAT TO SUBMIT: The completed measures.sql file.

2. Improving Your Matching Procedure

Now that you can quantify how well your matching performs, we challenge you to find the query that achieves the highest f1-score.

HINT: To get high F1-scores, you might consider using string operations that PostgreSQL supports. You can find them here.

WHAT TO SUBMIT The match.sql file with your improved matching procedure, as well as the output of running ./test.sh match.sql in a separate text file named output.txt.