Homework #3: Record Linkage and Fuzzy Matching

DUE: Monday 2/2 11:59pm

HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt listing members of your team who are present at the lab.

0. Getting Started

To get ready for this assignment, get a VM shell, and type the following command:

/opt/datacourse/sync.sh

Next, type the following commands to create a working directory for this homework. Here we use lab03 under your shared directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/lab03/ ~/shared/lab03/
cd ~/shared/lab03/

1. More Restaurant Matching

In the homework you worked with the restaurants(id,name,addr,city,type) table and found duplicate entries. Well, we only gave you a little over half the records to work with. In the lab you will now work with the full restaurants database. Go into the restaurants-full directory within the lab03 folder and create the restaurants-full database:

./setup.sh

Now you can use the full restaurants database:

psql restaurants-full

We have provided three sample solutions -- match1.sql, match2.sql, match3.sql -- for Homework #3 (these are picked from your submissions!).

A) Compute the f1 score for these solutions on smaller restaurants database from the homework. You can use the command ./test-small.sh match.sql (substitute match.sql with the solution want to test). Which one performs the best?

B) Now compute the f1 score for these solutions on the full restaurants-full database. You can use the command ./test-full.sh match.sql. Which one now performs the best?

C) Can you explain what is going on?

2. Challenge: Matching Products

Now that you are experts at record linkage, we present to you the following challenge. We have sets of product listings:

Go into the lab03/products sub-folder and create the products database:

./setup.sh

Write a query (in match.sql) that achieves the best f1 score.

WHAT TO SUBMIT The match.sql file with your improved matching procedure, as well as the output of running ./test.sh match.sql in a separate text file named output.txt.