DUE: Monday 2/2 11:59pm
HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select compsci216
and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file team.txt
listing members of your team who are present at the lab.
To get ready for this assignment, get a VM shell, and type the following command:
/opt/datacourse/sync.sh
Next, type the following commands to create a working directory for this homework. Here we use lab03
under your shared
directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/lab03/ ~/shared/lab03/
cd ~/shared/lab03/
In the homework you worked with the restaurants(id,name,addr,city,type)
table and found duplicate entries. Well, we only gave you a little over half the records to work with. In the lab you will now work with the full restaurants database. Go into the restaurants-full
directory within the lab03
folder and create the restaurants-full
database:
./setup.sh
Now you can use the full restaurants database:
psql restaurants-full
We have provided three sample solutions -- match1.sql
, match2.sql
, match3.sql
-- for Homework #3 (these are picked from your submissions!).
A) Compute the f1 score for these solutions on smaller restaurants
database from the homework. You can use the command ./test-small.sh match.sql
(substitute match.sql
with the solution want to test). Which one performs the best?
B) Now compute the f1 score for these solutions on the full restaurants-full
database. You can use the command ./test-full.sh match.sql
. Which one now performs the best?
C) Can you explain what is going on?
Now that you are experts at record linkage, we present to you the following challenge. We have sets of product listings:
amazon(id,title,description,manufacturer,price)
listing 1363 products, and
google(id,title,description,manufacturer,price)
listing 3226 products.
Go into the lab03/products
sub-folder and create the products
database:
./setup.sh
Write a query (in match.sql
) that achieves the best f1 score.
WHAT TO SUBMIT The match.sql
file with your improved matching procedure, as well as the output of running ./test.sh match.sql
in a separate text file named output.txt
.