DUE: Monday 2/2 11:59pm
HOW/WHAT TO SUBMIT: All files should be submitted through WebSubmit. Only one of your team members needs to submit on behalf of the team. On the WebSubmit interface, make sure you select
compsci216 and the appropriate lab number. You can submit multiple times, but please have the same team member resubmit all required files each time. To earn class participation credit, submit a text file
team.txt listing members of your team who are present at the lab.
To get ready for this assignment, get a VM shell, and type the following command:
Next, type the following commands to create a working directory for this homework. Here we use
lab03 under your
shared directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/lab03/ ~/shared/lab03/ cd ~/shared/lab03/
In the homework you worked with the
restaurants(id,name,addr,city,type) table and found duplicate entries. Well, we only gave you a little over half the records to work with. In the lab you will now work with the full restaurants database. Go into the
restaurants-full directory within the
lab03 folder and create the
Now you can use the full restaurants database:
We have provided three sample solutions --
match3.sql -- for Homework #3 (these are picked from your submissions!).
A) Compute the f1 score for these solutions on smaller
restaurants database from the homework. You can use the command
./test-small.sh match.sql (substitute
match.sql with the solution want to test). Which one performs the best?
B) Now compute the f1 score for these solutions on the full
restaurants-full database. You can use the command
./test-full.sh match.sql. Which one now performs the best?
C) Can you explain what is going on?
Now that you are experts at record linkage, we present to you the following challenge. We have sets of product listings:
amazon(id,title,description,manufacturer,price) listing 1363 products, and
google(id,title,description,manufacturer,price) listing 3226 products.
Go into the
lab03/products sub-folder and create the
Write a query (in
match.sql) that achieves the best f1 score.
WHAT TO SUBMIT The
match.sql file with your improved matching procedure, as well as the output of running
./test.sh match.sql in a separate text file named