DUE: Sunday 4/12 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select
compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
To get ready for this assignment, open up a VM shell, and type the following command:
Next, type the following commands to create a working directory for this homework. Here we use
hw12 under your
shared/ directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/hw12/ ~/shared/hw12/ cd ~/shared/hw12/
You have collected a huge set \(S_2\) of IP addresses who visited your website. Your partner website has similarly collected \(S_1\). You want to see which of your visitors also visited your partner website. Your partner website doesn't want to ship you the entire \(S_1\), but has agreed to give you a Bloom filter constructed from \(S_1\), whose size is a just tiny fraction of \(S_1\). Your job is to implement a Bloom Filter that does the job.
First, type the following command to generate the datasets:
This command will generate three files:
probe.txt are \(S_1\) and \(S_2\), respectively (with IP address per line);
answerkeys.pk contains information needed to validate your implementation (including a list of
False indicating, for each IP in
probe.txt, whether it really appears in
You should modify the file
bloom.py to complete the implementation and tuning of the Bloom filter. In particular:
BloomFilterclass. Note that we have constructed (in
__init__()) a list of independent hash functions that you can use in these methods. However, feel free to explore other choices of hash functions.
To run the Bloom filter code and store its output in a file named
python bloom.py | tee result.txt
We also encourage you to measure the test running time for each \(k\)---while \(k\) does not affect the storage requirement, it does affect the running time significantly. However, beware that Python is not designed for execution efficiency, and you might find your code very slow compared with, for example, Python's built-in
set for intersection (if space is not a concern and you have full access to both sets). The reason is that much of the Python built-in library uses an underlying C/C++ implementation. In practice, you might want to implement your Bloom filter in C/C++ as well as and just call it from Python; that should make running time of Bloom filter more competitive.
WHAT TO SUBMIT: Your