DUE: Sunday 4/12 11:59pm
HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216
and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.
WHAT TO SUBMIT: Nothing is required for this part.
To get ready for this assignment, open up a VM shell, and type the following command:
/opt/datacourse/sync.sh
Next, type the following commands to create a working directory for this homework. Here we use hw12
under your shared/
directory, but feel free to change it to another location.
cp -pr /opt/datacourse/assignments/hw12/ ~/shared/hw12/
cd ~/shared/hw12/
You have collected a huge set \(S_2\) of IP addresses who visited your website. Your partner website has similarly collected \(S_1\). You want to see which of your visitors also visited your partner website. Your partner website doesn't want to ship you the entire \(S_1\), but has agreed to give you a Bloom filter constructed from \(S_1\), whose size is a just tiny fraction of \(S_1\). Your job is to implement a Bloom Filter that does the job.
First, type the following command to generate the datasets:
python prepare.py
This command will generate three files: build.txt
and probe.txt
are \(S_1\) and \(S_2\), respectively (with IP address per line); answerkeys.pk
contains information needed to validate your implementation (including a list of True
or False
indicating, for each IP in probe.txt
, whether it really appears in build.txt
).
You should modify the file bloom.py
to complete the implementation and tuning of the Bloom filter. In particular:
add()
and check()
for the BloomFilter
class. Note that we have constructed (in __init__()
) a list of independent hash functions that you can use in these methods. However, feel free to explore other choices of hash functions.To run the Bloom filter code and store its output in a file named result.txt
, type:
python bloom.py | tee result.txt
We also encourage you to measure the test running time for each \(k\)---while \(k\) does not affect the storage requirement, it does affect the running time significantly. However, beware that Python is not designed for execution efficiency, and you might find your code very slow compared with, for example, Python's built-in set
for intersection (if space is not a concern and you have full access to both sets). The reason is that much of the Python built-in library uses an underlying C/C++ implementation. In practice, you might want to implement your Bloom filter in C/C++ as well as and just call it from Python; that should make running time of Bloom filter more competitive.
WHAT TO SUBMIT: Your bloom.py
and result.txt
files.