Homework #12: Bloom Filter

DUE: Sunday 4/12 11:59pm

HOW TO SUBMIT: Submit the required files for all problems (see WHAT TO SUBMIT under each problem below) through WebSubmit. On the WebSubmit interface, make sure you select compsci216 and the appropriate homework number. You can submit multiple times, but please resubmit files for all problems each time.

0. Getting Started

WHAT TO SUBMIT: Nothing is required for this part.

To get ready for this assignment, open up a VM shell, and type the following command:


Next, type the following commands to create a working directory for this homework. Here we use hw12 under your shared/ directory, but feel free to change it to another location.

cp -pr /opt/datacourse/assignments/hw12/ ~/shared/hw12/
cd ~/shared/hw12/

1. Intersecting Sets of IP Addresses

You have collected a huge set \(S_2\) of IP addresses who visited your website. Your partner website has similarly collected \(S_1\). You want to see which of your visitors also visited your partner website. Your partner website doesn't want to ship you the entire \(S_1\), but has agreed to give you a Bloom filter constructed from \(S_1\), whose size is a just tiny fraction of \(S_1\). Your job is to implement a Bloom Filter that does the job.

First, type the following command to generate the datasets:

python prepare.py

This command will generate three files: build.txt and probe.txt are \(S_1\) and \(S_2\), respectively (with IP address per line); answerkeys.pk contains information needed to validate your implementation (including a list of True or False indicating, for each IP in probe.txt, whether it really appears in build.txt).

You should modify the file bloom.py to complete the implementation and tuning of the Bloom filter. In particular:

  • Implement the methods add() and check() for the BloomFilter class. Note that we have constructed (in __init__()) a list of independent hash functions that you can use in these methods. However, feel free to explore other choices of hash functions.
  • Modify the end of the file to explore different settings of the Bloom filter parameter \(k\) and pick the one that's the best. Do not change the size \(n\) of the Bloom filter.

To run the Bloom filter code and store its output in a file named result.txt, type:

python bloom.py | tee result.txt

We also encourage you to measure the test running time for each \(k\)---while \(k\) does not affect the storage requirement, it does affect the running time significantly. However, beware that Python is not designed for execution efficiency, and you might find your code very slow compared with, for example, Python's built-in set for intersection (if space is not a concern and you have full access to both sets). The reason is that much of the Python built-in library uses an underlying C/C++ implementation. In practice, you might want to implement your Bloom filter in C/C++ as well as and just call it from Python; that should make running time of Bloom filter more competitive.

WHAT TO SUBMIT: Your bloom.py and result.txt files.