Duke DBGroup Logo

Data-intensive Computing Systems
Programming Assignment 3

Course information
Course schedule and notes
Extra Materials
The deadline for this assignment is Monday, Oct 3, 5.00 PM.

The objective of this programming assignment is to better understand how Hadoop works. For this assignment, you should know how to start a Hadoop cluster on Amazon EC2. You can use the software that we have provided for this purpose --- see the Extra Materials section on the class page.

You will copy a dataset into HDFS (the Hadoop Distributed FileSystem) in the Hadoop cluster you started on Amazon EC2. We will be providing you a data generator that can generate data like the /cps216/common/TPC-DS/customer.dat and /cps216/common/TPC-DS/web_sales.dat datasets you used in programming assignment 1 (data generation instructions). This data generator can be run on the Hadoop cluster you started on Amazon EC2. The generator will also copy data in parallel into HDFS on the cluster. You are welcome to use other datasets you like as long as they satisfy three properties: (a) they go to 10GB or more in size, (b) you can meaningfully run WordCount, and (c) you don't have to copy the dataset back and forth between the Hadoop cluster on EC2 and your local machine at Duke every time (Amazon will charge for network traffic between EC2 machines and your local machine at Duke or elsewhere outside EC2; but note that Amazon will not charge for traffic between EC2 machines and Amazon S3 storage).

You need two MapReduce programs for this assignment. Program 1 is WordCount. You can use the WordCount program that you wrote for assignment 1. Just ensure that the WordCount program will work for the dataset that you will be working with. For example, datasets like /cps216/common/TPC-DS/customer.dat that you worked with in assignment 1 have the "|" symbol as the separator between words. For example, the first record in /cps216/common/TPC-DS/customer.dat is:


Program 2 is of your choice, but it should come from the workload that you identified as part of Project 1.

The goal of this assignment is as follows: for each of the two programs, and an input dataset of size at least 10GB (GigaBytes), run the program with different number of reducers and three Hadoop cluster sizes: 2 slave nodes, 4 slave nodes, and 8 slave nodes. For each program, you should plot the running times observed for each Hadoop cluster size in a graph similar to the graph on the right side of the 25th slide (titled: Hadoop 75GB TeraSort) in the "How MapReduce Works (in Hadoop)" lecture slides (see: lecture slides). Note that the horizontal axis of this graph shows the number of reducers. The vertical axis shows the running time of the program. One way to get the running time is to use the JobTracker interface that is started by default on the 50030 port on the master node of the Hadoop cluster. Keep in mind that, on reboot, the JobTracker will remove information about the jobs it ran so far.

A natural question you may have is what different numbers of reducers you should consider in each case. We leave that for you to find out since the goal of this programming assignment is to better understand how Hadoop works. What we want to see is the trend as the number of reducers is varied in each case. For example, notice the trend in the right-side graph on the 25th slide in "How MapReduce Works (in Hadoop)". The running time starts off as very high for a small number of reducers, then it drops, then it goes up, and then it keeps coming down. You may not need to go up to 300 reducers or run the programs for each and every single value of the number of reducers. Choose wisely to understand the trend. Each "*" symbol in the right-side graph on the 25th slide shows one run of the program. Only 11 runs were done to understand the trend in this case.

Apart from your code, you should submit six graphs: one for each combination of program (WordCount, Your program from Project 1) and Hadoop cluster size (2/4/8 slave nodes).

For your convenience, we will be providing you the software for an "experiment harness". See here. This harness simplifies the task of running a MapReduce program on Hadoop repeatedly for different settings. Usage instructions will be posted and the TA will demonstrate its use in the TA sessions (basic harness instructions, running experiments instructions).