Duke DBGroup Logo

Data-intensive Computing Systems
Programming Assignment 1

Course information
Course schedule and notes
Extra Materials
The first Programming Assignment has three parts A, B, and C. The deadline for this assignment is Sept 16, 5.00 PM.

Part A

Chapter 2 in the handout on Hadoop gives a MapReduce program that computes the maximum temperature per year for the NCDC dataset. Write a program to compute the average temperature per year for the same dataset. Is there a simple way to add a combiner to your program?

You can use the NCDC data at /cps216/common/NCDC to test your program.

Part B

In this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets stored at /cps216/common/TPC-DS on the Duke Hadoop cluster. Note that the words in these datasets are separated by the delimitter string "|" (pipe character). For example, the first line of /cps216/common/TPC-DS/customer.dat is:


The first word on this line is "1". The second word is "AAAAAAAABAAAAAAA". The third word is "980124", and so on.

Part C

This part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:

First download this dataset and copy it to HDFS (e.g., in the /usr/research/home/USERNAME directory, where USERNAME is replaced with your user name). Use the copyFromLocal command described at:

The attributes in this dataset are described at:
Also, you can see a spreadsheet of the data at:

You are required to write efficient MapReduce programs to find the following information:

(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House.

(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.

(iii) The 10 most frequent visitor-visitee combinations.

(iv) Some other interesting statistic that you can think of.

Throughout this programming assignment, do not limit your programs to run with a single reduce task only.