Duke DBGroup Logo

Data-intensive Computing Systems
Programming Assignment 1

Course information
Course schedule and notes
Assignments
Readings
Project
Extra Materials
The first Programming Assignment has three parts A, B, and C. The deadline for this assignment is Sept 16, 5.00 PM.

Part A

Chapter 2 in the handout on Hadoop gives a MapReduce program that computes the maximum temperature per year for the NCDC dataset. Write a program to compute the average temperature per year for the same dataset. Is there a simple way to add a combiner to your program?

You can use the NCDC data at /cps216/common/NCDC to test your program.

Part B

In this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets stored at /cps216/common/TPC-DS on the Duke Hadoop cluster. Note that the words in these datasets are separated by the delimitter string "|" (pipe character). For example, the first line of /cps216/common/TPC-DS/customer.dat is:

1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javier.Lewis@VFAxlnZEvOx.org|2452508|

The first word on this line is "1". The second word is "AAAAAAAABAAAAAAA". The third word is "980124", and so on.

Part C

This part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0827.csv

First download this dataset and copy it to HDFS (e.g., in the /usr/research/home/USERNAME directory, where USERNAME is replaced with your user name). Use the copyFromLocal command described at:
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html

The attributes in this dataset are described at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
Also, you can see a spreadsheet of the data at:
http://www.whitehouse.gov/briefing-room/disclosures/visitor-records

You are required to write efficient MapReduce programs to find the following information:

(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House.

(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.

(iii) The 10 most frequent visitor-visitee combinations.

(iv) Some other interesting statistic that you can think of.

Throughout this programming assignment, do not limit your programs to run with a single reduce task only.