Duke DBGroup Logo

Data-intensive Computing Systems
Programming Assignment

Course information
Course schedule and notes
Assignments
Readings
Project
Extra Materials
This Programming Assignment has three parts A, B, and C.

Part A

In this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets obtained by uncompressing one or more of the compressed files found in the Freebase Data Dump. Add a combiner to your program.

Part B

In this part, you will write MapReduce programs to process the two datasets given at: Wikipedia page-to-page link database.
  1. Write a MapReduce program that will output all pages with no outlinks.
  2. Write a MapReduce program that will output all pages with no inlinks.

Part C

This part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0827.csv

First download this dataset and copy it to HDFS. Use the copyFromLocal command.

The attributes in this dataset are described at:
http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
Also, you can see a spreadsheet of the data at:
http://www.whitehouse.gov/briefing-room/disclosures/visitor-records

You are required to write efficient MapReduce programs to find the following information:

(i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House.

(ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House.

(iii) The 10 most frequent visitor-visitee combinations.

Throughout this programming assignment, do not limit your programs to run with a single reduce task only.