Data-intensive Computing Systems
| |||||||||||||
|
This Programming Assignment has three parts A, B, and C.
Part AIn this part, you will write a MapReduce program to do word count (i.e., count the number of occurrences of each word) for the datasets obtained by uncompressing one or more of the compressed files found in the Freebase Data Dump. Add a combiner to your program.Part BIn this part, you will write MapReduce programs to process the two datasets given at: Wikipedia page-to-page link database.
Part CThis part involves some fun MapReduce processing using the White House Visitor Log. You can find the dataset at:http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0827.csv First download this dataset and copy it to HDFS. Use the copyFromLocal command. The attributes in this dataset are described at:
You are required to write efficient MapReduce programs to find the following information: (i) The 10 most frequent visitors (NAMELAST, NAMEFIRST, NAMEMID) to the White House. (ii) The 10 most frequently visited people (visitee_namelast, visitee_namefirst) in the White House. (iii) The 10 most frequent visitor-visitee combinations. Throughout this programming assignment, do not limit your programs to run with a single reduce task only. |