-----------------------------------------------------
Instructions for using the local Hadoop cluster
-----------------------------------------------------

Some details of the current Duke CS Hadoop Cluster
-----------------------------------------------------
The master node is hadoop21.cs.duke.edu with the JobTracker monitoring page
here: http://hadoop21.cs.duke.edu:50030/jobtracker.jsp
and the HDFS (Hadoop Distributed File System) monitoring page
here: http://hadoop21.cs.duke.edu:50070/dfshealth.jsp
[NOTE: You could only access the JobTracker and HDFS pages within the CS trusted network]

There are 14 slave nodes named hadoop22 and from hadoop24 until hadoop36. These nodes have about 
2GB main memory and 30GB local disk each.

-----------------------------------------------------
1. How to access the cluster
-----------------------------------------------------

Before you logging into the cluster, please make sure you have ssh installed in your machine.
(You can google "openssh" and install it)

- ssh USERNAME@hadoop21.cs.duke.edu
[Use your user name and password for the Duke CS Research Cluster.
By default, this is the same as your cs account]

- export PATH=$PATH:/usr/research/home/shivnath/JAVA/jdk1.6.0_17/bin/
[Add the java package to the classpath so that you can run/compile java program anywhere]

- export HADOOP_HOME=/usr/research/home/rozemary/HADOOP/hadoop-0.20.203.0
[Set the environmental variable HADOOP_HOME to the Hadoop installation directory]

(Tips. If you don't want to set the environment each time you login, put the two commands above in ~/.bash_profile. If you don't have this file in your home directory, create one!
If you are using a CS lab administered machine, you should not modify ~/.bash_profile, but rather put them in the ~/.my-bash_profile)

- cd ${HADOOP_HOME}
[Go to the Hadoop installation directory]

- bin/hadoop dfs -lsr ~/
[Shows you the contents in your home directory on HDFS.
By default, your home directory is /usr/research/home/USERNAME]

(Notice. For the first time you use the cluster, the command above may show you nothing.
You can try to create a temporary file/directory, which you can delete later, in your home directory:
bin/hadoop fs -mkdir ~/temp, then see your directory again.)

Plenty of help available on the supported commands
 bin/hadoop -help
 bin/hadoop dfs -help
For a listing of HDFS File System Shell commands, see: http://hadoop.apache.org/common/docs/current/file_system_shell.html

-----------------------------------------------------
2. Start playing with Hadoop
-----------------------------------------------------
Now we are going to compile a Hadoop program and run it.

- Some sample data and code have been placed at /usr/research/home/rozemary/hadoop_example_data/ and /usr/research/home/rozemary/hadoop_example_code/
[Note: this directory is not on HDFS]

- bin/hadoop dfs -mkdir ~/ncdc_data
[Create the ncdc_data directory in your home directory on HDFS]
[Note: if you want to delete this directory, run: bin/hadoop dfs -rmr ~/ncdc_data]

- bin/hadoop dfs -copyFromLocal /usr/research/home/rozemary/hadoop_example_data/NCDC/sample.txt ~/ncdc_data/
[Copy the local sample.txt file to ncdc_data directory in your home directory on HDFS]

- mkdir /usr/research/home/USERNAME/hadoop_code
[Create a local directory for you to write and compile MapReduce code. Replace USERNAME with your user name]

- cd /usr/research/home/USERNAME/hadoop_code/
[Change to that directory. Replace USERNAME with your user name]

- mkdir MaxTemperature
[Local directory for the Java source files]

- cp /usr/research/home/rozemary/hadoop_example_code/MaxTemperature/* MaxTemperature/
[Copy the Java source files. ]

- mkdir MaxTemperature_classes
[Local directory for the Java class files]

Ensure that Java version 6 (i.e., 1.6.0+) is available. For example: "javac -version" gives javac 1.6.0_17 in my case.)

- javac -classpath $HADOOP_HOME/hadoop-core-0.20.203.0.jar -d MaxTemperature_classes MaxTemperature/*.java
[Compile the Java files]

- jar -cvf MaxTemperature.jar -C MaxTemperature_classes/ .
[Create a jar file containing the class files for the MapReduce Program. Don't miss the "." at the end
of the command. If things went correctly so far, then the file MaxTemperature.jar will present in the current directory]

- cd ${HADOOP_HOME}
[Get back to the hadoop base directory]

- bin/hadoop dfs -cat ~/ncdc_data/sample.txt
[Ensure that sample.txt has been copied correctly. This file has 5 lines.]

- bin/hadoop fs -rmr ~/myoutput
[~/myoutput is the output directory for MaxTemperature. Remove the output directory if it exists.
Hadoop will not run if the output directory exists]

- bin/hadoop jar /usr/research/home/USERNAME/hadoop_code/MaxTemperature.jar MaxTemperature ~/ncdc_data/sample.txt ~/myoutput
[Running the MapReduce Program. Replace USERNAME with your user name]

- bin/hadoop dfs -lsr ~/myoutput
[Show you the files generated by the job you submitted]

- bin/hadoop dfs -cat ~/myoutput/part-r-00000
[Show the output of the MapReduce Program]

-----------------------------------------------------
3. Some other notes
-----------------------------------------------------
Some commonly used datasets could be found at /cps216/common on HDFS.

Remember to delete (large) files you will not use again.

You can also run Hadoop on your own machine. Here is the easy-to-follow instructions on how to set up Hadoop on a single machine. http://hadoop.apache.org/common/docs/current/single_node_setup.html