CCL Home Software Community Operations |
Hadoop and Map-Reduce in the CCLHadoop is an open-source implementation of the Map-Reduce concept of data processing. Hadoop is currently running on the DISC cluster at Notre Dame. Hadoop is good at a restricted class of data intensive processing workloads. For more computation intensive workloads, consider using our campus Condor Pool.Getting StartedFirst, ask Prof. Thain to create an account for you on the DISC Cluster.Then, ssh to the DISC head node: ssh disc01.crc.nd.edu(Note that many Hadoop instructions found online ask you to set a bunch of environment variables. Don't do that here, everything should be set up already.) Now, try the following commands, which list the Hadoop filesystem, upload a file to your home directory, and then read it back: hadoop fs -ls / hadoop fs -ls /users/YOURNAME hadoop fs -put /usr/share/dict/linux.words /users/YOURNAME/words hadoop fs -ls /users/YOURNAME hadoop fs -cat /users/YOURNAME/words | lessNote that Hadoop does its best to keep everyone's file separate, it does not have strong authentication, so you should assume anything you put into it can be accessed by anyone else on the system. Your private directory is simply there as a convenience. Be a good citizen, and do not mess around with other people's data.
Example of Map-Reduce Using JavaHere is a very brief introduction to Map-Reduce using Java. If you are not a Java programmer, see the next section on Streaming.I have already uploaded the complete text of Tolstoy's "War and Peace" to the system under /public/warandpeace.txt. You are going to use WordCount.java to compute the frequncy of words in the novel. Begin by downloading the source of WordCount.java to your machine. Now, compile into WordCount.jar it as follows: javac -classpath `hadoop classpath` WordCount.java jar -cvf WordCount.jar *.classTo perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files: hadoop jar WordCount.jar WordCount /public/warandpeace.txt /users/YOURNAME/outputsNow, your outputs are stored under /users/YOURNAME/outputs in Hadoop: hadoop fs -ls /users/YOURNAME/outputs hadoop fs -cat /users/YOURNAME/outputs/part-r-00000
Example of Map-Reduce Using Streaming Using the streaming mode, you can run Map-Reduce programs where the mapper and reducer are ordinary programs written in whatever language you like. Data is passed between programs in plain ASCII format, where each line consists of a key string, a tab character, a value string, and a newline.For example, if you want to compute the frequency of words, you could write a mapper and reducer in Perl like this: cat words.txt | ./WordCountMap.pl | sort | ./WordCountReduce.pl > output.txtThen, to run it all in Hadoop, run a command like this: (all one one line) hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files WordCountMap.pl,WordCountReduce.pl -input /public/warandpeace.txt -output /users/YOURNAME/output2 -mapper WordCountMap.pl -reducer WordCountReduce.pl |