Hadoop and Map-Reduce in the CCL

CCL Home

Software

Community

Operations

Hadoop is an open-source implementation of the Map-Reduce concept of data processing. Hadoop is currently running on the DISC cluster at Notre Dame. Hadoop is good at a restricted class of data intensive processing workloads. For more computation intensive workloads, consider using our campus Condor Pool.

CCL Hadoop Filesystem Status

CCL Map-Reduce (YARN) Status

This document gives a very brief introduction. For more information, see here:

Hadoop File System User's Guide

Hadoop Map-Reduce Tutorial

Getting Started

First, ask Prof. Thain to create an account for you on the DISC Cluster.

Then, ssh to the DISC head node:

ssh disc01.crc.nd.edu

(Note that many Hadoop instructions found online ask you to set a bunch of environment variables. Don't do that here, everything should be set up already.)

Now, try the following commands, which list the Hadoop filesystem, upload a file to your home directory, and then read it back:

hadoop fs -ls /
hadoop fs -ls /users/YOURNAME
hadoop fs -put /usr/share/dict/linux.words /users/YOURNAME/words
hadoop fs -ls /users/YOURNAME
hadoop fs -cat /users/YOURNAME/words | less

Note that Hadoop does its best to keep everyone's file separate, it does not have strong authentication, so you should assume anything you put into it can be accessed by anyone else on the system. Your private directory is simply there as a convenience. Be a good citizen, and do not mess around with other people's data.

Example of Map-Reduce Using Java

Here is a very brief introduction to Map-Reduce using Java. If you are not a Java programmer, see the next section on Streaming.

I have already uploaded the complete text of Tolstoy's "War and Peace" to the system under /public/warandpeace.txt. You are going to use WordCount.java to compute the frequncy of words in the novel. Begin by downloading the source of WordCount.java to your machine. Now, compile into WordCount.jar it as follows:

javac -classpath `hadoop classpath` WordCount.java 
jar -cvf WordCount.jar *.class

To perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files:

hadoop jar WordCount.jar WordCount /public/warandpeace.txt /users/YOURNAME/outputs

Now, your outputs are stored under /users/YOURNAME/outputs in Hadoop:

hadoop fs -ls /users/YOURNAME/outputs
hadoop fs -cat /users/YOURNAME/outputs/part-r-00000

Example of Map-Reduce Using Streaming

Using the streaming mode, you can run Map-Reduce programs where the mapper and reducer are ordinary programs written in whatever language you like. Data is passed between programs in plain ASCII format, where each line consists of a key string, a tab character, a value string, and a newline.

For example, if you want to compute the frequency of words, you could write a mapper and reducer in Perl like this:

WordCountMap.pl

WordCountReduce.pl

You can easily test your mapper and reducer locally on a small amount of data like this:

cat words.txt | ./WordCountMap.pl | sort | ./WordCountReduce.pl > output.txt

Then, to run it all in Hadoop, run a command like this: (all one one line)

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
    -files   WordCountMap.pl,WordCountReduce.pl
    -input   /public/warandpeace.txt
    -output  /users/YOURNAME/output2
    -mapper  WordCountMap.pl
    -reducer WordCountReduce.pl