CCL | Software | Download | Manuals | Forum | Papers
CCL Home

Research

Software Community Operations

Hadoop and Map-Reduce in the CCL

Hadoop is an open-source implementation of the Map-Reduce concept of data processing. Hadoop is currently running on the DISC cluster at Notre Dame. Hadoop is good at a restricted class of data intensive processing workloads. For more computation intensive workloads, consider using our campus Condor Pool.
  • CCL Hadoop Filesystem Status
  • CCL Map-Reduce (YARN) Status
  • This document gives a very brief introduction. For more information, see here:
  • Hadoop File System User's Guide
  • Hadoop Map-Reduce Tutorial

  • Getting Started

    First, ask Prof. Thain to create an account for you on the DISC Cluster.

    Then, ssh to the DISC head node:

    ssh disc01.crc.nd.edu
    
    (Note that many Hadoop instructions found online ask you to set a bunch of environment variables. Don't do that here, everything should be set up already.)

    Now, try the following commands, which list the Hadoop filesystem, upload a file to your home directory, and then read it back:

    hadoop fs -ls /
    hadoop fs -ls /users/YOURNAME
    hadoop fs -put /usr/share/dict/linux.words /users/YOURNAME/words
    hadoop fs -ls /users/YOURNAME
    hadoop fs -cat /users/YOURNAME/words | less
    
    Note that Hadoop does its best to keep everyone's file separate, it does not have strong authentication, so you should assume anything you put into it can be accessed by anyone else on the system. Your private directory is simply there as a convenience. Be a good citizen, and do not mess around with other people's data.


    Example of Map-Reduce Using Java

    Here is a very brief introduction to Map-Reduce using Java. If you are not a Java programmer, see the next section on Streaming.

    I have already uploaded the complete text of Tolstoy's "War and Peace" to the system under /public/warandpeace.txt. You are going to use WordCount.java to compute the frequncy of words in the novel. Begin by downloading the source of WordCount.java to your machine. Now, compile into wordcount.jar it as follows:

    javac -classpath `hadoop classpath` WordCount.java 
    jar -cvf WordCount.jar *.class
    
    To perform a Map-Reduce job, run hadoop with the jar option and specify the input file and a new directory for output files:
    hadoop jar wordcount.jar WordCount /public/warandpeace.txt /users/YOURNAME/outputs
    
    Now, your outputs are stored under /users/YOURNAME/outputs in Hadoop:
    hadoop fs -ls /users/YOURNAME/outputs
    hadoop fs -cat /users/YOURNAME/outputs/part-r-00000
    


    Example of Map-Reduce Using Streaming

    Using the streaming mode, you can run Map-Reduce programs where the mapper and reducer are ordinary programs written in whatever language you like. Data is passed between programs in plain ASCII format, where each line consists of a key string, a tab character, a value string, and a newline.

    For example, if you want to compute the frequency of words, you could write a mapper and reducer in Perl like this:

  • WordCountMap.pl
  • WordCountReduce.pl
  • You can easily test your mapper and reducer locally on a small amount of data like this:
    cat words.txt | ./WordCountMap.pl | sort | ./WordCountReduce.pl > output.txt
    
    Then, to run it all in Hadoop, run a command like this: (all one one line)
    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
        -input   /public/warandpeace.txt
        -output  /users/YOURNAME/output2
        -mapper  WordCountMap.pl
        -file    WordCountMap.pl
        -reducer WordCountReduce.pl
        -file    WordCountReduce.pl