Cooperative Computing Lab

CCL Home

Software

Community

Operations

Makeflow Tutorial

Getting Started
Makeflow Example
Running Makeflow with Work Queue
Exercise
1. Running with Work Queue Workers on SGE

This tutorial will have you install CCTools into your CRC's home directory and will take you through some distributed computation examples using Makeflow.

Getting Started

For this tutorial, we assume you have SSH access to Notre Dame CRC login nodes. ssh newcell.crc.nd.edu

NOTE: If you are part of Notre Dame and do NOT have a CRC account, you can do the tutorial on cclsubmit00.cse.nd.edu and using Condor.

ssh cclsubmit00.cse.nd.edu

Further instructions for running on cclsubmit00.cse.nd.edu and Condor are below.

NOTE: If you are NOT part of Notre Dame, you can still do most of the tutorial on a local Linux machine, including everything except the section where workers are submitted to CRC/Condor.

Download, Build, and Install CCTools

Navigate to the download page in your browser to review the most recent versions: http://www.nd.edu/~ccl/software/download.shtml

Download a copy of CCTools 4.0.2

wget http://www.nd.edu/~ccl/software/files/cctools-4.0.2-source.tar.gz
tar xzf cctools-4.0.2-source.tar.gz

Build and Install CCTools

cd ~/cctools-4.0.2-source
./configure --prefix ~/cctools && make install

Set Environment Variables

You will need to add your CCTools directory to your $PATH:

setenv PATH "${HOME}/cctools/bin:${PATH}"

NOTE for non-CRC users: You will need to set the port ranges to use on cclsubmit00.cse.nd.edu.

setenv WORK_QUEUE_LOW_PORT 9100
setenv WORK_QUEUE_HIGH_PORT 9500

Makeflow Example

Setup

Setup a Sandbox for this Tutorial

If you have CRC account, set the sandbox in your home directory. SGE has access to your home directory in the shared filesystem.

mkdir ~/cctools-tutorial
cd ~/cctools-tutorial

Note for non-CRC users: Create the sandbox in /tmp. This is because Condor does not have permissions to access your home directory.

mkdir /tmp/cctools-tutorial-$USER
cd /tmp/cctools-tutorial-$USER

Create a sub-directory for storing Makeflow files.

mkdir makeflow-simple
cd makeflow-simple

Download simulation.py which is our application executable for this exercise. Download this Makeflow script which defines the workflow.

wget http://www.nd.edu/~ccl/software/tutorials/ndtut13/simple/simulation.py
wget http://www.nd.edu/~ccl/software/tutorials/ndtut13/simple/Makeflow

The Makeflow script should look like:

$ cat Makeflow
input.txt:
	LOCAL /bin/echo Hello World > input.txt

A: simulation.py input.txt
	python simulation.py 1 < input.txt > A
B: simulation.py input.txt
	python simulation.py 2 < input.txt > B
C: simulation.py input.txt
	python simulation.py 3 < input.txt > C
D: simulation.py input.txt
	python simulation.py 4 < input.txt > D

Running with Local (Multiprocess) Execution

Here we're going to tell Makeflow to dispatch the jobs using regular local processes (no distributed computing!). This is basically the same as regular Unix Make using the -j flag.

makeflow -T local

If everything worked out correctly, you should see:

$ makeflow -T local
/bin/echo Hello World > input.txt
python simulation.py 4 < input.txt > D
python simulation.py 3 < input.txt > C
python simulation.py 2 < input.txt > B
python simulation.py 1 < input.txt > A
nothing left to do.

Running on a batch system

If you have a CRC account: The following code tells Makeflow to dispatch jobs using the SGE batch submission system (qsub, qdel, qstat, etc.).

makeflow -T sge

Note for non-CRC users: Have Makeflow dispatch jobs to the Condor batch submission system (condor_submit, condor_rm, condor_q, etc.).

makeflow -T condor

You will get as output:

nothing left to do.

Well... that's not right. Nothing was run! We need to clean out the generated output files and logs so Makeflow starts from a clean slate again:

makeflow -c

We see it deleted the files we generated in the last run:

$ makeflow -c
deleted file D
deleted file C
deleted file B
deleted file A
deleted file input.txt
deleted file ./Makeflow.makeflowlog

Now let's try again:

makeflow -T sge

Note for non-CRC users: Use Condor instead.

makeflow -T condor

We get the output we expect:

/bin/echo Hello World > input.txt
python simulation.py 4 < input.txt > D
python simulation.py 3 < input.txt > C
python simulation.py 2 < input.txt > B
python simulation.py 1 < input.txt > A
nothing left to do.

Notice that the output is no different from using local execution. Makeflow is built to be execution engine agnostic. There is no difference between executing the task locally or remotely.

In this case, we can confirm that the job was run on another host by looking at the output produced by the simulation:

$ cat D
Running on host d6copt096.crc.nd.edu
Starting 2013 30 Sep 18:01:14
x = 2.0
x = 1.41421356237
x = 1.189207115
x = 1.09050773267
x = 1.04427378243
Finished 2013 30 Sep 18:01:19

Here we see that the task ran on node d6copt096.crc.nd.edu. It took 5 seconds to complete.

Running Makeflow with Work Queue

The submission and wait times for the Makeflow tasks in the above case will vary because of the latencies in the underlying batch job submission platform (SGE). To avoid long submission and wait times, Makeflow can be run using Work Queue. Work Queue excels at handling low latency and short turn-around time jobs.

Here, we will start Makeflow which will setup a Work Queue master on an arbitrary port using -p 0.

makeflow -c
makeflow -T wq -p 0

You should see output like this:

listening for workers on port 1024.
/bin/echo Hello World > input.txt
python simulation.py 4 < input.txt > D
python simulation.py 3 < input.txt > C
python simulation.py 2 < input.txt > B
python simulation.py 1 < input.txt > A

To start a work_queue_worker for this master, open another terminal window and login into corresponding login node:

ssh newcell.crc.nd.edu

Note for non-CRC users: Login to cclsubmit00.cse.nd.edu

ssh cclsubmit00.cse.nd.edu

Add the CCTools directory to our $PATH as before:

setenv PATH "${HOME}/cctools/bin:${PATH}"

Now, start a work_queue_worker for this Makeflow by giving it the port the master is listening on. Let us also enable the debugging output using -d all.

work_queue_worker -t 10 -d all localhost XXXX
...

replacing XXXX with the port the Makeflow master is listening on. When the tasks are finished, the worker should quit due to the 10 second timeout.

Note that the Makeflow still executed locally since the work_queue_worker was run on the same node (newcell or cclsubmit00, respectively) as the Makeflow master. The exercise below shows how to start several work_queue_worker processes on the SGE cluster.

Exercise

Running Makeflow with Work Queue Workers on SGE and condor

The goal of this exercise is to setup Work Queue workers on SGE and condor compute nodes. Here we submit the worker tasks using the sge_submit_workers and condor_submit_workers scripts. To know more about the options and arguments for these submission scripts, do:

sge_submit_workers -h

Note for non-CRC users: Use condor_submit_workers to submit workers to Condor.

condor_submit_workers -h

In this exercise, we will also use the catalog server and a project name so workers can find the master without being provided with the master's hostname and port. We specify a project name for the Makeflow script using the -N option that takes a string as an argument. The workers are then provided with the project name to connect using the same -N option.

makeflow -c
sge_submit_workers -t 300 -M $USER 2
makeflow -T wq -N $USER

Note for non-CRC users: Submit workers to Condor.

makeflow -c
condor_submit_workers -t 300 -M $USER 2
makeflow -T wq -N $USER

The $USER will subsitute your username. This means, your username will be used as the project name for your Makeflow.

While Makeflow is running, you can look at the current Work Queue projects.

$ work_queue_status
PROJECT            NAME                    PORT  WAITING  BUSY COMPLETE WORKERS 
[...]
nd-tutorial        newcell.crc.nd.edu      9001        4     0        0       0

work_queue_status is a useful tool for checking on your various Work Queue projects. The output includes various data on the tasks and workers the Master is managing. Here we see there are 4 tasks currently but no workers are connected, yet.