Cooperative Computing Lab
CCL | Software | Install | Manuals | Forum | Papers
CCL Home

Research

Software Community Operations

Makeflow Tutorial

This tutorial will have you install CCTools and will take you through some distributed computation examples using Makeflow.

  1. Getting Started
    1. Download, Build, and Install CCTools
    2. Set Environment Variables
  2. Makeflow Example
    1. Setup
    2. Running with Local (Multiprocess) Execution
  3. Running Makeflow with Work Queue
    1. Running with a single Work Queue Worker
  4. Running Makeflow with Work Queue using Project Names
  5. Running Makeflow with Work Queue on SGE cluster
    1. Running with Multi-slot Work Queue Workers
  6. Exercise: Running Makeflow directly on SGE Cluster

Getting Started

First, login in to the head node of XSEDE's Lonestar where you will install CCTools and run this tutorial.

ssh USERNAME@login1.ls4.tacc.utexas.edu

Please replace USERNAME with the username in the sheet of paper handed out to you. Use the password given there as well.

NOTE: Please feel free to use your own XSEDE account if you have one.

Download, Build, and Install CCTools

wget http://www.nd.edu/~ccl/software/files/cctools-4.0rc1-source.tar.gz tar xzf cctools-4.0rc1-source.tar.gz cd ~/cctools-4.0rc1-source ./configure --prefix ~/cctools && make install

Set Environment Variables

You will need to add your CCTools directory to your $PATH:

export PATH=~/cctools/bin:${PATH}

If these commands executed successfully, the follwing command should print out the version string (4.0.0rc1) of Makeflow and information from its build step.

$ makeflow -v makeflow version 4.0.0rc1-TRUNK (released 07/19/2013) Built by train110@login2.ls4.tacc.utexas.edu on Jul 19 2013 at 12:51:26 System: Linux login2.ls4.tacc.utexas.edu 2.6.18-194.32.1.el5_TACC #2 SMP Configuration: --prefix /home1/0000/train110/cctools

Makeflow Example

Setup

mkdir ~/cctools-tutorial mkdir ~/cctools-tutorial/makeflow-simple cd ~/cctools-tutorial/makeflow-simple

Download simulation.py which is our application executable for this exercise. Download this Makeflow script which defines the workflow.

wget http://www.nd.edu/~ccl/software/tutorials/xsede13/makeflow/simulation.py wget http://www.nd.edu/~ccl/software/tutorials/xsede13/makeflow/Makeflow

The Makeflow script should look like:

$ cat Makeflow input.txt: LOCAL /bin/echo Start Simulation > input.txt A: simulation.py input.txt python simulation.py 1 < input.txt > A B: simulation.py input.txt python simulation.py 2 < input.txt > B C: simulation.py input.txt python simulation.py 3 < input.txt > C D: simulation.py input.txt python simulation.py 4 < input.txt > D

Running with Local (Multiprocess) Execution

Here we're going to tell Makeflow to dispatch the jobs using regular local processes (no distributed computing!). This is basically the same as regular Unix Make using the -j flag.

makeflow -T local

If everything worked out correctly, you should see:

$ makeflow -T local /bin/echo Start Simulation > input.txt python simulation.py 4 < input.txt > D python simulation.py 3 < input.txt > C python simulation.py 2 < input.txt > B python simulation.py 1 < input.txt > A nothing left to do.

Running Makeflow with Work Queue

The above example illustrated how Makeflows are easy to write and execute. However, running Makeflows locally can incur performance penalties if the number of tasks are larger than the cores available for local execution.

In this section, we will illustrate the steps to run the above Makeflow on Lonestar's resources for parallel execution of its tasks. We will use Work Queue to distribute its tasks to nodes in Lonestar for parallel execution.

Work Queue provides a master-worker execution framework where a master program dispatches jobs to worker instances deployed on various resources

Here, we will start Makeflow to run as a Work Queue master on an arbitrary port using -p 0.

You will get as output:

$ makeflow -T wq -p 0 nothing left to do.

Well...that's not right. Nothing was run! We need to clean out the generated output files and logs so Makeflow starts from a clean slate again:

makeflow -c

We see it deleted the files we generated in the last run:

$ makeflow -c deleted file D deleted file C deleted file B deleted file A deleted file input.txt deleted file ./Makeflow.makeflowlog

Now, let us rerun the Makeflow with Work Queue. We should see an output like this:

$ makeflow -T wq -p 0 listening for workers on port 1024. /bin/echo Start Simulation > input.txt python simulation.py 4 < input.txt > D python simulation.py 3 < input.txt > C python simulation.py 2 < input.txt > B python simulation.py 1 < input.txt > A

The output of the Makeflow shows the port on which it is listening for connections from workers. Remember the value of this port since we will be using it later!

Running with a single Work Queue Worker

To start a single worker for this master, open another terminal window and login into the Lonestar head node with your given username:

ssh USERNAME@login1.ls4.tacc.utexas.edu

As before, replace USERNAME with the username provided to you (and its password).

Add the CCTools directory to our $PATH as before:

export PATH=~/cctools/bin:${PATH}

Now, start a work_queue_worker for this Makeflow by giving it the port the master is listening on. Let us also enable the debugging output using -d all

work_queue_worker -t 10 -d all localhost XXXX ...

replacing XXXX with the port the Makeflow master is listening on. When the tasks are finished, the worker should quit due to the 10 second timeout.

Note that the Makeflow still executed locally since the work_queue_worker was run on the same node as the Makeflow master. The section below describes how to start several work_queue_worker processes in Lonestar's SGE cluster.

Running Makeflow with Work Queue using Project Names

Keeping track of port numbers and specifying them correctly when starting workers for each master can get painful quickly. To overcome this, we have a project name feature that can help workers find Work Queue masters without using port numbers.

This is accomplished using a catalog server that matches project names to the network address of the Work Queue masters. When a project name is specified for a Work Queue master, it advertises this project name along with its location to the catalog server.

On the other end, the Work Queue workers are specified with a project name to which they must connect. These workers then talk to the catalog server to obtain the locations of masters with the corresponding project name.

In Makeflow, the project name can be set using the -N command line option. The -a option will force Makeflow to advertise its project name to the catalog server.

Let us rerun the Makeflow by specifying a project name for it:

makeflow -c makeflow -T wq -p 0 -d all -a -N $USER.makeflow

This assigns a project name containing your username with which you logged in.

Similarly, we will need to use -a and -N options to specify the project name to connect for the workers:

work_queue_worker -t 10 -d all -a -M $USER.makeflow

Note that we no longer have to specify the hostname and port of the Work Queue master created by the Makeflow.

Running Makeflow with Work Queue on SGE cluster

Here, we are going to show to run the Makeflow master and Work Queue workers on Lonestar's SGE cluster.

First, we submit the Makeflow master as a job using qsub as follows. We create a script with the commands to run so we can submit the Makeflow to run on SGE:

echo -e '#!/bin/sh\n makeflow -c\n makeflow -T wq -p 0 -a -N $USER.makeflow' > sge_makeflow.sh

The script file for SGE should look like this:

$ cat sge_makeflow.sh #!/bin/sh makeflow -c makeflow -T wq -p 0 -a -N $USER.makeflow

Submit the script to SGE using qsub as follows:

qsub -q development -l h_rt=00:10:00 -cwd -V -pe 12way 12 sge_makeflow.sh

Note the various arguments provided in the qsub command. These arguments are required by the Lonestar cluster for each job submission.

Next, we similarly submit a worker for this master on SGE by creating a script:

echo -e '#!/bin/sh\n work_queue_worker -t 10 -a -M $USER.makeflow' > sge_worker.sh

The worker script should look like this:

$ cat sge_worker.sh #!/bin/sh work_queue_worker -t 10 -a -M $USER.makeflow

Submit this script to SGE using qsub:

qsub -q development -l h_rt=00:10:00 -cwd -V -pe 12way 12 sge_worker.sh

Note this will start a single worker for the master and each task in the Makeflow will be sequentially executed. To execute the tasks in parallel, start multiple workers by running the above command multiple times.

Alternatively, you can make use of our sge_submit_workers script (provided in the CCTools package) to submit workers to the SGE cluster. To submit 4 workers to the SGE cluster, so that the 4 independent tasks defined in this Makeflow are executed in parallel, do:

sge_submit_workers -p "-q development -l h_rt=00:10:00 -V -pe 12way 12" -t 10 -M $USER.makeflow 4

To monitor the completion of the submitted master and worker jobs, we will watch the status of these jobs using SGE's qstat interface:

watch "qstat -u $USER"

When the Makeflow master job completes, we can confirm that the tasks were run on another host by looking at the output produced by the simulation:

$ cat D Ran on host c317-303.ls4.tacc.utexas.edu Starting 2013 21 Jul 12:56:50 The input file I was given contains: ['Start Simulation\n'] Starting intensive simulation of the collision of intergalactic forces in the outer cosmos! x = 2.0 x = 1.41421356237 x = 1.189207115 x = 1.09050773267 x = 1.04427378243 Finished 2013 21 Jul 12:56:55

Here we see that the task that creates output D was run by a worker on node c317-303.ls4.tacc.utexas.edu.

Running with Multi-slot Work Queue Workers

The SGE cluster in Lonestar allocates each submitted job to a dedicated 12-core node (the 12way argument in SGE submit options). However, each worker by default, advertises a single slot and runs a single core task. This leads to wastage of resources on Lonestar nodes where they can run 12 (single core) tasks simultaneously or a single 12 core task.

To avoid such wastage, we provide two features that enable a Makeflow to harness all the available cores on a node.

The first feature allows the rules in the Makeflow file to be annotated with the size of resources they require. This allows Work Queue to allocate tasks to workers with the appropriate resources and maximize the utilization of allocated resources.

You can download the annotated version of the above Makeflow using:

wget -O Makeflow http://www.nd.edu/~ccl/software/tutorials/xsede13/annotated_makeflow/Makeflow

In the Makeflow example used in this tutorial, the simulation tasks each use a single core. So, we annotate the tasks to each require and use a single core using the CORES tag. Since these are also small tasks, we annotate them to use 1 MB of memory using the MEMORY and 1 MB of disk using the DISK tags as shown below.

$ cat Makeflow input.txt: LOCAL /bin/echo Start Simulation > input.txt CORES=1 MEMORY=1 DISK=1 A: simulation.py input.txt python simulation.py 1 < input.txt > A B: simulation.py input.txt python simulation.py 2 < input.txt > B C: simulation.py input.txt python simulation.py 3 < input.txt > C D: simulation.py input.txt python simulation.py 4 < input.txt > D

Resubmit the sge_makeflow.sh script (created in the previous step) to run the Makeflow with the resource annotations:

qsub -q development -l h_rt=00:10:00 -cwd -V -pe 12way 12 sge_makeflow.sh

The second feature tells workers to harness the maximum size of resources (cores, memory, disk) it can harness on a allocated node. For example, we can specify workers submitted to Lonestar's SGE to use 12 cores since Lonestar allocates 12 dedicated cores to each job. The --cores option in Work Queue worker specifies the number of cores or slots for the worker to advertise and use.

From the other open terminal, submit a multi-slot worker to SGE that utilizes all the cores present (12) on its allocated node.

To do, we first create the worker script for SGE to start workers with the --cores option:

echo -e '#!/bin/sh\n work_queue_worker --cores all -t 10 -a -M $USER.makeflow' > sge_multicore_worker.sh

The worker script should look like this:

$ cat sge_multicore_worker.sh #!/bin/sh work_queue_worker --cores all -t 10 -a -M $USER.makeflow

Submit this script to SGE using qsub:

qsub -q development -l h_rt=00:10:00 -cwd -V -pe 12way 12 sge_multicore_worker.sh

Note that the worker, by default, will advertise and utilize the memory and disk space that it observes to be available on its allocated node.

This worker will execute the 4 independent single-core tasks defined in this Makeflow in parallel on the allocated cores.

Exercise: Running Makeflow directly on SGE Cluster

Makeflow can also be setup to dispatch its jobs directly to a cluster or grid (Condor, SGE, Moab, Torque) without using Work Queue. In this case, Makeflow will invoke the corresponding batch submission interface (qsub, msub, qdel, etc.,) to submit and monitor its jobs.

Makeflow supports the following clusters and grids for direct job dispatch and execution: condor, sge, work queue.

Before running again, remember to cleanup the output from the last Makeflow run:

makeflow -c

The following code tells Makeflow to dispatch jobs using the SGE batch submission system in Lonestar.

makeflow -T sge --batch-options "-q development -l h_rt=00:10:00 -V -pe 12way 12"

The --batch-options argument provides the specification of options that are required by Lonestar when submitting jobs using SGE for execution in its cluster as described.

You should see this output:

$ makeflow -T sge --batch-options "-q development -l h_rt=00:10:00 -V -pe 12way 12" /bin/echo Start Simulation > input.txt python simulation.py 4 < input.txt > D python simulation.py 3 < input.txt > C python simulation.py 2 < input.txt > B python simulation.py 1 < input.txt > A nothing left to do.

Notice that the output is no different from using local execution. Makeflow is built to be execution engine agnostic. There is no difference between executing the task locally or remotely.

After completion of this Makeflow, we can confirm the tasks were run on a resource in the Lonstar SGE cluster by looking at the output:

$ cat D Ran on host c301-321.ls4.tacc.utexas.edu Starting 2013 21 Jul 12:59:24 The input file I was given contains: ['Start Simulation\n'] Starting intensive simulation of the collision of intergalactic forces in the outer cosmos! x = 2.0 x = 1.41421356237 x = 1.189207115 x = 1.09050773267 x = 1.04427378243 Finished 2013 21 Jul 12:59:29