HOWTO: Execute Jobs in Chirp

A Chirp filesystem is normally used to connect a program running on one machine to its data stored on another machine. However, Chirp can also be used to ship the program to the data, and execute it on the server. This can provide significant performance improvement for I/O intensive codes.

(Of course, remote job execution is potentially a security concern, so job execution is disabled by default. To enable job execution, you must run your Chirp server with the -X command line option. While running, the job executes in an identity box that limits file access to that specified by the ACLs.)

This document describes how to execute jobs from the command line, and how to write programs that execute jobs.

Executing Jobs from the Command Line

To execute a remote job from the command line, you must create a directory, give yourself the x permission on that directory, upload the necessary program, and then invoke job_run. For example:

% chirp server.somewhere.edu
connected to server as unix:fred
chirp> cd mydata
chirp> put haystack.txt
chirp> put /usr/bin/grep
chirp> setacl /mydata unix:fred rwldax
chirp> job_run grep needle haystack.txt

You will then see the state of the job as it progresses:

 jobid 6 created
 jobid 6 submitted.
 jobid 6 completed with exit code 0
 jobid 6 removed.

By default, job_run will place the standard output and error into files named stdout.txt and stderr.txt and retrieve them when the job is complete. The input and output can be directed into other files using the > and < symbols.

Note that job_run displays a job id number. If you stop the chirp tool with Control-C, the job will still be running. You can return to the server and using the job number, examine its status with job_list, wait for it to complete with job_wait, or kill it with job_kill. Regardless of how the job completes, a record of its completion is left behind until you invoke job_remove

Writing Programs that Execute Jobs

Chirp uses a transactional model of job execution. This model is chosen so that communication failures between the client and server will not leave the system in an inconsistent state. The state diagram is shown to the right. The following symbolic constants correspond to each of the states:

CHIRP_JOB_STATE_BEGIN

CHIRP_JOB_STATE_IDLE

CHIRP_JOB_STATE_RUNNING

CHIRP_JOB_STATE_COMPLETE

CHIRP_JOB_STATE_FAILED

CHIRP_JOB_STATE_KILLED

There are also six functions that affect job execution:

chirp_reli_job_begin(
    const char *host,
    const char *cwd,
    const char *input,
    const char *output,
    const char *error,
    const char *cmdline,
    time_t stoptime );

chirp_reli_job_commit(
    const char *host,
    INT64_T jobid,
    time_t stoptime );

chirp_reli_job_wait(
    const char *host,
    INT64_T jobid,
    struct chirp_job_state *state,
    int wait_time,
    time_t stoptime );

chirp_reli_job_kill(
    const char *host,
    INT64_T jobid,
    time_t stoptime );

chirp_reli_job_remove(
    const char *host,
    INT64_T jobid,
    time_t stoptime );

chirp_reli_job_list(
    const char *host,
    chirp_joblist_t callback,
    void *arg,
    time_t stoptime );

The client creates a job by calling chirp_reli_job_begin, specifying the program to be run. This causes the server to create a new job in the INITAL state and return its jobid to the client. Next, the client must invoke chirp_reli_job_commit, which puts the job into the IDLE state, allowing it to run. The server may have multiple jobs queued at any given time, and has some internal algorithm ( currently FCFS) to decide which to run. When a job runs, the server moves it to the RUNNING state. If the job runs to completion by either exiting normally or crashing due to a signal, it reaches the COMPLETED state. If the job cannot be executed at all (e.g. the program specified is not an executable binary) the job reaches the FAILED state. If the server should crash and restart, the job will be placed in the IDLE state, and will have the opportunity to run again. The owner of the job (or the server super-user) may issue chirp_reli_job_kill, which will cause a job in the INITIAL, IDLE, or RUNNING state to be forcibly terminated and moved to the KILLED state.

The chirp_job_wait function is used to make the caller wait until either the job reaches one of the three terminal states (COMPLETE, FAILED, KILLED), the timeout parameter expires, or the server decides to stop return prematurely. A timeout of zero can be used to immediately return the job's status. Regardless of which condition is reached, chirp_job_wait will fill in a chirp_job_state structure with the current state of the job:

struct chirp_job_state {
        INT64_T jobid;
        char    command[CHIRP_PATH_MAX];
        char    owner[CHIRP_PATH_MAX];
        int     state;
        int     exit_code;
        time_t  submit_time;
        time_t  start_time;
        time_t  stop_time;
        int     pid;
};

Note that the return code of chirp_job_wait only indicates whether job status was successfully returned. A return value >=0 indicates the job state was retrieved, and a return value <0 indicates the job state was not retrieved. The caller MUST look at the state field of the structure to determine whether the job has completed or not.

Unlike Unix, the state of a complete job remains available on the server and can be viewed multiple times with chirp_job_wait. This allows for the possibility of communication errors without resulting in an inconsistency. The caller must explicitly remove the state with chirp_job_remove. However, the server retains the freedom to remove the state after an excessive amount of time (currently one week) has passed since the job completed. [an error occurred while processing this directive]