CCL Home

Research

Software Community Operations

Using Condor at Notre Dame

These instructions will get you started using Condor at Notre Dame. You can learn a lot more about Condor in general at the Condor web site. Log into a machine that has Condor installed. A large number of machine on campus are already set up with Condor. If you are a student doing coursework on Condor, you can access Condor from the CSE student[00-03].cse.nd.edu machines. If you have a CRC account, you can use the front-end CRC machine condorfe.crc.nd.edu to access Condor. You can also use the CCL submit machine cclsubmit.cse.nd.edu (Email dthain@nd.edu for access.) If you would like your machine set up as a Condor submitter, ask your system administrator to install Condor with these instructions.

To check whether the condor installation is accessible from your path, try:

which condor_q
If you get a message such as "/usr/bin/which: no condor_q...", then we recommend to modify your .bashrc configuration by adding the following lines:
cctools_setup=/afs/crc.nd.edu/group/ccl/software/cclsetup.sh
if [[ -f "$cctools_setup" ]]
then
    . $cctools_setup
    cclimport condor current
fi
Close your terminal and login again. The cclimport command will find the correct version for the current linux version of your shell.

To see the machines available in the ND Condor pool, you can view the Condor status web page, or you can run the condor_status command:

condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
 
vm1@hedwig.cs LINUX       INTEL  Owner      Idle       0.220   501  0+00:00:10
vm2@hedwig.cs LINUX       INTEL  Owner      Idle       0.000   501  0+00:00:11
wombat00.csel LINUX       INTEL  Owner      Idle       0.010   121  0+00:00:14
...

To submit a batch job to Condor, you must create a submission file and then run the condor_submit command. Try creating this sample submit file in /tmp/YOURNAME/test.submit. (Make sure that your really do put it in /tmp/YOURNAME/test.submit)

universe = vanilla
executable = /bin/echo
arguments = hello condor
output = test.output
should_transfer_files = yes
when_to_transfer_output = on_exit
log = test.logfile
queue 
Now, to submit the job to Condor, execute:
cd /tmp/YOURNAME
condor_submit test.submit
Submitting job(s)...
1 job(s) submitted to cluster 49603.
Once the job is submitted, you can use condor_q to look at the status of the jobs in your queue. By default condor_q shows the total number of jobs in various states:
-- Schedd: disc01.crc.nd.edu : <10.32.74.140:9618?... @ 02/10/21 10:12:52
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
dthain   ID: 49603    8/26 17:21      0     1       0      0      1 49603.0
If you want to look at the details of individual jobs, use condor_q -nobatch.

If you run condor_q quickly enough, you will see your job idle:

-- Schedd: disc01.crc.nd.edu : <10.32.74.140:9618?... @ 02/10/21 10:12:52
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
49603    dthain          8/26 17:21   0+00:00:00 I  0   0.0  echo hello world
If you decide to cancel a job, use condor_rm and the job id:
condor_rm 49630.0
Job 49630.0 marked for removal.
Note about email: Despite what the Condor manual says, you will not receive email when a job is complete. This feature has been disabled at Notre Dame due to our email security configuration.

Because you will certainly want to run many jobs at once via Condor, you can easily modify your submit file to run a program with tens or hundreds of variations. Change the queue command to queue several jobs at once, and the $(PROCESS) macro to modify the parameters with the job number.

universe = vanilla
executable = /bin/echo
arguments = hello $(PROCESS)
output = test.output.$(PROCESS)
error = test.error.$(PROCESS)
should_transfer_files = yes
when_to_transfer_output = on_exit
log = test.logfile
queue 10
Now, when you run condor_submit, you should see something like this:
condor_submit test.submit
Submitting job(s)..........
10 job(s) submitted to cluster 9.
Note in this case that "cluster" means "a bunch of jobs", where each job is named 9.0, 9.1, 9.2, and so forth. In this next example, condor_q shows that cluster 9 is halfway complete, with job 9.5 currently running.
condor_q -nobatch

-- Schedd: disc01.crc.nd.edu : <10.32.74.140:9618?... @ 02/10/21 10:12:52
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.5   dthain          8/26 17:46   0+00:00:01 R  0   0.0  echo hello 5
   9.6   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 6
   9.7   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 7
   9.8   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 8
   9.9   dthain          8/26 17:46   0+00:00:00 I  0   0.0  echo hello 9

Important note about AFS:

In the example above, the submit file and all of the job's details were stored in /tmp/YOURNAME on your local disk. Condor simply moved the necessary files back and forth in order to run your jobs. If instead your store your data files in AFS (i.e. your home directory), Condor cannot access them because it will not have your AFS Kerberos ticket..

If you want Condor to be able to read any data out of AFS, you must change the ACLs on the necessary directories to allow any machine on campus to read the data. This is fine for non-sensitive data. Here's how:

fs setacl ~/my/data/directory nd_campus rl
If you want Condor to be able to write to AFS, you must change the ACLs to allow any machine on campus to write to that directory. Of course, this is a security risk, and should not be done without some careful thought.

There is much more to Condor. Please read the manual to learn more.

Users and administrators of Condor at Notre Dame are encouraged to subscribe to the condor-discuss mailing list to learn more.

Related Links