Resource Monitor User's Manual

Last edited: August 2013

resource_monitor is Copyright (C) 2013 The University of Notre Dame.
All rights reserved.
This software is distributed under the GNU General Public License.
See the file COPYING for details.

Overview

resource_monitor is a tool to monitor the computational resources used by the process created by the command given as an argument, and all its descendants. The monitor works 'indirectly', that is, by observing how the environment changed while a process was running, therefore all the information reported should be considered just as an estimate (this is in contrast with direct methods, such as ptrace). It has been tested in Linux, FreeBSD, and Darwin, and can be used in stand-alone mode, or automatically with makeflow and work queue applications.

resource_monitor generates up to three log files: a summary file with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.

Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.

In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used. In contrast, resource_monitorv disables this wrapping, which means, among others, that it can only monitor the root process, but not its descendants.

Installation

The resource_monitor is included in the current development version of CCTools. For installation, please follow these instructions.

Running resource_monitor

Stand-alone mode

Simply type: % resource_monitor -- ls This will generate three files describing the resource usage of the command "ls". These files are resource-pid-PID.summary, resource-pid-PID.series, and resource-pid-PID.files, in which PID represents the corresponding process id. Alternatively, we can specify the output names, and the sampling intervals: % resource_monitor -O log-sleep -i 2 -- sleep 10 The previous command will monitor "sleep 10", at two second intervals, and will generate the files log-sleep.summary, log-sleep.series, and log-sleep.files. Currently, the monitor does not support interactive applications. That is, if a process issues a read call from standard input, and standard input has not been redirected, then the tree process is terminated. This is likely to change in future versions of the tool.

Makeflow mode

If you already have a makeflow file, you can activate the resource_monitor by giving the -M flag to makeflow with a desired output directory, for example: % makeflow -Mmonitor_logs Makeflow In this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per rule in the directory monitor_logs.

Work-queue mode

From Work Queue: q = work_queue_create(port); work_queue_enable_monitoring(q, some-log-file); wraps every task with the monitor, and appends all generated summary files into the file some-log-file. Currently only summary reports are generated from work queue.

Monitoring with Condor

Unlike the previous examples, when using the resource_monitor directly with condor, you have to specify the resource_monitor as an input file, and the generated log files as output files. For example, consider the following submission file: universe = vanilla executable = /bin/echo arguments = hello condor output = test.output should_transfer_files = yes when_to_transfer_output = on_exit log = condor.test.logfile queue This can be rewritten, for example, as: universe = vanilla executable = /path/to/resource_monitor arguments = -O echo-log -- /bin/echo hello condor output = test.output echo-log.summary echo-log.series echo-log.files should_transfer_files = yes when_to_transfer_output = on_exit log = condor.test.logfile queue

Output Format

The summary file has the following format: command: [the command line given as an argument] start: [seconds at the start of execution, since the epoch, float] end: [seconds at the end of execution, since the epoch, float] exit_type: [one of normal, signal or limit, string] signal: [number of the signal that terminated the process. Only present if exit_type is signal int] limits_exceeded: [resources over the limit. Only present if exit_type is limit, string] exit_status: [final status of the parent process, int] max_concurrent_processes: [the maximum number of processes running concurrently, int] wall_time: [seconds spent during execution, end - start, float] cpu_time: [user + system time of the execution, in seconds, float] virtual_memory: [maximum virtual memory across all processes, in MB, int] resident_memory: [maximum resident size across all processes, in MB, int] swap_memory: [maximum swap usage across all processes, in MB, int] bytes_read: [number of bytes read from disk, int] bytes_written: [number of bytes written to disk, int] workdir_number_files_dirs: [total maximum number of files and directories of all the working directories in the tree, int] workdir_footprint: [size in MB of all working directories in the tree, int] The time-series log has a row per time sample. For each row, the columns have the following meaning: wall_clock [the sample time, since the epoch, in microseconds, int] concurrent_processes [concurrent processes at the time of the sample, int] cpu_time [accumulated user + kernel time, in microseconds, int] virtual_memory [current virtual memory size, in MB, int] resident_memory [current resident memory size, in MB, int] swap_memory [current swap usage, in MB, int] bytes_read [accumulated number of bytes read, int] bytes_written [accumulated number of bytes written, int] workdir_number_files_dirs [current number of files and directories, across all working directories in the tree, int] workdir_footprint [current size of working directories in the tree, in MB int]

Specifying Resource Limits

The limits file should contain lines of the form: resource: max_value It may contain any of the following fields, in the same units as defined for the summary file: max_concurrent_processes, wall_time, cpu_time, virtual_memory, resident_memory, swap_memory, bytes_read, bytes_written, workdir_number_files_dirs, workdir_footprint Thus, for example, to automatically kill a process after one hour, or if it is using 5GB of swap, we can create the following file limits.txt: wall_time: 3600 swap_memory: 5242880 In makeflow we then specify: makeflow -Mmonitor_logs --monitor-limits=limits.txt Or with condor: universe = vanilla executable = matlab arguments = -O matlab-script-log --limits-file=limits.txt -- matlab < script.m output = matlab.output matlab-script-log.summary matlab-script-log.series matlab-script-log.files transfer_input_files=script.m limits.txt should_transfer_files = yes when_to_transfer_output = on_exit log = condor.matlab.logfile queue

For More Information

For the latest information about resource_monitor please subscribe to our mailing list.