Resource Monitor User's Manual
Overview
resource_monitor is a tool to monitor the computational resources used by the process created by the command given as an argument, and all its descendants. The monitor works indirectly, that is, by observing how the environment changed while a process was running, therefore all the information reported should be considered just as an estimate (this is in contrast with direct methods, such as ptrace). It works on Linux, and it can be used in three ways:
- Stand alone mode, directly calling the
resource_monitor
executable. - Activating the monitoring modes of makeflow and work queue applications.
- As a python module to monitor single function evaluations.
resource_monitor generates up to three log files: a JSON encoded summary file with the maximum values of resource used and the time they occurred, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.
Additionally, resource_monitor may be set to produce measurement snapshots according to events in some files (e.g., when a file is created, deleted, or a regular expression pattern appears in the file.). Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.
In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used.
Installing
See the Installation Instructions for the Cooperative Computing Tools package. Then, make sure to set your PATH
appropriately.
Running resource_monitor
On a terminal, type:
resource_monitor -O mymeasurements -- ls
This will generate the file mymeasurements.summary, with the resource
usage of the command ls
. Further:
resource_monitor -O mymeasurements --with-time-series --with-inotify -O mymeasurements -- ls
will generate three files describing the
resource usage of the command ls
. These files are mymeasurements.summary
, mymeasurements.series
, and mymeasurements.files
, in which PID
represents the corresponding process id. By default, measurements are taken
every second, and each time an event such as a file is opened, or a process
forks, or exits. We can specify the output names, and the sampling intervals:
resource_monitor -O log-sleep -i 2 -- sleep 10
The previous command will monitor sleep 10
, at two second intervals, and will
generate the files log-sleep.summary
, log-sleep.series
, and
log-sleep.files
. The monitor assumes that the application monitored is not
interactive.
To change this behaviour use the -f
switch:
resource_monitor -O my.summary -f -- /bin/sh
Output Format
The summary is JSON encoded and includes the following fields:
Field | Description |
---|---|
command | the command line given as an argument. |
start | time at start of execution, since the epoch. |
end | time at end of execution, since the epoch. |
exit_type | one of normal , signal or limit (a string). |
signal | number of the signal that terminated the process. Only present if exit_type is signal. |
cores | maximum number of cores used in a small time window. |
cores_avg | number of cores as cpu_time/wall_time. |
gpus | maximum number of gpus used. |
exit_status | final status of the parent process. |
max_concurrent_processes | the maximum number of processes running concurrently. |
total_processes | count of all of the processes created. |
wall_time | duration of execution, end - start. |
cpu_time | user+system time of the execution |
virtual_memory | maximum virtual memory across all processes. |
memory | maximum resident size across all processes. |
swap_memory | maximum swap usage across all processes |
bytes_read | amount of data read from disk. (in MB) |
bytes_written | amount of data written to disk. (in MB) |
bytes_received | amount of data read from network interfaces. (in MB) |
bytes_sent | amount of data written to network interfaces. (in MB) |
bandwidth | maximum bandwidth used. (in Mbps) |
total_files | total maximum number of files and directories of all the working directories in the tree. |
disk | size of all working directories in the tree. |
limits_exceeded | resources over the limit with -l, -L options (JSON object). |
peak_times | seconds from start when a maximum occured (JSON object). |
snapshots | List of intermediate measurements, identified by snapshot_name (JSON object). |
The time-series log has a row per time sample. For each row, the columns have the following meaning:
Field | Description |
---|---|
wall_clock | the sample time, since the epoch, in microseconds. |
cpu_time | accumulated user + kernel time, in microseconds. |
cores | current number of cores used. |
max_concurrent_processes | concurrent processes at the time of the sample. |
virtual_memory | current virtual memory size, in MB. |
memory | current resident memory size, in MB. |
swap_memory | current swap usage, in MB. |
bytes_read | accumulated number of bytes read, in bytes. |
bytes_written | accumulated number of bytes written, in bytes. |
bytes_received | accumulated number of bytes received, in bytes. |
bytes_sent | accumulated number of bytes sent, in bytes. |
bandwidth | current bandwidth, in bps. |
total_files | current number of files and directories, across all working directories in the tree. |
disk | current size of working directories in the tree, in MB. |
Specifying Resource Limits
Resource limits can be specified with a JSON object in a file in the same
format as the output format . Only resources specified in the file are
enforced. Thus, for example, to automatically kill a process after one hour,
or if it is using 5GB of swap, we can create the following file limits.json
:
{ "wall_time": [3600, "s"], "swap_memory": [5, "GB"] }
and set limits to the execution with:
resource_monitor -O output --monitor-limits=limits.json -- myapp `
Snapshots
The resource_monitor can be directed to take snapshots of the resources used according to the files created by the processes monitored. The typical use of monitoring snapshots is to set a watch on a log file, and generate a snapshot when a line in the log matches a pattern.
Snapshots are specified via a JSON-encoded file with the following syntax:
{
"FILENAME": {
"from-start":boolean,
"from-start-if-truncated":boolean,
"delete-if-found":boolean,
"events": [
{
"label":"EVENT_NAME",
"on-create":boolean,
"on-truncate":boolean,
"on-pattern":"REGEXP",
"count":integer
},
{
"label":"EVENT_NAME",
...
}
]
},
"FILENAME": {
...
},
...
Field | Type | Description | Default |
---|---|---|---|
FILENAME | string | Name of a text file to watch. | |
from-start | boolean | If FILENAME exits when the monitor starts running, process from line 1. | false |
from-start-if-truncated | boolean | If FILENAME is truncated, process from line 1. | true |
delete-if-found | boolean | Delete FILENAME when found. | false |
events | array | See following table |
Events fields:
Field | Type | Description | Default |
---|---|---|---|
label | string | Name that identifies the snapshot. Only alphanumeric, -, and _ characters are allowed. | required |
on-create | boolean | Take a snapshot every time the file is created. | false |
on-delete | boolean | Take a snapshot every time the file is deleted. | false |
on-truncate | boolean | Take a snapshot when the file is truncated. | false |
on-pattern | boolean | Take a snapshot when a line matches the regexp pattern. | |
count | integer | Maximum number of snapshots for this label. | -1 (no limit) |
As an example, assume that 'myapp' goes through three stages during execution: start, processing, and analysis, and that it indicates the current stage by writing a line to 'my.log' of the form '# STAGE'. We can direct the resource_monitor to take a snapshot at the beginning of each stage as follows:
File: snapshots.json:
{ "my.log": {
"events":[
{ "label":"file-created", "on-creation":true },
{ "label":"started", "on-pattern":"^# START" },
{ "label":"end-of-start", "on-pattern":"^# PROCESSING" },
{ "label":"end-of-processing", "on-pattern":"^# ANALYSIS" },
{ "label":"file-deleted", "on-deletion":true }]}}
resource_monitor -O output --snapshots- file=snapshots.json -- myapp
Snapshots are included in the output summary
file as an array of JSON objects under the key snapshots
. Additionally, each
snapshot is written to a file output.snapshot.N
, where N
is 0,1,2,...
As another example, the monitor can generate a snapshot every time a particular file is created. The monitor can detected this file, generate a snapshot, and delete the file to get ready for the next snapshot. In the following example the monitor takes a snapshot everytime the file please-take-a-snapshot is created:
{
"please-take-a-snapshot":
{
"delete-if-found":true,
"events":[
{
"label":"manual-snapshot",
"on-create":true
}
]
}
}
Integration with other CCTools
Makeflow mode
If you already have a makeflow file, you can activate the resource_monitor by
giving the --monitor
option to makeflow with a desired output directory, for
example:
makeflow --monitor monitor_logs Makeflow
In this case, makeflow wraps every command line rule with the monitor, and
writes the resulting logs per rule in the directory monitor_logs
.
Work-queue mode
From Work Queue in python, monitoring is activated with:
import work_queue as wq
q = wq.WorkQueue(port)
q.enable_monitoring()
Limits for a task are set by defining a category
of tasks. All tasks in a
category are assumed to use a similar quantity of resources:
# creating a category by assigning maximum resources:
q.specify_category_max_resources('my-category', {"cores": 1, "memory":512})
t = wq.Task(...)
t.specify_category('my-category')
...
t = q.wait(5)
if t:
print("cores: {}".format(t.resources_measured.cores))
print("memory: {}".format(t.resources_measured.memory))
print("disk: {}".format(t.resources_measured.disk))
print("bytes_read: {}".format(t.resources_measured.bytes_read))
...
if t.limits_exceeded:
# any resource above a specified limit is different from -1:
if t.limits_exceeded.memory != -1:
...
Similarly, in C:
q = work_queue_create(port);
/* wraps every task with the monitor, and appends all generated summary files
* into the file `some-log- file. */
work_queue_enable_monitoring(q, "some-log-file", /* kill tasks on exhaustion */ 1);
...
struct work_queue_task *t = work_queue_wait(q, 5);
if(t) {
/* access resources measured with t->resources_measured->{cores,disk,memory,...} */
/* and limits exceeded with: */
if(t->resources_measured->limits_exceeded) {
if(t->resources_measured->limits_exceeded->cores != -1) { ... }
}
}
Monitoring with Condor
Unlike the previous examples, when using the resource_monitor
directly with
condor , you have to specify the resource_monitor
as an input file, and
the generated log files as output files. For example, consider the following
submission file:
universe = vanilla
executable = matlab
arguments = -r "run script.m"
output = matlab.output
transfer_input_files=script.m
should_transfer_files = yes
when_to_transfer_output = on_exit
log = condor.matlab.logfile
queue
This can be rewritten for monitoring as:
universe = vanilla
executable = resource_monitor
arguments = -O matlab-resources --limits-file=limits.json -r "run script.m"
output = matlab.output
transfer_input_files=script.m,limits.json,/path/to/resource_monitor
transfer_output_files=matlab-resources.summary
should_transfer_files = yes
when_to_transfer_output = on_exit
log = condor.matlab.logfile queue
Monitoring functions in python
With the resource_monitor
python
module
python module, function evaluations can be monitored with resource limits
enforcement.
To monitor already defined functions, use the monitored
function. This
creates a new function that returns a tuple of the original result, and a
dictionary of the resources used:
import resource_monitor
def my_sum(a,b):
return a + b
my_sum_monitored = resource_monitor.monitored()(my_sum)
original_result = my_sum(1,2)
(monitored_result, resources) = my_sum_monitored(1,2)
print('function used ' + str(resources['cores']) + ' cores')
assert(original_result, monitored_result)
Or more directly, use it as decorator:
import resource_monitor
@resource_monitor.monitored()
def my_sum_decorated(a,b):
return a + b
(monitored_result, resources) = my_sum_decorated(1,2)
With the function resource_monitor.monitored
, we can specify resource limits
to be enforced. For example, if we simply want to enforce for a function not to
use for more than a megabyte of memory:
import resource_monitor
@resource_monitor.monitored(limits = 'memory': 1024, return_resources = False)
def my_sum_limited(a,b):
return a + b
try:
# Note that since we used return_resources = False, the return value of the
# function is not modified:
x = my_sum_limited(1,2)
except resource_monitor.ResourceExhaustion as e:
print(e.resources.limits_exceeded)
For a list of all the resources that can be monitored and enforced, please consult the documentation of the module.
Further, a function callback can be specified. This callback will be executed at each measurement. As an example, we can use a callback to send messages to a server with the resources measured:
import resource_monitor
# monitor callback function example
# a callback function will be called everytime resources are measured.
# arguments are:
# - id: unique identifier for the function invocation
# - fun_name: string with the name of the function
# - step: resource sample number (1 for the first, 2 for the second, ..., -1 for the last)
# - resources: dictionary with resources measured
def send_udp_message(id, fun_name, step, resources):
""" Send a UDP message with the results of a measurement. """
import socket
import json
finished = True if step == -1 else False
exhaustion = True if resources.get('limits_exceeded', False) else False
msg = {'id': id, 'function': fun_name, 'finished': finished, 'resource_exhaustion': exhaustion, 'resources': resources}
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.sendto(json.dumps(msg).encode(), ('localhost', 9800))
# Create the monitored function addint the callback. Also, set the interval measurements to 5s, instead of the default 1s
@resource_monitor.monitored(callback = send_udp_message, interval = 5, return_resources = False)
def my_function_monitored(...):
...
my_function_monitored(...)
# in another script, run the server as:
import socket
import pickle
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('localhost', 9800))
while True:
data, addr = sock.recvfrom(1024)
print("message: ", pickle.loads(data))
Warning
The monitored function and the callback are executed in a different process from the calling environment. This means that they cannot modify variables from the calling environment.
For example, the following will not work as you may expect:
# Note: This code does not work!!!
import resource_monitor
function_has_run = False
resources_series = []
def my_callback(id, fun_name, step, resources):
resources_series.append(resources)
@resource_monitor.monitored(callback = my_callback):
def my_function():
function_has_run = True
my_function()
# here function_has_run is still False, resources_series is [].
Please see an example here that shows how to construct a time series of the resources, and makes it available to the calling environment.
Further Information
For more information, please see Getting Help or visit the Cooperative Computing Lab website.
Copyright
CCTools is Copyright (C) 2022 The University of Notre Dame. This software is distributed under the GNU General Public License Version 2. See the file COPYING for details.