Setting a minimal CMS production system for new people¶
Introduction¶
The production system process the data either generated the CMS detector, or simulated data from the detector. The workload of the system is described by requests, which specify:
- Input data, if any. (Simulations may not have input data.)
- Sequential steps the data will go through. (With names such as RECO, GEN, SIM, etc.)
- Number of lumisections. A lumisection is an interval of time (in 2020, is about 23s).
- Number of events per lumisection. An event is a beam crossing inside the CMS detector.
In this guide we do not care too much about the physical meaning of lumisection, or of the steps such as RECO. They are only important as parameters to modify the size of the request during testing. (E.g., while testing new code, you may want to process only a handful of lumisections and events. You would not want to wait hours to find syntax error.)
A very important thing to remember is that the request only specifies what to do, not how and when to do it, nor where the data is coming from or the place it will be processed.
The how, when, and where is set by two susbsytems:
-
Core services. The core services have four main responsibilities:
- Maintain the state of the request. That is, determine whether the request
is new, running, completed, etc. This is done by the
reqmgr2
service. - Split the request into processing chunks, called workqueue elements. The
split is done by the
workqueue
service. The workload of a request is highly parallel. For the most part each lumisection can be processed independently of others. The core services group lumisections into workqueue elements so that each element has some desired characteristic. In 2020, this is that each group should take about 8 hours of processing. (Our test requests will run much shorter that this time.) - Maintain the frontend of the system, such as the http interface. This is done by the
frontend
. - Maintain the records for all of the requests (
couchdb
service).
- Maintain the state of the request. That is, determine whether the request
is new, running, completed, etc. This is done by the
-
WMAgent
- Transform the workqueue elements into concrete commands that can be executed. This means staging any data needed, and creating batch jobs (specifically condor jobs).
- Maintain the state of the batch jobs. (Did a job finished? Should it be retried? etc.)
- Serve as a foremen to the compute nodes that actually execute the jobs. In the real production system these compute nodes come from machines running CMSglideinWMS that connect to the WMAgent. For simplicity, in this guide we will not deal CMSglideinWMS, and instead the jobs will execute in the same machine running the WMAgent. This will be enough for the small tests be will run.
Execution happens somewhere else
In the real production system neither the core services nor the WMAgent actually executes the workload. The core services process descriptions of the workloads, the WMAgent transform the descriptions into executable jobs, but the actual payload of the jobs is executed via CMSglideinWMS running on thousands of other hosts.
Note
There are more core services other than the ones mentioned here.
In the real production system there is a single instance of the core services (cmsweb.cern.ch), and many instances of the WMAgent. Roughly, WMAgents are grouped into teams, and requests can be routed to be processed by particular teams. This is done so that requests can take advantage of, among others, data locality or particular hardware. For this guide we will have a single WMagent, so we will make sure that all requests are routed to it.
Both core services and WMAgent are collections of processes. This means that there is not a single program to which you can point and say 'this is the WMAgent'. Instead, each process in the collection has a responsibility, such as, creating jobs, collecting results, or making sure other processes are alive. These processes communicate with each other via databases running along side the core services and the WMagent.
Keeping names straight
Both core services and WMAgent live in a git repository called WMCore. In this git repository there are python subsystems, one of which is called WMCore, and that for the most part implements the core services. The WMAgent is implemented in the python susbsytem WMComponent, which is a sibling of the python system WMCore. Files in WMComponent use code from the module WMCore.
Prerequisites¶
-
Ability to create virtual machines.
-
Ability to create a grid user certificate. The instructions in this guide will assume you have a CERN account when setting the certificates.
-
Ability to ssh to lxplus.cern.ch, and have access to /afs/cern.ch. -- If you are not authorized to ssh lxplus.cern.ch, file a ticket to the CERN Service Desk. -- If it disconnects immediately after entering your password, then probably you are not subscribed to the AFS service. You can do that here: https://resources.web.cern.ch/resources/Manage/AFS/Subscribe.aspx?login=YOUR_CERN_USERID
Common Setup¶
We are going to setup two virtual machines which we will call USER-wmcore
and
USER-wmagent
. Replace USER with your CERN account id.
On USER-wmcore
we are going to run the core services and an xrootd server, and in USER-wmagent
the WMAgent. The xrootd server is where jobs will write the
results of their computation, and where computation steps share information. We
set it up this way because if you are reading this guide, most likely you do
not have write access to the official CMS servers. (And since we are testing
new code, that is a good thing since we do not have to worry about disrupting the public systems.)
Setting up user certificates¶
We need to create host certificates for USER-vmcore.cern.ch
, as well as
setting some common software to manage them. Before we can create the host
certificates, we need to install a user certificate. This certificate will
live in your afs home in the directory ~/.globus
.
To check whether you already have a valid user certificate, type:
openssl x509 -subject -dates -noout -in ~/.globus/usercert.pem
If you get an error such as unable to load certificate
, or the notBefore
and notAfter
dates printed include your current date, then you need to install a new certificate. For CERN:
- Request a new grid user certificate here: https://ca.cern.ch/ca
- Follow the prompts to install the certificate in your web browser.
- Go to the certificate preferences and export your certificate. (In firefox select your certificate and click
Backup...
. This will generate the filemyCertificate.p12
. - Copy the certificate to your home in afs, e.g.,
scp myCertificate.p12 USER@lxplus.cern.ch:
The .p12
file has both public and private keys. We separate them into the
public usercer.pem
and private userkey.pem
keys by login into
lxplus.cern.ch
and:
mkdir -p ~/.globus
rm -f usercert.pem userkey.pem
openssl pkcs12 -in myCertificate.p12 -clcerts -nokeys -out usercert.pem
openssl pkcs12 -in myCertificate.p12 -nocerts -out userkey.pem
chmod 400 userkey.pem
chmod 400 usercert.pem
Setting up the core services¶
Warning
Unless otherwise noted, run the commands as a regular user. We will use sudo
when needed.
Creating the virtual machine¶
Register your ssh key pair¶
When you create a virtual machine through https://openstack.cern.ch, the only way you can access that machine is via ssh using a preregistered key. You cannot use a login/password, so if you do not register a key pair, the virtual machine will be inaccessible.
-
To create key pair, ssh to
lxplus.cern.ch
, typessh-keygen
and accept the defaults. Note that you may already have a key (check withls ~/.ssh/id_rsa.pub
), in which case you may want to use that one, or generate a key with another name. -
To register your key:
-
Go to https://openstack.cern.ch.
- Click on
Key Pairs
, under the headingCompute
https://openstack.cern.ch/project/key_pairs - Click on
Import Public Key
. - Give your key a name, and copy the contents of the public part of the key pair (i.e. the contents of
~/.ssh/id_rsa.pub
) into thePublic key
box.
Launching the virtual machine instance¶
- Go to https://openstack.cern.ch.
- Click on
Instances
, under the headingCompute
https://openstack.cern.ch/project/instances - Click on
Launch an instance
- Fill instance name with:
USER-wmcore
. - For
Source
click on the up arrow ofCC7 Test - x86-64
. - For
Flavor
selectm2.medium
. - For
Key Pair
select the name of they you register in the step above. - Finally click on
Launch Instance
.
- Fill instance name with:
After about 10 minutes, ssh to lxplus.cern.ch
and try to login to your vm with: ssh USER-wmcore.cern.ch
. If your key does not have the default name, use ssh -i ~/.ssh/OTHERNAME USER-vmcore.cern.ch
.
If you cannot connect to your virtual machine
The virtual machines exist in a CERN private network. A command such as ssh USER-wmcore.cern.ch
will only work if your network access comes from CERN. From outside CERN you first need to ssh lxplus.cern.ch
and then try ssh to your virtual machine.
Setting up host certificates¶
Warning
You first need to setup your user certificate as described above.
The easiest way to create a host certificate for USER-vmcore.cern.ch
is to use a
deployment script that lives in the repository git://github.com/dmwm/deployment.git
.
The recommended instructions (taken from https://cms-http-group.web.cern.ch/cms-http-group/tutorials/environ/vm-setup.html are:
sudo -l
sudo yum -y install git.x86_64
mkdir -p /tmp/foo
cd /tmp/foo
git clone git://github.com/dmwm/deployment.git cfg
cfg/Deploy -t dummy -s post $PWD system/devvm
rm -fr /tmp/foo
sudo yum install nano
sudo yum install unzip.x86_64 zip.x86_64
sudo yum install libXcursor libXrandr libXi libXinerama
The previous step generated /data/certs/host{cert,key}.pem
by making a
request to https://ca.cern.ch/ca. This process used the key
~/.globus/userkey.pem
to authenticate to the service running on
https://ca.cern.ch/ca. The certificate and key are also copied to
/etc/grid-security/host{cert,key}.pem
.
Then, we will let the core services authenticate as your user when needed:
mkdir -p /data/auth/wmcore
cp ~/.globus/usercert.pem /data/auth/dmwm-service-cert.pem
# Unencrypt your key because the core services will not be able to enter the
# key password. (There must be a safer and better way to do this. Investigating...)
openssl rsa -in ~/.globus/userkey.pem -out /data/auth/dmwm-service-key.pem
# warning: you may be tempted to use symbolic links instead of making a copy of
# the file. Do not do this if your home lives in /afs. When you log out the
# files will become inaccessible and the link will be broken.
Finally, we set the correct permissions for all of the keys and certificates:
sudo chgrp -R _sw /data/auth
sudo chmod ug=r,o-rwx $(find /data/auth -type f)
sudo chmod u=rwx,g=rx,o-rwx $(find /data/auth -type d)
Custom virtual machine configuration¶
Turning off the firewall¶
If your virtual machines are running on the CERN openstack, they are hidden from inbound traffic from the rest of the world. You could add rules to allow your custom WMAgent to contact your custom core services, but for our test purposes is easier to simply turn off the firewall:
sudo systemctl stop firewalld
sudo systemctl disable firewalld
Now we are ready to install the core services. First we will deploy the core services, then modify the appropriate configuration files, and finally start the services.
Installing the core services (WMCore)¶
Obtaining the core services source code¶
(These instructions were adapted from https://github.com/dmwm/WMCore/wiki/Deploy-central-services-in-a-VM.
First we need to clone the WMCore repository. It is recommended that you use the directory names and paths shown, as it will ease communication when you ask for help, and also their values may be hardcoded somewhere.
cd /data
git clone git://github.com/dmwm/deployment.git cfg
cd cfg
Now you need to choose a deployment version. When developing, this is the place where you will create git branches and input your modifications. For now, we will use a release version (the value of RELEASE) used in production.
export RELEASE=HG1912e
cd /data/cfg && git reset --hard ${RELEASE}
Deploying the core services¶
The repository WMCore has a script to deploy the services. Run it as:
REPO="-r comp=comp"
ARCH=slc7_amd64_gcc630
cd /data
/data/cfg/admin/InstallDev -R comp@${RELEASE} -A ${ARCH} -s image -v ${RELEASE} -a ${PWD}/auth ${REPO} -p "admin frontend couchdb reqmgr2 reqmgr2ms workqueue reqmon t0_reqmon acdcserver"
This install the core services at /data/srv/${RELEASE}
, with a link
/data/srv/current
to /data/srv/${RELEASE}
.
This also sets some tasks to execute periodically as cron jobs. You can see them now with:
crontab -l
Once the core services are deployed, we need to copy the key and certificate to locations where the services expect them by default:
# The lines with chmod may fail if this is the first time you copy the certificates. This is ok.
sudo chmod 660 /data/srv/current/auth/{reqmgr2,workqueue,acdcserver,reqmon,t0_reqmon,reqmgr2ms}/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/acdcserver/
sudo chown _acdcserver._config /data/srv/current/auth/acdcserver/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmgr2/
sudo chown _reqmgr2._config /data/srv/current/auth/reqmgr2/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmgr2ms/
sudo chown _reqmgr2ms._config /data/srv/current/auth/reqmgr2ms/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmon/
sudo chown _reqmon._config /data/srv/current/auth/reqmon/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/t0_reqmon/
sudo chown _t0_reqmon._config /data/srv/current/auth/t0_reqmon/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/workqueue/
sudo chown _workqueue._config /data/srv/current/auth/workqueue/dmwm-service-{cert,key}.pem
sudo chmod 440 /data/srv/current/auth/{reqmgr2,workqueue,acdcserver,reqmon,t0_reqmon,reqmgr2ms}/dmwm-service-{cert,key}.pem
Modifying configuration files¶
There are configuration files we will need to modify. Some of them live in
/data/cfg
, while others live in the installation directory. We make a copy of
them in our afs home directory so that we do not have to worry about commiting
their changes in our git branches, or losing them when trying a new release.
mkdir -p ~/local-cfg/core
Whitelisting the custom WMAgent¶
The first changes relate to allowing our custom WMAgent to work with our custom core services. First we copy the appropiate configuration files:
for service in {reqmgr2,reqmon,workqueue}; do
cp /data/srv/${RELEASE}/config/${service}/config.py ~/local-cfg/core/${service}_config.py
done
# Note that we use the config files installed at
# /data/srv/${RELEASE}/config/${service}/config.py instead of those at
# /data/cfg/config/${service}/config.py.
# This is because the deployment scripts fills in some values, such as the
# hostname of the machine that will run the core services.
We now declare our custom WMAgent in our copy of the configuration files. The
way we do these is to find the place where an official WMagent host is named
(in this case vocms0127
, and add the hostname of our agent.
Here we assume that the host that will run the WMAgent has the name of the form
USER-wmagent
. Thus, in the following line, replace USER with your CERN user
id:
sed -i 's/or HOST.startswith("vocms0127"):/or HOST.startswith("vocms0127") or HOST.startswith("USER-wm"):/g' ~/local-cfg/core/{reqmgr2,reqmon,workqueue}_config.py
Give the correct production permissions of to your CERN ID and your WMAgent¶
When users connect to the core services via http, they will be asked to provide
a certificate for authentication. This certificate is the one you installed in
your web browser in a previous step of this guide. Using this authentication
the user will gain roles according to file /data/srv/state/frontend/etc/authmap.json
. This
file is a copy of the global file used for the whole CERN infrastructure, and
lists the roles that users and may perform.
If you are reading this guide, most likely you do not have the correct roles to
operate the actual production system. Thus, we will add such roles to
authmap.json
. However, note that since userids and machines are added and deleted to
authmap.json
by CERN all the time, this file is updated by a cron job every
four minutes. This means that any changes you make to this file will be lost.
The first thing we need to do is to disable the cron job:
crontab -e
Comment out the line (add a #
at the start of the line) that reads:
*/4 * * * * . /data/srv/current/apps/frontend/etc/profile.d/init.sh && PYTHONPATH=/data/srv/current/auth/frontend:$PYTHONPATH /data/srv/current/config/frontend/mkauthmap -c /data/srv/current/config/frontend/mkauth.conf -o /data/srv/state/frontend/etc/authmap.json
Now, back in the terminal, we obtain a copy of authmap.json
with the command:
(. /data/srv/current/apps/frontend/etc/profile.d/init.sh && PYTHONPATH=/data/srv/current/auth/frontend:$PYTHONPATH /data/srv/current/config/frontend/mkauthmap -c /data/srv/current/config/frontend/mkauth.conf -o ~/local-cfg/core/authmap.json)
Edit the file ~/local-cfg/core/authmap.json and:
Make sure that the line which has your CERN userid has the required roles for production. For most users, this means replacing:
"ROLES": {"user": ["group:users"]}
with
"ROLES": {"user": ["group:users"], "admin": ["group:reqmgr"], "data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"]}
After the line with your userid, insert the following lines to give read/write permissions to the hosts running your custom installations. Remember to replace USER with your CERN user id:
{"DN": "/DC=ch/DC=cern/OU=computers/CN=USER-wmcore.cern.ch", "ID": null, "LOGIN": "USER-wmcore@cern.ch", "NAME": "USER-wmcore Service", "ROLES": {"data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"], "user": ["group:users"]}},
{"DN": "/DC=ch/DC=cern/OU=computers/CN=USER-wmagent.cern.ch", "ID": null, "LOGIN": "USER@cern.ch", "NAME": "USER-wmagent Service", "ROLES": {"data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"], "user": ["group:users"]}},
Read and write access permissions to the core services¶
(Adapted from users.db constructed from: https://github.com/dmwm/WMCore/wiki/vm-access-to-other-user)
(See also /data/srv/current/config/frontend/deploy, which generated the defaults.)
As explained above, a user or machine authenticating with a certificate gains
roles as dictated by the authmap.json
files. For the core services, we need
to declare what a certain role entails, such as being able to create a request,
access statistics of the workflows, etc.
These permissions are maintained with a sqlite database which the frontend core service uses.
We set the correct permissions as follows:
cp /data/srv/current/auth/frontend/users.db ~/local-cfg/core/users.db
sqlite3 ~/local-cfg/core/users.db
At the sqlite3 prompt, type the follwing queries. Remember to change USER to your CERN userid:
DELETE FROM cms_name WHERE name = 'T3_NX_Foobar';
INSERT INTO cms_name VALUES (1, 'T3_NX_AnyNameYouWant');
DELETE FROM site WHERE name = 'Foobar';
INSERT INTO site VALUES (1, 'USER-wmcore');
DELETE FROM user_group WHERE name = 'foo';
INSERT into user_group values (1, 'reqmgr');
INSERT into user_group values (2, 'dataops');
DELETE FROM role WHERE title = 'bar';
INSERT INTO role VALUES (1, 'admin');
INSERT INTO role VALUES (2, 'production-operator');
INSERT INTO role VALUES (3, 'data-manager');
-- The script at /data/srv/current/config/frontend/deploy added with id 1 a line
-- with CN=USER, where USER is your CERN userid.
-- You can see this with: SELECT * FROM contact WHERE id = 1;
INSERT INTO contact VALUES (2, 'Service', 'USER-wmcore', 'USER-wmcore', '/DC=ch/DC=cern/OU=computers/CN=USER-wmcore.cern.ch');
INSERT INTO contact VALUES (3, 'Service', 'USER-wmagent', 'USER-wmagent', '/DC=ch/DC=cern/OU=computers/CN=USER-wmagent.cern.ch');
-- For the table group_responsibility, the columns are (contact, role,
-- user_group). thus, to give our user, custom services and WMAgent (ids 1, 2
-- and 3 in table contact) permissions to create/read requests, and read/write
-- statistics (the production-operator and dataops roles) we add:
-- USER with role 'admin' of group 'reqmgr'. Commented out as already added by .../frontend/deploy
-- INSERT INTO group_responsibility VALUES (1, 1, 1);
-- USER with role 'production-operator' of group 'dataops'
INSERT INTO group_responsibility VALUES (1, 2, 2);
-- USER with role 'data-manager' of group 'reqmgr'
INSERT INTO group_responsibility VALUES (1, 3, 1);
-- USER-wmcore and USER-wmagent with roles 'production-operator' of
-- group 'dataops', and 'data-manager' of group 'reqmgr'.
INSERT INTO group_responsibility VALUES (2, 2, 2);
INSERT INTO group_responsibility VALUES (3, 2, 2);
INSERT INTO group_responsibility VALUES (2, 3, 1);
INSERT INTO group_responsibility VALUES (3, 3, 1);
Deploying our custom configuration files¶
Warning
You may be tempted to use symbolic links to the contents of ~/local-cfg
instead of copying the files. If ~/local-cfg
lives in /afs, this will not
work once you close your terminal, as /afs will become inaccessible from
USER-wmcore
.
In this step we copy our custom configuration files to the correct locations.
Remember to set the RELEASE
variable as appropiate, as we did here.
# core services
for service in {reqmgr2,reqmon,workqueue}; do
sudo cp ~/local-cfg/core/${service}_config.py /data/srv/${RELEASE}/config/${service}/config.py
sudo chown _sw:_config /data/srv/${RELEASE}/config/${service}/config.py
done
# authmap.json
cp ~/local-cfg/core/authmap.json /data/srv/state/frontend/etc/authmap.json
# sqlite db
sudo cp ~/local-cfg/core/users.db /data/srv/${RELEASE}/auth/frontend/users.db
sudo chown _sw:_config /data/srv/${RELEASE}/auth/frontend/users.db
Starting/Stopping the core services¶
The core services can be controlled via the script
/data/cfg/admin/InstallDev
. This script has to be executed from the /data
directory. For example, to check the status of the services:
(cd /data && /data/cfg/admin/InstallDev -s status)
Now, to start (or restart) the core services:
(cd /data && /data/cfg/admin/InstallDev -s start)
Note however, that when you try stop
as in:
(cd /data && /data/cfg/admin/InstallDev -s stop)
You can also (re)start particular services. For example, to only restart the
reqmgr2
service:
(cd /data && /data/cfg/admin/InstallDev -s start:reqmgr2)
Connecting to the frontend¶
You can interact with the core services via https. If your workstation/laptop is at CERN, you can simply access them pointing your web browser to:
https://USER-wmcore.cern.ch
Warning
The https://
part is important, otherwise you will get a Bad request
error message.
If you are not at CERN, you need to make an ssh tunnel that redirects a given local address to that of the core services. You can do this for example as follows:
ssh -L localhost:8443:USER-wmcore.cern.ch:443 USER@lxplus.cern.ch
In the previous command we are mapping the address and port localhost:8443
to
the address and port USER-wmcore.cern.ch:443
. The port 443
is the official
port for https, and the port 8443
is an arbitrary convention when tunneling
https. (You could change 8443 to use other port number. Note that ports below
1025 will require sudo ssh ...
which is not recommended.
Now you can point your web browser to:
https://localhost:8443
Once the web page loads, you can click on the information of several services.
Only the links for Request Manager 2
, Request Monitor
and Work Queue
will
be functional. When you click on any of these links you will be ask to accept a
host certificate exception, and then to confirm the user certificate that you
want to use. This is the certificate you installed in your browser
here.
If when clicking on a link you get the message:
{"error":"unauthorized","reason":"Authorisation failed"}
Your user id has not been correctly configured in the authmap.json
file. We configured this file here.
Debugging information and logs¶
The logs of the core services are located at /data/srv/logs
. In particular in
this earlier stages you want to pay attention to:
- frontend/error_log_YYYYMMDD.log: Here you will see, among others, errors with certificates. If so, check that your certificates are installed correctly.
- frontend/access_log_YYYYMMDD.log: This will show you all the connection attempts, together with which certificates are being used per connection.
- reqmgr2/reqmgr2-YYYYMMDD.log: Eventually you will see the lifetime of a request. For this setup, errors that you may see:
unable to set private key file
: There is an error with the certificate. For example the file may not exist, or is prompting for a password. Check the steps here and hereCouchForbidden: CouchForbidden
: You may be using the incorrect key/certificate. In particular check the frontend error log. One possibility is that you are using the host key creted, rather than a user key. (The frontend will complain withrejecting non-vo certificate
).
- workqueue/workqueue-YYMMDD.log: Certificate errors will be similar to reqmgr2.
Removing an installation¶
To remove your installation, say to try another one:
Then, to start with a clean installation, remove the install directory of the release and the logs:
export RELEASE=HG1912e
# Stop the services:
(cd /data && /data/cfg/admin/InstallDev -s stop)
# Clear the cron jobs
crontab -r
# Kill all python programs that may have survided .../InstallDev -s stop
killall python
# Finally remove the installation directory, logs, and core services state:
sudo rm -rf current enabled ${RELEASE} logs state
Setting up xrootd¶
Our test system includes an xrootd server. This is because the outputs of the different steps in a request are staged-out to xrootd servers at CERN. As a new developer, you most likely will not have write access to those servers, thus we will redirect all these write operations to a server we can control.
This xrootd server could be installed in the machine that runs the core services, or the one that runs the WMAgent, or even a third machine. We choose to install it together with the core services so that we do not have to create a third machine, but also so that the we can test that jobs can stage-out to a machine that is not running them. (Remember, our jobs will run on the machine running the WMAgent, unlike the real system where the WMAgent simply creates the jobs.)
Install xrootd via the command:
sudo yum install xrootd
sudo yum install xrootd-client # neede only for debugging
We will configure the xrootd server as standalone. This is done with the following configuration file:
cat /tmp/xrootd-standalone.cfg <<'EOF'
# This xrootd server will export paths that start with /eos:
#
all.export /eos
# The adminpath and pidpath variables indicate where the pid and various
# IPC files should be placed
all.adminpath /var/spool/xrootd
all.pidpath /var/run/xrootd
# Allow checksums to be performed (needed by the xrdfs chksum command).
xrootd.chksum adler32
# Useful debug information with: tail -f /var/log/xrootd/standalone/xrootd.log
xrootd.trace all
EOF
sudo mv xrootd-standalone.cfg /etc/xrootd/xrootd-standalone.cfg
Warning
The above sets the xrootd server to be world writable. I am still investigating an easy way to add authorization via grid proxies: https://twiki.cern.ch/twiki/bin/view/LCG/XrootdTpc#Installation
As you can see from the configuration file, the xrootd server will only make
visible paths that start with /eos
. In the actual production system these
paths exist in the optimized EOS disk-based low-latency storage service from
CERN. In our test instances we do not have EOS, but keeping this prefix makes
life a little easier:
sudo mkdir -p /var/xrootd/eos
sudo mkdir -p /var/log/xrootd/standalone
sudo chown xrootd.xrootd /var/xrootd/eos
sudo chown xrootd.xrootd /var/log/xrootd/standalone
sudo ln -s /var/xrootd/eos /eos
Finally, we enable and start the xrootd server:
systemctl enable xrootd@standalone
systemctl start xrootd@standalone
Setting up the WMAgent¶
The WMAgent will run in its own virtual machine. Besides the WMAgent components, we need to install:
- HTCondor: The WMAgent creates HTCondor jobs for the workload of a request. In the actual production system, WMAgent only creates the jobs and waits for machines running CMSglideinWMS to connect and execute the jobs. In our test setup the machine running WMAgent will also execute the jobs.
- Singularity: This is needed to execute the jobs. In the actual production system WMAgent does not need singularity, but it is a requirement for machines running CMSglideinWMS.
- CVMFS: This is a read-only filesystem. CVMFS repositories provide the configuration files and software of the commands that execute the workload of the request. It is a requisite for the machines running WMAgent and CMSglideinWMS.
Creating the virtual machine for the agent¶
Follow the same instructions as for the core services, but with the following changes:
- As a hostname use:
USER-wmagent.cern.ch
. - As a flavor use:
m2.large
.
Once the virtual machine is available, you will need to also follow the instructions for the custom configuration.
Finally, we will use condor and cvmfs available from the Open Science Grid (OSG) repository. You can obtained the repository with the following commands:
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install https://repo.opensciencegrid.org/osg/3.5/osg-3.5-el7-release-latest.rpm
Setting up singularity¶
sudo yum install singularity
All jobs will be run inside a singularity container. We will do this by
wrapping each job with a script called singularity_wrapper.sh
. This script is
part of CMSglideinWMS. However, we will use the version used in CMSConnect. From a
terminal in USER-wmagent.cern.ch
:
mkdir /data/scripts
cd /data/scripts
curl -O https://raw.githubusercontent.com/CMSConnect/tutorial-interactivegpus/master/singularity_wrapper.sh
chmod 755 singularity_wrapper.sh
Setting up CVMFS¶
Install CVMFS via yum:
sudo yum install cvmfs
Once the software is installed, there are a couple of configuration files we
need to modify. First, we need to let the kernel know what to do when it
encounters a path that starts with /cvmfs
. We do these by adding the line
/cvmfs /etc/auto.cvmfs
to the end of the file /etc/auto.master
. (The file auto.cvmfs
is installed as part of cvmfs.)
# sed command:
# - $ matches the last line of the file
# - a sed append command, whatever text follows is appended at the indicated location.
# - -i Modify the file inplace.
sudo sed -e '$a/cvmfs /etc/auto.cvmfs' -i /etc/auto.master
We also need to create a cvmfs configuration file:
cat > /tmp/default.local <<'EOF'
CVMFS_REPOSITORIES="cms.cern.ch"
CVMFS_QUOTA_LIMIT=20000
CVMFS_HTTP_PROXY="http://ca-proxy.cern.ch:3128;http://ca-proxy1.cern.ch:3128|http://ca-proxy2.cern.ch:3128|http://ca-proxy3.cern.ch:3128|http://ca-proxy4.cern.ch:3128|http://ca-proxy5.cern.ch:3128"
EOF
sudo mv /tmp/default.local /etc/cvmfs/default.local
Finally, we restart the autofs
service to activate the cvmfs configuration:
systemctl restart autofs
Setting up condor¶
Install condor via yum:
sudo yum install condor
Now we need to customize the configuration for condor. First we need to obtain
the ip of the machine running the agent. From USER-wmagent
, run the command:
ip -f inet addr show eth0
and look for the entry inet. It should look something like 188.185.XXX.XXX
. Ignore the /XX
.
Edit /etc/condor/condor_config.local
such as:
vi /etc/condor/condor_config.local
and change COLLECTOR_NAME
to USER-wmagent_collector, and CONDOR_HOST
to the
ip number we found in the step above. Also, make sure that the line with
DAEMON_LIST
looks something like:
DAEMON_LIST = MASTER, COLLECTOR, SCHEDD, STARTD, NEGOTIATOR
In the actual production system STARTD
will be missing, as the machines that
run WMAgent usually do not themselves run jobs.
Still editing /etc/condor/condor_config.local
, add these lines at the end of
the file:
## All the following class ads are for startd, the condor process that runs jobs.
# Treat all the resources of the machine as partitionable slots. (E.g., to be
# able to run 4 one-core jobs at the same time. Without this, any job would be
# assigned all the resources.
NUM_SLOTS = 4
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
# All the CMS production jobs will be run through singularity. All the
# singularity options are set by the wrapper script `singularity_wrapper.sh`.
USER_JOB_WRAPPER = /data/scripts/singularity_wrapper.sh
HAS_SINGULARITY = True
OSG_SINGULARITY_PATH = "/bin/singularity"
# Singularity images are provided by OSG via /cvmfs:
OSG_SINGULARITY_IMAGE_DEFAULT = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6"
OSG_SINGULARITY_IMAGE = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6"
# In the actual production system, singularity_wrapper.sh passes the following
# options to CMSglideinWMS. While our test system does not use CMSglideinWMS,
# `singularity_wrapper.sh` does expect them.
GLIDEIN_REQUIRED_OS = "any"
GLIDEIN_CMSSite = "T2_CH_CERN"
STARTD_ATTRS = HAS_SINGULARITY, OSG_SINGULARITY_PATH, OSG_SINGULARITY_IMAGE, OSG_SINGULARITY_IMAGE_DEFAULT, GLIDEIN_REQUIRED_OS, GLIDEIN_CMSSite, $(STARTD_ATTRS)
Finally, we enable the condor service:
systemctl enable condor
systemctl start condor
Deploying the WMAgent¶
-> create /data/admin/wmagent/env.sh, .secrets
cd /data/srv
wget -nv https://raw.githubusercontent.com/dmwm/WMCore/1.2.8/deploy/deploy-wmagent.sh
line 337 (tweak configuration) add: cp /data/srv/configs/config.py ${MANAGE_DIR}/config.py
sh deploy-wmagent.sh -w 1.2.8 -d HG1912d -t btovarlo-wmagent -c btovaro-wmcore.cern.ch
source /data/admin/wmagent/env.sh
$manage start-agent
check that /data/srv/wmagent/v1.2.8/install/couchdb/certs/cert.pem key.pem do point to /data/cers/hostcert.pem hostkey.pem
Starting/Stopping and reseting the core services¶
source /data/admin/wmagent/env.sh
$manage stop-agent
$manage stop-services
Where to look for debugging information¶
Testing¶
Injecting a request into the system¶
Base on https://github.com/dmwm/WMCore/wiki/Injecting%2C-assigning-and-validating-test-requests
mkdir ~/test_injection cd test_injection curl -O https://raw.githubusercontent.com/dmwm/WMCore/master/test/data/ReqMgr/inject-test-wfs.py
PRE=MonB; python inject-test-wfs.py -m DMWM -c ${PRE}Cam -r ${PRE}Req -u "https://btovarlo-wmcore.cern.ch" -f TaskChain_ProdMinBiasSmall.json -t btovarlo-wmagent