Setting a minimal CMS production system for new people

Introduction

The production system process the data either generated the CMS detector, or simulated data from the detector. The workload of the system is described by requests, which specify:

  • Input data, if any. (Simulations may not have input data.)
  • Sequential steps the data will go through. (With names such as RECO, GEN, SIM, etc.)
  • Number of lumisections. A lumisection is an interval of time (in 2020, is about 23s).
  • Number of events per lumisection. An event is a beam crossing inside the CMS detector.

In this guide we do not care too much about the physical meaning of lumisection, or of the steps such as RECO. They are only important as parameters to modify the size of the request during testing. (E.g., while testing new code, you may want to process only a handful of lumisections and events. You would not want to wait hours to find syntax error.)

A very important thing to remember is that the request only specifies what to do, not how and when to do it, nor where the data is coming from or the place it will be processed.

The how, when, and where is set by two susbsytems:

  • Core services. The core services have four main responsibilities:

    • Maintain the state of the request. That is, determine whether the request is new, running, completed, etc. This is done by the reqmgr2 service.
    • Split the request into processing chunks, called workqueue elements. The split is done by the workqueue service. The workload of a request is highly parallel. For the most part each lumisection can be processed independently of others. The core services group lumisections into workqueue elements so that each element has some desired characteristic. In 2020, this is that each group should take about 8 hours of processing. (Our test requests will run much shorter that this time.)
    • Maintain the frontend of the system, such as the http interface. This is done by the frontend.
    • Maintain the records for all of the requests (couchdb service).
  • WMAgent

    • Transform the workqueue elements into concrete commands that can be executed. This means staging any data needed, and creating batch jobs (specifically condor jobs).
    • Maintain the state of the batch jobs. (Did a job finished? Should it be retried? etc.)
    • Serve as a foremen to the compute nodes that actually execute the jobs. In the real production system these compute nodes come from machines running CMSglideinWMS that connect to the WMAgent. For simplicity, in this guide we will not deal CMSglideinWMS, and instead the jobs will execute in the same machine running the WMAgent. This will be enough for the small tests be will run.

Execution happens somewhere else

In the real production system neither the core services nor the WMAgent actually executes the workload. The core services process descriptions of the workloads, the WMAgent transform the descriptions into executable jobs, but the actual payload of the jobs is executed via CMSglideinWMS running on thousands of other hosts.

Note

There are more core services other than the ones mentioned here.

In the real production system there is a single instance of the core services (cmsweb.cern.ch), and many instances of the WMAgent. Roughly, WMAgents are grouped into teams, and requests can be routed to be processed by particular teams. This is done so that requests can take advantage of, among others, data locality or particular hardware. For this guide we will have a single WMagent, so we will make sure that all requests are routed to it.

Both core services and WMAgent are collections of processes. This means that there is not a single program to which you can point and say 'this is the WMAgent'. Instead, each process in the collection has a responsibility, such as, creating jobs, collecting results, or making sure other processes are alive. These processes communicate with each other via databases running along side the core services and the WMagent.

Keeping names straight

Both core services and WMAgent live in a git repository called WMCore. In this git repository there are python subsystems, one of which is called WMCore, and that for the most part implements the core services. The WMAgent is implemented in the python susbsytem WMComponent, which is a sibling of the python system WMCore. Files in WMComponent use code from the module WMCore.

Prerequisites

  • Ability to create virtual machines.

  • Ability to create a grid user certificate. The instructions in this guide will assume you have a CERN account when setting the certificates.

  • Ability to ssh to lxplus.cern.ch, and have access to /afs/cern.ch. -- If you are not authorized to ssh lxplus.cern.ch, file a ticket to the CERN Service Desk. -- If it disconnects immediately after entering your password, then probably you are not subscribed to the AFS service. You can do that here: https://resources.web.cern.ch/resources/Manage/AFS/Subscribe.aspx?login=YOUR_CERN_USERID

Common Setup

We are going to setup two virtual machines which we will call USER-wmcore and USER-wmagent. Replace USER with your CERN account id.

On USER-wmcore we are going to run the core services and an xrootd server, and in USER-wmagent the WMAgent. The xrootd server is where jobs will write the results of their computation, and where computation steps share information. We set it up this way because if you are reading this guide, most likely you do not have write access to the official CMS servers. (And since we are testing new code, that is a good thing since we do not have to worry about disrupting the public systems.)

Setting up user certificates

We need to create host certificates for USER-vmcore.cern.ch, as well as setting some common software to manage them. Before we can create the host certificates, we need to install a user certificate. This certificate will live in your afs home in the directory ~/.globus.

To check whether you already have a valid user certificate, type:

openssl x509  -subject -dates -noout  -in ~/.globus/usercert.pem

If you get an error such as unable to load certificate, or the notBefore and notAfter dates printed include your current date, then you need to install a new certificate. For CERN:

  1. Request a new grid user certificate here: https://ca.cern.ch/ca
  2. Follow the prompts to install the certificate in your web browser.
  3. Go to the certificate preferences and export your certificate. (In firefox select your certificate and click Backup.... This will generate the file myCertificate.p12.
  4. Copy the certificate to your home in afs, e.g., scp myCertificate.p12 USER@lxplus.cern.ch:

The .p12 file has both public and private keys. We separate them into the public usercer.pem and private userkey.pem keys by login into lxplus.cern.ch and:

mkdir -p ~/.globus
rm -f  usercert.pem  userkey.pem
openssl pkcs12 -in myCertificate.p12 -clcerts -nokeys -out usercert.pem
openssl pkcs12 -in myCertificate.p12 -nocerts -out userkey.pem
chmod 400 userkey.pem
chmod 400 usercert.pem

Setting up the core services

Warning

Unless otherwise noted, run the commands as a regular user. We will use sudo when needed.

Creating the virtual machine

Register your ssh key pair

When you create a virtual machine through https://openstack.cern.ch, the only way you can access that machine is via ssh using a preregistered key. You cannot use a login/password, so if you do not register a key pair, the virtual machine will be inaccessible.

  • To create key pair, ssh to lxplus.cern.ch, type ssh-keygen and accept the defaults. Note that you may already have a key (check with ls ~/.ssh/id_rsa.pub), in which case you may want to use that one, or generate a key with another name.

  • To register your key:

  • Go to https://openstack.cern.ch.

  • Click on Key Pairs, under the heading Compute https://openstack.cern.ch/project/key_pairs
  • Click on Import Public Key.
  • Give your key a name, and copy the contents of the public part of the key pair (i.e. the contents of ~/.ssh/id_rsa.pub) into the Public key box.

Launching the virtual machine instance

  1. Go to https://openstack.cern.ch.
  2. Click on Instances, under the heading Compute https://openstack.cern.ch/project/instances
  3. Click on Launch an instance
    1. Fill instance name with: USER-wmcore.
    2. For Source click on the up arrow of CC7 Test - x86-64.
    3. For Flavor select m2.medium.
    4. For Key Pair select the name of they you register in the step above.
    5. Finally click on Launch Instance.

After about 10 minutes, ssh to lxplus.cern.ch and try to login to your vm with: ssh USER-wmcore.cern.ch. If your key does not have the default name, use ssh -i ~/.ssh/OTHERNAME USER-vmcore.cern.ch.

If you cannot connect to your virtual machine

The virtual machines exist in a CERN private network. A command such as ssh USER-wmcore.cern.ch will only work if your network access comes from CERN. From outside CERN you first need to ssh lxplus.cern.ch and then try ssh to your virtual machine.

Setting up host certificates

Warning

You first need to setup your user certificate as described above.

The easiest way to create a host certificate for USER-vmcore.cern.ch is to use a deployment script that lives in the repository git://github.com/dmwm/deployment.git.

The recommended instructions (taken from https://cms-http-group.web.cern.ch/cms-http-group/tutorials/environ/vm-setup.html are:

sudo -l
sudo yum -y install git.x86_64
mkdir -p /tmp/foo
cd /tmp/foo
git clone git://github.com/dmwm/deployment.git cfg
cfg/Deploy -t dummy -s post $PWD system/devvm
rm -fr /tmp/foo
sudo yum install nano
sudo yum install unzip.x86_64 zip.x86_64
sudo yum install libXcursor libXrandr libXi libXinerama

The previous step generated /data/certs/host{cert,key}.pem by making a request to https://ca.cern.ch/ca. This process used the key ~/.globus/userkey.pem to authenticate to the service running on https://ca.cern.ch/ca. The certificate and key are also copied to /etc/grid-security/host{cert,key}.pem.

Then, we will let the core services authenticate as your user when needed:

mkdir -p /data/auth/wmcore
cp ~/.globus/usercert.pem /data/auth/dmwm-service-cert.pem

# Unencrypt your key because the core services will not be able to enter the
# key password. (There must be a safer and better way to do this. Investigating...)
openssl rsa -in ~/.globus/userkey.pem -out /data/auth/dmwm-service-key.pem

# warning: you may be tempted to use symbolic links instead of making a copy of
# the file. Do not do this if your home lives in /afs. When you log out the
# files will become inaccessible and the link will be broken.

Finally, we set the correct permissions for all of the keys and certificates:

sudo chgrp -R _sw /data/auth
sudo chmod ug=r,o-rwx $(find /data/auth -type f)
sudo chmod u=rwx,g=rx,o-rwx $(find /data/auth -type d) 

Custom virtual machine configuration

Turning off the firewall

If your virtual machines are running on the CERN openstack, they are hidden from inbound traffic from the rest of the world. You could add rules to allow your custom WMAgent to contact your custom core services, but for our test purposes is easier to simply turn off the firewall:

sudo systemctl stop firewalld 
sudo systemctl disable firewalld

Now we are ready to install the core services. First we will deploy the core services, then modify the appropriate configuration files, and finally start the services.

Installing the core services (WMCore)

Obtaining the core services source code

(These instructions were adapted from https://github.com/dmwm/WMCore/wiki/Deploy-central-services-in-a-VM.

First we need to clone the WMCore repository. It is recommended that you use the directory names and paths shown, as it will ease communication when you ask for help, and also their values may be hardcoded somewhere.

cd /data
git clone git://github.com/dmwm/deployment.git cfg
cd cfg

Now you need to choose a deployment version. When developing, this is the place where you will create git branches and input your modifications. For now, we will use a release version (the value of RELEASE) used in production.

export RELEASE=HG1912e
cd /data/cfg && git reset --hard ${RELEASE}

Deploying the core services

The repository WMCore has a script to deploy the services. Run it as:

REPO="-r comp=comp"
ARCH=slc7_amd64_gcc630
cd /data
/data/cfg/admin/InstallDev -R comp@${RELEASE} -A ${ARCH} -s image -v ${RELEASE} -a ${PWD}/auth ${REPO} -p "admin frontend couchdb reqmgr2 reqmgr2ms workqueue reqmon t0_reqmon acdcserver"

This install the core services at /data/srv/${RELEASE}, with a link /data/srv/current to /data/srv/${RELEASE}.

This also sets some tasks to execute periodically as cron jobs. You can see them now with:

crontab -l

Once the core services are deployed, we need to copy the key and certificate to locations where the services expect them by default:

# The lines with chmod may fail if this is the first time you copy the certificates. This is ok.

sudo chmod 660 /data/srv/current/auth/{reqmgr2,workqueue,acdcserver,reqmon,t0_reqmon,reqmgr2ms}/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/acdcserver/
sudo chown _acdcserver._config /data/srv/current/auth/acdcserver/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmgr2/
sudo chown _reqmgr2._config /data/srv/current/auth/reqmgr2/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmgr2ms/
sudo chown _reqmgr2ms._config /data/srv/current/auth/reqmgr2ms/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/reqmon/
sudo chown _reqmon._config /data/srv/current/auth/reqmon/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/t0_reqmon/
sudo chown _t0_reqmon._config /data/srv/current/auth/t0_reqmon/dmwm-service-{cert,key}.pem
sudo cp /data/auth/dmwm-service-{cert,key}.pem /data/srv/current/auth/workqueue/
sudo chown _workqueue._config /data/srv/current/auth/workqueue/dmwm-service-{cert,key}.pem
sudo chmod 440 /data/srv/current/auth/{reqmgr2,workqueue,acdcserver,reqmon,t0_reqmon,reqmgr2ms}/dmwm-service-{cert,key}.pem

Modifying configuration files

There are configuration files we will need to modify. Some of them live in /data/cfg, while others live in the installation directory. We make a copy of them in our afs home directory so that we do not have to worry about commiting their changes in our git branches, or losing them when trying a new release.

mkdir -p ~/local-cfg/core

Whitelisting the custom WMAgent

The first changes relate to allowing our custom WMAgent to work with our custom core services. First we copy the appropiate configuration files:

for service in {reqmgr2,reqmon,workqueue}; do
    cp /data/srv/${RELEASE}/config/${service}/config.py ~/local-cfg/core/${service}_config.py
done

# Note that we use the config files installed at
# /data/srv/${RELEASE}/config/${service}/config.py instead of those at
# /data/cfg/config/${service}/config.py.
# This is because the deployment scripts fills in some values, such as the
# hostname of the machine that will run the core services.

We now declare our custom WMAgent in our copy of the configuration files. The way we do these is to find the place where an official WMagent host is named (in this case vocms0127, and add the hostname of our agent.

Here we assume that the host that will run the WMAgent has the name of the form USER-wmagent. Thus, in the following line, replace USER with your CERN user id:

sed -i 's/or HOST.startswith("vocms0127"):/or HOST.startswith("vocms0127") or HOST.startswith("USER-wm"):/g' ~/local-cfg/core/{reqmgr2,reqmon,workqueue}_config.py

Give the correct production permissions of to your CERN ID and your WMAgent

When users connect to the core services via http, they will be asked to provide a certificate for authentication. This certificate is the one you installed in your web browser in a previous step of this guide. Using this authentication the user will gain roles according to file /data/srv/state/frontend/etc/authmap.json. This file is a copy of the global file used for the whole CERN infrastructure, and lists the roles that users and may perform.

If you are reading this guide, most likely you do not have the correct roles to operate the actual production system. Thus, we will add such roles to authmap.json. However, note that since userids and machines are added and deleted to authmap.json by CERN all the time, this file is updated by a cron job every four minutes. This means that any changes you make to this file will be lost.

The first thing we need to do is to disable the cron job:

crontab -e

Comment out the line (add a # at the start of the line) that reads:

*/4 * * * * . /data/srv/current/apps/frontend/etc/profile.d/init.sh && PYTHONPATH=/data/srv/current/auth/frontend:$PYTHONPATH /data/srv/current/config/frontend/mkauthmap  -c /data/srv/current/config/frontend/mkauth.conf -o /data/srv/state/frontend/etc/authmap.json

Now, back in the terminal, we obtain a copy of authmap.json with the command:

(. /data/srv/current/apps/frontend/etc/profile.d/init.sh && PYTHONPATH=/data/srv/current/auth/frontend:$PYTHONPATH /data/srv/current/config/frontend/mkauthmap  -c /data/srv/current/config/frontend/mkauth.conf -o ~/local-cfg/core/authmap.json)

Edit the file ~/local-cfg/core/authmap.json and:

Make sure that the line which has your CERN userid has the required roles for production. For most users, this means replacing:

"ROLES": {"user": ["group:users"]}

with

"ROLES": {"user": ["group:users"], "admin": ["group:reqmgr"], "data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"]}

After the line with your userid, insert the following lines to give read/write permissions to the hosts running your custom installations. Remember to replace USER with your CERN user id:

{"DN": "/DC=ch/DC=cern/OU=computers/CN=USER-wmcore.cern.ch", "ID": null, "LOGIN": "USER-wmcore@cern.ch", "NAME": "USER-wmcore Service", "ROLES": {"data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"], "user": ["group:users"]}},
{"DN": "/DC=ch/DC=cern/OU=computers/CN=USER-wmagent.cern.ch", "ID": null, "LOGIN": "USER@cern.ch", "NAME": "USER-wmagent Service", "ROLES": {"data-manager": ["group:reqmgr"], "production-operator": ["group:dataops"], "user": ["group:users"]}},

Read and write access permissions to the core services

(Adapted from users.db constructed from: https://github.com/dmwm/WMCore/wiki/vm-access-to-other-user)

(See also /data/srv/current/config/frontend/deploy, which generated the defaults.)

As explained above, a user or machine authenticating with a certificate gains roles as dictated by the authmap.json files. For the core services, we need to declare what a certain role entails, such as being able to create a request, access statistics of the workflows, etc.

These permissions are maintained with a sqlite database which the frontend core service uses.

We set the correct permissions as follows:

cp /data/srv/current/auth/frontend/users.db ~/local-cfg/core/users.db
sqlite3 ~/local-cfg/core/users.db

At the sqlite3 prompt, type the follwing queries. Remember to change USER to your CERN userid:

DELETE FROM cms_name WHERE name = 'T3_NX_Foobar';
INSERT INTO cms_name VALUES (1, 'T3_NX_AnyNameYouWant');

DELETE FROM site WHERE name = 'Foobar';
INSERT INTO site VALUES (1, 'USER-wmcore');

DELETE FROM user_group WHERE name = 'foo';
INSERT into user_group values (1, 'reqmgr');
INSERT into user_group values (2, 'dataops');

DELETE FROM role WHERE title = 'bar';
INSERT INTO role VALUES (1, 'admin');
INSERT INTO role VALUES (2, 'production-operator');
INSERT INTO role VALUES (3, 'data-manager');

-- The script at /data/srv/current/config/frontend/deploy added with id 1 a line
-- with CN=USER, where USER is your CERN userid.
-- You can see this with: SELECT * FROM contact WHERE id = 1;
INSERT INTO contact VALUES (2, 'Service', 'USER-wmcore', 'USER-wmcore', '/DC=ch/DC=cern/OU=computers/CN=USER-wmcore.cern.ch');
INSERT INTO contact VALUES (3, 'Service', 'USER-wmagent', 'USER-wmagent', '/DC=ch/DC=cern/OU=computers/CN=USER-wmagent.cern.ch');

-- For the table group_responsibility, the columns are (contact, role,
-- user_group).  thus, to give our user, custom services and WMAgent (ids 1, 2
-- and 3 in table contact) permissions to create/read requests, and read/write
-- statistics (the production-operator and dataops roles) we add:

-- USER with role 'admin' of group 'reqmgr'.  Commented out as already added by .../frontend/deploy
-- INSERT INTO group_responsibility VALUES (1, 1, 1);

-- USER with role 'production-operator' of group 'dataops'
INSERT INTO group_responsibility VALUES (1, 2, 2);

-- USER with role 'data-manager' of group 'reqmgr'
INSERT INTO group_responsibility VALUES (1, 3, 1);

-- USER-wmcore and USER-wmagent with roles 'production-operator' of
-- group 'dataops', and 'data-manager' of group 'reqmgr'.
INSERT INTO group_responsibility VALUES (2, 2, 2);
INSERT INTO group_responsibility VALUES (3, 2, 2);
INSERT INTO group_responsibility VALUES (2, 3, 1);
INSERT INTO group_responsibility VALUES (3, 3, 1);

Deploying our custom configuration files

Warning

You may be tempted to use symbolic links to the contents of ~/local-cfg instead of copying the files. If ~/local-cfg lives in /afs, this will not work once you close your terminal, as /afs will become inaccessible from USER-wmcore.

In this step we copy our custom configuration files to the correct locations. Remember to set the RELEASE variable as appropiate, as we did here.


# core services
for service in {reqmgr2,reqmon,workqueue}; do
    sudo cp ~/local-cfg/core/${service}_config.py /data/srv/${RELEASE}/config/${service}/config.py
    sudo chown _sw:_config /data/srv/${RELEASE}/config/${service}/config.py
done

# authmap.json
cp ~/local-cfg/core/authmap.json /data/srv/state/frontend/etc/authmap.json   

# sqlite db
sudo cp ~/local-cfg/core/users.db  /data/srv/${RELEASE}/auth/frontend/users.db
sudo chown _sw:_config /data/srv/${RELEASE}/auth/frontend/users.db

Starting/Stopping the core services

The core services can be controlled via the script /data/cfg/admin/InstallDev. This script has to be executed from the /data directory. For example, to check the status of the services:

(cd /data && /data/cfg/admin/InstallDev -s status)

Now, to start (or restart) the core services:

(cd /data && /data/cfg/admin/InstallDev -s start)

Note however, that when you try stop as in:

(cd /data && /data/cfg/admin/InstallDev -s stop)

You can also (re)start particular services. For example, to only restart the reqmgr2 service:

(cd /data && /data/cfg/admin/InstallDev -s start:reqmgr2)

Connecting to the frontend

You can interact with the core services via https. If your workstation/laptop is at CERN, you can simply access them pointing your web browser to:

https://USER-wmcore.cern.ch

Warning

The https:// part is important, otherwise you will get a Bad request error message.

If you are not at CERN, you need to make an ssh tunnel that redirects a given local address to that of the core services. You can do this for example as follows:

ssh -L localhost:8443:USER-wmcore.cern.ch:443 USER@lxplus.cern.ch

In the previous command we are mapping the address and port localhost:8443 to the address and port USER-wmcore.cern.ch:443. The port 443 is the official port for https, and the port 8443 is an arbitrary convention when tunneling https. (You could change 8443 to use other port number. Note that ports below 1025 will require sudo ssh ... which is not recommended.

Now you can point your web browser to:

https://localhost:8443

Once the web page loads, you can click on the information of several services. Only the links for Request Manager 2, Request Monitor and Work Queue will be functional. When you click on any of these links you will be ask to accept a host certificate exception, and then to confirm the user certificate that you want to use. This is the certificate you installed in your browser here.

If when clicking on a link you get the message:

{"error":"unauthorized","reason":"Authorisation failed"}

Your user id has not been correctly configured in the authmap.json file. We configured this file here.

Debugging information and logs

The logs of the core services are located at /data/srv/logs. In particular in this earlier stages you want to pay attention to:

  • frontend/error_log_YYYYMMDD.log: Here you will see, among others, errors with certificates. If so, check that your certificates are installed correctly.
  • frontend/access_log_YYYYMMDD.log: This will show you all the connection attempts, together with which certificates are being used per connection.
  • reqmgr2/reqmgr2-YYYYMMDD.log: Eventually you will see the lifetime of a request. For this setup, errors that you may see:
    • unable to set private key file: There is an error with the certificate. For example the file may not exist, or is prompting for a password. Check the steps here and here
    • CouchForbidden: CouchForbidden: You may be using the incorrect key/certificate. In particular check the frontend error log. One possibility is that you are using the host key creted, rather than a user key. (The frontend will complain with rejecting non-vo certificate).
  • workqueue/workqueue-YYMMDD.log: Certificate errors will be similar to reqmgr2.

Removing an installation

To remove your installation, say to try another one:

Then, to start with a clean installation, remove the install directory of the release and the logs:

export RELEASE=HG1912e

# Stop the services:
(cd /data && /data/cfg/admin/InstallDev -s stop)

# Clear the cron jobs
crontab -r

# Kill all python programs that may have survided .../InstallDev -s stop
killall python

# Finally remove the installation directory, logs, and core services state:
sudo rm -rf current enabled ${RELEASE} logs state

Setting up xrootd

Our test system includes an xrootd server. This is because the outputs of the different steps in a request are staged-out to xrootd servers at CERN. As a new developer, you most likely will not have write access to those servers, thus we will redirect all these write operations to a server we can control.

This xrootd server could be installed in the machine that runs the core services, or the one that runs the WMAgent, or even a third machine. We choose to install it together with the core services so that we do not have to create a third machine, but also so that the we can test that jobs can stage-out to a machine that is not running them. (Remember, our jobs will run on the machine running the WMAgent, unlike the real system where the WMAgent simply creates the jobs.)

Install xrootd via the command:

sudo yum install xrootd
sudo yum install xrootd-client    # neede only for debugging

We will configure the xrootd server as standalone. This is done with the following configuration file:

cat /tmp/xrootd-standalone.cfg <<'EOF'
# This xrootd server will export paths that start with /eos:
#
all.export /eos

# The adminpath and pidpath variables indicate where the pid and various
# IPC files should be placed
all.adminpath /var/spool/xrootd
all.pidpath /var/run/xrootd

# Allow checksums to be performed (needed by the xrdfs chksum command).
xrootd.chksum adler32

# Useful debug information with: tail -f /var/log/xrootd/standalone/xrootd.log
xrootd.trace all
EOF

sudo mv xrootd-standalone.cfg /etc/xrootd/xrootd-standalone.cfg

Warning

The above sets the xrootd server to be world writable. I am still investigating an easy way to add authorization via grid proxies: https://twiki.cern.ch/twiki/bin/view/LCG/XrootdTpc#Installation

As you can see from the configuration file, the xrootd server will only make visible paths that start with /eos. In the actual production system these paths exist in the optimized EOS disk-based low-latency storage service from CERN. In our test instances we do not have EOS, but keeping this prefix makes life a little easier:

sudo mkdir -p /var/xrootd/eos 
sudo mkdir -p /var/log/xrootd/standalone

sudo chown xrootd.xrootd /var/xrootd/eos
sudo chown xrootd.xrootd /var/log/xrootd/standalone

sudo ln -s /var/xrootd/eos /eos

Finally, we enable and start the xrootd server:

systemctl enable xrootd@standalone
systemctl start xrootd@standalone

Setting up the WMAgent

The WMAgent will run in its own virtual machine. Besides the WMAgent components, we need to install:

  • HTCondor: The WMAgent creates HTCondor jobs for the workload of a request. In the actual production system, WMAgent only creates the jobs and waits for machines running CMSglideinWMS to connect and execute the jobs. In our test setup the machine running WMAgent will also execute the jobs.
  • Singularity: This is needed to execute the jobs. In the actual production system WMAgent does not need singularity, but it is a requirement for machines running CMSglideinWMS.
  • CVMFS: This is a read-only filesystem. CVMFS repositories provide the configuration files and software of the commands that execute the workload of the request. It is a requisite for the machines running WMAgent and CMSglideinWMS.

Creating the virtual machine for the agent

Follow the same instructions as for the core services, but with the following changes:

  • As a hostname use: USER-wmagent.cern.ch.
  • As a flavor use: m2.large.

Once the virtual machine is available, you will need to also follow the instructions for the custom configuration.

Finally, we will use condor and cvmfs available from the Open Science Grid (OSG) repository. You can obtained the repository with the following commands:

sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install https://repo.opensciencegrid.org/osg/3.5/osg-3.5-el7-release-latest.rpm

Setting up singularity

sudo yum install singularity

All jobs will be run inside a singularity container. We will do this by wrapping each job with a script called singularity_wrapper.sh. This script is part of CMSglideinWMS. However, we will use the version used in CMSConnect. From a terminal in USER-wmagent.cern.ch:

mkdir /data/scripts
cd /data/scripts
curl -O https://raw.githubusercontent.com/CMSConnect/tutorial-interactivegpus/master/singularity_wrapper.sh
chmod 755 singularity_wrapper.sh

Setting up CVMFS

Install CVMFS via yum:

sudo yum install cvmfs

Once the software is installed, there are a couple of configuration files we need to modify. First, we need to let the kernel know what to do when it encounters a path that starts with /cvmfs. We do these by adding the line /cvmfs /etc/auto.cvmfs to the end of the file /etc/auto.master. (The file auto.cvmfs is installed as part of cvmfs.)

# sed command:
# - $ matches the last line of the file
# - a sed append command, whatever text follows is appended at the indicated location.
# - -i Modify the file inplace.
sudo sed -e '$a/cvmfs /etc/auto.cvmfs' -i /etc/auto.master

We also need to create a cvmfs configuration file:

cat > /tmp/default.local <<'EOF'
CVMFS_REPOSITORIES="cms.cern.ch"
CVMFS_QUOTA_LIMIT=20000
CVMFS_HTTP_PROXY="http://ca-proxy.cern.ch:3128;http://ca-proxy1.cern.ch:3128|http://ca-proxy2.cern.ch:3128|http://ca-proxy3.cern.ch:3128|http://ca-proxy4.cern.ch:3128|http://ca-proxy5.cern.ch:3128"
EOF

sudo mv /tmp/default.local /etc/cvmfs/default.local

Finally, we restart the autofs service to activate the cvmfs configuration:

systemctl restart autofs

Setting up condor

Install condor via yum:

sudo yum install condor

Now we need to customize the configuration for condor. First we need to obtain the ip of the machine running the agent. From USER-wmagent, run the command:

ip -f inet addr show eth0

and look for the entry inet. It should look something like 188.185.XXX.XXX. Ignore the /XX.

Edit /etc/condor/condor_config.local such as:

vi /etc/condor/condor_config.local

and change COLLECTOR_NAME to USER-wmagent_collector, and CONDOR_HOST to the ip number we found in the step above. Also, make sure that the line with DAEMON_LIST looks something like:

DAEMON_LIST = MASTER, COLLECTOR, SCHEDD, STARTD, NEGOTIATOR

In the actual production system STARTD will be missing, as the machines that run WMAgent usually do not themselves run jobs.

Still editing /etc/condor/condor_config.local, add these lines at the end of the file:


## All the following class ads are for startd, the condor process that runs jobs.

# Treat all the resources of the machine as partitionable slots. (E.g., to be
# able to run 4 one-core jobs at the same time. Without this, any job would be
# assigned all the resources.
NUM_SLOTS = 4
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true

# All the CMS production jobs will be run through singularity. All the
# singularity options are set by the wrapper script `singularity_wrapper.sh`.

USER_JOB_WRAPPER = /data/scripts/singularity_wrapper.sh
HAS_SINGULARITY = True
OSG_SINGULARITY_PATH = "/bin/singularity"

# Singularity images are provided by OSG via /cvmfs:
OSG_SINGULARITY_IMAGE_DEFAULT = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6"
OSG_SINGULARITY_IMAGE = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6"

# In the actual production system, singularity_wrapper.sh  passes the following
# options to CMSglideinWMS. While our test system does not use CMSglideinWMS,
# `singularity_wrapper.sh` does expect them.
GLIDEIN_REQUIRED_OS = "any"
GLIDEIN_CMSSite = "T2_CH_CERN"

STARTD_ATTRS = HAS_SINGULARITY, OSG_SINGULARITY_PATH, OSG_SINGULARITY_IMAGE, OSG_SINGULARITY_IMAGE_DEFAULT, GLIDEIN_REQUIRED_OS, GLIDEIN_CMSSite, $(STARTD_ATTRS)

Finally, we enable the condor service:

systemctl enable condor
systemctl start condor

Deploying the WMAgent

-> create /data/admin/wmagent/env.sh, .secrets

cd /data/srv

wget -nv https://raw.githubusercontent.com/dmwm/WMCore/1.2.8/deploy/deploy-wmagent.sh

line 337 (tweak configuration) add: cp /data/srv/configs/config.py ${MANAGE_DIR}/config.py

sh deploy-wmagent.sh -w 1.2.8 -d HG1912d -t btovarlo-wmagent -c btovaro-wmcore.cern.ch

source /data/admin/wmagent/env.sh
$manage start-agent

check that /data/srv/wmagent/v1.2.8/install/couchdb/certs/cert.pem key.pem do point to /data/cers/hostcert.pem hostkey.pem

Starting/Stopping and reseting the core services

source /data/admin/wmagent/env.sh
$manage stop-agent $manage stop-services

Where to look for debugging information

Testing

Injecting a request into the system

Base on https://github.com/dmwm/WMCore/wiki/Injecting%2C-assigning-and-validating-test-requests

mkdir ~/test_injection cd test_injection curl -O https://raw.githubusercontent.com/dmwm/WMCore/master/test/data/ReqMgr/inject-test-wfs.py

PRE=MonB; python inject-test-wfs.py -m DMWM -c ${PRE}Cam -r ${PRE}Req -u "https://btovarlo-wmcore.cern.ch" -f TaskChain_ProdMinBiasSmall.json -t btovarlo-wmagent

Checking a request progress

Debugging condor jobs