A Case Study in Preserving a CMS Application (TauRoast) with Parrot

This webpage records our work on data and software preservation from Fall 2013 to Spring 2014. Our work is supported by DASPOS. For more information about our lab, please check The Cooperative Computing Lab.

(1) Overview of TauRoast

Within the ongoing investigation of the Higgs boson at the CMS detector, part of the LHC at CERN, the Higgs production in association with two top quarks allows measuring the Higgs coupling strength to top quarks. As the Higgs boson is too short-lived to be detected itself, it has to be reconstructed from its decay products.

TauRoast searches for cases where the Higgs boson decays to two tau leptons. Since the tau leptons are very short-lived, they are not observed directly, but by the particle decay products that they generate. So, the analysis must search for detector events that show a signature of decay products compatible with both hadronic tau and top decays. Properties of such events are used to distinguish the events of interest (Higgs decays) from all other events and are also used in further statistical analysis.

More information of the code and data sources of tauroast, please check here.

(2) Execution Environment of TauRoast

Hardware Architecture: X86_64;     Kernel: Linux 2.6.18;     OS: RedHat 5.10

CPU Cores: 64;     Memory Space: 125GB;     Disk Space: 204GB

(3) Create one Self-Contained Package for the Application

To help preserve one application, we create a Parrot Packaging Tool Suite based on Parrot, which is included inside CCTools. You can get the source code of of CCTools from our Github repository from here, or download the correct binary code of CCTools from our Download webpage from here.

(3.1) Genenate a Dependency List for the Application from one Successful Execution

To figure out the underlying file dependencies and execution environment, Parrot allows you to record the names of all the accessed files during the execution process of one program, which is implemented as the --name-list dependencylist option. When one filename is resolved by the Parrot name resolver, it is also recorded into the dependencylist file. The system call type of a file is also transferred to the name resolver and recorded into the dependencylist file.

The command used to generate the dependency list for Tauroast is as follows:

% parrot_run --name-list namelist.full /bin/bash ~/script-v4.sh

The source code of script-v4.sh is here.

After executing this command, all the accessed file names will be recorded into the file called namelist.full. The format of namelist is filename|system-call-type, such as usr/bin/ls|stat, which means the file /usr/bin/ls is accessed using the stat system call.

To repeat the step, you need to use the cctools source code under the following commit id: ca9d3c38c6e8c105a18bc50869c985242d1e84fa

For more information of parrot_run, please check here.

(3.2) Remove the duplicate items from the namelist file

The namelist file created above has duplicate items due to the possibility that one file may be accessed multiple times during the execution of one program. To shrink the namelist file, we could remove the duplicate items from the namelist file. For example:

% sort -u namelist.full > namelist

(3.3) Create a file including all the environment variables

First, run the following command to put all the environment variables into a file named env-list, each line corresponds to one environment variable, and in the following format: <name>=<value>.

% env > env-list

Then change env-list into the following format: setenv <name> "<value>" through the following command:

%env-process.sh -p env-list

env-process.sh will creates a file named env-setting, which is a list of environment varibles. Each line corresponds to one environment variable, and in the following format: setenv <name> "<value>"

(3.4) Create a package including all the dependencies, common mountlist, and the environment file

Create a package including all the dependencies, common mountlist, and the environment file through the following command:

% /bin/bash package-utility.sh --namelist namelist --env env-setting --path /tmp/package

package-utility.sh copies all the files in the namelist file into /tmp/package without messing up the directory paths, copies env-setting into the package, creates a file called common-mountlist including all the mount points which are not included in the package.

The package format of this version:

(a) All the accessed directories and files will be copied into the package, like /etc, /bin, /lib and so on.

(b) common-mountlist: record all the mount points which are not included in the package, such as /proc, /dev, /sys.

(c) env-setting: the list of environment varibles. each line corresponds to one environment variable, and in the following format: setenv <name> "<value>"

The size of the package is: 21GB.

(3.5) Add repeat-hep.sh into the package, repeat-hep.sh helps create the moutlist file

The source code of repeat-hep.sh is here.

(3.6) Output of TauRoast

The output of Tauroast is here.

(4) How to Distribute the Package

The package can be distributed in the format of TAR or TGZ.

(5) How to Repeat one Application within the Package

(5.1) Create one virtual machine with the following configuration

kernel version: 2.6.18 hardware platform: x86_64 operating system: GNU/Linux Note: We can use `centos5_64_bitCentos 5.10 - 64 Bit` provided on the website: https://ndcloudfe.crc.nd.edu//vmTemplates.php

(5.2) Download the package used to repeat the HEP experiment

If wget is not yet installed on the virtual machine, first install it through the following command:

% yum -y install wget

Download the package through the following command:

Suppose you put this tar file under /root

% cd /root

% tar xvf package-hep.tar

After this, the path of your package is: /root/package-hep

(5.3) Generate the mountlist file

Generate the mountlist through the following command. The -p parameter must be the path of your package; the -m parameter is the location of the mountlist which is determined by you.

% /bin/bash /root/package-hep/repeat-hep.sh -p /root/package-hep -m /root/mountlist

(5.4) Repeat the experiment with the help of parrot

To do this, cctools must be installed on your machine. Here, I installed cctools into ~/cctools and add the path into $PATH.

#enter into tcsh

% /bin/tcsh

#set the environment variable which is inside the package and called `env-setting`

% source /root/package-hep/env-setting

#repeat the experiment with the help of parrot; -m parameter: the mountlist file you generated in step 4; -l: must be absolute path including the absolute path of the packgage directory and lib64/ld-linux-x86-64.so.2

% ~/cctools/bin/parrot_run -m /root/mountlist -w /afs/crc.nd.edu/user/h/hmeng -l /root/package-hep/lib64/ld-linux-x86-64.so.2 /bin/tcsh ~/script-v4.sh