Two Case Studies in Preserving two High Energy Physics Applications

This webpage records our work on data and software preservation from Spring 2014 to Spring 2015. Our work is supported by DASPOS. For more information about our lab, please check The Cooperative Computing Lab.

The Parrot Packaging Tool is based on our work from Fall 2013 to Spring 2014 (here).The key changes are as follows:

A) We merge the file dependency tracking and the environment variables tracking into one single step by adding --env-list option to parrot_run.

B) The package generation tool, called parrot_package_create, is reimplementated from scratch in C, with optimization on the categories of system calls to speed up the packaging procedure. The old version is implementated in bash.

All the file-relevant syscalls calls can be divided into two categories: special_syscall and others. special_syscall includes: "open_object", "bind32", "connect32", "bind64", "connect64", "truncate", "link1", "mkalloc", "lsalloc", "whoami", "md5", "copyfile1", "copyfile2"; As for special_syscall, the copy degree of the involved file will be fullcopy; the copy degree of files involved in other syscalls will be metadatacopy. The following syscalls were fullcopy before in the shell-version implementation, but now become metadatacopy: "lstat", "stat", "follow_symlink", "link2", "symlink2", "readlink", "unlink".

Note: this change reduces the size of the package for the Tauroast application from 21GB to 18GB, and reduces the time consumption of the packaging procedure for TauRoast from about 90 minutes to about 30 minutes.

C) We design a new utility, called parrot_package_run, to help make it easier to repeat an application within a package. For the root user, an utility called chroot_package_run can be used to repeat an application within a jail constructed from a preserved package.

Our work illustrates that by combining light-weight virtualization techniques with software delivery mechanisms, complex applications, such as HEP applications, can be captured, invariantly preserved, and practically distributed and re-used.

(1) Case Study 1: TauRoast

(1.1) Overview of TauRoast

Within the ongoing investigation of the Higgs boson at the CMS detector, part of the LHC at CERN, the Higgs production in association with two top quarks allows measuring the Higgs coupling strength to top quarks. As the Higgs boson is too short-lived to be detected itself, it has to be reconstructed from its decay products.

TauRoast searches for cases where the Higgs boson decays to two tau leptons. Since the tau leptons are very short-lived, they are not observed directly, but by the particle decay products that they generate. So, the analysis must search for detector events that show a signature of decay products compatible with both hadronic tau and top decays. Properties of such events are used to distinguish the events of interest (Higgs decays) from all other events and are also used in further statistical analysis.

More information of the code and data sources of tauroast, please check here.

(1.2) Execution Environment of TauRoast

Hardware Architecture: X86_64;     Kernel: Linux 2.6.32;     OS: RedHat 6.6

CPU Cores: 64;     Memory Space: 125GB;     Disk Space: 204GB

(1.3) Create one Self-Contained Package for the Application

To help preserve one application, we create a Parrot Packaging Tool Suite based on Parrot, which is included inside CCTools. You can get the source code of of CCTools from our Github repository (here), or download the correct binary code of CCTools from our Download webpage (here).

If you are using old branch of the source code, your branch should be later than 3214e873132007762fbedc8d8d2998d7b63054d8.

If you are using the binary version, you should guarantee your cctools version is >= 4.2.

(1.3.1) Genenate a Dependency List for the Application from one Successful Execution

To figure out the underlying file dependencies and execution environment, Parrot allows you to record the names of all the accessed files during the execution process of one program, which is implemented as the --name-list dependencylist option, and allows you to record the environment variables of your program, which is implemented as the --env-list envlist option. When one filename is resolved by the Parrot name resolver, it is also recorded into the dependencylist file. The system call type of a file is also transferred to the name resolver and recorded into the dependencylist file. The command used to generate the dependency list and environment list of the TauRoast application is as follows:

% parrot_run --name-list namelist --env-list envlist /bin/tcsh ~/script-v4.csh

The source code of script-v4.csh is here.

After executing this command, all the accessed file names will be recorded into the file called namelist, and all the environment variables will be recorded into the file called envlist. The format of namelist is filename|system-call-type, such as usr/bin/ls|stat, which means the file /usr/bin/ls is accessed using the stat system call. The format of each item in envlist is: <name>=<value>

For more information of parrot_run, please check here.

(1.3.2) Generate a Package Containing all the Dependencies

After recording the accessed files of one program with the help of the --name-list parameter of parrot_run and the environment variables with the help of the --env-list parameter of parrot_run, parrot_package_create can generate a package containing all the accessed files and the environment variables. parrot_package_create shares the same --name-list and --env-list parameters with parrot_run. --package-path parameter is used to specify the location of package.

% parrot_package_create --name-list namelist --env-list envlist --package-path /tmp/package

After executing this command, one package with the path of /tmp/package will be generated. The total size of the directory is about 18GB.

For more information of parrot_package_create, please check here.

(1.4) How to Distribute the Package

Currently, there are two ways to ship one package: First, store the package into a website, and share the download url with others. Second, transfer the package into a Docker image, and then push the image into Docker Hub, and share the image id and name with others.

Note: Before Distributing the package to a new place on the same machine or on a different machine, first create a tar file. Do not directly copy the whole directoy.

(1.5) How to Repeat one Application within the Package

According to the distribution method mentioned above, there are two methods to repeat one application within the package. In the case where the tarball of the package is distributed, you need to first download the tarball and uncompress it to a directory like /tmp/package. Then you can repeat the application using the following command:

% parrot_package_run --package-path /tmp/package /bin/tcsh ~/script-v4.csh

For more information of parrot_package_run, please check here.

If you have the root account, you can also transfer the package into a jail and run the application within it. We also provide a tool called chroot_package_run to do this:

% chroot_package_run --package-path /tmp/package /bin/tcsh ~/script-v4.csh

For more information of chroot_package_run, please check here.

To repeat the application through Docker, please check the Documentation of Docker: here

(1.6) Another Light-weight Virtualization Packaging Tool: PTU

PTU is designed to create a package of an application by recording all of the binaries, libraries, scripts, data files, and environment variables used by a program. PTU uses the CDE technology to observe system calls, but takes a snapshot of every file at the point of access. In addition to files, PTU records metadata about the execution environment, such as kernel versions, application versions, and dynamic library versions by using standard Unix commands. PTU also records provenance in the form of a graph that describes how each file is created or consumed by processes within the application. Because PTU is focused solely on the problem of preservation, it can achieve lower overhead than Parrot when remote data access is not a requirement. For how to package and repeat one application through PTU, please check here.

Case Study 2: Athena