CCL | Software | Install | Manuals | Forum | Papers
CCL Home

Research

Software Community Operations

Completed Project: DASPOS: Data and Software Preservation for Open Science

PIs: Michael Hildreth, Robert Gardner, Douglas Thain, Mark Neubauer, Jaroslaw Nabrzyski

The DASPOS project (Data and Software Preservation for Open Science) was a multi-disciplinary effort between six universities and the Fermi and Brookhaven national laboratories whose aim was to understand the problems associated with knowledge preservation in data-heavy sciences like experimental particle physics, astrophysics, genomics, etc. The team included physicists, digital librarians, computer scientists, and other experts from different fields of research. Stated simply, the intellectual goals of the project were the following: (1) determine what needs to be saved in order to preserve the different aspects of a complicated data analysis with many processing steps for reproducibility and re-use, (2) determine how to save these elements in a manner that they could be archived, searchable, and re-useable, and (3) demonstrate a prototype preservation system that could do this. The research followed several different paths. One aspect focused on how to capture all of the necessary information to re-run a given process, including the operating system, the input data, all of the external database connections, etc. Several solutions to this were explored, including ones based on tracing the system calls of the process to find all of the necessary dependencies, and several based on linux containers. The relative performances of the different techniques were assessed, with the linux container approach (embodied, for example, by Docker containers) given the slight edge due to ease of use and available infrastructure. A second aspect of the research was to understand how to describe what was being done in a computational step as part of an analysis. This would be necessary for the material to be searched for and retrieved from an archive, or so another person could understand what was done and re-use some elements. The studies performed on this aspect of the project resulted in several new metadata vocabularies, including one that describes a "computational step" in a complex analysis, and one that describes a "detector final state" in High Energy Physics (HEP), the first such description to be recorded. In collaboration with the IT and SIS groups at CERN, we have been involved in building the CERN Analysis Preservation Portal (CAP) and the REANA analysis platform, both of which incorporate DASPOS research and represent the achievement of the original goals of the proposal. The CAP will allow individual researchers to store a wealth of pertinent information about their analysis, some of it collected automatically from their LHC experiment. Executables, scripts, and data can also be stored. In particular, individual computational steps can be described and captured, currently using container technology. The metadata description used to archive the information is based on the DASPOS work. The REANA analysis back-end is able to re-assemble complete analysis workflows based on the archived information and re-instantiate them using workflow engines implemented by the DASPOS and CERN teams. The infrastructure required is quite generic and includes many commondity elements that can orchestrate container-based applications on distributed high-throughput computing systems. We have demonstrated the functionality of this system using sample analyses from the LHCb, ATLAS, and CMS experiments at the LHC. The analyses preserved in the CAP portal can be re-run inside of the REANA infrastructure and produce identical results to the original processing.

(Showing papers with tag daspos. See all papers instead.)