PRUNE: The Preserving Run Environment | The Cooperative Computing Lab

Prune is designed to preserve the evolution of scientific workflows so that they can be easily verified or expanded upon by other researchers. Execution of the workflow is also performed through Prune to ensure that it has all information necessary to restore the workflow as it was at any point in time , such as when results were used in a publication.

Some other preservation solutions force the user to use specific low-level operations to make up the workflow or automatically preserve those low-level operations. Prune gives the user full control over the granularity by which operations are defined in the workflow. This makes it much easier for a human to understand the workflow after the fact.

Despite this flexibility Prune enables preservation by providing a framework for the researcher to explicitly state (in advance) all data, software, and hardware dependencies for any given operation in the workflow. As a workflow executes, intermediate data can be deleted with the knowledge that it could be re-generated later if needed, which allows Prune to execute workflows with reduced storage requirements. Even final published results can be deleted as the workflow evolves because the data required to re-generate those published results is retained. All this can be done in the background with no worry about accidentally deleting some data that might be needed in the future. Prune assumes that the data about the workflow and the software used to execute it consume much less space than the actual data generated at each stage of the workflow. If this is the case Prune could store a workflow as it evolves over many years.

Both content based and derivation based identifiers are stored in the Prune repository. They are used to detect and prevent duplicate execution and storage. This can be done in an ad hoc distributed manner across repositories, and in some cases they can even detect logical equivalence when files are bitwise disparate due to timestamps or intentional randomness based on statistical models. Additional naming designed to be readable to the user can be done in a Python script which describes the Prune operations that make up a workflow.

Related Publications

PRUNE: A Preserving Run Environment for Reproducible Computing

Peter Ivie and Douglas Thain

In IEEE Conference on e-Science, 2016

doi: 10.1109/eScience.2016.7870886

Bib PDF

@inproceedings{prune-escience-2016,
  author = {Ivie, Peter and Thain, Douglas},
  title = {{PRUNE: A Preserving Run Environment for Reproducible Computing}},
  booktitle = {{IEEE Conference on e-Science}},
  year = {2016},
  note = {{doi: 10.1109/eScience.2016.7870886}},
  cclpaperid = {930},
  keywords = {workqueue, prune, daspos},
}

Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?

Douglas Thain, Peter Ivie, and Haiyan Meng

In 12th International Conference on Digital Preservation (iPres), 2015

doi: 10.7274/R0CZ353M

Bib PDF

@inproceedings{techniques-ipres-2015,
  author = {Thain, Douglas and Ivie, Peter and Meng, Haiyan},
  title = {{Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?}},
  booktitle = {{12th International Conference on Digital Preservation (iPres)}},
  year = {2015},
  note = {{doi: 10.7274/R0CZ353M}},
  cclpaperid = {921},
  keywords = {parrot, prune, umbrella, daspos},
}