All Software | The Cooperative Computing Lab

TaskVine is a task scheduler for building large scale data intensive dynamic workflows that run on HPC clusters, GPU clusters, and commercial clouds. As tasks access external data sources and produce their own outputs, more and more data is pulled into local storage on workers. This data is used to accelerate future tasks and avoid re-computing exisiting results. Data gradually grows "like a vine" through the cluster. TaskVine is our third-generation workflow system, built on our twenty years of experience creating scalable applications in fields such as high energy physics, bioinformatics, molecular dynamics, and machine learning.

The Floability Project is an NSF funded research project to enable the rapid and portable deployment of notebooks expressing complex scientific workflows across a wide range of cyberinfrastructure. The key technical challenge is that workflows are incomplete: the code by itself cannot be moved between facilities without accurately capturing the software dependencies, required datasets, and capabilities of the underlying cluster hardware. Our research team at the University of Notre Dame, the University of Missouri-Columbia, and the University of Illinois is developing solutions to discover, express, and deploy the complete set of dependencies needed for complex scientific workflows.

Work Queue is a framework for building large distributed applications that span thousands of machines drawn from clusters, clouds, and grids. Work Queue applications are written in Python, Perl, or C using a simple API that allows users to define tasks, submit them to the queue, and wait for completion. Tasks are executed by a general worker process that can run on any available machine. Each worker calls home to the manager process, arranges for data transfer, and executes the tasks. A wide variety of scheduling and resource management features are provided to enable the efficient use of large fleets of multicore servers. The system handles a wide variety of failures, allowing for dynamically scalable and robust applications.

Makeflow is a production-ready workflow system for executing large, complex scientific applications on clusters, clouds, and grids. Its language is similar to Make, allowing users to easily define workflows as directed acyclic graphs (DAGs) of tasks, from simple chains to thousands of jobs. Makeflow is portable across local machines, public clouds, batch systems, and container environments, enabling seamless migration between platforms. It is highly scalable and fault-tolerant, capable of running millions of jobs for extended periods, and provides analysis tools for monitoring and visualizing workflow performance. Makeflow is widely used in fields such as data mining, physics, image processing, and bioinformatics.

JX (JSON Expressions) is an expression language for unstructured data. Adding to the standard JSON data description language, it provides operators, variables, functions, list comprehensions, and other conveniences to generate and query complex documents. JX is used throughout the CCTools to describe and query data.

The Resource Monitor (RM) is used to accurately capture the resource consumption (CPU, RAM, I/O, Disk, GPU, etc) of applications running in distributed systems. Production applications are typically not single processes, but complex assemblies of scripts, libraries, and processes written in multiple languages. The resource monitor tracks all components accurately and provides the enforcement needed to execute applications reliable at scale.

Parrot is a transparent user-level virtual filesystem that allows any ordinary program to be attached to many different remote storage services. Parrot captures the system calls (open, read, write, stat, etc) of an application through the ptrace interface, and redirects them to remote services such as HDFS, iRODS, Chirp, and FTP. This allows one to construct custom distributed filesystems on clusters without requiring special privileges.

Chirp is a personal user-level distributed filesystem that can be used to export existing data into distributed systems. Chirp enables unprivileged users to share space securely, efficiently, and conveniently. When combined with Parrot, Chirp allows users to create custom wide-area distributed filesystems that span high performance computing clusters.

Research Prototypes

We have also developed a number of research software prototypes that are released as open source:

PRUNE	The Preserving Run Environment for reproducible computing.
Umbrella	A configuration language for generating reproducible execution environments.
Confuga	An active storage cluster filesystem for scientific workflows.
AWE	The Accelerated Weighted Ensemble is a system for large scale molecular dynamics.
SAND	The Scalable Assembler at Notre Dame (SAND) is a set of modules that augment the Celera genome assembler.
AllPairs	A specialized framework for running massive-scale pairwise comparisons found in machine learning, genomics, biometrics.
Wavefront	A specialized framework for running very large dynamic programming problems found in game theory, economics, and bioinformatics.
FTSH	The Fault Tolerant Shell
AllocFS	A kernel-level filesystem with hierarchical space allocation.
SubID	A user-level implementation of hierarchical user identities.