Recent News in the CCL

CCL Home

Research

Software

Community

Operations

Recent News in the CCL

TaskVine Insights - Storage Management: Depth-Aware Pruning

Modern scientific workflows often span tens or hundreds of thousands of tasks, forming deep DAGs (directed acyclic graphs) that handle large volumes of intermediate data. The large number of tasks primarily results from ever-growing experimental datasets and their partitioning into various data chunks. Each data chunk is applied to a subgraph to form a branch, and different branches typically have an identical graph structure. Collectively, these form a gigantic computing graph. This graph is then fed into a WMS (workflow management system) which manages task dependencies, distributes the tasks across a HPC (high performance computing) cluster, and retrieves outputs when they are available. Below is the general architecture of the data partitioning and workflow computation process.

TaskVine serves as the core WMS component in such an architecture. The unique feature of TaskVine lies in its advantage of leveraging NLS (node-local storage) to alleviate I/O burdens toward the centralized PFS (parallel file system). This allows all intermediate data (or part of it) to be stored and accessed on individual workers' local disks, and peer-to-peer transfers may be performed on demand.

However, one of the challenges of using NLS is that the per-node storage capacity is limited. When requesting workers in the campus-held HPC cluster at the University of Notre Dame, the typical served worker has about 500 GB, which is much less than that of the PFS that is at the scale of hundreds of TB. Below is an example where intermediate data are generated over the course of computing the graph and are accumulated on individual workers.

On the one hand, we must have a way to promptly delete stale data that will not be needed in the future to free up space to avoid disk exhaustion on individual workers.

In TaskVine, we have the option to enable aggressive pruning for a workflow computation to save space. This allows us to promptly identify which files become stale and are safe to be detached from the system. Once recognized, the TaskVine manager directs all workers holding the file replicas to remove that replica from their own cache, and downstream tasks will no longer declare the pruned data as their inputs. This option is called aggressive pruning because no data redundancy is retained.

For example, the following figure shows the aggressive pruning for a given graph. File a is pruned when all the consumer tasks, namely Task 2, 3, and 4, have completed and their outputs have been updated. File b is pruned when the only consumer Task 6 has completed and the corresponding output File h has been created. At the time of T, four previously generated files have been pruned from the system, and the other four files are retained because there are some tasks needing them as inputs.

However, things become complicated when we consider worker failures at runtime. In an HPC environment, there is no guarantee that your allocated workers can sustain until your computation finishes, and worker failures or evictions are not rare. When workers exit at runtime, their carried data are lost as well. If pruning is too aggressive, a single worker eviction may force a long-chain recomputation.

Below is an example of this behavior. At the time of T, Worker 2 is suddenly evicted. Task 9 and Task 10 are interrupted and collected back to the manager. Meanwhile, the intermediates stored on Worker 2, namely File e, f, g, and h, are also lost. Some of them have additional replicas on other workers, but File f does not, so it is completely lost because of the eviction. Due to the data loss, Task 9 cannot run until its input File f is recreated. To regenerate File f, we must rerun Task 7. However, Task 7 cannot run because its input File c has been pruned. Thus, we are forced to work all the way back to the top of the graph to regenerate the lost file. Worse yet, we may ultimately end up having the longest path in the graph recomputed, prolonging the graph makespan and undermining the scientific outcome.

So, to reserve certain redundancies to minimize recomputation in response to worker failures, TaskVine provides the option to tune the depth of ancestors for each file before it can be pruned.

In the implementation, pruning is triggered at the moment a task finishes, when its outputs are checked for eligibility to be discarded. Each file carries a pruning depth parameter that specifies how many generations of consumer tasks must finish before the file can be pruned. Depth-1 pruning is equivalent to aggressive pruning, while depth-k pruning (k is greater than or equal to 2) retains the file until every output of its consumer tasks has itself satisfied the depth-(k-1) criterion. Formally, the pruning time is defined as: P(f) equals the minimum of P-sub-k(g) for all g that are outputs of f's consumers.

where Cons(F) denotes the consumer tasks of file F and Out(T) the outputs of task T. This recursive evaluation avoids expensive global scans by localizing pruning decisions to task completion events, with overhead proportional only to the chosen depth. As k increases, more ancestor files are preserved, providing redundancy and shrinking recovery sub-DAGs, whereas smaller k values reduce storage usage at the expense of higher recovery cost.

The figure below shows the improvement after applying depth-2 pruning, where we are able to preserve more data redundancy at runtime. At time T, Worker 2 is evicted, File f is completely lost, and we must recover it in order to rerun Task 9. Thanks to depth-2 pruning, File c can be found on another worker, so we can restart from there. This way, we avoid the need to go all the way back, though at the risk of consuming more local storage on individual workers.

Depth-aware pruning can work in conjunction with two other fault tolerance mechanisms in TaskVine: peer replication, in which every intermediate file is replicated across multiple workers, and checkpointing, in which a portion of intermediates are written into the PFS for persistent storage. Subsequent articles will cover these topics to help readers better understand the underlying tradeoffs and know how to select the appropriate parameters to suit their own needs.

TaskVine users can explore the manual to see how to adjust these parameters to suit their needs for the tradeoff between storage efficiency and fault tolerance. For depth-aware pruning, we recommend using aggressive pruning (depth-1 pruning) as the general and default setting to save storage space to the largest extent, and increase the depth parameter when worker failures are not uncommon and additional data redundancy is needed.

Tue, 02 Dec 2025 19:52:00 +0000

Your First Distributed Workflow on ACCESS CI: A Grad Student’s Checklist with TaskVine

What This Guide Is (and Isn’t)

If you’re a student who wants to run real distributed computing experiments—without fighting half a dozen HPC manuals—you’re in the right place. This post is a quick, practical checklist to help you:

get an ACCESS CI allocation,
choose the right resources (Purdue Anvil, Stampede3, etc.),
log in for the first time, and
run a minimal TaskVine workflow across real HPC nodes.

This is not a replacement for the official docs. ACCESS CI, Anvil, Stampede3, and TaskVine all have excellent detailed documentation. You’ll find direct links throughout this guide.

Think of this post as your “start here” map—a student-friendly path from I’ve never used ACCESS to I ran my distributed workflow on an NSF funded HPC machine.

What Is ACCESS CI

ACCESS CI (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support) is a nationwide, NSF-funded program that gives U.S.-based researchers free computing and storage resources. If you’re doing academic research—even as a first-year grad student or undergrad—you can request allocations and run your workflows on top-tier HPC systems.

When you request an allocation, you create an ACCESS project and choose a project type:

Explore — Small, quick-start allocations for testing resources or class projects.
Discover — Moderate allocations for active research or larger coursework.
Accelerate — Mid-scale allocations for sustained, collaborative research.
Maximize — Large-scale, high-impact research requiring major resources.

Most students start with Discover or Explore, and approvals usually take only a few days.

ACCESS CI vs OSPool (super short version)

OSPool is another free national computing service—part of the OSG/OSDF ecosystem—that provides high-throughput compute on opportunistic resources. You log in to a single access point and submit your workflow, and OSPool automatically distributes your tasks to whatever compute resources are available across the country.

ACCESS CI, in contrast, uses an allocation and credit-based model. You request an allocation, receive credits, and then “spend” those credits on specific HPC systems (like Anvil or Stampede3). You get scheduled jobs, predictable queue behavior, and access to fixed hardware configurations (e.g., large-memory nodes, GPUs, high-speed filesystems).

In short:

OSPool is great for workloads that are embarrassingly parallel, tolerant of variable performance, and don’t require specific node types.
ACCESS CI is great when you need predictable resources, specific hardware (like GPUs or large-memory nodes), or access to a particular HPC environment.

Checklist Overview (The Whole Journey at a Glance)

This is the full path we’ll walk through in the rest of the post:

Create an ACCESS CI account
Find the resources you need for your project
Prepare and submit your request
Pick a site (exchange credits)
Set up SSH keys and log in to the sites
Install TaskVine
Run your first TaskVine workflow
Debugging & Next Steps

Where these steps are documented

Steps 1–4 are well-documented on the ACCESS CI site:
👉 https://allocations.access-ci.org/get-your-first-project
Step 5 (logging in, SSH, MFA, environment modules) varies by site and is documented in each system’s user guide (e.g., Anvil, Stampede3).
Steps 6 and 7 (installing TaskVine and running a TaskVine workflow) are covered in the official TaskVine manual.

This guide brings all of them together in one place, filling in the gaps so you can follow a smooth end-to-end workflow—from getting an allocation to running your very first distributed TaskVine program on an NSF HPC machine.

Step 1 — Create an ACCESS CI Account

There are two ways to register for an ACCESS account, but the fastest option is to log in with your existing university identity. Most U.S. universities are supported—just choose your campus from the dropdown, click Log On, and follow the email-verification step.

If your university isn’t listed, you can register by creating an ACCESS-specific username, password, and Duo MFA.

Full instructions are here:

👉 https://operations.access-ci.org/identity/new-user

Step 2 — Choosing Resources for Your Project

ACCESS CI offers many HPC systems—CPUs, GPUs, large-memory nodes, and cloud resources. The easiest way to browse them is the ACCESS Resource Catalog:

👉 https://allocations.access-ci.org/resources

How to pick the right resource

Choose the simplest site that fits your needs.
For most introductory TaskVine workflows, standard CPU-only systems are sufficient.
Consider credit cost.
Each site charges different ACCESS credits per CPU-hour or GPU-hour.
If you don’t need GPUs or large-memory nodes, pick a CPU-only site to make your credits last longer.

Match the site to your workflow.
- CPU-only workloads → pick sites that offer regular CPU nodes (no GPUs needed).
- GPU workloads → choose GPU-capable systems.
- High-memory jobs → pick sites offering LM/XL nodes.
Need help choosing?
You can ask the ACCESS team for personalized resource suggestions here:
👉 https://ara.access-ci.org/

For this guide, we’ll use Purdue Anvil and TACC Stampede3 because they are beginner-friendly and well-documented.

Step 3 — Prepare and Submit Your Request

To submit an Explore request, you only need a few pieces of information and some simple PDFs. The full instructions are here:

👉 https://allocations.access-ci.org/get-your-first-project

What to prepare

A project title
A short overview describing what you plan to run and why you need ACCESS
Your own information (you are the PI for this ACCESS project)
Optional: grant number if your work is funded
Two PDF files:
- Your CV or résumé (max 3 pages)
- If you are a graduate student PI, a brief advisor support letter (PDF)
- (For class projects, upload a syllabus instead of a letter)
Available Resources

When the form asks which resources you want, simply check “ACCESS Credits.”

Submitting

Once everything is ready:

Go to 👉 https://allocations.access-ci.org/
Click Request New Project
Select Explore ACCESS
Fill out the form, upload your PDFs, and click Submit

Explore requests are usually approved by the next business day, so keep an eye on your email.

Step 4 — Pick a Site (Exchange Credits)

Once your Explore project is approved, you’ll see it listed in the My Projects section of the Allocations Portal.

How to exchange credits

Open your project and go to Credits + Resources.
Click Add a Resource.
In the dropdown, search for the site you selected in Step 2 (e.g., a CPU-only site).
Enter how many SUs, Node Hours, or GPU Hours you want to allocate.

What this means

Exchanging credits “activates” your access to that cluster.
After approval, you’ll be able to log in and submit jobs there.

How long it takes

Exchange requests are usually approved within a few days (up to one week).

Step 5 — Set Up SSH Keys and Log In

Each ACCESS site has its own login process, but the most popular systems—Anvil and Stampede3—are very well documented. On your Credits + Resources page, you’ll see a small book icon next to every site. Clicking it takes you directly to that site’s official documentation.

For convenience, here are the two sites we are using in this guide:

Anvil Documentation:

👉 https://www.rcac.purdue.edu/anvil#docs

Stampede3 Documentation:

👉 https://docs.tacc.utexas.edu/hpc/stampede3/

You should always follow the official documentation, because login steps (especially MFA) may change after this blog is written.
Here is a brief summary of what to expect:

Anvil (Purdue) — Quick Summary

Visit Anvil OnDemand:
👉 https://ondemand.anvil.rcac.purdue.edu
Log in and set up ACCESS Duo MFA.
Use the OnDemand interface to upload your SSH public key.

Then you can log in via SSH:


ssh -l x-ACCESSusername anvil.rcac.purdue.edu

Important: Your Anvil username

Your Anvil username starts with x-, e.g., x-mislam5.
This is not the same as your ACCESS username.
You can see your exact Anvil username on your Credits + Resources page next to the Anvil resource entry.

Notes from the official docs

Anvil does not accept passwords — only SSH keys.
Your ACCESS password will not work for SSH.
SSH keys must be created or uploaded through the OnDemand interface.

Stampede3 (TACC) — Quick Summary

Log in to the TACC Accounts Portal:
https://accounts.tacc.utexas.edu/login?redirect_url=profile
Accept the usage agreement and set up TACC’s MFA.

Then SSH into the system:


ssh myusername@stampede3.tacc.utexas.edu

You’ll be prompted for your password and MFA token.

Full instructions:
👉 https://docs.tacc.utexas.edu/hpc/stampede3/#access

Step 6 — Install TaskVine

Once you can log in to your HPC site, the next step is installing TaskVine.
TaskVine is part of the CCTools suite and is easiest to install through Conda.

Most HPC sites provide their own Conda modules, but these are often outdated.
We recommend installing a fresh Miniforge in your home directory and using that environment for TaskVine.

Official installation guide:
👉 https://cctools.readthedocs.io/en/stable/install/

Here’s the quick version:

Install CCTools (TaskVine) with Conda


# Create your environment (run once)
conda create -n cctools-env -y -c conda-forge --strict-channel-priority python ndcctools

# Activate the environment (run every time)
conda activate cctools-env

After this, TaskVine is ready to use on your HPC site.

Step 7 — Run Your First TaskVine Workflow

Now that TaskVine is installed, let’s run your very first distributed workflow.
We’ll use the TaskVine Quickstart example from the official docs, with one small change:

Instead of choosing a fixed port, we set port=0, which lets TaskVine automatically pick an available port.
We give the manager a name (tv-quickstart-blog) so vine_factory can discover it without you typing hostnames or ports.

Create a directory on a shared filesystem (e.g., $HOME on Anvil or /work/... on Stampede3).
Inside it, create a file named quickstart.py and paste:

# quickstart.py

import ndcctools.taskvine as vine

# Create a new manager
m = vine.Manager(name ="tv-quickstart-blog", port=0)
print(f"Listening on port {m.port}")

# Declare a common input file to be shared by multiple tasks.
f = m.declare_url("https://www.gutenberg.org/cache/epub/2600/pg2600.txt", cache="workflow")

# Submit several tasks using that file.
print("Submitting tasks...")
for keyword in ['needle', 'house', 'water']:
    task = vine.Task(f"grep {keyword} warandpeace.txt | wc")
    task.add_input(f, "warandpeace.txt")
    task.set_cores(1)
    m.submit(task)

# As they complete, display the results:
print("Waiting for tasks to complete...")
while not m.empty():
    task = m.wait(5)
    if task:
        print(f"Task {task.id} completed with result {task.output}")

print("All tasks done.")

Run the manager

Open one terminal (or one tmux pane) and activate your environment:


conda activate cctools-env
python quickstart.py

Leave this running.
The manager will start, pick an open port, and publish itself to the TaskVine catalog.

Start workers using vine_factory
Instead of running workers manually, we use vine_factory to submit workers through Slurm.

Basic command:

vine_factory -Tslurm --manager-name tv-quickstart-blog


However, each cluster usually requires a few Slurm batch options, which you pass using --batch-options.

Stampede3 example

Stampede3 generally expects a partition and a wall-clock limit:

vine_factory -Tslurm --manager-name tv-quickstart-blog \
  --batch-options "-p icx -t 01:00:00"


More details:
👉 https://docs.tacc.utexas.edu/hpc/stampede3/#running

Anvil example

Anvil requires selecting a partition (queue). For example:

vine_factory -Tslurm --manager-name tv-quickstart-blog \
  --batch-options "-p wide"


More details:
👉 https://www.rcac.purdue.edu/knowledge/anvil/run/partitions

What happens next


vine_factory submits Slurm jobs that run Vine workers on compute nodes.


When those workers start, they automatically connect to the manager using the name tv-quickstart-blog (through the TaskVine catalog).


The manager sends tasks to each worker as they connect.

In the first terminal, you will eventually see the results. Each task is computed on a node in the cluster. Depending on the queue and resource availability, it may take some time for workers to start and complete the tasks.

Debugging & Next Steps
If you’ve made it this far and your workflow ran successfully — congratulations!

You’ve just deployed a real distributed TaskVine workflow on an ACCESS HPC cluster.

But if things didn’t work on the first try, here are the most common issues and how to fix them.
Things That Can Go Wrong (and How to Fix Them)
1. Workers are running, but no tasks ever complete

You checked squeue and see the Vine worker jobs are active, but the manager terminal never shows progress.
This almost always means:

the compute node cannot connect back to the manager’s TCP port.
To confirm, run a simple TCP test Slurm job (outside TaskVine) that tries to connect from a compute node → login node. If that fails, workers will never reach the manager.


This problem depends on site network rules, so test it early.
2. Missing or incorrect Slurm batch options
Every cluster has slightly different expectations.

Check the site’s documentation for required options like:


partition (-p)


wall time (-t)


constraints


account or project flags




If workers never start or get rejected, this is usually the reason.
3. Vine Factory should use a shared filesystem
The worker scripts need to access some files provided by vine_factory.

To keep things simple and reliable, you should either:


run vine_factory from a shared filesystem (one that is mounted on the compute nodes), or


set --scratch-dir to a directory on a shared filesystem.




Check your site’s documentation to see which partitions/directories are visible from compute nodes. For example, some clusters share $HOME, while others require you to use a /work or /project directory.
What’s Next
If your quickstart worked—great! You now have a full TaskVine workflow running on an ACCESS CI cluster.
Here are a few things you can try next:


Read the TaskVine manual to explore caching, retries, and performance tuning


Increase the number of workers and test scaling


Try larger or real research datasets


Experiment with additional ACCESS CI sites





TaskVine grows with your research, and you now have the foundation to scale up your workflows with confidence.

Tue, 25 Nov 2025 21:27:00 +0000

Scaling SADE (Safety Aware Drone Ecosystem): A Hybrid UAV Simulation System for High-Fidelity Research

Autonomous drones are moving into increasingly complex, real-world environments where safety, compliance, and reliability have to be built in from the start. That's the motivation behind SADE—the Safety Aware Drone Ecosystem. SADE brings together physics-accurate simulation and photorealistic rendering, paired with PX4-based autonomy, to give researchers and developers a realistic, repeatable space to design, test, and validate UAV behavior before it ever leaves the lab.

Under the hood, SADE combines Gazebo for flight dynamics and physics with Unreal Engine 5 for immersive visuals, augmented by Cesium for globe-scale terrain that closely mirrors real geography. PX4 anchors flight control, and Mavlink backed by a "Mavlink Router" keeps telemetry and commands flowing reliably across multiple endpoints. This setup makes it possible to stream camera feeds from the simulated environment back into the autopilot loop, exercise computer vision pipelines under challenging conditions, and verify perception-driven behaviors in a safe, controlled feedback cycle.

The architecture is designed for scale and clarity. A React front end, integrated with Cesium-JS, lets users define zones, geometry, and mission parameters visually. Those configurations are saved by the backend and published to an MQTT broker, which the workflow runner listens to. From there, the runner launches isolated Docker Compose networks to run concurrent simulations without interference. In these stacks, PX4, Gazebo, pose-sending utilities, and shepherd tools (suas-engine) coordinate flight behavior, while an Unreal Engine container—accelerated via NVIDIA's container toolkit—renders the world. A Pixel Streaming signaling server brings UE5 to the browser, and the Mavlink Router consolidates telemetry to a single ground control station for consistent ops.

One of SADE's most useful ideas is the "SADE Zone." These are policy-aware regions that translate rules directly into drone behavior—think altitude caps, no-capture constraints for sensitive areas, or mandatory permission requests before entering and when exiting. Instead of treating compliance as something to bolt on later, SADE embeds policy into the simulation itself so drones "self-behave" according to the rules of the environment. That blend of physical realism and operational policy is essential for regulated and safety-critical use cases.

From a user standpoint, the SADE UI is where everything comes together. It gives you a flexible way to draw zones, set rules, and turn those constraints into autonomous responses. By capturing precise coordinates and dimensions, the UI drives compliance automatically, freeing teams to focus on algorithm design, perception, and mission planning. This approach is especially powerful for rare or risky scenarios that are hard to reproduce in the field but vital for building robust autonomy. SADE is built to handle multiple users and runs at once, intelligently distributing CPU and GPU resources so experiments remain responsive at scale. Looking ahead, the roadmap emphasizes richer workflows, larger multi-vehicle scenarios, and deeper policy enforcement to narrow the gap between simulation and real operations.

If you're working on UAV autonomy, simulation tooling, or regulated operations and want to explore policy-aware zones in photo real environments, check out the demo: https://sade.crc.nd.edu/

Tue, 18 Nov 2025 19:36:00 +0000

Wrangling Massive Tasks Graphs with Dynamic Hierarchical Composition

On Thursday, Octobor 30, research engineer Ben Tovar presented our recent work on accelerating the execution of High Energy Physics (HEP) workflows at the PyHEP 2025 Workshop, a hybrid workshop held at CERN. The presentation centered on an execution schema called Dynamic Data Reduction (DDR) that runs on top of TaskVine.

Current HEP analysis tools, like Coffea, provides users with an easy way to express the overall workflow and leverage local vectorization on column-oriented data. However, this often requires expressing the entire computation graph statically from the start. This introduces several issues, such as graph generation overhead which may take several hours longer than the actual computation needed, and the creation of computation units that do not fit the resources available.

With a DDR, we take advantage of the structure inherent in many HEP applications where when processing multiple collision events, the accumulation (reduction) step is typically both associative and commutative. This means that it is unnecessary to pre-determine which processed events are reduced together and can leverage factors such availability of data location. Further, the number of events processed together can respond dynamically to the resources available, and datasets can be processed independently.

In the DDR application stack, TaskVine acts as the execution platform that distributes the computation to the cluster.

As an example, we ran Cortado, a HEP application that processes 419 datasets, 19,631 files, and 14TB of data (totaling 12,000 million events) in about 5.5 hours using over 1600 cores at any one time. During the run some of these cores had to be replaced because of resources eviction.

For more information, please visit the DDR pipy page at https://pypi.org/project/dynamic-data-reduction/

Tue, 11 Nov 2025 18:21:00 +0000

TaskVine Insights - Storage Management: Disk Load Shifting

The University of Notre Dame operates an HTCondor cluster with roughly 20,000 cores for scientific computing. The system consists of heterogeneous machines accessible to all users and is managed by the Center for Research Computing (CRC). As newer, more advanced nodes are added, the cluster continues to expand its computational capacity, enabling increasingly sophisticated research. TaskVine primarily runs on this HTCondor environment, connecting hundreds of workers and executing thousands of tasks while optimizing data locality, disk usage, and scheduling efficiency to maximize throughput. The screenshot below shows the cluster’s activity at 12:10 p.m. on November 4, from this website.

Leveraging node-local storage (NLS) to cache data locally and reduce I/O bottleneck on the centralized parallel filesystem (PFS) is a core feature of TaskVine. However, in a heterogeneous cluster where nodes differ in hardware and architecture, NLS can be both a strength and a liability. While it provides high I/O bandwidth for intermediate data, heterogeneity often leads to imbalance: some workers may run out of disk space while others remain half empty. This imbalance can cause worker crashes, unnecessary recomputation, and wasted cluster resources.

Disk load skew can arise from several factors. One major cause is cluster heterogeneity. As newer and more powerful compute nodes are added to an HPC cluster, they naturally execute tasks faster and produce more data, leading to uneven disk consumption.

To evaluate how heterogeneity contributes to this imbalance, we executed 20,000 identical tasks involving both computation and disk I/O on 100 workers, each with 16 cores. The figure below shows the number of completed tasks per worker over time. Some workers consistently outperformed others: by the end of the workflow, the fastest worker completed around 350 tasks, while the slowest managed only about 100, which is nearly a fourfold difference. As these faster nodes process tasks more quickly, they accumulate intermediate data at a higher rate and thus face an increased risk of local disk exhaustion.

On the other hand, the imbalance can also arise from the algorithm that determines task placement and data storage. To minimize the overhead of transferring data between nodes, a common approach is to schedule a task where its input data already resides. Once the task completes, its child tasks are often dispatched to the same node to exploit data locality. Over time, this strategy can lead to a subset of nodes accumulating a large amount of intermediate data.

To examine this effect, we ran a DAG-structured high-energy physics (HEP) workflow, DV5, consisting of approximately 250,000 tasks on 100 workers, each with 16 cores, and monitored the storage consumption on each node over time. We performed two runs: one prioritized assigning tasks to workers that already had the input files, while the other favored workers with the most available disk space, shown in the following two figures, respectively.

The results were somewhat surprising: despite the theoretical expectation that the latter policy should yield a more balanced distribution, both runs showed similarly skewed disk usage patterns. One possible explanation is that task scheduling occurs very rapidly and at high concurrency, so when a task becomes eligible, only a few workers are available. As a result, tasks naturally tend to execute on workers that just completed their parent tasks, maintaining locality even without explicit intent.

So, we must acknowledge that disk load imbalance naturally arises from cluster heterogeneity and therefore must be handled carefully. TaskVine addresses this issue using two complementary techniques: Redundant Replica Removal and Disk Load Shifting. The following sections detail each of these approaches.

1. Redundant Replica Removal

Because TaskVine inputs can migrate between workers to satisfy locality, it is quite common for several nodes to end up caching the same temporary file; the effect is most pronounced on the faster machines, which finish more tasks per unit time and accumulate many more staged inputs than their slower peers. To keep this extra baggage from overwhelming NLS, the TaskVine manager intervenes every time a task completes. It walks the task’s input list, and for each temporary file that now has more replicas than the configured target it inspects the workers currently holding that file.

Nothing is deleted until every replica reports the READY state, guaranteeing that an in-flight transfer is not disrupted, and the manager double-checks that none of the workers are still executing tasks that depend on the file. Only after these safety checks does it queue up the redundant replicas for removal, restoring balance without jeopardizing correctness. The number of redundant replicas is calculated as the difference between the current number of replicas and the user-specified target, ensuring that at least one replica always remains available for future use. Workers holding these excess replicas are ranked by their free cache space, prioritizing cleanup on those with more available storage to maintain balanced disk utilization.

2. Disk Load Shifting

In TaskVine, this technique is triggered when a worker reports that a temporary file has been cached. When this happens, the manager’s cache-update handler inspects the newly staged replica and, if shifting is enabled, scans the worker pool to find a lighter host. Only workers that are active, with transfer capability, and do not already hold the file are considered. Candidates that would end up heavier than the source after receiving the file are skipped, leaving the lightest eligible worker to take the replica so that each transfer moves free space toward balance instead of swapping hot spots.

The migration reuses TaskVine’s existing peer-transfer pipeline. The destination streams the file directly from the source, the replica catalog tracks its state from CREATING to READY, and both workers update their transfer counters for admission control. Once the new replica is confirmed, the original worker releases its redundant copy to the cleanup routine, reclaiming disk space that just served its purpose. The work involved is modest, requiring only a single hash-table scan and one network transfer per staged replica, but the payoff is immediate: fast workers stay ahead of their disk usage, slower nodes lend idle capacity, and heterogeneous clusters keep their node-local storage evenly balanced without reducing throughput.

The following figure compares the NLS usage across all workers over time in the DV5 workflow, before and after enabling the two techniques.

After enabling Redundant Replica Removal and Disk Load Shifting, the NLS usage among workers became much more balanced. As shown in the bottom figure, storage consumption across nodes stayed within a narrow range under 10 GB, compared to over 20 GB and a significant skew before optimization. This indicates that the two techniques effectively prevented disk hotspots and improved overall resource utilization. In terms of overhead, the pre-optimization run completed in 206.85 seconds, while the optimized run took 311.92 seconds, indicating that the additional data transfers introduced a noticeable slowdown.

Both techniques are implemented on the TaskVine manager side in C, but from the user’s perspective they are simple to enable. After creating a manager object through the Python interface, for example:

m = vine.Manager(port=[9123, 9130]),

you can activate them individually with:

m.tune("clean-redundant-replicas", 1) and m.tune("shift-disk-load", 1).

While these modes are effective, they are not always recommended, since the additional data transfers and computations may introduce overhead and reduce overall throughput. However, if your workflow runs on disk-constrained nodes or workers are being evicted due to insufficient storage and you cannot request more disk space, enabling these options can significantly improve stability and performance.

Tue, 04 Nov 2025 21:22:00 +0000

Simulating Digital Agriculture in Near Real-Time with xGFabric

Advanced scientific applications in digital agriculture require coupling distributed sensor networks with high-performance computing facilities, but this integration faces significant challenges. Sensor networks provide low-performance, unreliable data access from remote locations, while HPC facilities offer enormous computing power through high-latency batch processing systems. For example, the Citrus Under Protective Screening (CUPS) project monitors environmental conditions in agricultural facilities to detect protective screening damage and requires real-time computational fluid dynamics simulations to guide interventions. Traditional networking approaches struggle to bridge these contrasting computing domains with the low latency and high reliability needed for near real-time decision making.

To address this challenge, we developed xGFabric, an end-to-end system that couples sensor networks with HPC facilities through private 5G wireless networks. The key innovation is using 5G network slicing to provide reliable, high-throughput connectivity between remote sensors and centralized computing resources. xGFabric integrates the CSPOT distributed runtime system for delay-tolerant networking and the Laminar dataflow programming environment to manage the entire pipeline from data collection to simulation execution. The system masks network interruptions, device failures, and batch-queuing delays by storing all program state in persistent logs, allowing computations to pause and resume seamlessly when resources become available.

To demonstrate the system's effectiveness, we deployed a working prototype connecting sensors at the University of Nebraska-Lincoln via private 5G networks to HPC facilities at multiple institutions. Their evaluation showed that the private 5G network achieved uplink throughput up to 65.97 Mbps and scaled effectively across multiple devices. Message latency through the CSPOT system averaged just 101 milliseconds over 5G, meeting the application's real-time requirements. The OpenFOAM CFD simulation completed in approximately 7 minutes using 64 cores, enabling the system to generate updated environmental predictions within the 30-minute decision window required by agricultural operators.

Overall, xGFabric demonstrated that private 5G networks can successfully bridge the gap between edge sensor deployments and HPC resources for time-critical scientific applications. By providing a unified software stack across all device scales and leveraging 5G's low latency and reliability, the system enables "in-the-loop" high-performance computing for real-time decision support in digital agriculture and other domains requiring coupled sensor-simulation workflows.

This work will be presented by Liubov Kurafeeva at the XLOOP workshop at SC '25 in St. Louis, Missouri

For more information you can visit our website at: https://sites.google.com/view/xgfabric

Tue, 28 Oct 2025 17:56:00 +0000

Undergraduate Researcher Showcases PLEDGE Project at APANAC 2025 in Panama

On Thursday, October 2, 2025, undergraduate student Andrés Iglesias attended APANAC 2025, the National Congress dedicated to Science and Technology held in Panama. Andrés participated in the poster session with “PLEDGE: Accelerating Data Intensive Scientific Applications with Consistency Contracts.”

Andrés joined the Cooperative Computing Lab (CCL) at the University of Notre Dame as a summer research student through the iSURE program, spending three months on campus. During his time at Notre Dame, he contributed to the initial implementation of the PLEDGE Tracer, based on Colin’s previous work, which observes a scientific application and generates a consistency contract. He also worked on the PLEDGE Enforcer, which ensures the scientific application respects the consistency contract at runtime.

We are proud of Andrés’s contributions and delighted to see his work showcased at a prestigious national conference!

Tue, 21 Oct 2025 17:37:00 +0000

Reducing Overhead of LLM-integrated Applications on GPU Clusters with Parsl+TaskVine

Large Language Models (LLMs) are becoming a key tool for scientific discovery, but using them on High-Performance Computing (HPC) clusters is challenging due to the limitations of traditional resource allocation methods. For instance, static allocation, which assigns a dedicated set of GPUs for a task, is a rigid system. This can lead to long queues of frustrated users and wasted resources, as the allocated GPUs sit idle while waiting for the next job to start. Meanwhile, opportunistic allocation allows tasks to use available, but not guaranteed, resources. While this improves overall cluster utilization, it's problematic for LLM applications. The initial loading of a multi-billion-parameter LLM is a time-consuming process, and since tasks in an opportunistic environment can be preempted at any moment, this expensive startup often has to be repeated from scratch.

To solve this, we propose a new technique called Pervasive Context Management. The core idea is to decouple the LLM initialization context from the actual inference tasks and keep this context persistent on GPUs until it is no longer needed. This transforms the high startup cost into a one-time, amortizable expense. When a task is preempted, it can be quickly rescheduled to another GPU that already has the necessary context loaded, eliminating the need to re-initialize the model. Our Parsl+TaskVine system can also transfer existing context between nodes to bootstrap new GPUs, reducing data transfer time and avoiding bottlenecks.

To demonstrate the effectiveness of this approach, we transformed a fact-verification application to use Pervasive Context Management and conducted a comprehensive evaluation. Our results show a significant improvement in performance on both static and opportunistic resources. By enabling Pervasive Context Management, the end-to-end execution time of our application was reduced by 72.1%, from 3 hours to just 48 minutes, using the same number of GPUs. The application also scaled efficiently on up to 32.8% of all GPUs (up to 186 GPUs) in the cluster, further reducing the execution time to a mere 13 minutes.

Additionally, Pervasive Context Management helps users avoid the complex problem of tuning the inference batch size. Because the expensive startup cost is now a one-time event, the application's performance becomes much more stable regardless of the batch size chosen. This removes the burden of manual tuning and ensures near-optimal execution. In summary, our findings show that this new approach is a viable solution for running high-throughput LLM inference applications efficiently on heterogeneous and opportunistic HPC clusters.

Tue, 14 Oct 2025 20:39:00 +0000

TaskVine Insights: Storage Management – PFS vs. NLS

There are two primary storage layers when running workflows in HPC environments:

Parallel File System (PFS): A shared file system accessible to all users in the same cluster, such as VAST, Lustre, BeeGFS, and CephFS.
Node-Local Storage (NLS): Each worker’s local disk, accessed directly without relying on the network, usually the temporary directory of the local file system.

PFS and NLS each have their own advantages and disadvantages.

PFS is convenient because it can be easily accessed by users. It usually has large capacity, often hundreds of terabytes, making it an ideal choice for storing big datasets. It is also stable and reliable, ensuring that data is not lost. However, the main drawback of PFS is that its I/O bandwidth is shared among many users, so it can become saturated when many jobs perform I/O at the same time, turning data access into a major bottleneck for parallel computation.

In contrast, NLS provides isolated storage on distributed workers. Each worker has its own local disk and can read and write data directly. This allows the total I/O bandwidth to aggregate across all workers, and data transfer between nodes happens through peer-to-peer communication. This design effectively reduces the I/O contention that occurs on the PFS and helps workflows scale to larger sizes. On the flip side, it also has limitations. Its capacity is small, typically a few hundred gigabytes per node, and it is less reliable, because node failures are unpredictable, and when a node is preempted or goes offline, all data stored on that node is lost, which poses a risk to users.

The figure below shows the performance difference between PFS and NLS. We test concurrent reads and writes from 1 to 128 threads and measure their average bandwidth. The PFS test runs directly on the parallel file system, while the NLS test runs on 8 workers, each with 16 cores. It is clear that running parallel I/O on NLS provides much higher average bandwidth. Each thread maintains about 1 GB/s of throughput, whereas the average bandwidth on PFS drops sharply as concurrency increases.

The key takeaway from this comparison is that running large-scale or data-intensive workflows on HPC systems requires relying on NLS for its high aggregate I/O bandwidth, ensuring that the entire workflow is not slowed down by a deluge of reads and writes.

Our team has been studying and improving how to better leverage NLS to accelerate large-scale workflow computations in HPC systems through TaskVine, a workflow management system we have been developing over the past few years. TaskVine’s key advantage is its ability to use each worker’s NLS to reduce I/O contention on the PFS, enabling faster data access and quicker workflow completion. It also employs a range of data management techniques and strategies to ensure that NLS is used effectively and efficiently, keeping data safe and handling unpredictable node failures with care.

This blog series shares how we manage data carefully, address the challenges of using NLS when local disk space is limited and nodes are prone to failures, and achieve massive scalability.

Stay tuned for upcoming technical insights, code examples, and updates!

Tue, 07 Oct 2025 18:06:00 +0000

eScience 2025: Liberating the Data Aware Scheduler to Achieve Locality in Layered Scientific Workflow Systems

On September 16 graduate student Colin Thomas presented the paper titled: Liberating the Data Aware Scheduler to Achieve Locality in Layered Scientific Workflow Systems at the 21st IEEE International Conference on eScience in Chicago, Illinois.

This work engages mutliple topics including workflow systems, data management, and task scheduling. The title describes two key components; The Data Aware Scheduler, and Layered Scientific Workflow Systems. Data aware schedulers are capable of understanding task data dependencies and making task scheduling decisions based on that information, primarily to benefit from data locality when there is intermediate workflow data which is already cached somewhere in the cluster. It is beneficial to schedule tasks who consume this data to the same site in which the data was created, so that the dependencies to not have to move through the network, or perhaps even out of memory.

The term "layered workflow system" is a way to describe multiple popular workflow systems in the HPC community such as Parsl and Dask. These workflow systems consist of two primary components. The DAG manager and executor. The DAG manager understands the workflow composition and data dependencies. The executor receives tasks from the DAG manager and uses its understanding of the cluster and available resources to place tasks on their execution sites.

The primary argument of the paper is highlighting the obstacles created by using a data aware scheduler in this layered execution scheme. If we take a DAG such as the one described by Figure 1, we can easily identify opportunities for data locality in the groups of sequentially dependent tasks. However the data aware scheduler is not privy to this picture of the DAG. Rather the scheduler, or executor, is only aware of tasks which are ready to run, while the DAG manager withholds future task information until they are ready to run as well. This forces a data aware scheduler to make last-minute scheduling decisions based on individual tasks. In many cases it may occupy a node in which a later task would have been better suited due to data locality opportunities.

The paper shows an implementation of a modified Parsl DAG manager and TaskVine data aware scheduler which passes through intermediate-dependent sequential tasks to the data aware scheduler before all of them are ready to run. This allows TaskVine to identify the group dependencies and the ideal execution pattern, and schedule these task groups in batches rather than on an individual basis.

The result of this increases the data locality achieved on the 2 workflows in the evaluation. In addition it reduces the total number of scheduling operations by a factor of the average task group size.

The link to the full paper can be found below:

https://ccl.cse.nd.edu/research/papers/liberating-escience-2025.pdf

Wed, 01 Oct 2025 14:55:00 +0000

Floability at eScience 2025: Making Notebooks Portable with Backpacks Across HPC Clusters

Grad student Saiful Islam presented our paper “Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments” at the 2025 IEEE eScience Conference in Chicago, Illinois.

Notebooks have become the de facto interface for scientific computing, but moving from a local notebook to large-scale HPC clusters is far from seamless—especially when the notebook contains a distributed workflow that must be submitted across multiple nodes. A notebook file alone doesn’t capture the full execution context—it misses environment specifications, data locations, and resource requirements. As a result, the same notebook often runs on one system but fails on another.

This paper introduces the concept of a backpack—a lightweight companion that travels with the notebook and captures everything needed for execution. A backpack makes explicit the software environment, data sources, and resource requirements that are often left implicit in code. With Floability, our implementation of backpack specifications, backpacks transform ordinary notebooks into portable, reproducible workflows that can execute across heterogeneous HPC clusters with zero to minimal code modification.

We evaluated Floability on three representative scientific workflows—distributed image convolution, climate trend analysis, and high-energy physics data analysis—running them across five heterogeneous HPC systems (Notre Dame CRC, Purdue Anvil, UT Stampede3, OSPool, and AWS). In each case, backpacks successfully captured the required software and data dependencies, provisioned worker environments, and reproduced execution without code changes. While runtime varied due to site-specific infrastructure like schedulers and storage, all workflows completed consistently, demonstrating that backpacks enable portable, reproducible, and scalable execution of notebook workflows across diverse HPC environments.

For all the details, please check out our paper here:

Md Saiful Islam, Talha Azaz, Raza Ahmad, A D M Shahadat Hossain, Furqan Baig, Shaowen Wang, Kevin Lannon, Tanu Malik, and Douglas Thain, Backpacks for Notebooks: Enabling Containerized Notebook Workflows in Distributed Environments, IEEE Conference on eScience, pages 9, September 2025.

Project website: https://floability.github.io

Example backpacks: https://github.com/floability/floability-examples Tue, 23 Sep 2025 16:30:00 +0000

Workshop on Harmonizing Python Workflows at IEEE e-Science 2025

We helped host the Workshop on Harmonizing Python Workflows at IEEE International Conference on e-Science on Monday, 15 Sep 2025.

This workshop is one component of our NSF POSE project to explore the creation of an open source ecosystem encompassing a variety of workflow technologies, including the Parsl Project (Kyle Chard at the University of Chicago), the RADICAL tools (Shantenu Jha, Rutgers University), and TaskVine (Douglas Thain, University of Notre Dame).

Workshop attendees conducted several working groups where they identified key barriers to workflow creation, deployment, and adoption; proposed activities for an open source ecosystem to support these needs; and provided feedback on the possible structure sof an ecosystem.

The organizers will be following up soon with a workshop report and draft proposal for next steps.

Thank you to everyone who participated!

Tue, 16 Sep 2025 17:10:00 +0000

Welcome Back, Colin!

This past summer, 4th year PhD student Colin Thomas completed an internship at the National Energy Research Scientific Computing Center (NERSC) located at the Lawrence Berkeley National Laboratory.

Colin worked with a team of researchers and fellow interns to develop and deploy an Inference-as-a-Service (IaaS) platform for particle physics experiments including DUNE and ATLAS. Colin organized a network of services deployed on Kubernetes and the Perlmutter supercomputer which enabled remote scientists to run their applications and perform analysis of their data. The IaaS deployment consisted of metrics collection capabilities used to profile the system, identify bottlenecks, and inform the system about the need to scale the compute resources to meet the demands of users in real-time. The team profiled multiple inference-serving technologies such as Ray Serve and NVIDIA Triton. The team effort resulted in a number of successes including tests with scientists from multiple institutions, the gathering of valuable data from the profiles of Ray and Triton, and the system prototype which can serve as an example for future work in scientific inference serving.

Colin shared with us his wonderful experience during the summer and also gave an engaging talk about the work he has done during our first team meeting. Below is the poster he created and presented. We are excited to learn new ideas and look forward to more inspiring discussions ahead.

Tue, 09 Sep 2025 20:34:00 +0000

New Semester, New Faces

The new semester is here, and we’re excited to welcome three new colleagues and roll out a clear plan for the months ahead.

New faces. Lax joins as a first-year Ph.D. student. Ryan just completed his M.S. in our lab and begins his Ph.D. this semester. Abby joins as a first-year M.S. student. We’re glad to have them on board!

Each semester we adjust our routines and schedules to keep the lab running smoothly and to support a diverse team. This semester we’re trying a few new approaches. Colin will organize our weekly team talks with a schedule planned about a month ahead, host practice sessions for conference presentations, and invite remote guests for ~30-minute Zoom talks and discussion. Jin will handle outreach and support: a short blog every Tuesday, cross-posting to LinkedIn and other channels, encouraging GitHub Discussions as a support path, and serving as the first responder for incoming technical questions. Saiful will lead the website migration to a Jekyll site on GitHub, move over most existing content, convert prior posts into native Jekyll, and switch our papers database from PHP + database to a JSON dataset. Ben, the most experienced and supportive engineer who knows every line of the codebase, will guide release engineering: review PRs and give feedback, prioritize and maintain issues, keep integration tests in working order, and plan and ship regular software releases.

This semester, alongside our long-standing systems research, we will explore practical ways to pair our work with LLMs. For example, improving developer documentation, identifying code invariants, visualizing logs, and troubleshooting workflows. Ground rules: you are responsible for your output; do not create extra work for others; use ND-authorized tools such as Gemini and Gemini-CLI to protect data. This week each person will pick a modest task and tool to explore. Next week we will report on methods, results, and observations.

Project highlights.

CMS Computing (Barry, Jin, Ben, Alan): keep the current TaskVine release stable and usable; share successes with the HEP community; publish the new cms-taskvine-example repository; advance an IPDPS/TPDS paper on dynamic data reduction.
NSF CSSI Floability (Saiful, Abby, Ben): keep the release stable, usable, and documented; attach straightforward data handling for existing applications; grow users and collaborations; build interactive visualization of notebook workflows to support troubleshooting and performance; then extend the approach to new notebook and workflow models.
NSF Consistency Contracts (Colin + Jin): complete and release the basic toolchain to measure, summarize, and enforce; choose baseline techniques that exploit application behavior.
DOE XGFabric (Ryan, Thanh): capture the full software stack; port to multiple ACCESS sites and troubleshoot issues as they appear; evaluate scale, responsiveness, and resource use; develop solutions for HPC batch queue delays.
NSF HARMONY: run a fall workshop on workflow collaboration with the eScience conference; capture examples across Parsl-TaskVine (astro), DaskVine (HEP), and RADICAL (xgfabric).
NASA-SADE (Lucas, Lax): demonstrate the simulation infrastructure at the September Year 2 review; evaluate OS scalability and performance; begin integrating native SADE components.

We’ll post a short update every Tuesday and syndicate it to LinkedIn and other channels. Questions and ideas are welcome on GitHub Discussions.

Author: Jin Zhou (jzhou24@nd.edu)

Lab PI: Douglas Thain (dthain@nd.edu)

Lab Website: https://ccl.cse.nd.edu/

GitHub: https://github.com/cooperative-computing-lab/cctools

Wed, 03 Sep 2025 03:48:00 +0000

CCL Team at GCASR 2025

Members of the CCL team traveled to Chicago, Illinois on May 8 to attend GCASR 2025 (12th Greater Chicago Area Systems Research Workshop).

CCL team members presented their work in both poster sessions.

Barry Sly-Delgado presented: Task Graph Restructuring via Function Based Annotation For Large-Scale Scientific Applications.

Colin Thomas presented: Enabling Tailored Optimizations for Scientific Workflows through I/O Tracing.

Md. Saiful Islam presented: Floability: Enabling Portable Scientific Workflows Across HPC Facilities.

Jin Zhou presented: Effectively Exploiting Node-Local Storage For Data-Intensive Scientific Workflows.

Thu, 22 May 2025 15:13:00 +0000

Reshaping High Energy Physics Applications Using TaskVine @ SC24

Barry Sly-Delgado presented our paper titled: "Reshaping High Energy Physics Applications for Near-Interactive Execution Using TaskVine" at the 2024 Supercomputing Conference in Atlanta, Georgia. This paper investigates the necessary steps to convert long-running high-throughput high energy physics applications to high concurrency. This included incorporating new functionality within TaskVine. The paper presents the speedup gained as changes were incorporated to the workflow execution stack for application DV3. We eventually achieve a speedup of 13X.

Configurations for each workflow execution stack as improvements were made.

Starting with stack 1. The first change incorporated was to the storage system where initial data sets are stored. This change showed little improvement, taking 3545s runtime to 3378s. This change is minimal as much of the data handling during application execution is related to intermediate results. Initially, with stack 1 and 2, intermediate data movement is handled via a centralized manager.

data movement during application execution between Work Queue and TaskVine. With TaskVine, the most data exchanged between any two nodes tops off around 4GB (the manger is node 0). With Work Queue the most data transferred is 40GB

Incorporated in stack 3 is a change of scheduler, TaskVine. Here, TaskVine allows for intermediate results to be stored on node-local storage and transferred between peer compute nodes. This relieves strain on the centralized manager and allows it to schedule tasks more effectively. This change drops the runtime to 730s.

CDF of task runtimes within the application per execution paradigm. With Function Calls, individual tasks execute faster.

Our final improvement changes the previous task execution paradigm within TaskVine. Initially "PythonTasks" serialized functions along with arguments and distributed them to compute nodes to execute individual tasks. Under this paradigm, the python interpreter would be invoked for each individual task. Our new task execution paradigm, "Function Calls" stands up a persistent Python process, "Library Task", that contains function definitions that can be invoked via individual function calls. Thus, invocations of the Python interpreter are reduced from per-task to per-compute-node. This change reduces runtime to 272s for a 13X speedup from our initial configuration.

Application execution comparison between stack configurations.

Tue, 14 Jan 2025 15:47:00 +0000

Shepherd Paper at WORKS/SC 2024

Grad student Saiful Islam presented our paper on Shepherd at the 19th Workshop on Workflows in Support of Large-Scale Science at Supercomputing 2024 in Atlanta, Georgia.

Shepherd is a local workflow manager that enables a fleet of actions and services to functionally behave as a single task. This allows us to seamlessly deploy these local workflows into HPC clusters at large scale. For example, consider a workflow for large-scale distributed drone simulation. This workflow includes steps such as preprocessing, creating a configuration, running simulations, and post-processing. We aim to deploy the "run simulation" step in HPC clusters at large scale. However, each "run simulation" step includes a local workflow that needs to perform specific actions and start services like Gazebo, PX4, and Pose Sender. These services and actions depend on each other’s internal states. For example, when Gazebo reaches the "ready" state, the nth PX4 service is started. Without Shepherd, the workflow becomes as shown in the following figure:

With Shepherd, the entire "run simulation" step becomes a single task of running Shepherd with a specified configuration. Shepherd starts the required actions and services, manages dependencies, and ensures the graceful shutdown of services when they are no longer needed. This enables complex workflows of actions and services to be seamlessly integrated with a larger distributed workflow manager, as shown in the following figure:

The paper discusses the architecture and design principles of Shepherd, showcasing its application to large-scale drone simulations and integration testing.

Shepherd uses a YAML-based workflow description to define tasks, dependencies, and execution conditions. It monitors logs and file generation to infer internal states and manage lifecycles effectively. Additionally, it generates three visualizations post-execution for debugging and documentation. For instance, the figure below illustrates a timeline of a 100-drone simulation distributed across 25 nodes, with each Shepherd instance managing 4 drones. A zoomed-in view highlights how components execute at varying times across nodes, which becomes challenging without awareness of service readiness. Shepherd simplifies this with YAML-based configurations and internal state tracking.

For all the details, please check out our paper here:

Md Saiful Islam, Douglas Thain, Shepherd: Seamless Integration of Service Workflows into Task-Based Workflows through Log Monitoring, 19th Workshop on Workflows in Support of Large-Scale Science at ACM Supercomputing, pages 1-8, November, 2024.

Tue, 07 Jan 2025 23:50:00 +0000

Data Pruning Mechanism in Daskvine

In our recent work, we introduced a file pruning technique into DaskVine to address challenges in managing intermediate files in DAG-based task graphs. This technique systematically identifies and removes intermediate stale files—those no longer needed by downstream tasks—directly from worker nodes. By freeing up storage in real time, file pruning not only enables workers to process more computational tasks with limited disk space but also makes applications feasible that were previously constrained by storage limitations.

Specifically, the pruning algorithm monitors task execution in the graph, categorizing tasks as waiting or completed. Once a task finishes, it prunes its parent tasks’ output files if all dependent child tasks are done and immediately submits dependent tasks for execution when their inputs are ready.

To evaluate the effectiveness of our file pruning technique, we utilized four key metrics:

File Retention Rate (FRR): FRR represents the ratio of a file's retention time on storage to the total workflow execution time. It indicates how long files remain in storage relative to the workflow's progress. A shorter FRR reflects more efficient pruning. The following figures show the FRR over all intermediate files, each producing by one specific task. The left graph (FCFS) shows higher and more uniform FRR, where files remain on storage for a significant portion of the workflow. In contrast, the right graph (FCFS with Pruning) shows a clear reduction in FRR for most files, which proves that pruning reduces storage usage by removing stale files promptly.

Accumulated Storage Consumption (ASC) and Accumulated File Count (AFC): ASC measures the total amount of storage consumed across all worker nodes throughout the workflow execution, while AFC tracks the total number of files retained across all worker nodes during the workflow. In the following, the left graph shows that both ASC and AFC steadily increase throughout the workflow, peaking at 1082.94 GB and 299,787 files, respectively. This highlights significant storage pressure without pruning. The right graph demonstrates dramatic reductions, with peaks at 326.78 GB for ASC and 80,239 files for AFC. This confirms that pruning effectively reduces storage consumption and file count by promptly removing unnecessary intermediate files.

Worker Storage Consumption (WSC): WSC reflects the storage consumption of individual workers at any point during the workflow execution. It helps assess the balance of storage usage among workers. The left graph shows a peak WSC of 24.95 GB, indicating high storage pressure on individual workers. In contrast, the right graph shows a significantly lower peak of 7.07 GB, demonstrating that pruning effectively reduces storage usage per worker and balances the load across the cluster.

Mon, 09 Dec 2024 16:10:00 +0000

TaskVine + Parsl Integration

The Cooperative Computing Lab team has an ongoing collaboration with the Parsl Project, maintaining the TaskVine Executor for use with the Parsl workflow system. Using the TaskVine executor involves expressing an application using the Parsl API, where tasks are created and managed using the Parsl Data Flow Kernel. Tasks are passed to the executor which makes final scheduling decisions.

The TaskVine executor offers a number of features involving data management and locality scheduling. The distinction between the TaskVine executor and other available executors is largely the awareness of data dependencies and the use of node-local storage. There are a number of features to the TaskVine executor that have been recently added or are in progress.

One prominent new feature is the extension of TaskVine function invocations to the TaskVine executor. Function invocations reduce the overhead of starting new tasks, considerably benefiting workflows with many short-running tasks. TaskVine function calls have been described with greater detail in this article by Thanh Son Phung. This benefit to task startup latency is especially useful for the Parsl/TaskVine executor, since Parsl is a python-native application. Normally each task must consider starting a new python process and loading all modules used by Parsl as well as the user's application. With short running tasks, such as ones resembling quick function calls, the overhead of loading this information from shared storage, or transferring it to local storage and starting a new python process can be greater than the running time of the task itself. TaskVine function invocations allow this runtime environment information to be effectively cached, and multiple tasks to run sequentially in the same python process. A molecular dynamics application using the Parsl/TaskVine executor is shown below, with L1 being simple tasks, and L2 using TaskVine function invocations.

Other new changes to the TaskVine executor include the addition of the tune_parameters to the executor configuration. TaskVine is highly configurable by users to suit the needs of their particular application or cluster resources. Adding this option to the TaskVine executor allows users to automatically replicate intermediate data, configure the degree of replication desired, disable peer transfers, and several in-depth scheduling options which fit different specific task dispatch and retrieval patterns.

Sun, 01 Dec 2024 21:49:00 +0000

Accelerating Function-Centric Applications via Reusable Function Context in Workflow Systems

Modern applications are increasingly being written in high-level programming languages (e.g., Python) via popular parallel frameworks (e.g., Parsl, TaskVine, Ray) as they help users quickly translate an experiment or idea into working code that is easily executable and parallelizable on HPC clusters or supercomputers. Figure 1 shows the typical software stack of these frameworks, where users wrap computations into functions, which are sent to and managed by a parallel library as a DAG of tasks, and these tasks eventually are scheduled by an execution engine to execute on a remote worker node.

A traditional way to execute functions remotely is to translate them into executable tasks by serializing functions and their associated arguments into input files, such that these functions and arguments are later reconstructed on remote nodes for execution. While this way fits function-centric applications naturally into well understood task-based workflow systems, it brings a hefty penalty to short-running functions. A function now takes extra time for its states to be sent and reconstructed on a remote node which are then unnecessarily destroyed at the end of that function’s execution.

At HPDC 2024, graduate students Thanh Son Phung and Colin Thomas proposed the idea that function contexts, or states, should be decoupled from function’s actual execution code. This removes the overhead of repeatedly sending and reconstructing a function’s state for execution, and allows functions of the same type to share the same context. The rest of the work then addresses how a workflow system can treat a function as a first-class citizen by discovering, distributing, and retaining such context from a function. Figure below shows the execution time of the Large-Scale Neural Network Inference application, totaling 1.6 million inferences separated into 100k tasks, with increasing levels of context sharing (L1 is no sharing, L3 is maximum sharing). Decoupling the inference function’s context from the inference massively reduces the execution time of the entire workflow by 94.5%, from around 2 hours to approximately 7 minutes.

An interested reader can find more details about this work in the paper:Thanh Son Phung, Colin Thomas, Logan Ward, Kyle Chard, and Douglas Thain. 2024. Accelerating Function-Centric Applications by Discovering, Distributing, and Retaining Reusable Context in Workflow Systems. In Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC '24). Association for Computing Machinery, New York, NY, USA, 122–134. https://doi.org/10.1145/3625549.3658663

Mon, 11 Nov 2024 16:07:00 +0000

A New Visualization Tool for TaskVine Released

We released a web-based tool to visualize runtime logs produced by TaskVine, available on Github. Using this tool involves two main steps. First, the required data in CSV format must be generated for the manager, workers, tasks, and input/output files. After saving the generated data in the directory, users can start a port on their workstation to view detailed information about the run. This approach offers two key advantages: the generated data can be reused multiple times, minimizing the overhead of regeneration, and users can also develop custom code to analyze the structured data and extract relevant insights.

For example, the first section describes the general information of this run, including the start/end time of the manager, how many tasks are submitted, how many of them succeeded or failed, etc.

The second section describes the manager's storage usage through its lifetime, the a-axis starts from when the manager is started, and ends when the manager is terminated, the y-axis is in MB unit, and such pattern is applied to all diagrams in this report.

The third section is the table of all workers' information, which is basically grabbed from the csv files from the backend, but this enables users to easily sort by their interested columns.

The fourth section is the storage consumption among all workers. Several buttons in the top are provided, to turn the y-axis to a percentage unit, or to highlight one worker that is of interest.

The fifth section is the number of connected workers throughout the manager's lifetime, hovering the mouse on a point shows the information of a connected/disconnected worker.

The sixth section shows the number of concurrent workers throughout the manager's lifetime.

The seventh section is table of completed/failed tasks and their information.

The eighth section is the execution time distribution of different tasks. Those with a lower index on the left side of the x-axis are submitted earlier, while those on the right are submitted later. A CDF can be seen by clicking the button on the top.

The nineth section shows the task count, average execution time and max execution time of different task categories. One category can comprise multiple tasks that have a similar behavior, such as having the same function name but with different inputs.

The tenth section demonstrates the general runtime distribution of tasks, the y-axis is the worker-slot pair. In the following example, we have 64 workers and each with 20 cores. One can zoom in the diagram and hover to see the detailed information of one task, which is particularly useful when examining outliers.

The last section is about the structure of the compute graphs. Nodes in the graph are tasks with an index label, while edges are the dependencies between input/output files of tasks. Weights on those task->file edges are the execution time of tasks, while those on the file->task edges are the waiting time starting from when a file is produced to when it is consumed by a consumer task.

This tool works well under the scale of hundreds of thousands of tasks, but for large runs, which may have millions of tasks, the online visualization tool may be unable to process such amount of data because the data transferring bottleneck between backend and frontend. Under such case, or just for convenience, we recommend the users to use tools under the pyplot directory, which is more lightweight and uses traditional matplotlib and seaborn to draw diagrams. Detailed explanations are provided under the README file.

Mon, 04 Nov 2024 23:34:00 +0000

Integrating TaskVine with Merlin

Graduate student, Barry Sly-Delgado, completed a summer internship onsite at Lawrence Livermore National Laboratory where he worked on integrating TaskVine with Merlin, an executor for machine learning workflows. Barry worked as a member of the WEAVE team under Brian Gunnarson and Charles Doutriaux.

Previously, Merlin used Celery to distribute tasks across a compute cluster. With TaskVine's addition, utilization of in-cluster resources (bandwidth, disk) is available for workflow execution. Existing Merlin specifiacations can use TaskVine as a task scheduler with little change to the specification itself.

Merlin works with TaskVine by utilizing the Vine Stem, a DAG manager that borrows the concepts of groups and chains to create workflows from Celery. With this, the Vine Stem sends tasks (Merlin Steps) to the TaskVine manager for execution. Execution of these tasks eventually create a directory hierarchy that previous Merlin workflows already do. In addition to the workflow specification, a Merlin Specification contains specifications for starting workers, which are submitted via a batch system (HTCondor, UGE,Slurm)

Architecture of Merlin with TaskVine

Sample Merlin Specification Block With TaskVine as Task Server

The TaskVine task server option will be included in an upcoming release of Merlin. We would be happy to find more use cases so please check it out once released!

Wed, 16 Oct 2024 19:14:00 +0000

Introducing Shepherd: Simplifying Integration of Service Workflows into Task-Based Workflows

We are pleased to announce the release of Shepherd, an open-source tool designed to streamline the integration of service workflows into task-based workflows. Shepherd enables users to manage applications that require dynamic dependencies and state monitoring, ensuring that each component in a workflow starts only after its dependencies have reached specified states.

What Is Shepherd?

Shepherd is a local workflow manager and service orchestrator that launches and monitors applications according to a user-defined configuration. It bridges the gap between persistent services and traditional task-based workflows by treating services as first-class tasks. This approach allows for seamless integration of services that run indefinitely or until explicitly stopped, alongside actions that run to completion.

By analyzing log outputs and file contents, Shepherd tracks the internal states of tasks without the need to modify their original code. This capability is particularly useful when integrating persistent services into workflows where tasks may depend on the internal states of these services rather than their completion.

Key Features

Services as Tasks: Treats both actions (tasks that run to completion) and services (persistent tasks that require explicit termination) as tasks within a workflow, allowing integration into task-based workflow managers.
Dynamic Dependency Management: Initiates tasks based on the internal states of other tasks, enabling complex state-based dependencies. Supports both "any" and "all" dependency modes for flexible configurations.
State Monitoring: Continuously monitors the internal state of each task by analyzing standard output and specified files, updating states in real-time based on user-defined patterns.
Graceful Startup and Shutdown: Ensures tasks start only when their dependencies are met and provides controlled shutdown mechanisms to maintain system stability and prevent data loss.
Logging and Visualization: Generates comprehensive logs and state transition data, aiding in performance analysis and debugging.
Integration with Larger Workflows: By encapsulating service workflows into single tasks, Shepherd enables integration with larger distributed workflow managers, enhancing workflow flexibility and reliability.

Documentation and Source Code

For detailed documentation, examples, and source code, please visit the Shepherd GitHub Repository.

Mon, 14 Oct 2024 13:55:00 +0000

TaskVine at ParslFest 2024

On September 26-27 members of the CCL team attended ParslFest 2024 in Chicago, Illinois to speak about TaskVine and connect with our ongoing collaborators at the Parsl Project.

Grad student Colin Thomas delivered a talk titled "Parsl/TaskVine: Interactions between DAG Manager and Workflow Executor."

The talk highlighted recent developments in the implementation of TaskVine Temporary Files within the Parsl/TaskVine Executor. Temporary files allow users to express intermediate data in their workflows. Intermediate data are items produced by tasks, and consumed by tasks. With the use of TaskVine Temporary Files, this intermediate data will remain in the cluster for the duration of a workflow. Keeping this data in the cluster can benefit a workflow that produces intermediate files of considerable quantity or size.

In addition to adding temporary files to the Parsl/TaskVine executor, the talk presented ongoing work on the development of intermediate data-based "task-grouping" in TaskVine and extending it to the TaskVine executor. The concept of grouping related tasks may sound familiar to Pegasus users or other pre-compiled DAG workflow systems. Currently task-grouping is focused on scheduling sequences of tasks which contain intermediate dependencies. The desired outcome is for TaskVine to interpret pieces of the DAG to identify series of sequentially-dependent tasks in order to schedule them on the same worker, such that intermediate data does not need to travel to another host. Extending this capability to Parsl brought about an interesting challenge which questioned the relationship between DAG manager and executor.

The problem is that in the typical workflow executor scheme, such as with Parsl or Dask, the executor will receive tasks from the DAG manager only when they are ready to run. In order for TaskVine to identify sequentially dependent tasks it needs to receive tasks which are future dependencies of those that are ready to run.

The solution to this came about as a special implementation of a Data Staging Provider in Parsl, which is the mechanism for determining whether data dependencies are met. When the TaskVine staging provider encounters temporary files, it will lie to the Parsl DFK (Data Flow Kernel) in saying the file exists. Therefore Parsl will send all temporary-dependent tasks to TaskVine without waiting for the files to actually materialize. This offers TaskVine relevant pieces of the DAG to inspect and identify dependent sequences.

The impact of task grouping was evaluated on a benchmark application. Each benchmark run consists of 20 task sequences of variable length. Each task in a sequence produces and consumes an intermediate file of variable size. The effect on running time is shown while adjusting these parameters. Grouping tasks was shown to benefit the performance across a variety of sequence lengths, and when intermediate data size exceeds 500MB.

In addition to the positive benchmark results, this talk promoted discussion about this non-traditional interaction between DAG manager and executor. The DAG manager has insight into future tasks and their dependencies. The executor possesses knowledge about data locality. Combining this information to make better scheduling decisions proved to have some utility, yet it challenged the typical scheme of communication between Parsl and the TaskVine executor.

Mon, 07 Oct 2024 14:16:00 +0000

Predicting Resources of Tasks in Dynamic Workflows with Bucketing Algorithms at IPDPS 2024

Thanh Son Phung will present Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources at the International Parallel and Distributed Processing Symposium 2024.

Non-technical users running dynamic workflows usually have no knowledge of resource (e.g., cores, memory, disk, GPUs) consumption of tasks within a given workflow. However, this omission of information can bring substantial performance degradation as it is important for workflow execution engines to know the exact amount of resources each task takes and make scheduling decisions accordingly.

In the figure above, an application submits many invocations of two functions: f and g. The invocations are transformed into tasks through many layers and scheduled on worker nodes. Note that the resource specification of each task may be unknown.

Since a manual resource specification for each task is unreliable, an automated solution is required. Such solution needs to be (1) general-purpose (can work with any workflow), (2) prior-free (no prior information is used as workflows may change over each run), (3) online (collect information and run as the workflow runs), and (4) robust (performant on many distributions and unexpected changes). The below figure shows two workflows: ColmenaXTB (top row) and TopEFT (bottom row). Resource types range from core, memory, disk, and execution time from left to right.

To predict resources of each task, we first collect resource records of completed tasks, group them into buckets of tasks based on their similarity in resource consumption, and probabilistically choose a bucket to allocate the next task, as shown in the example below.

We introduce two bucketing algorithms, namely Greedy Bucketing and Exhaustive Bucketing, which share the same principle of resource prediction and only differ in the way buckets of tasks are computed.

We compare the two algorithms with five alternative algorithms on 7 workflows with 3 resource types. Results show that Greedy Bucketing and Exhaustive Bucketing consistently outperform other algorithms by yielding the highest resource efficiency in the majority of cases.

Further details can be found in the paper:

Thanh Son Phung, Douglas Thain, Adaptive Task-Oriented Resource Allocation for Large Dynamic Workflows on Opportunistic Resources, IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 12, May, 2024.

Mon, 04 Mar 2024 15:24:00 +0000