Because TaskVine inputs can migrate between workers to satisfy locality, it is quite common for several nodes to end up caching the same temporary file; the effect is most pronounced on the faster machines, which finish more tasks per unit time and accumulate many more staged inputs than their slower peers. To keep this extra baggage from overwhelming NLS, the TaskVine manager intervenes every time a task completes. It walks the task’s input list, and for each temporary file that now has more replicas than the configured target it inspects the workers currently holding that file.
Nothing is deleted until every replica reports the READY state, guaranteeing that an in-flight transfer is not disrupted, and the manager double-checks that none of the workers are still executing tasks that depend on the file. Only after these safety checks does it queue up the redundant replicas for removal, restoring balance without jeopardizing correctness. The number of redundant replicas is calculated as the difference between the current number of replicas and the user-specified target, ensuring that at least one replica always remains available for future use. Workers holding these excess replicas are ranked by their free cache space, prioritizing cleanup on those with more available storage to maintain balanced disk utilization.
In TaskVine, this technique is triggered when a worker reports that a temporary file has been cached. When this happens, the manager’s cache-update handler inspects the newly staged replica and, if shifting is enabled, scans the worker pool to find a lighter host. Only workers that are active, with transfer capability, and do not already hold the file are considered. Candidates that would end up heavier than the source after receiving the file are skipped, leaving the lightest eligible worker to take the replica so that each transfer moves free space toward balance instead of swapping hot spots.
The migration reuses TaskVine’s existing peer-transfer pipeline. The destination streams the file directly from the source, the replica catalog tracks its state from CREATING to READY, and both workers update their transfer counters for admission control. Once the new replica is confirmed, the original worker releases its redundant copy to the cleanup routine, reclaiming disk space that just served its purpose. The work involved is modest, requiring only a single hash-table scan and one network transfer per staged replica, but the payoff is immediate: fast workers stay ahead of their disk usage, slower nodes lend idle capacity, and heterogeneous clusters keep their node-local storage evenly balanced without reducing throughput.
The following figure compares the NLS usage across all workers over time in the DV5 workflow, before and after enabling the two techniques.
After enabling Redundant Replica Removal and Disk Load Shifting, the NLS usage among workers became much more balanced. As shown in the bottom figure, storage consumption across nodes stayed within a narrow range under 10 GB, compared to over 20 GB and a significant skew before optimization. This indicates that the two techniques effectively prevented disk hotspots and improved overall resource utilization. In terms of overhead, the pre-optimization run completed in 206.85 seconds, while the optimized run took 311.92 seconds, indicating that the additional data transfers introduced a noticeable slowdown.
Both techniques are implemented on the TaskVine manager side in C, but from the user’s perspective they are simple to enable. After creating a manager object through the Python interface, for example:
m = vine.Manager(port=[9123, 9130]),
you can activate them individually with:
m.tune("clean-redundant-replicas", 1) and m.tune("shift-disk-load", 1).
While these modes are effective, they are not always recommended, since the additional data transfers and computations may introduce overhead and reduce overall throughput. However, if your workflow runs on disk-constrained nodes or workers are being evicted due to insufficient storage and you cannot request more disk space, enabling these options can significantly improve stability and performance.