SAND - Scalable Assembly at Notre Dame | The Cooperative Computing Lab

SAND is a set of modules for genome assembly that are built atop the Work Queue platform for large-scale distributed computation on clusters, clouds, or grids. SAND was designed as a modular replacement for the conventional overlapper in the Celera assembler, separated into two distinct steps: candidate filtering and alignment.

To use SAND, you start your assembly process as normal, then run a lightweight worker program on as many other machines as you can access. You can start them manually, run them on the cloud, or submit them to systems like Condor or SGE. SAND will organize the machines into a workforce that, under the right conditions, can speed up assembly tasks by several hundred fold.

The correct output of SAND has been validated on the anopheles gambiae, sorghum bicolor, and homo sapiens datasets listed below.

Sample Data

The following are the datasets used for evaluating SAND in our various publications. The .cfa data format is binary Compressed FAsta, which can be converted to/from plain text FASTA files using sand_compress_reads and sand_uncompress_reads.

(Note: We are in the middle of restoring these datasets from backup. The small, medium, and large datasets are available for download. The repeat files are currently being regenerated. The human dataset is still being restored.)

Sequence Data	Repeat Data	Num Reads	Compr. Size	Notes
small.cfa	small.repeats	101617	21MB	Small subset of Anopheles gambiae.
medium.cfa	medium.repeats	2586385	642MB	Full set of reads from the Anopheles gambiae Mopti form.
large.cfa	large.repeats	7915277	1.7GB	Simulated reads from the Sorghum bicolor genome.
human.cfa	human.repeats	31257852	7.1GB	Ventner Homo sapiens genome.

Related Publications

A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids

Christopher Moretti, Andrew Thrasher, Li Yu, Michael Olson, Scott Emrich, and Douglas Thain

IEEE Transactions on Parallel and Distributed Systems, 2012

doi: 10.1109/TPDS.2012.80

Bib PDF

@article{assembly-tpds,
  author = {Moretti, Christopher and Thrasher, Andrew and Yu, Li and Olson, Michael and Emrich, Scott and Thain, Douglas},
  title = {{A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids}},
  journal = {{IEEE Transactions on Parallel and Distributed Systems}},
  volume = {23},
  number = {12},
  year = {2012},
  note = {{doi: 10.1109/TPDS.2012.80}},
  cclpaperid = {100},
  keywords = {workqueue, sand},
}

Highly Scalable Genome Assembly on Campus Grids

Christopher Moretti, Michael Olson, Scott Emrich, and Douglas Thain

In Many-Task Computing on Grids and Supercomputers (MTAGS), 2009

doi: 10.1145/1646468.1646480

Bib PDF

@inproceedings{assembly-mtags09,
  author = {Moretti, Christopher and Olson, Michael and Emrich, Scott and Thain, Douglas},
  title = {{Highly Scalable Genome Assembly on Campus Grids}},
  booktitle = {{Many-Task Computing on Grids and Supercomputers (MTAGS)}},
  year = {2009},
  note = {{doi: 10.1145/1646468.1646480}},
  cclpaperid = {82},
  keywords = {sand},
}

Scalable Modular Genome Assembly on Campus Grids

Christopher Moretti, Michael Olson, Scott Emrich, and Douglas Thain

2009

Bib PDF

@techreport{assembly-tr,
  author = {Moretti, Christopher and Olson, Michael and Emrich, Scott and Thain, Douglas},
  title = {{Scalable Modular Genome Assembly on Campus Grids}},
  institution = {{University of Notre Dame, Computer Science and Engineering Department}},
  number = {2009-04},
  year = {2009},
  cclpaperid = {77},
  keywords = {workqueue, sand},
}