SeqAlign: Hardware Acceleration of DNA Sequence Alignment

dalke · on Oct 6, 2013

Did anything ever become of that work? Based on the project link to http://opencores.org/websvn,listing,seqalign , I see that that the last modification time is 2009-08-17 18:21:06 GMT, so it's been 4 years.

BTW, there are more details at http://chrisfenton.com/wp-content/uploads/2009/08/final_repo... , which has a date one day before the above "last modified" time.

tensor · on Oct 6, 2013

As far as I know, people generally use more sophisticated algorithms that run on commodity hardware. I remember hearing about this many years ago, but never actually saw it in the wild.

epistasis · on Oct 6, 2013

Smith-Waterman is the "gold-standard," but for speedy heuristic DNA alignment these days, BWA and Bowtie are probably the two most common mappers, and they're both based on the ideas of the FM-index [1]. BLAST and BLAT were previously the two most commonly used, both hash-based aligners, and they're still used today for one-off database searches, as they are more accurate, particularly for long sequences.

[1] http://en.wikipedia.org/wiki/FM-index

hyperbovine · on Oct 6, 2013

But see: http://snap.cs.berkeley.edu. BWT-based aligners hark back to a time when it was not cheap to own enough RAM to store an entire (human) genome seed lookup table in memory. That is no longer the case (you need about 64gb). Also, hash aligners perform better as the read length increases.

epistasis · on Oct 6, 2013

Well, BLAT for example also stores the entire seed table in memory, it just has to use fewer, shorter, and more frequently-occurring seeds. BWT requires something like two random RAM accesses per base pair, which is incredibly slow. With a large enough seed table, hash-based can get down to just a few random accesses per 100 base pair read.

Has SNAP been published? I had heard about it but not seen it used in practice anywhere.

epistasis · on Oct 6, 2013

BTW, thank you for the pointer to the repository! I hadn't realized that it was publicly available at all. There's a few niceties that would be good, such as reading gzip directly (named pipes are a hassle). But also very cool that they're using HugeTLB, I'm always glad when good systems people apply themselves to bioinformatics, they can really squeeze the most out of recent machines!

tensor · on Oct 6, 2013

If you are interested in high accuracy for highly diverged DNA, take a look at:

http://www.ncbi.nlm.nih.gov/pubmed/20733242 https://github.com/akhudek/feast

Sadly, I haven't seen this research be applied anywhere. I'm not sure if it's a lack of advertising, or that there is nothing interesting at extreme divergences.

usamec · on Oct 6, 2013

It is much better to speed to aligment using heuristics (e.g. use hashing for find matching kmers and do dynamic programming only on small portion of data) than by using faster hardware. Look here: http://bowtie-bio.sourceforge.net/index.shtml or here: http://mummer.sourceforge.net/

TheLegace · on Oct 6, 2013

I'm curious what people would think if they had this in their computers, laptops, game systems. So they could actively choose to help large scale computation for DNA sequencing or any other big scientific problem they could lend computation to.

gren · on Oct 6, 2013

Yeah! like some hidden script in popular websites doing some computation with WebCL

oakwhiz · on Oct 6, 2013

I wonder if it would somehow be possible to use this with the protein folding game Foldit. http://fold.it/portal/

Some puzzles start off with a sequence alignment phase.

epistasis · on Oct 6, 2013

These particular algorithms are commonly used with proteins' amino acid sequences for general database searches. But since the database of protein sequences is far far smaller than typical DNA search problems, there are more sophisticated and computationally expensive algorithms such as Pair HMMs [1] or Profile HMMs [2] can be used, and for fine tuning of 3D model threading they would be a much better option, since the problem is so small.

[1] http://ai.stanford.edu/~serafim/CS262_2008/notes/lecture8.pd...

[2] http://www.biology.wustl.edu/gcg/hmmanalysis.html

kevinwuhoo · on Oct 6, 2013

Structural alignment algorithms are (generally) much different and is a much more difficult problem. The "most popular" (at least in my experience) methods are combinatorial extension [1] and DALI [2]. But I'm sure it could be done, no doubt about it. However, most people, probably besides ones at the PDB, usually don't do massive structural alignment tasks. The effort to develop and maintain such a device probably isn't feasible.

[1]: http://peds.oxfordjournals.org/content/11/9/739

[2]: http://www.sciencemag.org/content/273/5275/595