
Darwin: a genomics co-processor provides up to 15,000x acceleration - godelmachine
https://blog.acolyer.org/2018/04/19/darwin-a-genomics-co-processor-provides-up-to-15000x-acceleration-on-long-read-assembly/
======
klmr
> The long read technology comes with a drawback though – high error rates in
> sequencing of between 15%-40%

These error rate estimates are _seriously_ outdated (the quoted number is from
a 2015 paper but it was obsolete pretty much as soon as that paper was
published). Long-read technologies are evolving rapidly, so this is important.
The state of the art is working with error rates of at most 15% (but usually
much less), which, after correction, go down below 8% [1].

This is crucial because it makes the difference between a successful and a
failed assembly: 40% error rate essentially means that you need different
algorithms to perform sequence assembly (as the article notes, error
correction takes up “orders of magnitude” more time than actual assembly).

I’d therefore be curious how this Darwin setup performs against conventional,
state-of-the-art sequence assembly with state-of-the-art long read sequencing
data.

[1]
[https://www.nature.com/articles/nbt.4060](https://www.nature.com/articles/nbt.4060)

~~~
stochastic_monk
Read page 8 [206] of this paper to see the numbers they're providing. The
PacBio reads are listed at up to 15% error rate. Their experiments were
synthetic data from 15-40% error rates.

A few comments:

1\. This was published in ASPLOS, not a comp bio journal, which should raise
some eyebrows.

2\. As another commenter said, error correction requires a reference, either
by alignment, kmer filtering, or consensus. However, this is usually part of
any assembly pipeline.

3\. Like you said, they failed to compare to actual state of the art alignment
tools. They do cite Canu, but they don't use it as their baseline. The issue
here is that the current best long read assemblers use locality-sensitive
hashing. The Canu paper even states that miniasm can produce a higher-quality
assembly for CHM1 in 1/400 the CPU time (though their discussion of the table
selectively ignores this fact). Considering that a combination of algorithm
and implementation is able to provide a 400x speedup makes this result seem
significantly less impressive, and that a comparison of CPU time is slightly
unfair to miniasm considering how perfectly it threadscales, which would make
the other tools look even worse by comparison. Additionally, for problems
where you really need the improvement in performance, the GPU RAM limitations
and communication overhead would likely incur significant penalties to their
method. It's not simply a matter of how many reads a second you can process
because it's a quadratic overlap problem.

~~~
klmr
Their 40% number comes from the paper I referenced (i.e. that’s the one the
Darwin paper references). I completely agree with your point 2. I’m not sure I
agree with your first point: I think it’s pretty reasonable for it to have
been published in that venue.

------
tradedash
Serious question: if the processing problems are classified as NP, then
couldn't that processing be outsourced to a blockchain? Why have miners solve
for useless problems that have no long lasting impact like "number of zeros in
a SHA string" instead of processing data such as this? I can easily see it
being the case that the mining problem in a given blockchain could be based on
real scientific problems that needs solving.

I'm looking for someone to help me understand why the above isn't possible.

~~~
JD557
I thought about this as well before, but the problem is that you don't want
just ANY NP-problem: You want an NP problem (it doesn't even have to be NP-
complete) that you can easily adjust the difficulty.

How do you adjust the difficulty on something like this?

Also, you always have to calculate the "hashes" of the blocks. I guess you
could encode a block as a graph and it's "hash" could be the smallest path
that visited all the edges, but how useful would that be?

When I looked into this in the past, I stumbled into Gridcoin[1], but I don't
know how they deal with the difficulty/usefulness of the solutions.

[1]: [https://www.gridcoin.us/](https://www.gridcoin.us/)

~~~
tradedash
Thanks for the link. I'll look into it!

> you can easily adjust the difficulty.

For Prime Factorization for instance, which like you say, isn't even NP-
complete. You can easily adjust the difficulty by increasing the size of the
number in question.

~~~
JD557
I'm not sure if that's completely true for prime factorization. There are
numbers that are larger, yet easier to factorize.

For example, I think that a power of two, like 1024, is much faster to
factorize than, let's say, 1001 (7 _11_ 13). (I'm not quite sure this example
is true)

Here's a more detailed description:
[https://mathoverflow.net/questions/249266/classes-of-
numbers...](https://mathoverflow.net/questions/249266/classes-of-numbers-that-
are-easy-to-factorize-using-classical-computers)

------
dekhn
There have been a lot of co-processors in the genomics and sequencing world.
They have all failed for obvious business reasons. It's not even clear that
any genomics problem that exists today can't be solved with conventional
hardware. Most genomics programs reach a few percent of the capacity of the
hardware, so I think more software tuning and design is the right approach.

~~~
searine
I agree. The problem in genomics isn't how fast alignment is going, it is
where to put the gobs and gobs of data.

The bottleneck in speed is how fast you can flip a DNA sequencer, which right
now with an illumina Novaseq that bottleneck is a few days per run. That gives
plenty of time to process each runs data on an HPC on current software for
alignment and snp calling.

However, when each runfolder takes up 30Tb and you're flipping 20 runfolders a
week. The scale of storage get large, fast.

~~~
klmr
That’s why compression of sequencing data is becoming so important. CRAM is
nowhere near the theoretical limit (CRAM achieves ~2x over BAM; commercial
products currently achieve up to 6x).

(COI disclaimer: I work for such a company.)

~~~
dekhn
It's not clear you want to store your sequence data maximally compressed
unless you have it archived. If you're doing live work on the data, it's
better to have it modestly compressed with a fast decompressor, and a good
index so you can minimize your total access.

~~~
klmr
That’s plausible but it turns out not to be true. Disk read latency is larger
than the overhead of our decompression, even on fast storage media. As a
consequence, better compression actually leads to (slight) performance
improvements. Additionally, genomic data is often accessed via relatively slow
network storage on clusters or, worse, via the internet. Increasing throughput
trumps all other considerations here. You’re right concerning random access
and indexing, of course.

~~~
dekhn
I counter your argument: I have constructed systems that did this. Disk read
latency doesn't matter for streaming reads.

The actual math for this for a product at scale is interesting; I built such a
product and did the math, and we found that moderate compression gave higher
rates. However, this is based on a production-class infrastructure.

~~~
dekhn
I should also mention: using more CPU to decompress highly compressed data
means you have less CPU available (on a multithreaded app in a resource-
managed environment) for other work. Since after decompressing you're going to
be doing a bunch of other work (and a lot of that work will be going on
simultaneousy in other threads), using less CPU to decompress can give higher
throughput. But, again, this is a very complex problem.

------
Scaevolus
Slides from a talk by the author that give a good introduction to the
landscape and the previous algorithms:
[https://platformlab.stanford.edu/Seminar%20Talks/Yatish_Tura...](https://platformlab.stanford.edu/Seminar%20Talks/Yatish_Turakhia.pdf)

------
daemonk
Cool. But I think the current state of the art is using minhashes (minimap2)
to avoid doing expensive all vs all alignments. There is still a place for
more sensitive alignments though.

------
noobermin
Very interesting write up too! I often read that computational biology is a
different beast than the rest of the computational landscape (which usually
revolve around solving PDE's fundamentally) and this was a nice picture into
at least one problem on that side of the pond. A "best substring" search to me
sounds closer to the kind of problems CS people think about than what
computational cats do.

------
m3kw9
“Up to x% faster” another marketing term that used to work.

------
lewisinc
Oh no, someone should tell them that darwin is the name for the kernel inside
of iOS/MacOS.

~~~
_of
... as well as a british biologist.

~~~
ShinTakuya
That was clearly the joke.

