
Optimizing CRISPR: Sub-second searches on a 3 billion base genome - sajithw
http://benchling.engineering/optimizing-crispr/
======
tstactplsignore
It seems to me as though the authors are unfamiliar with bioinformatics
standard practice - it's not like this is an unsolved problem, and the
existing solutions are faster, more computationally elegant, and more
flexible. Actually computing and searching for all strings with an edit
distance of X away from the query is an incredibly poor way (Guess how long
their strategy takes when you increase the edit distance to 5?) of of solving
the alignment problem that throws out 40+ years of research on the problem of
sequence alignment. The actual solutions to this problem solve it in tenths of
a second in vastly superior ways.

BLAT[0] is the most obvious and preferred solution designed for alignment
against a reference sequence- a 20 BP search against the human genome should
essentially be instantaneous. BLAST [1] is more versatile and a bit slower
than BLAT, but would also align these sequence sizes against the human genome
in ~1-2 seconds, and is a traditional solution to the alignment problem, and
has no license restrictions. BWA [2] and Bowtie [3] default settings could
also be modified for the task (they're optimized for performing this task with
larger strings on the order of thousands of times per second).

More generally, it would not be difficult to re-implement any of the
algorithms behind these software implementations if the authors really wanted
to. It's weird, this is the second post I've seen recently when software folks
who are now working in the bioinformatics space have seemed completely unaware
of both the basic algorithms we use in computational biology and their common
implementations, like Smith-Waterman and Burrows–Wheeler. These are
complicated problems with 40+ years of research behind them, and the actual
solutions are elegant and fast algorithms which solve the problem in a far
superior way within reasonable computational time.

[0] [http://genome.ucsc.edu/cgi-bin/hgBlat](http://genome.ucsc.edu/cgi-
bin/hgBlat)

[1] [http://blast.ncbi.nlm.nih.gov/](http://blast.ncbi.nlm.nih.gov/)

[2] [http://bio-bwa.sourceforge.net/](http://bio-bwa.sourceforge.net/)

[3] [http://bowtie-bio.sourceforge.net/index.shtml](http://bowtie-
bio.sourceforge.net/index.shtml)

~~~
vineetg
Thanks for bringing up all of these alternatives. We definitely would have
preferred to use an existing solution to building our own.

Unfortunately, a lot of the existing software is not intended for the search
we're trying to do, or does not perform well under these conditions. We did in
fact experiment with some of them before building our own. Bowtie, for
example, doesn't allow more than 3 mismatches, and is also intended for
alignments where there are very few matches (close to 1).

Since we need to be able to support multiple genomes (see Josh's comment), the
amount of RAM we need to run a particular set of alignments is relatively
important. Things like BLAT (which seems also intended for > 25 bases) need to
keep the entire genome index in memory. This means that we would need to spin
up a lot more servers to handle parallel requests, especially with different
genomes.

FWIW our search is only a couple hundred lines of C++, and does the search
with very little memory requirements.

~~~
tstactplsignore
Fair points about both the memory requirement and Bowtie - also, 20 BP is
_really_ small, so while BLAT seems to find 20 BP hits correctly, it doesn't
find 20 BP hits with mismatches. Why be so memory stringent? This is genomics,
after all... If you're adverse to rolling your own, I'd look into BWA-MEM as a
solution (which has more adjustable mismatch scoring than Bowtie), or figure
out how Primer-BLAST seems to do specificity checks for small sequences so
well.

~~~
joshma
We care about reducing memory usage because it gives us greater flexibility in
our infrastructure. We're offering this to hundreds of scientists a day - not
huge numbers in the grand scheme of things, but usage is bursty and by paying
customers. This means we can mix in CRISPR work with our other processing
infrastructure, and we can choose between many servers with low memory vs a
few servers with high memory.

Maybe genomics doesn't have to be so memory hungry. 8)

------
joehilton
This is exciting stuff. However, the sad takeaway to me is the broken patent
system is already stifling what can be done with this innovation.

Consider this: the patent was awarded to a group that could not have invested
more than a few thousand dollars of incremental time and resources (in fact,
probably the majority of the costs were in the patent application and process
itself). And yet the license is worth billions.

Patents were created - and the US Patent Charter still states this - to
encourage and enhance the economic stature of the nation. Instead we use
patents to throttle it. Imagine if this patent were only good for a few years
or up to license fee commensurate with the incremental investment needed to
produce and validate the research (even if this fee were 3x, 5x, 10x, etc. of
the costs). Everyone could contribute to the work and the pace of innovation
accelerates. Instead we've got a couple universities (and, inevitably, follow-
on corporate licensors) locking it down for all but the publicly funded and
litigous.

There is so much opportunity out there, there are so many brilliant minds,
eager innovators, and great startups. Why do we shoot ourselves in the foot
with patent nonsense that hasn't been significantly rethought since (in the
US) its 18th-century English law origins?

~~~
eggie
I would not stress about this. Patenting works of nature seems increasingly
tenuous even in this broken patent environment. Will we really be able to
place patents on ubiquitous biological systems that are 500 million years old
and present in billions of copies even in our own guts.

------
Ono-Sendai
3 billion pairs * 2 bits/pair = 750 MB of data. So you could put in GPU memory
and do some kind of brute-force search on the GPU.

~~~
joshma
Yep, GPUs were the first thing we looked at. :) (And to be fair, our current
approach is pretty close to a brute-force search!)

The main reasons we decided against a GPU-based approach were cost and
scalability:

\- We support a dozen+ reference genomes (eg for difference species), and plan
to support a lot more (including eventually supporting custom genomes that
users provide). Assuming we want to support a few concurrent searches against
the same genome, we'd need a few GPUs per genome, and this gets expensive
pretty quickly on AWS.

\- Our fleet is now non-homogeneous, and now if machine X fails we need to
restore machine X' with the same set of genomes.

\- If certain genomes are more popular than others, we'll likely have GPUs
spun up that aren't being used much (only one lab might be investigating a
certain genome, for example). I suppose you could swap genomes in and out of
memory as they're accessed, but again it's more complex to manage resources.

\- Our current approach allows us to add genomes ad-hoc - hypothetically, you
could point us to your own genome on S3 and we'd be able to work with it.

We hint at it towards the end, but we're actually switching to AWS Lambda soon
- based on early calculations, it could cost us as little as $50 a month to
run everything!

------
TheLoneWolfling
Three ideas:

First, try the bitap algorithm.

Second, you can encode the search as a DFA - look up the Aho–Corasick
algorithm. Then just run the DFA over the genome. It means that you don't need
to match every string at every position. If you've read AAAAA and your string
starts with CCCCC with an edit distance of 4, you can skip ahead for a while
before you need to start reading again.

Third, you could build a suffix tree (O(n) preprocessing), and then use the
standard fuzzy string matching algorithm using suffix trees on it.

------
pbnjay
Why did BLAST not work for you? In the comments here you keep mentioning
memory, but we've never had that be the bottleneck (and we use BLAST for
things much larger than the human genome).

I'm really concerned that the team behind a bioinformatics tool is talking
about searching sequences without even a mention of BLAST. It should have been
solution #1!

------
dluan
This looks awesome! Just watched the brief intro video, how do you guys
calculate the off-target effects?

~~~
vineetg
(Author of the blog post here)

We use the scoring function published by Hsu et al[1], which most scientists
seem to be using. This function takes into account both the number of
mismatches and where they occur in the guide. There's a more readable version
here: [http://crispr.mit.edu/about](http://crispr.mit.edu/about) .

[1]
[http://www.nature.com/nbt/journal/v31/n9/full/nbt.2647.html](http://www.nature.com/nbt/journal/v31/n9/full/nbt.2647.html)

~~~
danieltillett
Interesting that indels are not considered. We found that with RNA/DNA
hybridisation that the probes with indels bind with high efficiency [1].

1\.
[http://www.ncbi.nlm.nih.gov/pubmed/20649647](http://www.ncbi.nlm.nih.gov/pubmed/20649647)

------
gopalv
Nice. A bitset & Rabin Karp search.

Both winning strategies, but I suspect you can push it much much further to
scan much faster if you're hitting memory bandwidth (ACGT = 2 bit space).

Of all the big-data problems I see, nothing feels as close to a personal
problem like genetic data.

------
nimish
36GB is totally fine to store in memory but it wouldn't have led to this
seriously awesome article on algorithm design.

~~~
joshma
Haha, Vineet appreciates the compliment, I think...

I mentioned it in a comment below, but our constraints are a bit different
once you start supporting multiple genomes - we could've been clearer about
that in the post for sure.

------
fradg
Nice post. What tools do you use to understand which line of the code is
creating a bottleneck? For instance, the "if statement has too many false
positives" in Solution #4. Do you locate these bottlenecks simply by manual
inspection of the code?

~~~
vineetg
A lot of these are actually pretty easy to spot with tools like gcov. To
determine false positives, we can just look at how many times each "if"
statement was hit, and compare those to the final count.

------
TheLoneWolfling
I don't suppose anyone has a sample set of data that I could use to play
around with various techniques for this?

I'd randomly generate it, but I don't know what the statistics should be - and
that makes a huge difference for branch prediction / etc.

------
ticking
Have you thought about using the hammond distance, instead of the array?

It should give you the same answer in 3 CPU instructions on registers instead
of 2 array lookups and arithmetic in ram. XOR, hammond weight/bitcount and and
equality check.

~~~
vineetg
We actually do this in our "expensive" check. The reason that we use the 2
arrays is because in 2 array lookups, we can check for matches for all 200
guides. I probably could've made that clearer in the post - the array contains
matches for ALL of the guides.

------
daemonk
PAM sites? I am working on a genome assembly now and I want to try to identify
potential g-rna sites around genic regions. This post is pretty helpful.

~~~
vineetg
I left PAM sites out of the blog post (it's actually mentioned briefly in the
footnotes), as it made the problem slightly more complicated.

The final algorithm actually keeps track of the last 20 bases + PAM length,
and checks both the edit distance and PAM before deciding if something is a
match. The Benchling CRISPR tool will do this for you :)

~~~
daemonk
Couldn't you use one of the available short read aligners (BWA,bowtie...)
available for this also? Most of the aligners use some kind of FM-index for
indexing the genome.

~~~
vineetg
We messed around with bowtie - it seems like most of these are optimized for
the number of alignments being small (i.e. close to 1). Unfortunately, the
number of matches for a 20 base guide on the human genome is closer to 1000.

------
keithwhor
Nice work. Great to see more bitstrings being used in genomics. Did I give you
guys some ideas? ;)

------
abecedarius
I'd have tried agrep on the bit-packed sequence.

------
imaginenore
Perfect problem for a GPU. Even a naive solution should be crazy fast.

