
Parallel C++ on AWS Lambda for CRISPR - binarynate
https://benchling.engineering/powering-crispr-with-aws-lambda-f22c151a1ffc
======
zbjornson
We have a similar application (parallelized C++ code operating on large files,
for bioinformatics even) and ended up down the reverse path: started on
Lambda, moved to our own RPC system. Lambda got super expensive (in part
because there was no way to reuse a worker while it was doing async work like
an S3 download), couldn't parallelize nearly enough without hitting AWS hard-
limit quotas, had significantly lower CPU perf and didn't provide a way to
cache files (like a genome in this case). Spot/preemptible instances keep our
costs down while letting us keep a few hundred servers up at a time.

~~~
cobookman
I feel like serverless is only benefitial for low volume / low traffic
workloads.

The only other perk I've found is that the serverless billing model makes it
easier to estimate costs.

~~~
reconbot
I found it lowered our server cost incredibly for a high volume read heavy
site. It allowed us to scale in response to increased traffic (not instantly
that's a lie, spikes in the thousands of requests a second are not handled
well, but over a few minutes it catches up without issue) and not have to
provision and spin up servers. We were constantly over provisioned before and
now it's a much lower but moving margin.

~~~
cobookman
You can do that today though with K8s and auto-scaling. Not sure how well
autoscaling works on AWS, but on GCP its a breeze.

------
niklasrde
Lambda is an exciting platform, but it does require some bending to get it to
work for certain use cases. This reminds me of some problems we solved last
year [0], which should not have been problems for a fairly straightforward
application.

[0]: [https://iplayer.engineering/evaluating-tensorflow-models-
in-...](https://iplayer.engineering/evaluating-tensorflow-models-in-aws-
lambda-c0e06cf23d87)

~~~
ComputerGuru
I read that article before and just reread it now - thanks for writing it up,
but I still have the same question I did the first time around. You posit:

> Because we run 200 invocations or so in parallel we’ll only need to download
> the model once and save it there

From my own reading of the Lambda docs, it seems that a _simultaneous_ request
for the same Lambda may or may not spin up a new container, ie while serial
requests within the 15 minute timeout will likely reuse the same instance with
the same frozen/cached tmp data, parallel requests do not have that
assumption.

Was this your finding in practice? If so, were your CloudWatch “keep warm”
events a series of just 1 Lambda invocation ~10/15min apart of 200
simultaneous requests serially spaced to keep the instances spun up?

~~~
niklasrde
The CloudWatch "keep warm" events are just one invocation, yes.

I do not remember having had issues with it, but honestly, I don't think I
actually have stats on that anymore.

I've just checked in S3, but it doesn't look like we have request or data
transfer metrics enabled on the model bucket. I may enable those next week to
monitor the effectiveness of our strategy better.

~~~
ComputerGuru
I imagine if the code were properly written to deal with race conditions you
wouldn’t notice any issues either way besides an increased latency for some
requests.

Good luck!

------
andrewon
What's the difference between this CRISPR search problem and DNA sequence
alignment? There were extensive development in the latter and is highly
optimized. The author seems to be coming up with solution from scratch.

[https://en.wikipedia.org/wiki/Sequence_alignment](https://en.wikipedia.org/wiki/Sequence_alignment)

~~~
vineetg
Original author here.

You're right - conceptually the CRISPR search problem and DNA sequence
alignment are related. In both, you're looking for place where two (or more)
sequences are very similar. I would say there are two major differences.

The first is in the goal of the search. Typically, alignment tools try to find
the _best_ positional alignment for two (or more) sequences. The CRISPR search
problem is to find _every possible_ match above some similarity threshold.

There are also a few constraints on the CRISPR search problem that allow us to
make this much faster than a general DNA sequence alignment tool:

1) We know that that guide sizes tend to be very small (~20bp) 2) Part of the
guide must match exactly (the PAM site), allowing us to restrict our search
even further. 3) We don't need to worry about insertions or deletions in our
search.

Using those three constraints, we can do this search a lot faster than a more
general DNA alignment tool!

~~~
inciampati
Honestly you are taking a big risk designing a DNA search algorithm something
from scratch. It's akin to the risk people take when they roll their own
crypto. There are aspects of this that you may not be considering, and it
tends to be best to rest on the extensive work in the field than assume it is
a trivial problem.

How do you deal with natural variation in the genome? Can you be sure your
gRNA doesn't target an essential locus in some percentage people who carry a
particular allele? The data to solve this is out there (1000 Genomes for
instance).

Edit: excuse me, I appreciate that you are using a collection of whole genomes
as your target. Will this reliably scale to thousands or millions of genomes
and likely recombinations between them?

------
hawktheslayer
Articles like this are why I return to HN everyday. There are 3 terms in the
title that I know well, but put them together and they are something
completely new to me. And the cost savings here is pretty remarkable.

------
w8rbt
I hope Lambda supports C++ natively someday. That would be awesome. Glad to
see Go support was added recently too.

------
akhilcacharya
Wouldn't this be easier to do in Fargate?

