
Building a Regex Search Engine for DNA - joshma
http://benchling.engineering/dna-regex-search/?hn
======
lambda
I don't know if there was any strict requirement to use Python, but you
mention the difficulty in finding a library to parse regular expressions.

One of the many great features of the Rust regex crate ([https://doc.rust-
lang.org/regex/regex/index.html](https://doc.rust-
lang.org/regex/regex/index.html)) is that there's actually a regex_syntax
crate that provides this parsing support ([https://doc.rust-
lang.org/regex/regex_syntax/index.html](https://doc.rust-
lang.org/regex/regex_syntax/index.html)), which can be used for purposes like
this, to do alternative kinds of matching engines.

Rust's regex library also has pretty good DFA and NFA matching engines, and an
Aho-Corasick engine for parts of the regex that can be decomposed into just a
set of simple strings to match, as well as some fairly well optimized exact
substring search to speed up search for expressions with some substring that
must match exactly, so it can be fairly fast for the kinds of matches like
your "TAG(A|C|T|G)GG" example.

Another, even more radical change, would be to build a finite state transducer
of your data set, which you can then do regex matching on directly by taking a
union of an automaton representing the regex and the FST representing your
data (which is itself an automaton). I learned about FSTs from the author of
the Rust regex crate:
[http://blog.burntsushi.net/transducers/](http://blog.burntsushi.net/transducers/).
An FST is actually used internally for certain types of searches in
Elasticsearch/Lucene
([https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
suggesters-completion.html)). I don't know if these would be suited to this
problem as is by just building an FST from your set of sequences, or maybe
using the overlapping subsequences idea to make FSTs from a larger number of
shorter strings, but it would be an interesting project to explore.

~~~
joshma
Insanely cool, thanks for sharing - this would be interesting to replace our
final filter (if I'm interpreting correctly, a trigram index would still speed
up the initial search)

And yeah, Python is a strict requirement since the rest of our code is in
Python - figured it wasn't worth throwing in another language when a rough
parser worked.

~~~
burntsushi
> (if I'm interpreting correctly, a trigram index would still speed up the
> initial search)

Your trigram approach is roughly isomorphic to the FST approach. The main
benefit of an FST is that it can store a huge number of keys in a very small
amount of space that is still searchable without a separate decompression
phase. In the case of trigrams with a low alphabet size, an FST probably isn't
a big win. But say if you wanted to increase to a bigger N-gram for some
reason, then perhaps an FST might help you out. It's true that Rust's fst
library has a way to apply regexes to the FST for very fast search (Lucene
also has this), but its subjects are terms in the FST, which in your case
corresponds to trigrams. This means you still have to go through the "extract
a set of trigrams from a regex to pre-filter on" dance.

And yeah, using the regex-syntax crate from Python would have been a fair
amount of work. In particular, it still needs a C API before you can use it
from Python, which would probably be a fair amount of grunt work.

In any case, as a computational biologist in a former life, this is very cool
work. Nice job! :-)

------
daemonk
Bioinformatican here with couple years of bench experience. Nice writeup, but
what would be the use case here? I would imagine something like automatically
annotating motifs/restriction sites and then searching for them would be more
useful than searching for near-exact sequence?

For example, finding all plasmids with a certain primer sequence, specific
restriction sites, promoter motif...

~~~
theophrastus
Protein Structural Chemist (i guess) here and I'd like an even higher level
'regular expression' which would allow one to search DNA sequences likely to
code for protein domains with likely secondary structure, (something like a
string to search for "Secstruc" within "Exon Structure" [1]). The "likely"
will always be a problem in these sorts of attempts; any genomics regular
expressions should allow for the option of the statistical best hit.

[1]
[http://www.rcsb.org/pdb/protein/P00918](http://www.rcsb.org/pdb/protein/P00918)

~~~
kusmi
Have you looked at T-coffee? They have this new feature called expresso which
might do what you want.

------
tmaly
I was going to say there was a faster way than regex, but I saw you were not
actually using regex to parse in the way I first thought.

I work in a different domain, and I find custom state machines to the
preferred way to do matching over large sets of data. Of course using
something like compressed bitmap indices are also helpful.

~~~
joshma
Yeah, I'm not too happy about how we're using regex inside a groovy script for
the final matching - when we scale this to support larger matches we'll likely
hit memory issues trying to parse the entire string at once, and I've been
playing around with the idea of streaming the bases into a state machine. Is
that similar to what you were referring to?

Do you have a way of generating state machines, or are they hand coded? Is it
a perf optimization, or does it help integrating domain-specific logic?

~~~
malisper
What were some of the query features that caused you to choose Elasticsearch
over pg_trgm, because pg_trgm supports regular expressions out of the box.

~~~
joshma
We use ES for the rest of benchling search already, so that was the main
reason why we stuck with ES (ngram indexing is pretty standard there).

As for why we were using ES for the rest of search: we were using it for
things like language stemming, matching terms with "slop", things like
{'minimum_should_match': '3<67%'} (require exact matches if 3 or less terms,
67% if more than that), searching _all to match anything in the document, etc
etc. I think a bunch of these (maybe even all) you can do in PG, but it was
way easier to get going with ES.

ES is also distributed, which makes scaling up and doing maintenance a lot
easier - a bunch of times we just threw new nodes at it and remove old nodes
that had gone bad.

------
wrong_variable
Could HN help me with this.

I remember reading this really wonderful article on the levenshtein distance
on HN 2-3 months ago.

basically I have difficulty understanding why the Wagner-Fisher's algorithm
works - the intuition behind it - I remember it was dynamic programming but I
really want to read that article again.

The format was similar to medium with wonderful diagrams :( I tried searching
for it on HN but searching on HN is really difficult and it was my own fault
for not saving it on Pocket.

I would really appreciate the help since it was one of those "better
explained" gems.

~~~
alixaxel
Also curious, could it be one of these websites where they explain algorithms
visually?

Have you tried
[https://hn.algolia.com/?query=levenshtein&sort=byPopularity&...](https://hn.algolia.com/?query=levenshtein&sort=byPopularity&prefix&page=0&dateRange=pastYear&type=story)
?

~~~
wrong_variable
oh wow ! didnt know there was a handy search engine for HN.

basically it explained 1-dimentional Levenshtein distance first.

That blew my mind since its much easier to think about the 1d case and then
apply it for a 2d matrix.

but i remember reading only half-way through the article before had to do
something else :( sorry - but I really want to read it again since it will
really help me out at my work !

EDIT -

FOUND IT !

thanks a lot for that ! It was a god send !

[http://davedelong.tumblr.com/post/134367865668/edit-
distance...](http://davedelong.tumblr.com/post/134367865668/edit-distance-and-
edit-steps)

here is the link - you just made my afternoon !

EDIT 2 -

someone should write a book which explains all maths/algorithms concepts using
just links to blog posts by most HN popularity - just a though :)

------
twotwotwo
Part of the way down the post links to Russ Cox's excellent writeup of how
Google used this general sort of technique (turn a regexp into a bunch of
conditions involving three-symbol sequences, search an index for them, work
from there) to do regexp searching scalably for Code Search:
[https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)

You can peruse his open-source (re)implementation of the approach at
[https://github.com/google/codesearch](https://github.com/google/codesearch) .

------
rurban
You are still searching for char (256 values, 8 bit) when you only need 4
values (2 bit)? You know that cache misses dominate the search, so you need to
compress the chars to 2 bit, and search in L1 cache size blocks. Esp. the
inverted index will be much faster to search if compressed.

Furthermore those search terms are so highly specialized that you can easily
jit the regex.

Why using the super slow python and not a fast language? I thought DNA
matching is a big and important enough problem where you shouldn't toy around
with dinosaurs.

[Edit: 3 -> 2 bit]

~~~
joshma
Hm, unless I'm misunderstanding, the bottleneck here isn't running regex over
all the sequences - it's in filtering out a large % of sequences before
running a regex over the remaining ones. You're storing trigrams (3-byte keys)
in an inverted index - compressing them won't get you much other than space
savings. Re: caching, we're using Elasticsearch's bitmap caching to speed
things up. If you search for something twice, the second time it'll take the
cached bitmap from the first search and use that instead.

We use fast languages when we need them. Python is just parsing a regex to
generate constraints - for DNA search, parsing a 1000-char regex is super fast
and rarely an issue.

In this case, we're already at sub 100ms searches (usually sub 50ms) so I
don't see much benefit from playing around with L1 caches and JIT-ing when
higher-level structure already gives the perf characteristics we need.

------
alixaxel
Nice article!

BTW, Go has a very complete RE2 parser:
[https://golang.org/pkg/regexp/syntax/](https://golang.org/pkg/regexp/syntax/)
\- it even handles simplifications for you, I'm actually using it together
with a RDPT in namegrep.com.

And in JavaScript you also have ret.JS:
[https://github.com/fent/ret.js/](https://github.com/fent/ret.js/).

------
noir_lord
Absolutely fascinating write up, stuff like this is why I still love HN, as a
web developer who mostly writes line-of-business stuff this kind of
programming is a long way from what I do but reminds me just how big the field
is.

