

How I Wrote an Ultra-Fast DNA Sequence Alignment Algorithm in JavaScript - jebus989
https://medium.com/@keithwhor/nbeam-how-i-wrote-an-ultra-fast-dna-sequence-alignment-algorithm-in-javascript-c199e936da

======
tstactplsignore
This is a neat JS library with some possible applications, but if you're only
matching substrings without gaps, then you're basically string searching, not
performing alignment. Alignment is an entirely different and more complicated
computational problem with decades of algoritm development (see Smith-
Waterman, Needleman-Wunsch, Burrows-Wheeler transform) and implementation
going into it, and it is far more biologically useful and practical than
substring searching. Considering this, in practice heuristic implementations
like BLAST, FASTA, and lastz are used to do most big alignment searching, and
that's not even considering multiple sequence alignment, which is another
problem into itself. In other words, it's a useful JS library and interesting
implementation, but it isn't alignment.

~~~
keithwhor
The algorithm maps every position of seqSearch to seqGenome, aggregates the
results, and returns them. It is _not_ merely a string search. (If it were,
the Booyer-Moore [1] algorithm would outperform this significantly, binary or
not, and we'd call it a day.) It is aligning seqSearch to seqGenome and
returning the best results, in order.

Alignment is _not_ restricted to "gapped" alignments only (or there wouldn't
be a need for such a distinction!).

You can modify the algorithm to include gapped alignments (by introducing gaps
in your search sequence), as it supports a "gap" character ("-"). The gap
character will just always count as "not a match."

[1]
[http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear...](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)

~~~
tstactplsignore
So, ungapped alignment is just substring searching with degeneracy, which is
fine. While that is its own computational problem and may even have the
occasional biological application, sequence alignment is defined as a gapped
alignment problem (see the Wikipedia page, for example [1], which defines gap
insertion as a critical step, and all alignment variants on the page are
gapped alignments)

I am not sure you understand what gapped alignment is- it is not the alignment
of a sequence with known gaps, but an algorithm which determines the best
placement of gaps in a query sequence to obtain the highest matching score.
This is a very different problem than the one you just described, and is
essentially "the hard part" of sequence alignment. [1]
[http://en.m.wikipedia.org/wiki/Sequence_alignment](http://en.m.wikipedia.org/wiki/Sequence_alignment)

~~~
keithwhor
Hmm, while I understand the problem of gapping is traditionally the hard part,
I'm under the impression that the argument you're putting forward is primarily
one of semantics.

"In bioinformatics, a sequence alignment is a way of arranging the sequences
of DNA, RNA, or protein to identify regions of similarity that may be a
consequence of functional, structural, or evolutionary relationships between
the sequences." (From the article you linked.)

Gapped sequence alignment is certainly far more robust (and biologically
relevant - insertions are a common error - when comparing sequences across
organisms) than ungapped, and a much harder problem, but as for the definition
of "alignment" itself, I don't _believe_ I've misnamed anything here.

If we're going to be overly pedantic about the use of the word "alignment"
that's fine, but I'm not sure it's a worthwhile debate to have. A quick search
for "ungapped sequenced alignment" returns a great deal of results on Google
[1]. So if I am mistaken, I'm certainly not the first (nor do I believe I'll
be the last.)

Furthermore, there's nothing preventing anybody from using the methods
described here from implementing an ungapped sequence alignment tool that
outperforms tools that only use string comparisons. :)

[1]
[https://www.google.com/search?q=ungapped%20sequence%20alignm...](https://www.google.com/search?q=ungapped%20sequence%20alignment).

~~~
lightlike
you shouldn't dismiss this as an argument over semantics if your understanding
of the term differs from researchers' use of the term. if you introduce "a
fast tool for XYZ" and researchers understand XYZ to mean A, where you
understand it to mean B, then the tool is not useful for researchers to
perform what they know as XYZ.

tools like BLAST are extremely sophisticated and have been under development
for decades, and I'm fairly confident they've moved past naive string
comparisons by now.

~~~
keithwhor
Fair. Though I'm not convinced "ungapped sequence alignment" is particularly
confusing to a researcher, considering there are tools and papers that have
existed for decades using this description [1][2][3]. Though the algorithm
described in my article is extremely focused on raw performance (and
relatively naive with scoring), I would still choose to categorize it as
primarily a tool that deals with ungapped sequence alignment, specifically
supporting IUPAC degenerate nucleotide sequences. Thus, I believe the initial
argument is, indeed, overly pedantic.

And to be clear, nowhere am I comparing what I've developed to BLAST. (They
have very different applications.)

[1]
[http://schneider.ncifcrf.gov/paper/malign/](http://schneider.ncifcrf.gov/paper/malign/)

[2]
[http://www.ncbi.nlm.nih.gov/pubmed/9697204](http://www.ncbi.nlm.nih.gov/pubmed/9697204)

[3]
[http://www.ncbi.nlm.nih.gov/pubmed/15130540](http://www.ncbi.nlm.nih.gov/pubmed/15130540)

------
xaa
My colleague (an editor at Bioinformatics journal) and I were joking the other
day about how every other paper title, especially about aligners, seems to
include the word "fast" in it. This takes it to a new level.

As a learning exercise, this is interesting and fine. I am trying very hard to
suppress the inner "reviewer" right now. Walking away...not comparing this to
existing algorithms which are implemented in highly optimized C/C++, or CUDA,
or even hardware. About why other authors would go to such extraordinary
lengths if high-level languages are suitable. Not going to ask how conclusions
can be drawn about the suitability of Javascript for very computationally-
intensive tasks without solving the actual alignment problem rather than a
subset, let alone comparing to existing tools. Not going to ask about the
application of the tool. It's not a paper. I'm breathing. OK.

~~~
keithwhor
This was certainly not intended as an academic submission, merely a
description of a method used. You can breathe. :)

The initial point of the tool / algorithm was to find _all_ potential binding
sites for a DNA-binding domain of a protein in 100kbp - 1Mbp genome. (Even
those with sequence identity ~50% or less.) This is provided you have a
consensus sequence that contains ambiguous nucleotides. (For example, roughly
discerned from a sequence logo.) It quickly turned into development of a
general bioinformatics library in JavaScript, and a chance to see how far and
fast I could push V8 at doing these sequence comparisons.

I would love (at some point) to go into significantly more detail and compare
what I've written here with existing tools. If you're willing to offer
mentorship or guidance (or know somebody who would be), it would be fantastic
to present the information in a more thoroughly peer-reviewed context.
Otherwise, the post (and associated library) are meant primarily as learning
tools for both biologists and developers.

~~~
xaa
Alas, alignment is not in my area of expertise, nor of the people I know
directly. I'm a heavy user, but not a developer of aligners (except a toy
one). You could consider contacting Cole Trapnell or someone well known for
making a widely used aligner and propose a collaboration, but that is a hard
cold call. On the plus side, it means I probably wouldn't be your reviewer :)
(but editors do sometimes get desperate...)

Your initial problem, if you frame it as the desire to simply enumerate all
the degenerate sequences and loci, could be solved any number of ways as other
commenters have mentioned. Probably I would reach first for a regex. But sure,
no crime in learning to implement a new algorithm while also testing the
limits of V8. Probably half of my grad school time was spent that way ;)

I think your best bet if you wanted to publish would be to find a use case for
in-browser alignment. It would be hard to answer the obvious question, "why
not server-side?" though. But who knows, people are somewhat often taking old
bioinformatics algorithms and saying "but now you can do it on your phone!!".
And they do publish.

But as for matching both speed and accuracy of state-of-the-art aligners with
Javascript, it is my considered, scholarly opinion that you have no chance in
hell. So you shouldn't present it that way. It would be unnecessary to compare
(at least for speed) if you weren't making claims about speed.

------
searine
Except this is totally reinventing the wheel.

If I want to do fast genome alignment, I'll use lastz which is already blazing
fast and written in C. If I want to just look for homology, I'll use blast.

There isn't much need for improved alignment algorithms. If it's a big job,
most bioinformatics have access to clusters. If it is a small job, who cares
about speed.

~~~
iten
To be fair, to my knowledge, LASTZ, BLAST, or BLAT don't treat IUPAC ambiguous
bases (K, Y, R, etc.) in the way the OP was looking for. That's not to blame
the tools, since they have a _very_ good reason not to (they build an index on
the target first, and treating ambiguous bases properly would increase the
size of the index).

That said, I wonder if grep wouldn't be much faster, since this program is
only looking for exact matches, which are easily transformed into a regex by
replacing the ambiguous nucleotides with something like (A|T|C).

------
discardorama
I liked the exposition, but it would have been nice if the author had compared
his implementation with something more meaty. Like Sustik-Moore[1], for
instance.

[1] [http://www.cs.utexas.edu/users/moore/publications/sustik-
moo...](http://www.cs.utexas.edu/users/moore/publications/sustik-moore.pdf)

~~~
mbreese
It would have been good to compare the JS implementation to a C
implementation. Given the algorithm, it shouldn't be too bad. I'd suspect that
the C version would work much better, but I could be surprised.

~~~
keithwhor
Funny anecdote, my first implementation of the algorithm was actually in Go
(for concurrency sake). It was an earlier version and wasn't as optimized as
the one available here, but I found it actually performed 10% _slower_ at
matching nucleotides as compared the same algorithm running in V8. At the
time, I attributed it to V8 "writing" better-optimized instructions than I
possibly could.

------
AdamTReineke
I'm not familiar with the ASM.js spec[0] to know how hard it would be, but I
am curious to know if there would be a perf improvement if you could get it to
run in that mode. Nice write-up!

[0] [http://asmjs.org/spec/latest/](http://asmjs.org/spec/latest/)

------
Ultimatt
Not sure I get why the author of the article didn't just use the well known
Bitap algorithm?
[http://en.wikipedia.org/wiki/Bitap_algorithm](http://en.wikipedia.org/wiki/Bitap_algorithm)

------
hashtree
With io.js coming on to the scene, I've recently revisited my thoughts on
using JavaScript for scientific/statistical computing. With ES6, V8 Turbofan
coming down the pipe, and enough support for serious functional programming
(e.g. fantasy land, transducers, ramda, etc), I've decided to shift from using
Scala/Haskell/R to JavaScript. I think there is a bright future in these areas
and a lot of room for devs to make a name for themselves doing this kind of
development. R is particularly easy to beat on the performance front, from the
libs I have rewritten to JS to date.

~~~
codygman
What have you gained going from Scala/Haskell/R to Javascript? What have you
lost?

~~~
collyw
integers for starters

------
evanpw
I thought Atwood's Law [1] was a joke, but this is seriously interesting. I
remember when Java was just a fun little toy language for making web pages,
and I'm still surprised every time someone mentions it as a serious number-
crunching platform. (I've heard that the entire Nasdaq matching engine is a
single-threaded Java application.) Is JavaScript going to end up the same way?

[1] [http://blog.codinghorror.com/the-principle-of-least-
power/](http://blog.codinghorror.com/the-principle-of-least-power/)

~~~
pimlottc
For what it's worth, the Chicago Mercantile Exchange's matching engine is
written in Java as well.

------
jerven
Super neat JS work! I have similar code for doing GC/AT counts in java. Using
longs and popcount one can do 16 nucleotide per cycle. Also what is nice is
that one can easily use a memory mapped byte buffer as long buffer to run
through the code. To go faster than that AVX needs to be used/jitted in.

What it really shows is that the FASTA format is just terrible for
computational efficiency :(

[https://gist.github.com/JervenBolleman/d1430d0549028861504c](https://gist.github.com/JervenBolleman/d1430d0549028861504c)

------
Rhapso
this is the commonly used BLAST/alignment tool:
[http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearc...](http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=blastn&BLAST_SPEC=GlobalAln&LINK_LOC=BlastHomeLink)

for comparison.

~~~
mbreese
BLAST can handle many more cases than the method listed here...

------
daemonk
Cool. I saw this the other day when you posted it at BioStar (I am Damian Kao
at BioStar). Next step is to establish some alignment metrics (some kind of
p-value).

~~~
keithwhor
There's work I've started on this, actually. It's available in the README of
the NtSeq repository[1]. Near the bottom. The probability becomes a mess (or
at least, I can't figure out the correct model) when you start including
degenerate nucleotides, and I didn't want to release a half-done statistical
analysis tool, so I axed it from the current release.

[1] [https://github.com/keithwhor/NtSeq](https://github.com/keithwhor/NtSeq)

------
tresil
Very exciting work and a fantastic write up! Looking forward to giving this a
shot.

------
yatoomy
Loving the rise of JS algs. Keep it up!

~~~
collyw
It sounds like a poor choice of language for the task to me.

