
A faster way to compute edit distance does not exist - dnt404-1
http://www.bostonglobe.com/ideas/2015/08/10/computer-scientists-have-looked-for-solution-that-doesn-exist/tXO0qNRnbKrClfUPmavifK/story.html
======
a3_nm
The title of the Boston Globe article and of the HN submission is misleading.

The article does not prove that subquadratic edit distance algorithms do not
exist. It only proves it conditional to another unproven complexity hypothesis
(SETH). In other words, it shows that if such algorithms exist, then faster
algorithms exist for another problem (SAT for CNF formulae with certain size
bounds), where we haven't been able to find them yet. See the original article
[http://arxiv.org/abs/1412.0348](http://arxiv.org/abs/1412.0348) whose title
is more accurate.

This is important work, of course, and provides more evidence that faster edit
distance computation is not possible, but it is not proven yet.

~~~
brudgers
There are really two notions of proof here [and hence truths and facts]. The
article uses proof in a scientific sense, where P!=NP is accepted as empirical
fact by the community of computer scientists in the same manner that
biologists take natural selection as empirical fact.

To put it another way, representing the world as if P!=NP is no less
reasonable than representing the world as if Newtonian physics are applicable.

~~~
a3_nm
Theoretical CS is usually not thought of as an experimental or empirical
field. When a paper in that field claims "X does not exist", this has to mean
that the inexistence of X was proven, not that a body of empirical knowledge
against the existence of X has been collected. By contrast, such things are
called "plausibility arguments", "heuristics", etc.

There is such a thing as experimental truth, but it applies to experimental
fields, and theoretical CS is usually not thought as one of them. I still
think it is misleading to claim that the inexistence of something has been
proven if it is conditional to another hypothesis.

~~~
brudgers
Computer scientists behave as if P!=NP, and Donald Knuth will send a check to
anyone who comes up with a polynomial time general solution to 3-sat.

Or to put it another way, P!=NP is a _de facto_ if not _de jure_ axiom of
theoretical computer science. And the burden of justification is on anyone
whose conclusions are not premised on P!=NP just as it is a biologist who
takes issue with evolution by saying "it's only a theory."

------
im3w1l
>The fact that edit distance is only computable in quadratic time has big
implications for genomics. The human genome contains about 3 billion base
pairs. To compute the edit distance between two human genomes takes 3 billion-
squared operations (to appreciate how big that is, write a 9 followed by 18
zeroes). That’s a number of operations, explains Piotr Indyk of MIT, that is
“computationally infeasible.” Which is to say that our best computers chugging
away for a really long time still won’t generate an answer.

As was commented last time this was posted: Just because it's not possible to
do better in the _general case_ doesn't mean it isn't possible to do better in
_your case_. For all we know, human genomes may not actually be 100% random
strings, but may have some kind of structure that makes it possible to do it
faster. Even when solving the problem exactly.

~~~
new299
Human genomes are very much not random strings, and in any case if calculating
the edit distance between two human genomes they are likely largely similar,
and if memory serves there are suffix tree/array based algorithms for bounded
edit-distance calculations.

Moreover, I'm not sure if it would ever actually be useful to compute the edit
distance between two full genomes. Edit distances/local/global alignments are
useful for short fragments (gene's/reads etc) but not the full sequence.

~~~
dekhn
The challenge with comparing genomes is that you can't just do edit distance
using a naive substitution table. It has to be probabilistic and contextual
(for it to be scientifically useful), because genome sequences are actually
sampled estimates of the true genome sequence. You have to tolerate errors,
and when you do that, the exploration space gets huge.

Instead of aligning two genomes this way, people normally start with a very
well calibrated and fairly accurate reference sequence, then align to that
structure probabilistically, and "call the joint variants".

This is actually far more interesting that doing the raw hard problems that
computer scientists love to talk about theoretically. One of the reasons it's
so interesting is that different genomes differ by operations that are much
more complex than simple edit distance- for example, large regions can be
excised, and placed, in reverse order, in completely other parts of the
genome. Trying to use a DP edit distance calculation to detecet those is
futile.

[http://bcb.io/2014/10/07/joint-calling/](http://bcb.io/2014/10/07/joint-
calling/)

------
jongraehl
The name "Wagner-Fischer" is new to me. I usually cite this as Levenshtein
[1965]. Even the Wagner-Fischer wp page notes that Levenshtein was first by a
long shot.
[https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)

~~~
markbnj
Yeah this struck me as well. I've always heard it described as Levenshtein
edit distance.

------
leni536
It's a nice theoretical result. In practice there are some quirks though:

\- I assume this is worst case complexity. While this is useful, maybe the
average time can be much lower for certain data (like the genome).

\- I assume the theorem doesn't restrict the input strings in any way. If you
put restrictions on the input then there could be a faster algorithm than the
general one. I don't know if this can be applied to the genome though.

\- People are interested in the constant factor too.

~~~
codeflo
Also, a fast approximation with a good error bound might be good enough for
many applications in biology.

------
nomercy400
As a quadratic algoritm, assuming 13 million base pairs can be solved in 7
seconds on a GTX275 (according to a paper on edit distance computation), is it
correct to assume that 3 billion base pairs can be solved in (3bil * 3bil /
(13mil * 13mil/7)) in 373443 seconds, or about 103 hours. Isn't that fast
enough for a quadratic algorithm that samples DNA to be useful?

~~~
dr_zoidberg
I think the article is talking about the "full fledged" problem, while the
computation you talk about could be using a set of optimizations aimed
particularly at base-pair edit distance problem.

For example, I could assume that there no deletions and insertions, only
changes and then the problem becomes embarrasingly paralell and extremely easy
to express in assembly-like operations. So I could paralellize it on a GPU by
splitting the string into fixed length substrings and get huge gains. Problem
is, can it be safely assumed that there are no deletions and insertions?

Wikipedia has a list with some modifications that can also be used to speedup
the calculation under certain constraints:
[https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm#Possible_modifications)

------
sanxiyn
Previously here:
[https://news.ycombinator.com/item?id=9698785](https://news.ycombinator.com/item?id=9698785)

------
omouse
That was refreshing; a regular newspaper publishing an article about computer
science. And it does a fairly good job, I wish I could see something related
to CS rather than to web apps and startups in the Canadian newspapers.

------
gtrubetskoy
An article on edit distance without mentioning the person who invented it -
Levenshtein.
[https://en.wikipedia.org/wiki/Vladimir_Levenshtein](https://en.wikipedia.org/wiki/Vladimir_Levenshtein)

------
CephalopodMD
This is interesting, but it looks like this proves nothing according to the
article since we don't know if P==NP. We are no closer to understanding
whether or not a faster way to compute edit distance exists. All we know now
is that it matches up to a class of problems that we _think_ are probably
hard. It is still possible that we might find a faster solution to the
problem.

I'd like to request a title change - "New proof that a faster way to compute
edit distance might be tied up in the P vs. NP problem"

------
nemo44x
Hence the need and desire for approximation algorithms. Often, you can give up
a tiny amount of precision in exchange for dramatically increased performance
- if you can make an assumption.

For instance, computing cardinality of a value over a huge dataset
(distributed or otherwise) that is assumed to be evenly distributed would take
memory proportional to the size of the dataset for 100% precision.
Implementing a cardinality approximation algorithm like HyperLogLog++ lets you
get a pretty close to accurate result for this calculation but with far fewer
resources. There are many others and I think it is important to consider this
trade off when it is appropriate.

Of course, it makes an assumption that the data is distributed more or less
evenly.

------
amelius
Now we can start looking for a more precisely defined problem, that _can_ be
solved faster than in quadratic time.

Or in other words, for certain classes of inputs, we can find a faster
algorithm. We only need to find/define those classes.

------
wolfgke
Does a similar result hold, if we don't consider inserts, deletions and
substitutions (definition of edit distance), but only inserts and deletions?

~~~
evandijk70
How do you define an edit distance (the number of operations required to
transform one string to the other) if you don't allow substitutions?

Eg. what's the edit distance between these two strings?

AAAA BBBB

Traditionally, it would be 4, but it's unclear if you allow only inserts and
deletions.

~~~
wolfgke

      > Eg. what's the edit distance between these two strings?
      > 
      > AAAA BBBB
      > 
      > Traditionally, it would be 4, but it's unclear if you
      > allow only inserts and deletions.
    

It is >= 8, since you surely have to remove 4 As and insert 4 Bs. Since these
8 transformations indeed transform AAAA into BBBB, it is exactly 8.

It is very easy to generalize the Wagner-Fisher algorithm
([https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorit...](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm))
to this "simpler" variant of edit distance, but this algorithm will still have
a (worst-case) runtime of O(m*n), where m, n are the lengths of the two
strings.

Of course, the question remains: Is there a (worst-case) faster algorithm?

~~~
jongraehl
Don't know, but you still end up aligning common subsequences (advanced
monotonically in both seqs) and you still have the same order-of-computing-
cells allowed in the standard m*n dynamic program. Your restricted problem is
basically asking for the standard unix line-diff.

------
erikb
> Edit distance is useful, but also very time-consuming to calculate.

So I guess we will soon see a lot of crypto based on calculating edit
distances?

~~~
Sharlin
No. Quadratic speed is way too fast for crypto - besides, asymmetric
cryptography relies on problems that are difficult to solve but _candidate
solutions are easy to validate_! This is what the complexity class NP is
about. As far as I know there's no way to check a solution for edit distance
any faster than it takes to compute it in the first place. Also, the new proof
just shows that edit distance is quadratic _if P!=NP_.

~~~
Scea91
I get what you mean but 'easy to validate' in the context of NP means 'can be
validated in polynomial time'. It seems to me that you imply that edit
distance is harder to validate than problems in NP. Edit distance itself is in
NP since P is in NP.

~~~
Sharlin
I mean that there needs to be a fundamental difference between the difficulty
of finding a solution and validating one. There's none in case of edit
distance.

~~~
wolfgke

      > I mean that there needs to be a fundamental difference between the 
      > difficulty of finding a solution and validating one. There's none in 
      > case of edit distance.
    

If you consider the decision problem "Is the edit distance <= d?", there is
(according to this result) a fundamental difference - though only a polynomial
one.

To see this, consider the following argument:

Clearly the edit distance between strings of length m, n is bounded by m+n (>=
d). So if we found a way to transform the string s1 into s2 using d steps, we
can construct a certificate of length O(m+n) (worst case) steps, which can be
verified in linear time.

On the other hand (according to this result) you need Omega(m*n) (worst case)
steps to find such a certificate.

------
larsga
This is misleading twice over. First, as a3_nm notes, this proof depends on
unproven conjectures, and so this isn't proven yet.

Secondly, it _is_ possible to do better than the Wagner-Fischer algorithm in
certain cases. I figured this out myself and was about to write a paper on it
when I realized that other researchers had beat me to it by a couple of
decades.

If the two strings are equal you can establish that the edit distance is 0 in
linear time, obviously. If you consider how the algorithm fills out the square
down the diagonal you'll see that if there's only a one-letter difference you
only actually need to fill out small parts of the grid around that difference.

This is generalizable, so the complexity actually depends on the distance
between the strings.

~~~
podgib
In practice, yes, the computation time required to compute edit distance will
depend on the actual data. Similarly, 'sorting' a list that is already sorted
can be done in linear time, and near-sorted lists can be sorted much faster
than a randomly ordered list.

This is certainly useful in practice, but it doesn't affect whether the worst-
or average-case complexity is quadratic. I'd like to see a quadratic (or
exponential etc) time problem that couldn't be solved faster in many cases.

