
Computer scientists prove that a 40-year-old algorithm is optimal - kercker
http://newsoffice.mit.edu/2015/algorithm-genome-best-possible-0610
======
ruggeri
Am I reading this wrong? The title is wrong, right?

The paper says that if a strongly sub-quadratic solution exists than the
exponential time hypothesis (that SAT cannot be solved in subexponential time)
is invalidated.

That's very interesting, but _it 's not a proof_ that no strongly sub-
quadratic solution exists for SED.

Note that exponential time hypothesis _is strictly stronger than_ P!=NP. Even
if SAT can't be solved in poly time, that doesn't mean it can't be solved in
subexponential time. There are functions that lie between polynomial and
exponential...

Of course, the paper was careful to explain it, but the media summary...

Edit: I was interested to learn about the notion of _strongly quadratic_ ;
there are O(n^2/log n) solutions to SED, but this paper is casting doubt on
solutions with time complexity O(n^(2-delta)) for any delta > 0\. Another
commenter mentioned a method to solve SED like this.

~~~
anonetal
I haven't seen too many results that rely on SETH, so just did a bit of
research. Ryan Williams (from Stanford) at least doesn't believe SETH is true
-- here is a nice talk by him on SETH:
[http://www.imsc.res.in/~vraman/exact/ryan.pdf](http://www.imsc.res.in/~vraman/exact/ryan.pdf)

~~~
Sniffnoy
Interesting to note there slide 5 -- "For many polynomial time problems,
improving the best known algorithms, even slightly, implies ¬SETH or ¬ETH."

So apparently the edit-distance result discussed here is part of a history of
similar results.

Edit: More detail on this in slides 11 and forward.

------
necessity
The Wagner-Fischer (1970) algorithm they mention (AKA Needleman-Wunsch,
Lowrance-Wagner, etc) is quadratic in time and space. It might never get any
better than that in time - an improved version with linear space complexity
was presented by Hirschberg in 75 - but that doesn't mean execution times
can't be improved. Today there are various variations of the same family of
algorithms with ever improving execution times. Some adapt the algorithm to
massively parallel setups with hundreds of GPUs, others use speculative
execution, some use approximations, and so forth.

~~~
throwaway12357
> that’s disappointing, since a computer running the existing algorithm would
> take 1,000 years to exhaustively compare two human genomes.

I did a quick googling [1] and

> Our algorithm divides the problem into independent `quadrants' ...

> Our results show that our GPU implementation is up to 8x faster when
> operating on a large number of sequences.

It's still soul crushing. Why did our genome have to be that long :(

BTW, do you have numbers for setups with hundreds of GPUs?

I'm also left wondering about results using stochastic solutions. On how
accuracy and problem size relate.

[1]
[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=tru...](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6339593)

~~~
daveloyall
I'm confused by this. 1,000 years seems a bit steep to me.

Suppose you have two files, `bob.genome` and `mary.genome`. Let's say they are
1gb each [1].

I think I can diff two 1gb files in less than 1,000 years.

diff(1) shows "deletions, insertions, and substitutions".

Therefor, I don't believe it. Yet. What did I miss?

1\. [http://stackoverflow.com/questions/8954571/how-much-
memory-w...](http://stackoverflow.com/questions/8954571/how-much-memory-would-
be-required-to-store-human-dna) (Rounded up because Fermi estimation [2].)

2\. [https://what-if.xkcd.com/84/](https://what-if.xkcd.com/84/)

~~~
mmarx
> diff(1) shows "deletions, insertions, and substitutions".

diff(1) doesn't give you a _minimal_ set of edits to apply to go from one file
to the other, just _a_ set of edits.

~~~
darkmighty
Also, I think he's picturing two almost-equal files. In that case the average
running time should be way lower, no? (I believe the quadratic time is worst
case)

~~~
daveloyall
1,000 years? Really?

Isn't it more likely that somebody misquoted "slow, like a half an hour" as
"slow, like a THOUSAND YEARS"?

~~~
mathattack
They're quoting exponential (2^N), not quadratic (N^2) time.

 _If on some machine a quadratic-time algorithm took, say, a hundredth of a
second to process 100 elements, an exponential-time algorithm would take about
100 quintillion years._

~~~
mmarx
That is yet another section of the article, the thousand years clearly
reference the edit distance, which is quadratic.

------
davmre
Dick Lipton (GATech) has a blog post discussing this paper:
[https://rjlipton.wordpress.com/2015/06/01/puzzling-
evidence/](https://rjlipton.wordpress.com/2015/06/01/puzzling-evidence/).

------
mabbo
Short version: computer scientists have shown that if the Edit Distance
problem can be solved in sub-quadratic time, then SAT can be solved in sub-
exponential time, and therefore P=NP.

~~~
sanxiyn
Exponential time hypothesis implies P!=NP, but converse is not true. It is
possible that SAT takes subexponential but superpolynomial time.

~~~
mabbo
Ah, a very valid point!

This sort of thing is exactly what I love about Computer Science.

------
sanxiyn
The article clearly states this, but the title doesn't, so N.B.: the proof is
conditional on exponential time hypothesis, that SAT requires exponential
time. Of course we don't know whether it does.

~~~
paulfr
What the article doesn't state clearly is that this assumes the _strong_
exponential time hypothesis: it assumes that SAT cannot be solved in time
1.9999^n -- in other words that it's impossible to do better than the brute-
force algorithm, which has complexity 2^n (up to polynomial factors). That's a
very strong assumption.

------
dfan
Here's the paper:
[http://arxiv.org/abs/1412.0348](http://arxiv.org/abs/1412.0348)

~~~
meteorfox
Thanks! I don't understand why the MIT News article has no links to the
sources.

~~~
sanxiyn
It actually does. The link is on the right sidebar.

------
jbapple
The authors of the paper note (though this MIT News summary omits) that there
is a well-known algorithm for edit distance that runs is subquadratic time. In
fact, it runs in O(n^2/log^2 n) on a word RAM. It uses a method sometimes
called "Four Russians" or "shaving a log".

Of course, this does not invalidate the results; I mention it only to dispel
the notion that a reader may get from this MIT News summary that there is no
subquadratic algorithm likely to be found.

It's also interesting reading; just Google for "edit distance" and "Four
Russians" and you'll find many summaries.

------
dietrichepp
> But it also means that computer scientists can stop agonizing about whether
> they can do better.

This is incorrect... because a "better" algorithm might have 99.9% accuracy
but be millions of times faster.

~~~
CHY872
That's pedantic and also incorrect. The problem of whether an exact solution
can be found efficiently can be put to rest. You might not choose to use this
algorithm, but that's not really the point of algorithms research. You've just
picked a strawman to criticise the wording on.

~~~
mcherm
I don't think so. I think he is pointing out a _really_ , _really_ important
thing that many overlook because of the way we have come to teach computer
science.

When evaluating different algorithms, there are lots of criteria we could use.
How often is it correct? How long does it take to run? How difficult is it for
people to read? How often does it use the letter "r" which happens to be
broken on my keyboard?

Computer science made ENORMOUS strides by picking out a specific criterion
(running time on the computers of the time) and finding a way to make it
mathematically rigorous (asymptotic performance analysis, Big-O notation, and
all of the related mechanisms that we learn in computer science classes). This
was TREMENDOUSLY valuable, and by turning the whole power of mathematical
analysis loose on the problem it formed the modern field of algorithm analysis
and completely transformed how we build computers.

But we need to remember that this is base on one particular simplification of
how computers work. It assumes a Von-neuman architecture where execution steps
are the key criterion. We have extended this framework to consider things like
parallel execution... that was fruitful also. More recently, we've been
noticing that our actual machines are no longer dominated by the "steps" in
the algorithm, but most often are dominated by memory usage, so we have turned
the same formalism onto the use of memory -- but still (in my opinion) lack a
rigorous approach for analyzing the _combination_ of steps taken and memory
usage.

Even more significant is the fact that there are OTHER things we could trade
off. One of those is accuracy. Look at some of the research on probabilistic
algorithms: you will find that there are some incredible gains to be made with
losses in accuracy that are well within the bounds of what are acceptable for
most uses. Yet this isn't covered in an introductory computer science course,
so many programmers are not even aware of the option. With so much of
traditional algorithm analysis already mapped out by the past several decades
of researchers, much of the fertile ground in the near future will, I believe,
lie in investigating these other sorts of tradeoffs.

~~~
CHY872
No - there are many fruitful formalisations of parallel programming (the CLRS
chapter is pretty good) and indeed, of non-von Neumann architectures (check
out balancing, comparison networks etc etc).

This sort of work is not reliant on von-Neumann - it quantifies the amount of
work you need to do. Coincidentally, the algorithm listed parallelises very
well; it has a good span.

Of course it is useful to avoid galactic algorithms, but the article is
_totally_ correct. It would be incorrect to read its conclusion as anything
but what it actually says. As a general principle, sure, but it shouldn't be a
criticism of this article.

------
judemelancon
The paper mentions an algorithm published in 1980 is O( (n/log(n))^2 ), which
is better than the Wagner-Fischer algorithm mentioned in the article, even if
it's not strongly subquadratic.

~~~
Retric
O( (n/log(n))^2 ) > 0(n^1.99999999999999) for sufficently large N

~~~
judemelancon
Yes, the fact it's >O(n^k) for all k in [1,2) is why it's not strongly. It's
still less than O(n^2).

------
beefman
The edit distance between two strings L and S, where |L| >= |S|, is at most
|L|. For any number n and string W, a DFA can be built which accepts an input
IFF its edit distance to W is < n. Such a DFA can be built and run in O(|W|)
time.[1]

What's the complexity of a 'binary search' for the edit distance between L and
S using this method? The DFA construction complexity grows very fast with
respect to n, but we only need log(|L|) DFAs. And the search cut points can be
skewed to amortize the growth with respect to n (i.e. first n << |L|/2).

[1] see page 63 of
[https://scholar.google.com/scholar?cluster=99460367496861516...](https://scholar.google.com/scholar?cluster=9946036749686151606)

------
rbanffy
Well... I guess now it's up to the hardware engineers to make it faster... ;-)

------
zerokk
This is so wrong. Just because it's NP-hard (optimization problem, right?) it
doesn't mean that there aren't very-very-very good heuristic algorithms that
can make the search significantly faster than exponential in the vast majority
of the cases. Take a look at SAT solvers and see how deep the rabbit hole
really goes.

~~~
epistasis
Heuristic algorithms can never be "optimal". The article is precise and
correct. I use heuristic algorithms for all my DNA alignment, and for all my
probabilistic modeling, but I do that with the knowledge that I'm not
necessarily finding an optimal solution, I'm only find a "pretty good"
solution like the SAT solvers do.

------
anacleto
I love 'proofs' which rely on an unproven component.

