
Overlap Detection in strings, with code - ColinWright
http://neil.fraser.name/news/2010/11/04/?
======
archangel_one
I would have found the article much better if there was more discussion of the
"highly efficient indexOf function". It sidesteps the whole discussion of
algorithms to appeal to some magic black box with no explanation of how it
works - in this case it seems like the speed advantage might very well be
because that's implemented in C rather than his KMP implementation which is
all in Python. That's still a valid reason to use the approach instead if it
means the result is faster, but I'd have preferred to see some more discussion
of what was going on behind the scenes.

~~~
tomdeakin
This kind of discussion is lacking in the article. It seems as if the OPs goal
is optimisation and of course finding the right algorithm with the lowest
complexity is the first port of call. However, the indexOf function seems to
rely on properties of the data which may not be present (like the forced
example in the middle).

There are other algorithms besides KMP too for this. I'm sure with some
thinking on my part the linear time suffix tree algorithms could solve this
too off hand.

------
rogerbinns
A company I worked for was started by someone doing this kind of work in
biology. In particular trying to find matches in the ACTG strings and when
there is a huge volume of data.

The startup was doing network compression - put an appliance either side of a
WAN link and transparently compress the data inbetween. The core loop of that
is examining IP and TCP traffic and determining if you have seen something
like that before and how much of a match you have. If there is a match then
you can just send the match details to the other side rather than the actual
data. Typical compression ratios were 20:1 / 95% - ie you could stuff 20 times
as much data through the link.

The standard zip algorithm has a window of 32kb (ie it looks roughly 32kb
behind to see if data has been repeated). bzip2 uses 900kb. That is usually
less than a second or two on standard WAN links! The founder's algorithm
worked over hundreds of megabytes.

(The company name was Peribit Networks - you'll find Wikipedia rather
unhelpful.)

------
rflrob
It may be worth pointing out that the OP's appeal to the relevance to biology
may not be quite so relevant. Speaking as a bioinformaticist, most of the time
we're interested in matching strings with some number of mismatches. In that
case, something like the Smith-Waterman algorithm is going to provide much
more interesting results. Given the choice between matching 400 basepairs
exactly; or 399 basepairs, followed by one mismatch, followed by 399 more
basepairs, I'd much rather take the latter case, but the indexOf approach
wouldn't be able to tell me that.

~~~
cperciva
Also, indels. Mismatches are the easy part -- in my thesis I show how, with
linear-time preprocessing, you can solve the overlap problem with a constant
proportion of mismatches in O(sqrt(overlap length)) time.

------
kilburn
The algorithms used in this article are _extremely_ bad for the case! Realize
that the guy is just trying to find the _longest overlap_ between two strings.
This is, the longest part at the end of text1 that is repeated exactly at the
beggining of text2.

With that in mind, it is _obvious_ that there is an algorithm linear in the
size of the overlap to compute it. For instance:

    
    
      def commonOverlapSanely(text1, text2):
      	right = len(text1)-1
      	left  = 0
      	maxleft = len(text2)
      	while left < maxleft and right >= 0:
      		if text1[right] != text2[left]:
      			break
      		right -= 1
      		left += 1
      	return left
    

Instead of using this _very simple_ algorithm, the author employs three
different approaches:

1\. A "naive" version that keeps copying increasingly large portions of the
original strings, and checking the whole portion each time. This obviously
makes the algorithm quadratic in the size of the overlap, thus leading to very
bad performance.

2\. The KMP algorithm, that is designed to search for occurrences of a string
within another one, disregarding where. This is a silly thing to do because
the resulting algorithm is linear in the size of the longest input string,
instead of on the size of the actual overlap.

3\. An "indexOf" monstruosity, that is so wrong it is even hard to explain.
Carefuly inspect the code and cry.

Now, please tell me if I missed something...

~~~
ColinWright
OK, as requested: You've missed something.

I've just run a few naive tests and all three versions on the linked page give
the same result on every test, and yours gives different results on some of
the tests. In short, your code doesn't work.

So you've missed something.

