

String overlap detection in Python - RiderOfGiraffes
http://neil.fraser.name/news/2010/11/04/

======
glenjamin
As I understand this, the KMP function uses an optimal algorithm, while the
indexOf version uses a suboptimal algorithm but takes advantage of a native
(C) implementation found in the standard library.

It follows that the fastest approach should be to implement the KMP algorithm
natively, using Pyrex rather than C should make this fairly trivial.

~~~
cracki
besides Pyrex, there's also Cython (I don't know if one superseeded the other
or what exactly their relationship is)

~~~
wisty
Pyrex is the original. Cython is the fork by people who wanted to add more
features.

I think their relationship is reasonable, but not reasonable enough for the
two projects to link to each-other.

------
btilly
His implementation of the KMP algorithm starts by constructing a table for the
whole of one string, to use in matching the other. But the full table is
seldom needed. I wonder how much of the performance is taken in that
construction, and how much of a difference only constructing part of that
would make.

------
drallison
The problem statement: "How does one detect when two strings overlap?" is
inadequate. If "overlap" means "at least one substring in common", a trivial
test for overlap would be to determine whether they have a character in
common.

~~~
drallison
I do not understand why this comment was down voted. When analyzing
algorithms, a clear and correct statement of the problem is necessary and the
one made here is clearly inadequate. For example, while the problem calls for
an algorithm to "detect" overlap, while the algorithms described produce a
list the overlapped strings. Moreover, there is no definition (other than the
programs listed) for what is meant by "overlap". Consider "abc" and "abcd".
The overlapping strings might be "a", "ab", "abc", "bc" , and "c" under one
definition, or simply "abc" under another.

------
endtime
>The IndexOf algorithm is also O(n)

The author doesn't seem to understand big O. His indexOf approach is a linear
algorithm in the average case, but not in the worst case - the author himself
demonstrates this.

>Computer Science (always suspect any discipline which feels the need to add
the word 'science' to its name) is often an exercise in compromises.

This kind of pissed me off, which was probably the intent. CS is about math,
not implementation. Once you start engineering things of course you have to
take practical considerations into account (such as the classes of strings
your algorithm is likely to be fed as input), but that doesn't invalidate (or
devalue) proofs which make no claims about practical usage.

~~~
btilly
_> The IndexOf algorithm is also O(n)

The author doesn't seem to understand big O. His indexOf approach is a linear
algorithm in the average case, but not in the worst case - the author himself
demonstrates this._

I suspect that you have the surprisingly common misconception that Big-O is
about worst case scenarios. This is absolutely not true. Big-O is a notation
to describe the growth of functions. The functions themselves can describe
anything you want, including either the average OR the worst case. The usage
you complain about is therefore correct.

See <http://en.wikipedia.org/wiki/Big_O_notation> for verification.

 _> Computer Science (always suspect any discipline which feels the need to
add the word 'science' to its name) is often an exercise in compromises.

This kind of pissed me off, which was probably the intent. CS is about math,
not implementation._

It is actually a very unoriginal quip. I've been repeating it for years.
Furthermore it is a quip with a lot of truth. Sciences are fields of study
that use experiments to test hypotheses and refine theories. As you point out,
CS is a lot like math, and unlike science, in that it is a field of knowledge
which is in a large part based on our ability to construct ideas on top of
each other and prove things about them rather than being built on experiment.

In short "not science" is descriptive, not pejorative.

~~~
endtime
>I suspect that you have the surprisingly common misconception that Big-O is
about worst case scenarios...

I appreciate what you're saying here, but I think this part of the Wikipedia
page is relevant:

>A description of a function in terms of big O notation usually only provides
an upper bound on the growth rate of the function. Associated with big O
notation are several related notations, using the symbols o, Ω, ω, and Θ, to
describe other kinds of bounds on asymptotic growth rates.

Upper bound implies worst case. I _was_ going to say that big Theta is average
case, but after some brief research I actually think I was mistaken on that.

~~~
plinkplonk
"Upper bound implies worst case."

No it doesn't. You can talk about upper bounds on average case running time.
The function under consideration can be _any_ function, best case, worst case,
average case, or _any other_ function.

O-notation is used to represent "asymptotic upper bound" on _any_ function.
(Omega-notation for lower bounds, Big Theta for simultaneous lower _and_ upper
bounds and so on, on _any_ function).

The upper or lower bounds of functions have nothing to do with the "case-
iness" (best/worst/average running times etc) of algorithm running times.

The "meaning" of the function (or for that matter the argument "n"), is
completely orthogonal to its bounds. O(n) by itself does not say anything
about what the function represents. The function does not have to be about
algorithms.

E.g: It could represent the error between an approximation and actual
calculated value. You can use O notation to bound such functions, to state the
upper bounds on the rate of growth of the error of an approximation. (Gilbert
Strang uses O notation thus in his book "Calculus" for example).

As another example, Knuth applies this notation to dozens of non-algorithm-
running-time functions in Chapter 9 of Concrete Mathematics.

Even when using such notation to discuss algorithm running times you really
have to say O(n log_n) worst case (or best case or average case) running time
for e.g, if you want to be clear.

Thus (for example)from Cormen et al's "Introduction to Algorithms" book,
emphasis mine (section 3.1 for anyone interested),

"The Big_Theta(n^2) bound _on the worst case running time of insertion sort_
does not imply a Big_Theta(n^2) bound on the running time of the insertion
sort on every input.... when the input is already sorted _insertion sort runs
in Big_Theta(n) time_ "

Big-Theta is applied to both best case _and_ worst case running time functions
in the above paragraph and the author clearly specifies what he is applying
the bound to in each case.

iow btilly is right (about how O notation can be used to represent any
function, not just worst case running time functions) and you are wrong (when
you say "Upper Bound implies worst case" ). As btilly states, this is a common
misconception.

Another common abuse of the notation is to use Big-Oh-notation when Big-Theta
is intended.

~~~
endtime
>No it doesn't. You can talk about upper bounds on average case running time.

Hmm. At first read I agreed with this, but now I don't again. A bound is a
theoretical limit. If certain inputs to a function cause it to exceed your
"bound" then your "bound" isn't a bound.

I do now agree that "big O" applies to average case - not that I ever intended
to deny that average case analysis is a thing; it was just a question of
terminology. I think the way to resolve what btilly said with the Wikipedia
quote I gave is that the "usually" in "big O notation _usually_ only provides
an upper bound" gives rise to the common misconception.

>The function under consideration can be any function, best case, worst case,
average case, or any other function.

Could you clarify this? "Average case" describes a class of inputs, not a
function. Perhaps I'm just experiencing a parsing error.

~~~
plinkplonk
">No it doesn't. You can talk about upper bounds on average case running time.

Hmm. At first read I agreed with this, but now I don't again."

You should really just read some textbooks (not being patronizing, just saying
that explaining this whole thing from scratch on HN takes too much space and
time).

Average Case bounds _are_ used in algorithm analysis.

E.g Lemma 5.2 from Cormen et al's "Algorithms"

"Assuming that the the candidates are presented in a random order, algorithm
Hire-Assistant has an average case total hiring cost of O(c ln n)"

Hmm that seems to be an upper bound (hence the O) on an average case. Cormen
knows something you don't. ;-)

Earlier in the chapter "Thus we are, in effect averaging the running time over
all possible inputs. When reporting such a running time, we will refer to it
as the average case running time".

Thus, irrespective of your "agreement", stating the bounds on an average case
[whatever] _is_ valid. Sure you have to define what "average" is. In the
statement "average case" is defined as a "random order of presentation", iow
as a probability distribution represented by a Random variable. Iow, the
distribution of the input is a part of the definition, but the _bounds_ are
still on a function.

Which brings me to

">The function under consideration can be any function, best case, worst case,
average case, or any other function.

Could you clarify this? "Average case" describes a class of inputs, not a
function. Perhaps I'm just experiencing a parsing error."

You are being pedantic here . In the very first sentence I said (emphasis new)
"You can talk about upper bounds on average case _running time_."

So yes, A mapping of n to average_running_time_with_n_inputs _is_ a function,
often a reccurrence. The definition of average case just constrains the n
inputs somehow. It doesn't change the meaning of "bound". And as shown above,
from Cormen, (and stated by btilly above) "average case running time of
operation foo is O(blah)" is perfectly valid.

Strictly speaking, The Oh notation on the RHS of an "equation" shouldn't be
using the equal to sign and should be using an "element of" sign from Set
Theory and the terms on both sides are really sets and so on. All this is
explained in most good texts.

I also suspect you learned O(n) strictly through algorithms (and not math) and
_maybe_ (just guessing here) that's why you are confused about how the various
terms map to this or that aspect of algorithms.( I quote you from your post
above "Upper bound implies worst case. I was going to say that big Theta is
average case, .." etc. )

Again a good book on the fundamentals is the best way to resolve this (vs
argument on HN). I suggest "Concrete Mathematics". Chapter 9 explains O,
Theta, etc with functions with no algorithms in sight.

Once you learn the underlying math, applying that math to algorithmic analysis
is trivial. Going the other way might lead to fundamental errors (though it
need not, really) as demonstrated in your reply to btilly for e.g.

Now, with respect, I will exit this thread. I've given plenty of references to
the proper definition and use of Oh, Theta etc notation in both algorithmic
and non algorithmic contexts in my posts. (Strang, Knuth, Cormen etc).

If you want to follow up and read those, that's fine. If not, that is fine
too. _Arguing_ over mathematical notation without first reading the texts is
(imho) a waste of everyone's time.

This thread is getting too long. Over and out.

