
On bananas and string matching algorithms - cjbprime
http://www.wabbo.org/blog/2014/22aug_on_bananas.html
======
StefanKarpinski
Really interesting post and good debugging work. A couple of take-aways:

1\. This is one reason it's a good idea to use signed ints for lengths even
though they can never be negative. Signed 64-bit ints have plenty of range for
any array you're actually going to encounter. It may also be evidence that
it's a good idea for mixed signed/unsigned arithmetic to produce signed
results rather than unsigned: signed tends to be value-correct for "small"
values (less than 2^63), including negative results; unsigned sacrifices
value-correctness on all negative values to be correct for _very_ large
values, which is a less common case – here it will never happen since there
just aren't strings that large.

2\. If you're going to use a fancy algorithm like two-way search, you really
ought to have a lot of test cases, especially ones that exercise corner cases
of the algorithm. 100% coverage of all non-error code paths would be ideal.

~~~
wyager
> This is one reason it's a good idea to use signed ints for lengths

I disagree. Types are a form of documentation, and a constraint on proper
program behavior. Length types can't be negative; that doesn't make any sense.
If you end up with a negative length, you are doing something wrong. One
should just write correct code and use the proper length type. In C that's
`size_t`, and in Rust it's `uint`.

~~~
yongjik
The problem is that C (and many other languages) only have the notion of
"signed" vs. "unsigned" values, and nothing else. For example, if 'len' is a
variable that holds the length of something, then its range is defined by (len
>= 0). So far so good. But what is the range of the expression (len - 1)?

In a perfect world, its range is {-1, 0, 1, ...} (up to some maximum value),
but C(++) doesn't have such notion. If we used uint, -1 will silently change
to MAX_UINT. Oops.

~~~
wyager
>But what is the range of the expression (len - 1)?

Ill-defined, which is as it should be. Naturals are not closed under
subtraction or negation, and therefore you should never negate or subtract a
length.

~~~
thaumasiotes
You subtract lengths from other lengths all the time. How far have you gone
recently? How much taller are you than your wife? Subtracting a length from
another length isn't a problem at all; a problem would be if you wanted to
subtract a length from a weight.

This

> Naturals are not closed under subtraction or negation, and therefore you
> should never negate or subtract a length.

assumes that a paramount goal in having types is to make function signatures
like the following impossible:

    
    
        public static String format(String format, Object... args);
    

Infinite-dimensional vectors (in which the first element is a String) sure
aren't closed under format -- it maps them to scalars! But somehow string
formatting is wildly popular. Or think of

    
    
        public Color(int r, int g, int b);
        public Color(int rgb);
    

Are you comfortable having a mapping from ℤ to Color, _and_ from ℤ^3 to Color?
If so, why does it bother you less than a mapping from ℤ ∩ [0, ∞) to ℤ ∩ [-1,
∞)?

Try a simple rewording of your argument: Hash tables aren't closed under
access. Therefore, you should never retrieve values from a hash table.

~~~
wyager
>You subtract lengths from other lengths all the time.

Indeed. Notice that all your examples are physical lengths. This is of a
different type than the naturals (digital lengths). We generally consider
physical lengths to be in the domain of Reals, usually defined along a
particular path or dimension. This is closed under subtraction, while naturals
are not.

> a problem would be if you wanted to subtract a length from a weight.

That is _also_ a problem, not the only problem.

>assumes that a paramount goal in having types is to make function signatures
like the following impossible: [vomit-inducing variadic signature]

I would sure like it if function signatures like that were impossible :)

> Infinite-dimensional vectors (in which the first element is a String) sure
> aren't closed under format

It's very generous to call a set of arguments to variadic function a "vector"
at all, and they certainly aren't infinite-dimensional. They are practically
bounded by machine memory.

>it maps them to scalars

How?

>But somehow string formatting is wildly popular.

Lots of mathematically ill-defined things are wildly popular.

>Hash tables aren't closed under access.

Not sure what you mean by this.

~~~
thaumasiotes
> We generally consider physical lengths to be in the domain of Reals, usually
> defined along a particular path or dimension. This is closed under
> subtraction

This is wrong. We consider physical lengths to be nonnegative real numbers,
which are not closed under subtraction. Trying to use negative numbers for
physical lengths will get you very funny looks.

> It's very generous to call a set of arguments to variadic function a
> "vector" at all

It's not an element of a mathematical vector space, but it's a vector in the
common sense of an ordered collection of values. The other examples I gave
you, like (int, int, int), are vectors even in the algebraic sense.

> I would sure like it if function signatures like that were impossible :)

That was the signature for Java's String.format, the equivalent of C's sprintf
(which, like String.format, maps an infinite-dimensional vector to a single
string). If you really didn't understand this, you can tell that the return
type from String.format is a scalar by looking at the return type declaration,
"String". To understand the precise mapping, I can only recommend you read the
javadoc.

>> But somehow string formatting is wildly popular.

> Lots of mathematically ill-defined things are wildly popular.

I'm willing to grant this, but it's certainly not relevant to string
formatting.

>> Hash tables aren't closed under access.

> Not sure what you mean by this.

Accessing a hash table isn't guaranteed to get you another hash table. Just
like subtracting two positive numbers won't necessarily get you a positive
number.

~~~
eru
Subtracting two length doesn't give you a length, it gives you a difference of
length, which can be any real number.

So by analogy subtracting two unsigned ints should give you a signed int by
default (unless you specifically ask for the unsafe unsigned int, and then
it's on you to make sure the preconditions are met).

------
srean
There are many comments discussing whether length should be unsigned or
signed. There are arguments for and against it. This a prototypical carpet
bump, you squash it in one place it raises its head somewhere else.

I see it _not_ as a question whether lengths should be signed or unsigned but
whether subtraction, assignment etc should be polymorphic w.r.t signed and
unsigned. I think the issue here is the polymorphic variants of these binary
operators are inherently risky.

Casting gets a little tedious, but languages that do not have operator
overloading should disallow subtraction from unsigned and subtraction that
return unsigned. You either cast it up, or if possible reorder the
expression/comparison so that the problem goes away. Even assignment can be a
problem. Ocaml can get irritating because it takes this view but I think it is
safer this way. It is very hard to be always vigilant about the unsigned
signed issue, but hopefully a compiler error will mitigate risks, not
completely, but it is better than nothing.

That leaves languages that allow operator overloading, in those cases if you
are overloading an operator you better know what you are doing, and watch out
for cases where the operator is not closed.

~~~
judk
What is the signed value of 0-UINT_MAX?

There is no way to handle this issue on machine types, unless part of the
range is reserved for overflow and checked. But performance nuts intentionally
discard safety for performance.

~~~
srean
As I suggested there should not be a '-' that returns unsigned. If you want to
cast it back to unsigned you would have to do it explicitly, and ideally a
checked cast, though I am not averse to a more verbose unchecked cast.

I am not averse to binary operators with wrap around / modulo arithmetic, but
they should be look different from the usual '+', '-' etc. This only applies
to languages that do not allow overloading operators.

~~~
gweinberg
I agree. It seems to me that saying unsigned - unsigned is unsigned is just
asking for trouble.

------
zackmorris
I've seen some very strange things in my career. I find posts like this
delightful, because I can point to them and say "see, here is proof that even
code written with the best of intentions can still have bugs."

Programmers tend to fall into (at least) two camps: the skeptics and the
pragmatists.

Sometimes when I report a finding, programmers accuse me in one way or another
of messing something up because “that can’t possibly be failing.” Those are
the skeptics, using incredulousness almost like a shield to protect their
worldview. They tend to have an up-close/what’s right in front of them
approach to programming, are prolific and usually take a positive stance on
programming.

At other times, reporting a finding is met with resignation, almost like
“please work around it because we just don’t need this right now”. Those are
the pragmatists, taking the long view/forest for the trees approach, knowing
that programming is more than the some of its parts but also that it’s a
miracle it even works at all. They are the daydreamers and sometimes are
perceived as negative or defeatist.

I was a pragmatist for as long as I could remember, but had a change of heart
working with others in professional settings. I saw that certain things like
databases or unix filesystems could largely be relied upon to work in a
deterministic manner, like they created a scaffolding that helps one stay
grounded in reality. They help one command a very mathematical/deliberate
demeanor, and overcome setbacks by treating bugs as something to be expected
but still tractable.

But here is one of those bugs, where the floor seemed to fall out from under
our feet. One day I mentioned that “SSL isn’t working” and about half the
office flipped out on me and the other half rolled their eyes and had me track
it down:

[https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=694667](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=694667)

The gist of it is that OpenSSL was failing when the MTU was 1496 instead of
1500, because of path MTU discovery failing and SSL thinking it was a MITM
attack and closing (at least, that is how I remember it, I am probably futzing
some details).

That was odd behavior to me, because I see SSL as something that should be a
layer above TCP and not bother with the implementation details of underlying
layers. It should operate under the assumption that there is always a man in
the middle. If you can open a stream to a server, you should be able to send
secure data over it.

Anyway, we fixed the router setting and got back to work. All told I probably
lost a day or two of productivity, because the office network had been running
just fine for years and so I discounted it for a long time until I ruled out
every other possibility. I’ve hit crazy bugs like this at least once a year,
and regret not documenting them I suppose. Usually by the time they are fixed,
people have forgotten the initial shock, but they still remember that you are
the one weirdo who always seems to break everything.

~~~
cjbprime
Yeah, I like the phrase "suspecting a compiler bug is the principle of last
resort". (Maybe there should be an "unless you're using Rust" thrown in
there.)

It's not that compilers can't have bugs. It's that you're more fallible than
the compiler, and should use that likelihood when deciding where to
investigate to fix the bug you're seeing.

~~~
Retric
I remember a long day tracking down a bug in some low level networking code
that turned out to be bad RAM on a test machine. I remember thinking "This is
probably what going crazy feels like." But, I still trusted the compiler more
than the HW.

------
sitkack
Why is this code so damn fancy? Shouldn't the fanciness be offset by proofs or
extended testing? Open loop!

~~~
userbinator
Agreed, this seems like a case of excessive complexity. I hadn't even heard of
"two-way" before, but I don't think an algorithm the paper claims to be
"remarkably simple" should require _twenty-five pages_ to describe.

"two-way" isn't all that fast either, so I don't see any good reason for its
use; in the benchmarks I could find, the simpler BMH easily beats it (and
optimised BMH variants go even faster):
[http://volnitsky.com/project/str_search/benchmark-4.5.1.png](http://volnitsky.com/project/str_search/benchmark-4.5.1.png)
[http://blog.phusion.nl/2010/12/06/efficient-substring-
search...](http://blog.phusion.nl/2010/12/06/efficient-substring-searching/)

~~~
sitkack
It would be interesting to put a trace on `strstr` for a couple days and see
what the distribution of arguments look like.

ToDo: Profile libc for various workloads, record arguments and code coverage

------
alayne
I try to never look at GPL/LGPL code if I'm going to implement something at
work or under another license.

Sorry for suggesting a good practice to avoid legal liability.

~~~
JadeNB
> Sorry for suggesting a good practice to avoid legal liability.

Why is this good practice? Has anyone ever been exposed to a legal liability
that could be creditably attributed to their merely _looking at_ (as opposed
to using) such code? (Although my suspicion is that the answer is 'no', this
is an honest question.)

~~~
cjbprime
I think the idea is that it's going to be difficult, having looked at the
code, to implement a version that is not similar enough to the original to
trigger a copyright "derivative work".

For example, I believe it used to be the case (just a few years ago) that
Microsoft employees were contractually prohibited from looking at GPL'd code.

But if alayne is planning on bringing that up every time someone shares a post
that might contain GPL code, alayne is quickly going to become the most
tedious person at the party, and that's why the downvotes are happening.

If your situation prevents you from studying and learning from GPL'd code,
then just don't study it, and don't brag about not studying it.

~~~
JadeNB
To be clear, I didn't mean "why might there be _fear_ about this?" (which I
can understand, though it sounds a bit hysterical to me), but rather "has this
ever actually _happened_ (outside of a lawyer's dreams / nightmares)?", as a
weaker version of the question "is avoiding this really good practice?".

P.S. Perhaps a good defence against a claim of copyright infringement would be
"you can tell that my version isn't a copy, because it correctly finds 'nana'
as a substring of 'banana'"? ;-)

~~~
cjbprime
I see. I'm not aware of any case law either.

I don't think it's good practise, and agree that it's hysterical, and think it
stems from unthinkingly believing Microsoft's anti-OSS claims during their
decade-long freakout about open source and the GPL.

I think that attempting to convince people not to participate in sharing and
studying code with the rest of the world is an evil thing that they did,
especially since the GPL was the most popular OSS license at the time.

~~~
alayne
I'm absolutely not anti-OSS. I personally think it is against the spirit of
the GPL to copy code that is GPLed, though.

I'm just saying be careful about how you do your engineering. This seemed like
an inexperienced developer and he mentioned glibc vocally along with the
comment "I'll admit that originally I copied this logic from glibc's
implementation without fully understanding it, but I've now taken a closer
look at the Two-Way paper and understand what it does." It's certainly not
good software engineering to copy implementation like that. It is good that he
read the paper.

Now I regret even bringing it up. I don't feel that I deserve your ad hominem
comments or characterizations of my motivations.

~~~
JadeNB
> I personally think it is against the spirit of the GPL to copy code that is
> GPLed, though.

Why? I assume that you don't mean _literatim_ copying, which is obviously the
whole point of OSS, but rather copying with modification; but, even then, a
quick read suggests that
[http://www.gnu.org/copyleft/gpl.html#section5](http://www.gnu.org/copyleft/gpl.html#section5)
explicitly _permits_ this activity, as long as you meet certain conditions.
Maybe I should have understood an implicit insertion "to copy, _without
acknowledgement_ , code that is GPLed"?

