
Fix Boyer-Moore searcher with the Rytter correction - HenryR
https://github.com/microsoft/STL/pull/724
======
bumbledraven
What is a minimal test case (needle + haystack) for which the old code yields
a different result than the new code?

~~~
StephanTLavavej
The smallest test case appears to be a needle of "aaa" and a haystack of
"xxaaa", where VS 2019 16.5's boyer_moore_searcher will report "not found",
while the corrected code finds the needle at offset 2 in the haystack.

~~~
jameshart
Seems astonishing that nobody has run into this in the wild. How does it do on
finding ‘www’ in ‘[http://www.example.com/](http://www.example.com/), for
example?

~~~
StephanTLavavej
I conclude that usage of std::boyer_moore_searcher is relatively low, despite
this C++17 feature shipping in VS 2017 15.3 (August 2017).

"www" triggers the bug like all 3-character repeats, but Boyer-Moore manages
to find it in "[http://www.example.com/"](http://www.example.com/") despite
the damaged delta2 table, because the incorrect shift value is never
exercised. However, a haystack of
"[https://www.example.com/"](https://www.example.com/") triggers the bug,
because now the "www" is at an offset of 8.

------
oefrha
The papers:

\- Boyle, Moore (1977)
[https://www.cs.utexas.edu/users/moore/publications/fstrpos.p...](https://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf)

\- Knuth, Morris, Pratt (1977)
[https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f...](https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f8f10637707928f.pdf)

\- Rytter (1980)
[https://doi.org/10.1137/0209037](https://doi.org/10.1137/0209037) (paywalled
without subscription, but you know where to find it _cough scihub cough_ ).

------
dang
Submitted title was "A C++17 STL bug that can be attributed to Knuth, and its
40-year-old fix", which is nice additional context, but can now go here rather
than there.

------
tacitusarc
What I really enjoy about this merge request are these comments:

> Rename _Suffix_fn to _Fx, again following the papers. Note that in the usage
> below, there is a semantic change: _Suffix_fn stored 0-based values, while
> _Fx stores 1-based values. This undoes a micro-optimization (_Suffix_fn
> avoided unnecessarily translating between the 0-based and 1-based domains),
> but with the increased usage of f in the Rytter correction, I wanted greater
> correspondence with the published algorithms in order to verify the
> implementation.

And

> Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it
> confusing that _Idx was 0-based in other loops but 1-based in this loop.)

Naming is one of the most strangely difficult aspects of programming, but if
there is one rule it's be consistent. I really hate code that uses a variable
name to mean one thing in one place and other somewhere else. I just learned
what that meant! Why change it?! And yet this is common, and sometimes a
comment about it in code review might get a response like "the code works, why
does it matter?" This. This is why it matters. It causes confusion, which
makes code difficult to verify, which causes bugs.

~~~
ahelwer
I've implement a few string search algorithms and translating between 1- and
0- based offset functions was the source of a huge number of bugs, as you
might expect. For some reason a lot of sources present the algorithms as
1-based.

~~~
ivanbakel
>For some reason a lot of sources present the algorithms as 1-based.

Because that's a natural mathematical model. "the n-th element" being at
position n is handy and natural. What's nuts is that array indexing based on
pointer offsets has made its way into nearly all high-level languages.

~~~
derefr
One might say that computing doesn't care about ordinal numbers, only cardinal
numbers.

On an infinite tape, there is no "first byte", only "a byte placed where the
tape already was when you started" (i.e. at offset=0.)

Similarly, there is no "first byte" of a random-access memory, because the
ordering of memory is not guaranteed (i.e. some memories have bit-planes, some
memories have banks, some memories are big/little-endian, etc.) The only thing
you can say about a memory (and thereby about a RAM-word machine, or about an
array) is what exists at address 0 of it, at address 1 of it, etc. It would be
incorrect to describe address 0 as "the first byte" of memory, as this would
impose a canonical iteration order.

~~~
DeathArrow
>On an infinite tape, there is no "first byte", only "a byte placed where the
tape already was when you started" (i.e. at offset=0.)

Maybe, but human brain likes things starting at 1.

------
dvt
What I find most interesting about this bugfix is simply how terrible of a
source Wikipedia is and, at the same time, how valuable professional insight
can be (e.g. someone that knows what they're talking about). And to add insult
to injury, someone quickly edited Wikipedia (for the "street cred" I'm sure):

> The original paper contained errors for computing the table of pattern
> shifts which was corrected by Wojciech Rytter in 1980 [1]

But that's _still actually wrong_. Computing the table was not even discussed
in the "original paper" \-- see the papers linked by @oefrha in this thread.
They're two different 1977 papers!

[1] [https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-
sea...](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-
search_algorithm)

~~~
ahelwer
Yeah I wrote that Wikipedia article, as much as a single person can. Used the
textbook _Strings, Trees, and Sequences_ by Gusfield; it never made mention of
Rytter. I wouldn't say Wikipedia is a "terrible" source, though; compared to
what? Papers? Textbooks? During the course of my career I've found major
technical errors in about half of CS papers. Now the error is being documented
& fixed in the Wikipedia article.

My main takeaway was that people should avoid BM in favor of KMP as much as
possible if you're writing a string search function, because you're guaranteed
to have some bugs when implementing BM. Writing your own string search
function seems nearly on the same level as implementing your own cryptographic
functions, though.

Interestingly someone reported a bug in my BM sample code on the wikipedia
article talk page some years ago (still not yet fixed) which might be related
to this Rytter issue. And someone forked my wikipedia article code repo
yesterday, which I thought was very weird; must be related to this.

I guess coming to a further defence of wikipedia: as you read technical books
& papers (or philosophy!) you see knowledge is extremely unevenly-distributed
in society. Much of it is locked up inside these dense, expensive books (if
you're lucky!) and even more of it is sitting undigested in papers written for
people with five years of postgrad education, and _even more_ is in expert's
brains and never put to paper. These barriers stay up for a long time, decades
even. We just witnessed one such barrier falling. Wikipedia is an optimal
resting place for such liberated knowledge, and I would go so far as to say
knowledge that is not on Wikipedia is not yet liberated.

~~~
StephanTLavavej
> you're guaranteed to have some bugs when implementing BM. Writing your own
> string search function seems nearly on the same level as implementing your
> own cryptographic functions, though.

And that’s the Curse Of The Standard Library Implementers: we’re the ones that
have to implement Boyer-Moore, floating-point string conversions, etc. so
everyone else can use them :-)

~~~
battery_cowboy
Why aren't there projects amongst the standard library maintainers to make
some standard tests and maybe even APIs? That way you'd know that every
library implements each algorithm correctly.

~~~
StephanTLavavej
We actually do use libc++'s test suite, which has found several bugs, but not
this one (apparently because they don't yet have std::boyer_moore_searcher).

------
praptak
Lol, I was taught by Rytter like 20 years ago. He did much more in text
algorithms than just correcting B-M. The two books by Crochemore & Rytter are
basically _the_ books on text algorithms.

~~~
burntsushi
> the books on text algorithms.

And also "Flexible Pattern Matching in Strings" by Navarro and Raffinot.

------
skyde
This guy is a hero! I have so much respect for his work.

By i’m still wondering if this bug could have been detected by automatic
fuzzing testing ?

~~~
StephanTLavavej
I believe that fuzz testing would have found it, yes. I'm not sure if fuzz
testing would have been guided towards highly repetitive patterns, given the
relative lack of branches in the table construction code, but the incorrectly
handled patterns can be very short so fuzz testing should have stumbled upon
them (given a pattern that generated an incorrect table, finding a haystack
which doesn't work is merely a matter of having the needle occur at a specific
offset).

Currently we don't fuzz test MSVC's STL, but that is an extremely interesting
area to explore in the future. I played around with this a little while
working on std::from_chars() (and found no bugs), but it involved actually
porting the code to Linux to be compiled with American Fuzzy Lop, so I didn't
permanently add such test coverage to our repo.

~~~
marvy
It seems AFL might have been ported to Windows:

[https://github.com/ivanfratric/winafl](https://github.com/ivanfratric/winafl)

(don't know; just found this two minutes ago)

~~~
StephanTLavavej
Very interesting, thanks!

------
zvrba
> However, the published algorithm for dd' was incorrect! This was discovered
> and fixed by Rytter in 1980, which we weren't aware of until we received a
> bug report.

Fits what Knuth said: "Beware of bugs in the above code; I have only proved it
correct, not tried it."

------
karma_fountain
Nice fix! It's been a while since I've done c++ so does the following code

    
    
      const auto elapsed = steady_clock::now() - start;
      if (elapsed > 10s) {
    

actually use a units quantifier on the 10, i.e. does 10s mean 10 seconds? Is
that a library thing? Cool if so.

~~~
andijcr
that's a user defined literal, a language feature:
[https://en.cppreference.com/w/cpp/language/user_literal](https://en.cppreference.com/w/cpp/language/user_literal)

the std::chrono library reserved some suffixes (under
std::literals::chrono_literals) for time units
[https://en.cppreference.com/w/cpp/symbol_index/chrono_litera...](https://en.cppreference.com/w/cpp/symbol_index/chrono_literals)

~~~
StephanTLavavej
And this works because I have `using namespace std;` in the test, which makes
all of the Standard Library's "user"-defined literals available. So does
`using namespace std::literals;` (without dragging in other names). There's
also the extremely fine-grained `using namespace std::chrono_literals;` (no
need to type `using namespace std::literals::chrono_literals;` due to how
inline namespaces work) which drags in just the chrono duration literals.
Finally, `using namespace std::chrono;` drags in all of the chrono names and
the chrono literals.

The reason that a using-directive of some kind is necessary is that UDLs are
provided by operators, and directly spelling out the operator in order to
namespace-qualify it would be self-defeating. (Many people dislike `using
namespace std;` and I understand their point, but when one works on the STL
all day every day, this using-directive is sure nice for test code.)

My exhaustive test case for this is (demonstrating all the different ways that
chrono literals can be accessed):
[https://github.com/microsoft/STL/blob/31419650d472932dd24b9c...](https://github.com/microsoft/STL/blob/31419650d472932dd24b9c5602707b0454f61fb2/tests/std/tests/Dev11_0000000_user_defined_literals/test.cpp#L77-L122)

------
rurban
Nice, but using decade old and overly slow string search algorithms still is
questionable. There are hundreds of faster algos, the fastest and most stable
currently being epsm (2013).
[https://www.dmi.unict.it/~faro/smart/algorithms.php?algorith...](https://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=EPSM&code=epsm)

------
mcbain
A while back I filed an issue on the MSVC std::vector implementation as the
behaviour of a weird corner case differed from GCC/Clang.

The reply I got from STL himself was fantastic, pointing out that it was
[paraphrasing here] a wart in the spec, and that neither were actually
incorrect.

To my lasting regret, I didn't keep an _ahem_ offsite backup of some of those
corp emails. Hence I can't recall the exact details.

~~~
StephanTLavavej
This sounds like the vector<list<unique_ptr<T>>> fiasco, where a vector
contains elements that are movable-only (you can't copy a list<unique_ptr<T>>,
only move it), yet with throwing move constructors (as MSVC's list, set, and
other node-based containers dynamically allocate sentinel nodes when
default/move-constructed - this is a behavior difference from GCC's libstdc++
and Clang's libc++, where both dynamically-allocated and container-internal
sentinel nodes are permitted by the Standard). The doom begins when STL
containers like list (also vector itself) don't have "constrained copy
constructors" \- that is, list(const list&) always appears to be available, as
far as type traits like is_copy_constructible are concerned, even if that copy
constructor can't actually compile. The doom accelerates when vector<Elem>
needs to ask the question "are you nothrow-movable? If so, I can move you
efficiently. Otherwise, are you copyable? If so, I can fall back to copying
you during reallocation, in order to preserve the strong exception guarantee.
Otherwise, you're throwing movable-only, so I'll move you and you get the
basic guarantee". The doom is complete when list<unique_ptr<T>> correctly
reports that it isn't nothrow-move-constructible (it would throw due to
allocating a sentinel), yet incorrectly claims to be is_copy_constructible.
Therefore the vector tries to copy the list<unique_ptr<T>> and that fails to
instantiate, resulting in an epic compiler error.

This situation is a fiasco because there is no good way out of it
(constraining container copy constructors isn't possible because the Standard
permits incomplete element types for many containers, and users use them for
other containers even though they technically shouldn't, weakening EH
guarantees is potentially problematic, changing the nature of sentinel nodes
is ABI-breaking and affects iterator invalidation guarantees). It is also
highly memorable, which is why I can guess what it was :-)

~~~
mcbain
I only just noticed this reply - many apologies.

Yes, that's the one! Our fix was explicitly deleting the copy constructor. In
essence a one-liner for us, but you can't see the iceberg of an issue beneath
the surface.

Thanks once again. :)

------
skyde
Any one know why the Turbo Boyer-Moore algorithm is not more popular. Any case
when Boyer Moore would be better?

~~~
MattPalmer1086
There are literally hundreds of string search algorithms published now
(including a large number of Boyer Moore variants).

There's a great resource called SMART for those interested in exploring this,
with implementations of many of them and a simple benchmarking tool.

[https://smart-tool.github.io/smart/](https://smart-tool.github.io/smart/)

Although Boyer Moore is well known, there are generally faster algorithms
available these days.

For short patterns, SHIFTOR will tend to be a lot faster than BM. Even the
much simpler variant of BM, Horspool is usually faster than BM.

This highlights an interesting tension in search algorithm design. Often a
simpler algorithm outperforms one with theoretically higher performance, but
not always!

~~~
rurban
Fastest and best is the latest, EPSM. For all cases, given you are on x86_64
or aarch64 with sse4.2

The cited implementation has only minor problems, which I fixed in my fork.

------
kazinator
> _I find it very curious that it isn 't constantly mentioned when explaining
> Boyer-Moore_

Papers need "superseded by ..." type forward references, like IETF RFC's. Or
rather "corrected by", in this case.

Maybe important algorithms need to be summarized in RFC-like documents.

~~~
buckminster
Exactly the same problem occurs with legislation. For years this was solved by
third-parties publishing enormous annotated legal references. Nowadays the
government (in the UK anyway) publishes legislation online incorporating all
the edits made by later legislation.

Knuth's TAOCP serves as an annotated third-party reference, except that in
this case he's not a third-party. I don't have a copy to hand to see if it has
the correct version of the algorithm.

------
NCG_Mike
I remember Dr Dobbs, way back when, having an article on the search algorithm.
I think they had C code for it.

I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was
working on.

------
kzrdude
So there is potentially a simpler fix than the Rytter correction, is that also
yet to discover in published papers somewhere, or do we need to ask Knuth
himself?

~~~
StephanTLavavej
Apparently there are multiple fixes, some of which are said to be simpler. See
[https://github.com/microsoft/STL/issues/727](https://github.com/microsoft/STL/issues/727)
and the cited comments within, which quote Knuth. (I'm learning a lot by
having this PR appear on HN!)

------
fithisux
Is Dlang affected?

