

Railgun: a fast strstr(3)-like function - silentbicycle
http://www.sanmayce.com/Railgun/index.html

======
StefanKarpinski
Ugh. Licensed under "Code Project Open License":

[http://www.codeproject.com/info/cpol10.aspx](http://www.codeproject.com/info/cpol10.aspx)

Good luck figuring out what this is legal to use with.

~~~
duskwuff
There are several insane clauses to this license, but the worst is probably
§5f, which states that:

    
    
        You agree not to use the Work for illegal, immoral or improper purposes,
        or on pages containing illegal, immoral or improper material.
    

Good luck figuring out what that even means.

------
scaramanga
Seems difficult to verify any of the claims made, are there comparisons to
other algorithms? Any analysis? Description of the algorithm?

Maybe they're there and I missed them because the website made my eyes bleed.

~~~
ye
He has a ton of benchmarks on that page:

    
    
        Searching for Pattern('an',2bytes) into String(206908949bytes) line-by-line ...
    
        strstr_Microsoft_hits/strstr_Microsoft_clocks: 1212509/544
        strstr_Microsoft performance: 248KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        strstr_GNU_C_Library_hits/strstr_GNU_C_Library_clocks: 1212509/359
        strstr_GNU_C_Library performance: 376KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        Railgun_Doublet_hits/Railgun_Doublet_clocks: 1212509/321
        Railgun_Doublet performance: 421KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        Railgun_Quadruplet_8Triplet_hits/Railgun_Quadruplet_8Triplet_clocks: 1212509/335
        Railgun_Quadruplet_8Triplet performance: 403KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        Railgun_Mischa_8Triplet_hits/Railgun_Mischa_8Triplet_clocks: 1212509/348
        Railgun_Mischa_8Triplet performance: 388KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        BNDM_32_hits/BNDM_32_clocks: 1212509/505
        BNDM_32 performance: 267KB/clock
        StrnglenTRAVERSED: 138478024 bytes
    
        ...

~~~
acqq
Such a 'ton' of codes and dumps in that form is the problem in itself. George,
if you happen to read this once, I hope you'll get what I mean.

------
Sanmayce
@StefanKarpinski

The article is licensed under CPOL, not the code. Railgun is licenseless, one
developer working for Mozilla advised me to put it under BSD or public domain
- which is guess what: just another license, all my etudes/tools/functions are
100% FREE, not as pseudo-copylefters understand and try to sell their "Free"
\- which is ridiculous, especially the free beer part, if I am to share my joy
with my buddies I buy beers and give them for free UNCONDITIONALLY.

The bottom-line: Railgun is people's choice 'memmem', if you ever face the
possibility to go to jail, just call me I will tell the judges some copyleft
sagas of my own, that is to educate them how university professors are funded
with people's money (not only) and any derivate of those
algorithms/implementations should follow the same licenselessness - a nifty
word - everything else is just one perverted game for money, as I like to say
hypocrisy in action.

Regards to all, and no, my endless dumps are not to obstruct the usage, quite
the contrary - to provide field feedback - to give thorough comparisons, had I
had more than one computer I would have dumped several times more stats.

Best, Georgi

~~~
StefanKarpinski
Thanks for the clarification. The algorithm is very clever and the performance
quite impressive. It would be great if the legal status of the code were
clearer so that there was more chance of it making its way into usage by
people. Saying that the code is "free" in an article is not really
sufficiently clear to alleviate legal concerns people might have. The word
"license" is never mentioned anywhere in the page, while the code project
version [1] appears to state that the article and its code are under the very
unfortunate CPOL. If you want to make this code public domain – which is
great, btw – then I recommend that you put it up on GitHub (or BitBucket, or
something) with a LICENSE.md file that explicitly states that it is "public
domain". Use that phrase verbatim – it will greatly alleviate the uncertainty
and doubt about its legal status. Thanks for the cool algorithm.

p.s. I fully agree that code from publicly funded academic work should be open
source – ideally under a very liberal license like BSD or MIT.

[1] [http://www.codeproject.com/Articles/250566/Fastest-strstr-
li...](http://www.codeproject.com/Articles/250566/Fastest-strstr-like-
function-in-C)

------
dubcanada
That webpage is just wow...

It looks pretty awesome, I think it could be ported to PHP fairly easily.

------
codezero
This site is faster and has some code:
[http://www.codeproject.com/Articles/250566/Fastest-strstr-
li...](http://www.codeproject.com/Articles/250566/Fastest-strstr-like-
function-in-C)

------
robinhoode
Do projects like this ever get included into the mainstream? Would this be an
appropriate candidate for inclusion into PHP's standard library?

~~~
dannypgh
I would have hoped (but have no idea) that PHP simply uses libc's strstr in
their implementation of strstr. If this is the case, then this would need to
be included in the relevant libc for the platform you're using PHP on.

I was going to have that be my entire comment here, but I figured this was
easy enough to check - so I pulled php 5.5.7 source, and because of option
parsing complexities strstr ends up being implemented in terms of php_memnstr,
which is a macro for zend_memnstr, which in turn calls memchr and memcmp
repeatedly in a loop. So, no, libc's strstr doesn't seem to be used.

I'm a little unsure whether or why this has to be so complex, but after a
quick dip the water doesn't seem inviting enough for me to follow up.

~~~
maffydub
With regard to your comment about complexity, the cunning thing here is that
these algorithms find a substring in a string very quickly, often without even
looking at every character in the string.

For example, Boyer-Moore ([http://en.wikipedia.org/wiki/Boyer-
Moore_string_search_algor...](http://en.wikipedia.org/wiki/Boyer-
Moore_string_search_algorithm)) starts by looking at the end of the substring.
If it finds a match, it searches earlier. If it does not find a match, it can
skip ahead by several characters (possibly even the length of the substring,
depending on how the match failed). How much to skip ahead is a bit
complicated, but can be calculated in advance.

Consider searching for a substring consisting of 1000 'a's. Boyer-Moore starts
by looking at the 1000th (1-indexed) character. If it's an 'a', it then walks
back and checks the 999th, 998th etc. However, if it's not an 'a', it can
immediately skip on to examine the 2000th character, i.e. only looking at 1 in
every 1000 characters. As you can imagine, this can be very fast!

The Railgun implementation seems to be a combination of improved Boyer-Moore
(Boyer-Moore-Horspool-Sunday) with Rabin-Karp (which uses hashing). My
understanding is that these algorithms complement each other, so if you have
an input string that is particularly inefficient with one algorithm, it
automatically picks the other one.

Since many programs have string-searching in their innermost loops, spending
some time optimizing this function can be worthwhile.

~~~
mediocregopher
I think your parent was referring to why php's implementation is so complex.

~~~
dannypgh
Indeed. PHP is often criticized for having inconsistently named or patterned
libraries, and the response is usually "PHP is a relatively light wrapper to a
bunch of C libraries" \-- In fact I saw 3 versions of strstr in PHP core -
strstr, mb_strstr, and grapheme_strstr. I guess I would have expected one of
them to be a somewhat thin wrapper around libc's strstr.

As another commentator pointed out, libc's strstr assumes NUL-terminated
strings. Maybe php's doesn't? Which seems a bit odd to me in light of the
explanation of PHP's genesis as having roots in C, but stranger things...

I'm not surprised at all that libc's strstr would be complex.

------
jaytaylor
The site seems to be getting hammered, and once it loaded I struggled to read
it due to the low-contrast font/bg-color selection.

Here is a gist of the code:
[https://gist.github.com/jaytaylor/8102304](https://gist.github.com/jaytaylor/8102304)

~~~
acqq
The site has 5 MB of png files which show... well nothing relevant to the
topic. And the content of the page is mostly unfiltered output of some
strangely presented program pieces.

I would be glad to read that Sanmayce reads this or some similar input and
then starts to think about making his output really more accessible. But I
guess he likes it as it is. Bon Appétit.

~~~
gwu78
"I would be glad that [webpage author] ... starts to think about making his
output really more accessible."

This is how I feel when I hit a webpage that offers zero content without
having to execute JavaScript in a browser first.

For whatever it's worth, this page loads in less than 1 second and looks fine
in my text-only browser.

I guess in this case the web developer has chosen to recklessly punish users
who never disable images or JavaScript, in the same way some web developers
recklessly punish users who never enable such "essential features".

If the user's objective is to read and perhaps download some source code (as
in this "article"), there is arguably no reason that images or JavaScript
should be necessary.

"recklessly" here means the web developer does not intend to make users suffer
but he knows some users will suffer if he makes a certain design choice and,
knowing this, he makes that choice anyway

------
devicenull
Thought this was part of
[https://www.cloudflare.com/railgun](https://www.cloudflare.com/railgun) at
first...

------
nwmcsween
There are trade-offs it may be 'better' but it requires (I think) O(nm) space
while something like two-way strstr requires O(1).

