
Why GNU grep is fast - ColinWright
http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html?
======
js2
129 days ago - <http://news.ycombinator.com/item?id=2393587>

Never to be discussed without bringing up The Treacherous Optimization -
[http://ridiculousfish.com/blog/archives/2006/05/30/old-
age-a...](http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-
treachery/)

Edit: and TTO's discussion - <http://news.ycombinator.com/item?id=1627367>

------
y0ghur7_xxx
Previous discussions:

<http://news.ycombinator.com/item?id=2393587>

<http://news.ycombinator.com/item?id=1626305>

------
tekyon
<http://news.ycombinator.com/item?id=2822041> Wasn't this from you?

~~~
ColinWright
Yes, absolutely. I figure that rather than have endless repetition of content-
free crap, I'd rather see the very occasional revisit of something that's got
serious content. In part I look forward to seeing how this particular
submission fares.

Nice to see you register to make a contribution - it's an important question
to ask, and I appreciate that you've taken the effort to do so (and I've
upvoted you).

~~~
dfc
Do you think something has changed that makes it worth revisiting this post?
Or do you think its just a good post to revisit regularly?

I had not read it for a while and enjoyed reading it again. There were
definitely somethings I forgot but I am wondering if something has
changed/happened that should change my reading of the thread.

------
p9idf
Fun fact: Gnu grep is slow at UTF8. Try

    
    
      time LANG=C grep asdf < les_misérables.txt > /dev/null

versus

    
    
      time LANG=en_US.UTF-8 grep asdf < les_misérables.txt > /dev/null

~~~
_delirium
Hmm, strange. One of the stated design features of UTF-8 was that Boyer-Moore
could still be directly applied (<http://www.ietf.org/rfc/rfc2279.txt>). Is
grep doing something unnecessarily suboptimal here, or is that RFC's comment
correct but not the whole story?

~~~
Sharlin
The problem is probably that deciding whether two Unicode characters (where
one character may be composed of multiple code points) are equivalent in
general case is MUCH more involved than doing a simple byte comparison.

<http://en.wikipedia.org/wiki/Unicode_equivalence>

~~~
_delirium
True, but in the special case where your search string is composed entirely of
7-bit ASCII characters (the characters representable as 1-byte in UTF-8), like
the example, shouldn't the character-equivalence logic still be easy?

~~~
Someone
I do not think that accented e is a 7-bit ASCII character. Even if it is, the
text being searched may contain the character coded as a multi-byte sequence.

~~~
_delirium
That was just in the filename; the search string in the example is 'asdf'. And
UTF-8 I believe is purposely designed so that 7-bit ASCII characters _can't_
appear anywhere else in the stream except when representing themselves (even
as part of multibyte characters)--- every other byte in a UTF-8 stream must
have the high-order bit set.

My guess is that p9idf has it right, and grep is just converting everything to
wchar_t first, rather than trying to do any sort of clever searching directly
on the UTF-8 byte stream.

~~~
forgotusername
There's nothing to prevent this from being implemented, other than it'd be a
huge hacky mess. It would require sidestepping iconv (or however grep does it)
when LC_ALL is one of a specific set of strings, activating some special
cases, and then additionally, building on those special cases, further
scanning the input pattern (which may not just be a literal string - how might
this work with character classes/ranges?) to ensure it is 7bit, in order to
achieve the desired speedup.

Or if your input data is sufficiently ASCII-ish, and so is your search
pattern, then why not just force the process locale to C and avoid the whole
mess to begin with.

I'm suddenly left wondering how the "." regex syntax functions in the face of
surrogates when handling UTF-8.

------
_delirium
This is vaguely relevant to one of the recent discussions on HN about null-
terminated strings: Boyer-Moore doesn't gain you these advantages if you have
null-terminated strings, unless you also know the length, for the same reason
that this email gives for not searching for \n initially: you can't skip any
bytes during the search if you're looking for a 1-byte match.

------
DanielRibeiro
_The key to making programs fast is to make them do practically nothing_

Priceless...

~~~
kragen
While that has been true for 70 years, it is ceasing to be true now; today you
can often make your program faster by causing it to do more _computation_ in
order to avoid _communication_ , or in order to tolerate communication
latency, which may involve doing _more_ communication, or occasionally in
order to tolerate node failures, so that you can take advantage of more nodes.

~~~
burgerbrain
In the first case, you are doing less (communication) to go faster.

In the second case, you are doing less (computation) to go faster.

The less expensive operations you have to do (whatever they may be), the
faster your program is. That has been true for the past 70 years, and will
continue to be true as long as computing remains in the slightest bit
recognizable to men of our field as it currently stands.

~~~
jleader
70 years later, and we're still all "men"?

~~~
burgerbrain
Oh get over yourself.

    
    
      b : the human race : humankind

<http://mw1.merriam-webster.com/dictionary/man>

~~~
jleader
Yes, but that definition you quote is for "man" (often written as the more
archaic "mankind"), not "men". You're actually using definition a.1: "an
individual human; especially : an adult male human", in the plural, to refer
to adult male humans of our field. Admittedly a pedantic point, I wouldn't
bother if it didn't contribute to a real-world problem.

~~~
burgerbrain
Men is nothing more than a pluralization of man. In this context it means
members of "man".

<https://secure.wikimedia.org/wiktionary/en/wiki/men>

Now take your self-righteous offtopicness and shoveoff. I contribute nothing
to the problems of society by exercising freedom of vocabulary. You want to
see real problems? Go look at pictures of the shelled remains of children in
Syria. You're suggesting that I am contributing to a "real-world problem" is
insulting.

Even if we want to consider the furthering of systemic sexism through benign
use of vocabulary a "real-world problem", you are more to blame than I by
trying to convert an innocent phrase into an example of hatred. Words only
have the power that _you_ allow them to. I accept no responsibility for your
oversensitivity.

~~~
jtheory
Meh, it's a smaller problem than most, but it's also far easier to solve than
most. Why avoid making the world a tiny bit better if it costs you nothing? If
making minor tweaks to our word choices could avoid violence to Syrian
children, too, that'd be grand, but alas...

This is not exactly a response to you, BTW (since I can't imagine you'd be
convinced) but for other readers.

------
ColinWright
To the observation that this has been submitted before, agreed. Contributions
to those previous submissions is now closed, however, because they are
sufficiently far in the past, and I think this subject is worth re-visiting
occasionally.

FWIW: I've upvoted the coments about earlier submissions - I'd like to reward
such behavior. Not least, searching for previous discussions often unearths
genuinely useful material and should be encouraged.

~~~
antics
Yeah. You're the guy who reposted that Reinvigorated Programmer article again
a week or so back, right? Something like, "Old Post, What's changed?"

Just a friendly reminder, you may have forgotten, here's the profile blurb you
have up right now:

> I'm tired of the repetition, [...] old material being recycled,

> as I write this, Lockhart's Lament has been posted again.

EDIT: formatting error

~~~
ColinWright
Yup, exactly so. To quote a slightly different bit of it:

    
    
        ... banal comments, old material being recycled,
        and general lack of interesting stuff.
    

In short, the vast majority of repeated material is drivel. Fluff. Content-
free. I'm trying to restore the balance a little with more technical,
challenging and deep material.

This is probably controversial, but it's my last ditch effort to actually get
some technical material back into _Hacker_ News.

I honestly don't expect it will succeed, but then I can finally depart with a
completely clear conscience that I tried my best and found that the so-called
"Hacker News" no longer contains much I care about.

I hope I'm wrong, but I had to try the experiment.

PS: And FWIW - I've upvoted you.

~~~
pgbovine
random (drivel) hack idea ... create hybrid pages with iframes that combine a
meaty technical article (e.g., a russ cox blog post) with a fluffy tech gossip
article, and post these hybrid articles to HN. e.g., "Fast string matching
with DFAs / new music social network for your pets launch party pics!"

------
iam
Great article, really makes you think about O(n) substring searching doesn't
it?

It would've been nice to drill down why mmap makes a difference in particular,
why not use something like lseek to skip characters?

~~~
thristian
My guess: modern CPUs have hardware support for pausing user-level code and
switching to kernel mode to service a page fault, but the syscall interface
requires that the transition to kernel mode be done "manually", as it were.

Also, if you're reading from a file with lseek() and read(), you have to
marshall data into the struct the kernel expects, make the syscall, and the
kernel copies the data into the user-space buffer you provide. If you're
reading from a file with mmap(), the page cache just appears in your address-
space, no copying necessary.

On the other hand, I was curious why the default had changed from "use mmap"
to "don't use mmap". Turns out, the grep 2.6.3 documentation states:

 _\--mmap: This option is ignored for backwards compatibility. It used to read
input with the `mmap' system call, instead of the default `read' system call.
On modern systems, `--mmap' rarely if ever yields better performance._

~~~
iam
That could explain why using mmap is faster than read in general, but what
about skipping bytes? It sounds to me like if the seek distance is less than a
page size, the kernel would need to fill the entire page with data.

Then again, hard drives return minimum sized blocks of data anyway, which is
sized in kilobytes. So I'm guessing the "skipping" is a performance benefit
only as so far as user space goes.

One would really have to grep a long string to be able to skip more than a
block/page!

