
Make grep 50x faster - iamtechaddict
https://blog.x-way.org/Linux/2013/12/15/Make-grep-50x-faster.html
======
agf
If you take a look at the comments on the article, most of that speedup is
because the LANG=C command was run second and the files were cached.

His estimate accounting for that was 7x, but this is clearly not a benchmark
that was carefully thought through.

------
pmelendez
I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do
underneath.

Actually I would recommend people to give it a try to those alternatives, I
haven't had to look back to grep again since I am using ack-grep (and now ag)

[1] [http://beyondgrep.com/](http://beyondgrep.com/)

[2] [http://geoff.greer.fm/2011/12/27/the-silver-searcher-
better-...](http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-than-
ack/)

~~~
cs02rm0
Maybe, the silver searcher still seems a _lot_ faster than grep with this
trick though. (About 0.5s vs 32s on an arbitrary test).

------
anilshanbhag
I am curious what will happen when we run the commands in the reverse order.
The LANG=C variation before the first. I suspect some of the speedup is
because you just brought the file into memory.

~~~
ye
Not only that, the search branches are sitting in the CPU caches.

~~~
pygy_
Wouldn't the cache be flushed by the OS and other background processes, in the
mean time, plus the display updates in the shell and the shell history?

------
iagooar
This is what I get on a Precise 32bit Ubuntu:

    
    
      stuff$ du -sh big.log
      2.8G	big.log
    
      stuff$ time grep -i e big.log > /dev/null
      real	0m30.228s
      user	0m12.213s
      sys	0m3.228s
    
      stuff$ time LANG=C grep -i e big.log > /dev/null
      real	0m30.130s
      user	0m12.105s
      sys	0m3.308s
    
    

What is LANG=C supposed to do?

~~~
simias
Maybe your default locale is already C?

I'm still surprised that TFA can claim such a speedup, I would have thought IO
speed was the bottleneck when you grep through a big amount of data.

As an other poster mentioned I wonder if the speedup is not mainly disk
caching in RAM during the 2nd run.

~~~
masklinn
FWIW my locale is en_GB.utf-8 and I also get no difference (in fact, the
version _with locale_ is slightly faster than without), with GNU grep 2.14 on
OSX 10.9.

The built-in BSD grep (2.5.1-FreeBSD) also runs in 30% of the time GNU grep
does.

------
blassium
There was a really interesting post on here a while back on GNU grep vs BSD
grep (2010)[1]

The improvement mentioned here also has to do with the Boyer-Moore algorithm.
When switching the locale from LANG=whatever to LANG=C, we're reducing the
size of the lookup table to a fraction of what it previously was. In this
case, the fraction is 1/50th, but, as the author said, this will vary between
patterns and platforms.

[1] [http://lists.freebsd.org/pipermail/freebsd-
current/2010-Augu...](http://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

------
comex
Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy
doesn't show up, so it's smart enough to recognize that the UTF-8 search can
be correctly performed as a byte search. I suspect the insensitive version can
also be done correctly much faster, though.

~~~
acdha
> it's smart enough to recognize that the UTF-8 search can be correctly
> performed as a byte search

It shouldn't that simple – it'd also need to confirm that the pattern wouldn't
match any combining characters or normalization would still be necessary.

~~~
Joeri
The non-leading utf-8 bytes all are in an easily detected range that doesn't
overlap with ascii.

~~~
acdha
Yes - it's not hard to do but it does require someone to remember to check
before attempting the optimization.

------
tszming
See:

[1]
[https://news.ycombinator.com/item?id=3337411](https://news.ycombinator.com/item?id=3337411)

[2]
[http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...](http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-
win/)

~~~
acqq
And especially:

[http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-
utf...](http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf-8/)

"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was
fixed with the release of GNU grep 2.7. The rest of the article can now be
considered obsolete."

------
dfc
I did not see any version numbers or if we are discussing BSD grep or GNU
grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow
the first thing I ask is if they are using OSX, the answer is almost always
yes. GNU grep is a lot faster.

That being said there was a bug with grep and UTF a little while back. Debian
lists the bug as present in 2.6 and fixed in 2.8:

"grep ." pathologically slow in UTF-8 locales -- [http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=604408](http://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=604408)

------
nullanvoid
There was a write up about this a while ago.
[http://www.inmotionhosting.com/support/website/ssh/speed-
up-...](http://www.inmotionhosting.com/support/website/ssh/speed-up-grep-
searches-with-lc-all)

