
2000x performance win - tilt
http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance-win/
======
luriel
This is not really a 2000x performance "win", this is GNU grep(1) being so bad
at dealing with UTF-8 to the point of being basically unusable.

The PoSix locale system is a nightmare, but the implementation in GNU
coreutils is even worse, when it is not insanely slow it is plain broken.

I wasted days tracking down a bug when we moved a script from FreeBSD to a
linux sytem that had an UTF-8 locale, even for plain ASCII input GNU awk could
match lines erratically if the locale was not set back to C! I'm sure this bug
has been fixed by now, but is by far not the only time I have found serious
flaws with UTF-8 support in GNU tools, it seems this days the only way to be
safe is to set the locale to C at the top of all your scripts, which is quite
sad.

This is also required because due to POSIX locale insanity the actual behavior
of tools as basic as ls(1), and more fundamentally regexps matching changes
based on the locale, making scripts that don't set their locale at the top
unreliable and nonportable.

Or using a toolkit with decent UTF-8 support, like the from Plan 9 that one
can use on *nix systems via Plan 9 from User Space ( <http://plan9.us> ) or
9base ( <http://tools.suckless.org/9base> ), if you use those also your
scripts will be much more portable, as there are all kinds of
incompatibilities among for example awk implementations.

~~~
dododo
here's another "fun" grep locale oddity:

    
    
       $ echo HI | LANG=en_US.utf8 grep '^[a-z]'
       HI
       $ echo HI | LANG=C grep '^[a-z]'
       $
    

apparently en_{GB,US}.utf8 orders a-z like aAbBcC..zZ.

    
    
       $ echo ZI | LANG=en_US.utf8 grep '^[a-z]'
       $

~~~
iso8859-1
This is what I get:

    
    
        $ echo HI | LANG=C grep '^[a-z]'
        $ echo HI | LANG=en_US.utf8 grep '^[a-z]'
        $ 
    

How come?

~~~
p9idf
I was able to reproduce the bug. It could be a version thing.

    
    
      ; grep --version
      GNU grep 2.6.3
      ; echo A | LANG=en_US.utf8 grep '[a-z]'
      A

~~~
Ives
No, I have the same version but not a similar result. I also have the
en_US.utf8 locale installed.

------
Nitramp
Does anyone understand what the bug in grep was?

UTF-8 regular expression matching shouldn't be different from ASCII at all, as
far as I can tell. In UTF-8, every byte by itself can be identified as a start
byte or trail byte, so if you want to match a regular expression, you don't
even have to care about UTF-8 in any way. Any legal character in the regexp
you want to look for can only match at a legal character start in the
haystack.

~~~
4ad
GNU grep uselessly converts UTF-8 to some multibyte char internal
representation.

You are completely right, matching regular expressions with UTF-8 text is the
same as with ASCII text, one of the reasons why UTF-8 is so good.

~~~
omgtehlion
You both are wrong. If you take arbitrary byte in ascii stream you can be sure
that this is a whole char, but if you take arbitrary byte in utf-8 stream you
can get a start of a char / end of char / or whole char.

~~~
4ad
Obviously choosing an arbitrary offset in the byte stream might not align on
rune boundary, but UTF-8 is great because you can determine that. You can
determine the start/end boundaries in any subset of the byte stream, as
opposed to most other encodings.

The issue is completely unrelated to the fact that matching regexps in UTF-8
text is the same as matching regexp in ASCII text. The regular expression tool
doesn't even need to care that the text is UTF-8. It's just byte comparisons,
the tool doesn't even need to be aware of rune boundaries.

~~~
colomon
Don't some of the common cases of searching require the ability to quickly
skip forward N characters for high efficiency? That's simple pointer
arithmetic in the ASCII case, but requires reading each byte skipped in UTF-8,
right?

(I'm particularly thinking of Boyer-Moore --
[http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_sear...](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)
\-- but I'm sure there are other examples as well.)

------
tantalor
In case anybody else was wondering what LANG=C means,

    
    
      > The 'C' locale is defined as the "default" locale
      > for applications, meaning that their strings are
      > displayed as written in the initial code (without
      > passing through a translation lookup).
      > 
      > There is nothing special about the C locale, except
      > that it will always exist and applies no string
      > replacements.
      
      -Malcolm Tredinnick  
    

[http://mailman.linuxchix.org/pipermail/techtalk/2002-Novembe...](http://mailman.linuxchix.org/pipermail/techtalk/2002-November/013691.html)

~~~
tantalor

      > If the locale value is "C" or "POSIX", the POSIX locale is used 
      
      - The Single UNIX ® Specification, Version 2
    

<http://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html>

------
bjoernbu
As far as I know, grep uses Boyer-Moore for string matching when possible.
Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't
possible, but it is the asymptotically faster algorithm known.

Hence even perfect versions of grep will slower by arbitrarily large factors,
depending on the input.

So while there may be problems, expecting no difference or no significant
difference between encodings is not correct either

~~~
dexen
_> Giving a variable char size encoding such as UTF8, plain Boyer-Moore isn't
possible (...)_

That you get UTF-8 input and produce UTF-8 output doesn't imply you are better
off using UTF-8 for processing. Translating UTF-8 to fixed-width UTF-32 and
back is of linear complexity and takes small, fixed amount of memory. The only
trade-off is when processing /very/ long lines -- up to four time more memory
would be used for buffer.

As mentioned in other posts, Unicode requires normalization of certain
character combinations into other characters, so you'll be processing all
input characters anyway. Just prefix an extra step to it, not even a separate
loop.

And so you can do Boyer-Moore with Unicode at very little extra cost :-)

Some text-intensive programs of Plan 9, including grep, use internally fixed-
widht format called `Rune' for unicode, exactly for reasons of performance.
UTF-8 input is translated into strings of Runes for processing and translated
back for output.

------
gravitronic
For getting dtrace-like traces under Linux I strongly suggest getting oprofile
(<http://oprofile.sourceforge.net/news/>) working on your target machine. I've
used it on a PPC embedded board and it worked wonderfully. Do not optimize
that which you have not measured.

------
mgedmin
Is this really a bug in grep, rather than a bug in Solaris's libc? I've never
seen grep so slow, and I've been using UTF-8 locales for _years_.

I'm not denying that grep was buggy (there's a link to grep's bug tracker to a
bug that was closed more than a year ago), but I'm surprised at the magnitude
of the slowdown.

~~~
pdw
The official GNU grep used to be absurdly slow at UTF-8. Linux distributors
very quickly noticed this and fixed it when they switched to UTF-8 by default.
But GNU grep maintenance was essentially dormant for years and these patches
were only integrated in 2010.

For an old, unpatched GNU grep a 2000x slowdown is quite believable.

------
lordlarm
It actually was 4000x times faster by using C's inbuilt counter ("-c") instead
of "wc -l".

Any risks with just updating to latest version of grep instead of using the
LANG=C hack?

------
iradik
Anyone able to verify the author's claim?

I only get a 2x improvement when switching LANG on Redhat linux on EC2:

% export LANG=C % time grep done nohup.out | wc -l 152929

real 0m0.343s user 0m0.233s sys 0m0.112s

% export LANG=en_US.UTF-8 % time grep done nohup.out | wc -l 152931

real 0m0.771s user 0m0.673s sys 0m0.100s

% grep --version GNU grep 2.6.3

Author is using grep 2.5.3, I am using 2.6.3, so not testing the same thing.

------
kennystone
I enjoyed seeing your debugging process and picked up a few tricks.

