Hacker News new | past | comments | ask | show | jobs | submit login

Nice work. But your story is incomplete if you don't include comparison with icgrep (parabix.costar.sfu.ca). Although icgrep is still an active research project, it is faster in many cases and has broader Unicode support (full Unicode level 1 of UTS #18, plus many level 2 features). For example, try the '\N{SMIL(E|ING)}' search that finds lines containing emoji characters with SMILE or SMILING in their Unicode name. icgrep also correctly applies Unicode character class intersection with expressions such as [\p{Greek}&&\p{Lu}], while ripgrep fails to meet UTS 18 level 1 requirements by interpreting '&&' as literal characters.

Nevertheless, we appreciate the challenge that ripgrep presents to our performance story. We definitely see some cases in which rg achieves better performance taking advantage of fixed strings somewhere in the pattern. We'll have to work on that... But for patterns based on Unicode classes (e.g., \p{Greek}, icgrep can be much faster, especially on large files. (We are only focused on big data applications -- icgrep has significant dynamic compilation overhead). It also does very well in cases involving alternations and ambiguity.

The icgrep performance story is based primarily on a new bitwise data parallel regular expression algorithm working off the Parabix transform representation of text. See our ICA3PP 2015 or PACT 2014 papers.

It might be fair to say that icgrep is not yet polished enough for inclusion in your study. I just added our first implementation of -r/-R flags last night and we certainly haven't yet handled .gitignore, etc. But if you want to understand the truth about regular expression performance, I think that the data point represented by icgrep (and its continuing development) needs to be included.




Benchmarking aside, I think the icgrep approach is very interesting.

The appeal of this sort of bit parallel stuff is that you don't get performance variability. It's all fun and games with literal optimizations until you pick the wrong literal - all you have to do is stuff the file in the down-thread benchmark with a huge pile of "Holmes Holmes Holmes Lestrade Lestrade Lestrade" etc and suddenly both rg and Hyperscan will look a lot dumber and icgrep will keep going at the same speed.

Unfortunately, we've never figured out a way to dispense with literal-based matching for the large scale cases that we work with. When someone gives you a dozen regexes - or 12,000 - bit parallel approaches go into the weeds (as do most regex-oriented techniques; DFAs explode and bit-parallel NFAs - regardless of organization - become ponderous).


Speaking as the leader of the Hyperscan project (https://github.com/01org/hyperscan), I'd say you might not be the only project feeling a bit neglected in the performance comparison here.

It's nice to be mentioned - and even called out by name as the inventor of Teddy (!) - but we're always even more pleased when someone else measures Hyperscan, as it minimizes the prospect of us embarrassing ourselves by posting some self-serving and/or outlandish microbenchmark.

Also it saves everyone time reading the obligatory Intel legal disclaimers...


We are very interested in tackling the multiple-pattern regular expression problem and will definitely want to use Hyperscan as a comparator.

Any help in setting up a study with both patterns and data sets would be most appreciated!


This sounds interesting. I suspect there are many alternate approaches to multiple regex.

Sadly, regex benchmarking is a sewer. There are two classes of multiple regex benchmarks: public ones and good ones, and not much intersection between the two. Synthetic pattern generation can be manipulated to say whatever you want it to say, Snort patterns aren't intended to be run simultaneously (so putting a big pile of them into a sack and running them is of arguable use), and most vendors guard proprietary signature sets closely (we have thousands, but all the good ones are customer confidential).

That being said, there are some paths forward here. Let's talk.


Certainly. I dropped an e-mail at the hyperscan account.

By the way, if you try out icgrep and use the -DumpASM option, you may notice a very unusual characteristic: the generated code is almost completely dominated by AVX2 instructions!


I did a brief performance comparison with icgrep here: https://github.com/BurntSushi/ripgrep/issues/63 --- Yes, it does a little better on things like \p{Greek} (and has even better Unicode support than ripgrep) but is missing literal optimizations, which are a ridiculously common case for search tools like this. (As you acknowledged.) Just about every single person who has come to me with a performance comparison of ripgrep has used a simple literal pattern, because that's what people search. You need to knock that case out of the park to be competitive. Even sift and pt do this to some extent.

Hyperscan is another top-notch regex implementation worth looking at too, for example. With that said, it does look like icgrep could be added to the single-file benchmark at least, but it is by far the hardest piece of software to actually get installed. (Your build instructions required me to compile llvm.)

> But your story is incomplete if you don't include comparison with icgrep

My story is incomplete for a very very large number of reasons. icgrep is only one of them. As I said in the blog post, the benchmarks are not only biased, but curated.

Of course, I agree icgrep is worth looking into. It's on my list to learn more about.

> It also does very well in cases involving alternations and ambiguity.

So does ripgrep. :-)

    $ ls -hl OpenSubtitles2016.raw.en
    -rw-r--r-- 1 andrew users 9.3G Sep 10 11:51 OpenSubtitles2016.raw.en
    
    $ time rg '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l
    26464
    
    real    0m30.606s
    user    0m29.987s
    sys     0m0.567s
    
    $ time icgrep '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l
    26464
    
    real    0m41.346s
    user    0m40.770s
    sys     0m0.513s
    
    $ time rg -i '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l
    27370
    
    real    0m17.611s
    user    0m17.100s
    sys     0m0.510s
    
    $ time icgrep -i '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l
    27370
    
    real    0m43.715s
    user    0m43.193s
    sys     0m0.523s
In the case insensitive search, ripgrep isn't actually doing any literal optimizations, but it is in the first case. Namely, it sees that ` ` is required in every alternate. Since ` ` is so common, this ends up slowing down the search. (This shows how literal optimizations aren't necessarily so simple, because it's easy to make a mistake like I've done and cause search to be slower.)

If we change ` ` to `\s+`, we can see ripgrep regain its performance precisely because it can't do any literal optimizations:

    $ time rg '\w+\s+Holmes|\w+\s+Watson|\w+\s+Adler|\w+\s+Moriarty|\w+\s+Lestrade' OpenSubtitles2016.raw.en | wc -l
    26464
    
    real    0m17.607s
    user    0m17.117s
    sys     0m0.490s
    
    $ time icgrep '\w+\s+Holmes|\w+\s+Watson|\w+\s+Adler|\w+\s+Moriarty|\w+\s+Lestrade' OpenSubtitles2016.raw.en | wc -l
    26464
    
    real    0m46.368s
    user    0m45.883s
    sys     0m0.483s
    
    $ time rg -i '\w+\s+Holmes|\w+\s+Watson|\w+\s+Adler|\w+\s+Moriarty|\w+\s+Lestrade' OpenSubtitles2016.raw.en | wc -l
    27370
    
    real    0m17.583s
    user    0m17.080s
    sys     0m0.493s
    
    $ time icgrep -i '\w+\s+Holmes|\w+\s+Watson|\w+\s+Adler|\w+\s+Moriarty|\w+\s+Lestrade' OpenSubtitles2016.raw.en | wc -l
    27370
    
    real    0m48.662s
    user    0m48.127s
    sys     0m0.503s
Nevertheless, ripgrep outperforms icgrep in every case.

Now, with all that said, I bet this is my bias showing. I'm confident you can find other cases (we don't already know about) where icgrep beats ripgrep. :-)

BTW, the corpus I used can be got here: http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl... (The one I used in my benchmark is a sample of this. For these benchmarks here, I used the full file.) The Cyrllic corpus is here: http://opus.lingfil.uu.se/OpenSubtitles2016/mono/OpenSubtitl...


Running the same tests on our test machine, I am seeing dramatically different performance results. Our test machine has AVX2, but this can't be the whole story.

cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ time rg '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l 26464

real 1m8.380s user 1m6.211s sys 0m2.006s cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ time icgrep '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l 26464

real 0m11.899s user 0m9.212s sys 0m2.125s cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ time rg -i '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l 27370

real 0m31.891s user 0m29.559s sys 0m2.119s

cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ time icgrep -i '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en | wc -l 27370

real 0m14.359s user 0m11.765s sys 0m2.211s

Here is the processor info for (only showing the first processor). cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 61 model name : Intel(R) Core(TM) i3-5010U CPU @ 2.10GHz stepping : 4 microcode : 0x16 cpu MHz : 2076.621 cache size : 3072 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap bogomips : 4189.93 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:


Strange. I'm not sure how to explain the results either. Is there something I'm supposed to do to enable icgrep to use AVX2? I followed the build instructions in the README verbatim. Here's my cpu info (which is quite new and does have AVX2):

    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 79
    model name      : Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
    stepping        : 1
    microcode       : 0xb00001d
    cpu MHz         : 1267.578
    cache size      : 20480 KB
    physical id     : 0
    siblings        : 16
    core id         : 0
    cpu cores       : 8
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 20
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
    bugs            :
    bogomips        : 6398.91
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management:


Are you using icgrep1.0? That may explain it.

My reports are from our current development version r5163.

  cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ perf stat -e instructions:u,cycles:u,branch-misses:u icgrep1.0 -i -c '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en 
  27370
  Performance counter stats for 'icgrep1.0 -i -c \w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade OpenSubtitles2016.raw.en':
   252,725,532,395      instructions:u            #    2.75  insns per cycle        
    91,867,444,975      cycles:u                 
       283,661,301      branch-misses:u                                             
      46.570725331 seconds time elapsed

   
  cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ perf stat -e instructions:u,cycles:u,branch-misses:u rg -i -c '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en 
  27370
  Performance counter stats for 'rg -i -c \w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade OpenSubtitles2016.raw.en':
    84,296,004,027      instructions:u            #    1.38  insns per cycle        
    61,298,903,577      cycles:u                 
           510,918      branch-misses:u                                             
      31.962195024 seconds time elapsed

  cameron@cs-osl-10:~/ripgrep/datadir/subtitles$ perf stat -e instructions:u,cycles:u,branch-misses:u icgrep -i -c '\w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade' OpenSubtitles2016.raw.en 
  27370
  Performance counter stats for 'icgrep -i -c \w+ Holmes|\w+ Watson|\w+ Adler|\w+ Moriarty|\w+ Lestrade OpenSubtitles2016.raw.en':
    42,064,581,840      instructions:u            #    1.94  insns per cycle        
    21,723,251,095      cycles:u                 
        47,953,756      branch-misses:u                                             
      13.301160493 seconds time elapsed


    [andrew@Cheetah icgrep-build] pwd
    /home/andrew/clones/icgrep1.0/icgrep-devel/icgrep-build
    [andrew@Cheetah icgrep-build] ./icgrep --version
    LLVM (http://llvm.org/):
      LLVM version 3.5.0svn
      Optimized build.
      Built Sep 24 2016 (11:27:32).
      Default target: x86_64-unknown-linux-gnu
      Host CPU: x86-64
If the file path is any indication, it looks like I'm using `icgrep1.0`. But the file path also has `icgrep-devel` in it. So I don't know. When I get a chance, I guess I'll try to figure out how to compile the devel version. (It seemed like that was what I was doing, by checking out the source, but maybe not.)


Yes, you have icgrep 1.0. The current development version has about the same build process, sorry that it is such a pain. It is available by svn checkout as follows.

  svn co http://parabix.costar.sfu.ca/svn/icGREP/icgrep-devel
AVX2 is autodetected and used if available and enabled by the operating system/hypervisor (Although icgrep1.0 can be compiled to use AVX2, it had some issues).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: