
Sift: Grep on steroids - sciurus
https://sift-tool.org/
======
trymas
Can someone write a TL;DR how this works?

I've read a bit about ag, which is rather fast compared to anything in the
wild, and it seems that it utilises all tools to make it as fast as possible.
Though, sift can be up to 8 times as fast. How?

EDIT: also I remember, that the last time it was posted here (or either
/r/programming), the benchmark was considered not true. Few checks on few
folders (from big to small) renders sift slower than ag from 5 to 20 times
(though probably faster than grep and/or ack).

EDIT2:
[https://github.com/svent/sift/issues/7](https://github.com/svent/sift/issues/7)
\- about benchmarks. Maybe numbers are true, but they are done in rather
specific conditions, with specific cases, where sift can be possibly faster.
But in general development it will be rather slow.

~~~
snordlast
It's basically equivalent to:

    
    
      $ export LC_ALL=C; grep ...
    

The regular grep works on Unicode, which can be 10 times slow. Sift works on
plain bytes. Also sift benchmark is somewhat inflated. Test on your own data.

~~~
agumonkey
There was an article about haskell vs C where, haskell would appear slow at
first because its string type was unicode while C was ascii. When using
unicode the c code suddenly was a bit slower than haskell's. Details matter.

~~~
hawkice
To be perhaps even more fair, Haskell's String type is actually a singly
linked list of unicode 32 characters, so you'd want C to use unicode and
Haskell to use ByteString (or Text).

------
ggreer
I'm the author of ag. Sift has been mentioned to me a few times, and I finally
felt the need to comment on it.

Sift is a neat tool and I hope people get a lot of use out of it. That said, I
don't think the chosen benchmarks are indicative of real-world performance.
For example, I just downloaded the latest Linux kernel source, built it, and
ran sift and ag on it. This is on a Core i7 4770K with 32GB of RAM and an
Intel 330 series SSD. Times are medians of 5 runs each. FS cache was hot.

    
    
        ggreer@lithium:~/Downloads/linux-4.3% du -sh                                      
        12G .
        ggreer@lithium:~/Downloads/linux-4.3% time ~/code/sift/sift -n --group line_number
        ...
        ~/code/sift/sift -n --group line_number  10.03s user 1.59s system 625% cpu 1.856 total
        ggreer@lithium:~/Downloads/linux-4.3% time ag line_number
        ...
        ag line_number  0.68s user 1.03s system 459% cpu 0.371 total
    

0.37s vs 1.86s. You can see the output of both commands in my gist[1]. Ag
obeys .gitignore/hgignore/etc by default, so it shows fewer results.[2] Since
sift can't search files with greater than 2^18 characters between newlines, it
prints out quite a few error messages. This typically happens with binary
files and minified css/js.

You may be wondering why I added command line options to sift. They are to
enable grouping and line numbers, making the output similar to ag. This may
seem cosmetic, but it actually affects performance. Displaying line numbers
means you have to count line numbers. That requires reading the entire
matching file and adding up the newlines.

One last bit of benchmarking: If I tell ag to do an unrestricted search
(meaning both ag and sift search the same files) and I tell sift to format its
output like ag (meaning both count line numbers), the results are much closer.
Sift takes 1.85 seconds. Ag takes 1.60 seconds. But unlike sift, ag doesn't
bail on big files without newlines. It dutifully reports all the matching .o
files.

Really though, sift and ag are trying to solve different problems. I designed
ag to be a code searching tool. From what I can tell, sift is meant to be more
of a grep replacement. Instead of worrying so much about which one is faster,
use the right tool for the job.

1\.
[https://gist.github.com/ggreer/96ed22bab57d1e791b9d](https://gist.github.com/ggreer/96ed22bab57d1e791b9d)

2\. I consider this a feature. The goal of ag is to find what you're looking
for, not to show everything remotely related to what you're looking for. If
the default search doesn't find anything, you can quickly add the "-a" option.

~~~
Touche
I love ag and use it daily. One problem I constantly run into is searching for
something and getting minified JavaScript in the results which is unreadable
and eats up all scrollable space in my terminal. I'd like to ignore all files
with super-long lines in them, any options for this sort of problem?

~~~
z1mm32m4n
Wow, I had never thought about that. I've run into issues with minified
JavaScript files as well, but never really thought about how much it annoyed
me.

To "ignore" super-long lines, we can just use a Unix filter:

    
    
        ag --color --group <pattern> | cut -c -100
    

the --color and --group force ag to not change the output just because we're
piping the output to something that isn't a tty. The cut command simply clamps
all lines at 100 chars; feel free to change this length.

This is pretty long, so we could write a wrapper function or something:

    
    
        agc() {
          ag --color --group $@ | cut -c -100
        }
    

Stick it in your zshrc and you're good to go.

~~~
Touche
Perfect, thank you.

------
pzone
I'm not going to jump until I see some sort of technical explanation of what
Sift does that Ag is missing. It just seems implausible.

As for the additional features, I am not looking forward to memorizing more
command line crap. Anyone want to write an Emacs/helm package?

~~~
jedisct1
Plus, typing sift is twice as long as ag.

~~~
pmoriarty

      alias s=sift
      alias a=ag
    

Now they are even.

------
Albright
The "Conditions" as described on the "Samples" looks promising - looks like
you can search for a pattern which is within X lines of another pattern.
That's useful for finding lines used only in certain contexts. AFAIK, there's
no way to do that with ag.

[https://sift-tool.org/samples](https://sift-tool.org/samples)

~~~
mturmon
The samples page you cite is helpful.

The other two affordances that caught my attention on that page are (optional)
multiline captures, e.g. for XML entities that might span lines, and the
--replace modifier, which can transform a captured bit of text, simplifying
the

    
    
      grep var= FILE | sed 's/var=\(.*\);/\1/'
    

usage pattern.

------
searchfaster
Nope definitely not as fast as ag in my tests in my source tree.

~~~
binarycrusader
Some data would be nice so that we can see the difference.

------
Nadya
I see a dash of black magic was used... I'm extremely curious what
optimizations and algorithms were used for such performance gains.

 _> This takes about 1 second - sift processed 40 million records / 1 GB of
data in just one second._

------
StavrosK
I see everyone here using ag, does anyone know how it compares to ack-grep? I
use the latter, but I tried ag and it was pretty much identical, as far as I
could tell?

~~~
has2k1
ag is meant to be identical. It is a lot faster.

~~~
StavrosK
Ah, thank you.

------
lqdc13
Ok just tried it and it's not great for all occasions.

Ex. i have a dir with 2 million files with an avg size of about 3 kilobytes.

    
    
        ls -f /data/ | wc -l
        2183060
    
    

So I try to find a pattern:

    
    
        time find "/data/" -name "*txt" -type f -print0 | LC_ALL=C xargs -0 -P 6 -n 40 fgrep -iq "let the right one in"
        
        real	0m3.755s
        user	0m3.085s
        sys	        0m17.764s
    

Then try sift with same command:

    
    
        time find "/data/" -name "*txt" -type f -print0 | LC_ALL=C xargs -0 -P 6 -n 40 sift -iq "let the right one in
    
        real	 0m35.956s
        user	 0m55.170s
        sys	        3m32.120s
    
    

Or try sift by itself:

    
    
        time sift -rq --files="*txt" "let the right one in" /data/
    

And it is:

    
    
        real	0m9.298s
        user	0m29.620s
        sys	        0m12.371s
    
    

Sticking with grep for now.

------
NelsonMinar
The benchmarks claim it searched 35GB of data in 0.6s. I must not be
understanding what that means, because that's nearly 60 GB/second. The web
page notes everything's cached, so there's no disk I/O. But isn't 60 GB/second
about the bandwidth of consumer RAM?

~~~
eternauta3k
You don't have to look at all the data, if the prefix of the search doesn't
match then you can skip ahead for some search patterns.

~~~
tedunangst
The pattern only had a nine byte prefix, that's not a lot of skipping. And
it's far less than a cache line.

~~~
glandium
And the only way to read 35GB of data at 60GB/s is to have it all in RAM in
the first place. That's a lot of RAM to have.

------
SlyShy
Curious. The table on this page ([https://sift-tool.org/info](https://sift-
tool.org/info)) indicates it performs even faster than ag. That blows my mind
a little because ag already feels so fast.

------
nattaylor
Sift was 6x slower on my first test

time sift -z ',R,' 20151104-00.gz | wc -l 210740

real 0m43.725s user 1m9.984s sys 0m10.695s

time zgrep ',R,' 20151104-00.gz | wc -l 210740

real 0m7.262s user 0m7.509s sys 0m1.395s

~~~
etep
Did you drop caches in between tests (or alternately, did you warm up the
cache for both)?

------
andrewchambers
Monster function in the implementation
[https://github.com/svent/sift/blob/master/matching.go#L28](https://github.com/svent/sift/blob/master/matching.go#L28)

~~~
nemo1618
not to mention intentionally ignoring errors returned by os.Open. Dear lord.
Go shoves those errors in your face for a reason.

------
clumsysmurf
A nice feature is easily being able to find a match preceded and / or followed
by something else within N lines. I always found that cumbersome with other
tools.

Great for looking through Android code, when you want to find API call X used
along with API call Y.

------
mjcohen
I got a 404 on the language showdown.

------
kylek
Off-topic but...

> No installation. just download a single file.

Is it really so hard for people to provide .rpm/.deb these days? Are man pages
extinct as well?

~~~
Spivak
I think most developers, mostly correctly, assume that if their software is
good and properly licensed that it'll make its way into distros' repositories
on their own.

Having a simple build process is more helpful to packagers than something
already built since distros are just going to repackage it anyway. Also, few
developers are intimately familiar with their OS's packaging standards so
they're just asking for obscure incompatibilities.

Man pages though -- for the love of god we need to keep them alive. No they're
not fun to write but they're perhaps the most useful tool in the Linux
ecosystem. I write them for all of my projects but I assume I'm in the
minority at this point.

~~~
eru
The manpage format isn't even all that great. Their main benefit is that they
are available at your fingertips in a common format at the console.

~~~
Spivak
You're absolutely right, and the syntax is even worse, but the real issue is
that we have a standards problem. There's incredible value in having a
standard format for technical documentation which is small, concise, easily
accessible to everyone, and most importantly has the same layout. If we
abandon man pages then every project will ship a different incompatible
documentation system which would make the Linux ecosystem overall worse off.

More and more documentation is being pushed to the web and/or the offline
version is just a dump of the website. It's not _bad_ but everyone has a
different interpretation of what good documentation looks like and it requires
a relatively modern web browser to view, users have to relearn how it's
organized, and projects take the mentality that complete documentation isn't
something that should be part of a release rather than the thing that's
updated every now and then.

~~~
eru
Oh, I like man pages, and have written some for my own little projects. I wish
we had more man pages for our internal tools at Google.

Gnu's `info' was supposed to be the replacement. But I seldom look at their
documentation.

------
smegel
Everyone who had a gut feeling it was written in Go gets a dollar.

------
meshko
Only search in HTML files: sift -x html pattern

Not a Unix tool, I see.

~~~
eru
You don't like GNU grep either,
[https://stackoverflow.com/questions/221921/use-grep-
exclude-...](https://stackoverflow.com/questions/221921/use-grep-exclude-
include-syntax-to-not-grep-through-certain-files) ?

~~~
meshko
there is no such thing as file extension, it is just a special case of file
matching pattern.

~~~
eru
So what? If you want to follow the `do one thing well' philosophy, grep (and
similar) shouldn't even support matching more than one file. That's what find
and xargs are for.

