$ homebrew info grep
GNU grep, egrep and fgrep
/usr/local/Cellar/grep/3.3 (21 files, 885.3KB) *
Poured from bottle on 2019-01-03 at 12:23:37
$ time /usr/local/bin/ggrep "foobarbaz" application.log
$ time /usr/bin/grep "foobarbaz" application.log
There's nothing odd about the file, straight boring application logs, and the line starts with the logging level in capital letters. So I can use a regex that is pinned to the start of the line, about as optimal as you can get:
$ time /usr/local/bin/ggrep -c "^INFO" application.log
$ time /usr/bin/grep -c "^INFO" application.log
In my experience, this isn't the most pathological example, but it's close. Note that this also applies to other tools that leverage grep under the surface, like zgrep etc.
I've specifically aliased grep over to ggrep in my shell so that I'm avoiding bsd grep whenever I can:
$ alias grep
try something like tr:
time gtr a b < application.log > /dev/null
time tr a b < application.log > /dev/null
$ time gtr a b < http_10x.log > /dev/null
$ time tr a b < http_10x.log > /dev/null
$ shuf /var/log/syslog > shuf-syslog
$ time sort shuf-syslog > /dev/null
$ time LC_ALL=C sort shuf-syslog > /dev/null
I once got annoyed with sort's speed, threw together a parallel external sort program that dramatically outperformed it, then realized the difference was not as dramatic with LC_ALL=C...oh well, it was a fun afternoon program anyway.
I really appreciate this property of UTF-8, and I can highly recommend to do a pen&paper exercise to see why it preserves order :-)
It wouldn't work if you want any type of normalisation and especially solving composition/decomposition.
~ % which grep
~ % which sed
~ % which ls
Now, getting it to report line numbers will kill it, since you then have to scan every character.
Or, roughly, grepping for '^x' instead of 'x' should result in less work if the line does not contain x in the first character. One fewer comparison for each character after the first.
Instead, treat the file as a collection of characters and just look for patterns. Which may include the line break character.
That make sense?
Edit: I should also add that you should look for burntsushi's posts. Turns out some of this had become out of date. And largely depends on size of what you are searching.
Way back when, the first thing I'd do on any non-GNU host was to install a full GNU userland. I could write a book on my issues with GNU tools, but they are on balance preferable to the alternatives. All IMHO, of course.
I'm uncertain if it's the patent clause or the fact that apple prevents you from modifying and running changed software (such as on the iphone).
What's funny is that I think they are technicall in violation of the GPL (their modifications to the GPL v2 bash are not entirely distributed)
Is there much else they could do? Don't see how Apple could reasonably ship GPL code...
If anyone knows other tools where the gnu/home brew alternative is much better, please chime in (already saw tr on another comment).
In this case isn't the "already in RAM" test a more accurate reflection of performance anyway, as we are talking about the performance of grep and not the IO subsystem?
There are many cases where grep's performance won't be bottlenecked by IO, or at least not impacted directly by a bottleneck there. Anywhere when the input is coming from another program, essentially, and even if that task is IO bound it might be sending output to grep much faster than it is consuming input (perhaps gunzip <file | grep searchtext).
And in the case of searching a log file interactively, it is common that you won't just run grep of the file just once in a short space of time, instead doing it a couple of times as you refine your search, so for most of the runs it will be in cache (assuming the file is not to large).
Nearly ever SSD listed achieves well over 1GB/s in an actual benchmark, not just on a spec sheet. And these are just boring old off the shelf consumer drives. Nothing crazy.
So yeah maybe not over 400MB/s, but all of them are over 200MB/s. Sequential speeds really spiked as densities kept increasing.
Note that you're not going to get this with SATA SSDs, you need NVMe, it's a 5x difference in throughput and IOPS.
That's a very common use-case with grep. Either grepping a file you recently wrote, or running grep multiple times as you refine the regex, at which point the files will be in the FS cache.
It's also possible that the file is cached in memory (I ran grep a few times through the file before I carried out the specific measurements).
> #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.
TFA is incredibly short, and will explain it much better than I can.
This would not help, since the backing storage doesn't provide support for this kind of resolution. It would end up reading in the entire file anyways, unless your input string is on the order of an actual block.
As the other reply mentioned though, it's just that MacBook SSDs are that fast.
Somewhat confusing since it has to look at every byte to find the newlines. They are using a pretty specific definition of "look".
> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!
I assume the Boyer Moore preprocessor reads a lot of bytes also.
Not disputing it's more efficient, but there's no magic. It avoids reading some bytes when and if it can.
It can whenever you don't ask for line numbers, can't it?
> It probably looks for newlines after a match is found
Probably, yeah. Counting number of newlines in a range, without caring just where they fall, can probably be pretty darned fast with vector instructions. Idk if that's worth the additional pass, though.
Another way of looking at is just considering the ^ another character (plus one-off special handling for start of file).
If we assume something like 100 MB/s sustained for spinning disks, that's a lot of disks to get to 2.41 GB/s even ignoring overheads.
This test was hitting the OS disk cache.
Samsung's PM1725b (https://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZP...) has a Seq. Read of 6300 MB/s and Seq. Write of 3300 MB/s.