I have to disagree, and I could actually use Boyer-Moore as an example on this:
Split the string you're searching into four roughly equal pieces, with an overlap so that potential matches is guaranteed to be found. Then do a Boyer-Moore on those in parallel. E.g. if you look after FOO, you'd have to split the string like this (pipes are the start/end of the piece exclusive, i.e. piece 1 does not contain the last O)
piece 1 here|
...7890abcdeFOOghijk
|piece 2 here
Now, what we know is that in the worst cases, this would require more operations: The O may be detected by piece 1, and it will check if the character in front is an O as well. Piece 2 will also check the same O, so there's redundant computation going on. However, this is theoretically faster (I'd guess it is practically faster for large files) even though this requires more operations.
Mozilla is not big because it's full of useless crap. Mozilla is big because your needs are big. Your needs are big because the Internet is big. There are lots of small, lean web browsers out there that, incidentally, do almost nothing useful
- Roll your own unbuffered input using raw system calls.
A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. This helped performance dramatically on fragmented data (e.g. tons of small files).
For most typical greps (at least of files as opposed to standard input), "grep" is likely disk-bound, not CPU-bound.
(Note: I mean literal prefetch, not actually reading the file from disk. This is important because file input is a blocking operation in UNIX -- the thread blocks when read() is called and can only be resumed when the read is complete, unlike the case of output. This prevents the filesystem from reordering or merging multiple reads unless they come simultaneously from different threads. This is why reads are often slower than writes on typical data.)
Regarding this bit: "Roll your own unbuffered input using raw system calls. / A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. "
I think you mis-understood. "Roll your own unbuffered input [....]" does not mean "do not read ahead" it means "do not use stdio". Stdio buffers have a lot of overhead (relatively speaking) in support of features that grep does not need.
Wow, I didn't realise grep had a separate --mmap option. Interesting.
I don't think it would make much difference in the parent's case of many, small, fragmented files because if you're mmapping each file in turn and it's not cached, it still needs to be loaded from disk - it just happens in a page fault instead of the read() call.
Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are in the list of files. Maybe grep does that, I don't know?
I guess the case where that technique would help is actually when you have a combination of (a) many files, (b) a computationally expensive pattern match (even just -i is a measurable hit) and (c) largeish files.
Because on many small files and a simple match, the disk I/O is still going to be the major component - even if you prefetch you still can't get around needing to load all the file contents from disk.
Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are. Maybe grep does that, I don't know?
It doesn't do that, and that's the first method I used in my hack.
"mmap" is not a general solution for grep. It's nice when it can be used but grep must be able to grep files much to large to reasonably mmap and various kinds of streams.
Yes but you would think that a reasonable default solution would be to use both. ie. mmap by default and then if the file is larger than MAX (based on memory use, file size) then use read.
Less than a year old. HN really needs to alert you to dupes when you submit . . or does it? I know it doesn't allow you to post a dupe for a cool off period after it has been posted . . .
There is a dupe checker, but only for exact matches (which if there is one instead of getting a warning it automatically lodges an upvote of that link on your behalf). The OP of this thread appended a # to the end of the URL to bypass the filter.
Thank you, although not everyone agrees. Although sometimes they then get re-upvoted, many of the duplication notifications get downvotes. Here are two recent ones:
then HN needs to implement URL normalization and equivalence from RFC3986. These are fresh in my mind because I extended urllib in Python to support them last week.
It isn't an easy problem to solve, working out if two URLs are equivalent, but the RFC goes some way to solving it which picks up easy things like adding a #.
Thanks to hashbangs (or really just the prevalence of Ajax-only sites), it's now a lot harder to tell whether two pages with #something in the URL are effectively the same.
No cool off period (although that is a good idea). I just submitted a good read and it head already been posted 616 days ago, and thus no one will see it.
Interestingly enough, awk is many times faster for an inverted grep (grep -v) than grep is. Get a large file and test yourself!
grep -v regex
awk '$0 !~ /regex/ {print}'
This is possibly due largely to this not helping in that case:
GNU grep AVOIDS BREAKING THE INPUT INTO LINES.
awk is very fast at breaking the input into lines (that's what it spends most of the day doing!). I don't understand why it's so much slower though. (I had a 100+MB log file that I was searching when I discovered this).
This is discussed best in the opening chapter of O'Reilly's Beautiful Code: http://oreilly.com/catalog/9780596510046. Google Books has a readable snippet of the section at http://bit.ly/g34QRh, but I highly recommend buying the book because it is a great read.
For vi users, I recommend this plugin: https://github.com/mileszs/ack.vim. I like it better than simply replacing grep with ack. This has made me a lot more productive, as I can do complex searches MUCH faster!
The other day I was actually surprised over how slow gnu grep was. I wanted to count how many lines in a log file contained a domain. Using grep -c <domain> <file> took 25 seconds for 1.7GB log file, whilst agrep only took 5 seconds.
One thing to check: what is your current locale? GNU grep is much, much slower when it's multibyte-aware, even if you're searching for an ascii string. Try repeating your test with LC_ALL=C grep (don't forget to take the filecache into account).
You can check the current values of your locale with `locale`.
I'm still surprised why it's so much slower with UTF-8, though. I guess gnu grep is naively converting back and forth between representations? There's nothing in UTF-8 that should prevent it from doing this efficiently. Even with complex patterns, it possible to search through the file in about the same time as a simple pattern. E.g. aspell can build a FSA where each transition is O(1), making the search time more or less independent of the search pattern.
I assume the problem is that you can't skip ahead X characters on multibyte data without parsing the bytes, because you don't know how big a character is. So you may have to read every byte.
But you can skip ahead without parsing all the bytes since UTF-8 is self-synchronizing. The only scenario I can envision is that gnu grep wants to perform unicode normalization, to catch equivalent codepoints, and that this is implemented inefficiently.
Edit: Granted, since I'm using -c, it has to look at all the bytes to find all the newlines.
What language could you possibly be using here? The first line looks like C, but then in C on most systems, open returns an int (a file descriptor). Maybe you wrote open() but you meant fopen()? An easy mistake, but then how are we to interpret the syntax in the 3rd line? f.read(...)? What is this I don't even.
> I don't know how many people's code I've optimized...
data analysis, i.e. reading in large files, but overall it's pretty fast anyways. Furthermore, you don't read the whole file into memory so your program only uses a small amount of memory while running. I'd work with grad students reading in 1-2GB of data to analyze it and allocating that all to a process, and they'd wonder why their system would slow down.
The best thing about it is that the OS does the caching. So, say I analyze a file, I run a program with 4 arguments that maps a file to memory. Next time it runs, the file is still in memory so it doesn't bother re-reading the original file. (Try doing a recursive grep on directory, then try it again right after)
Thanks! That leaves the question: when is mmap slower, or is it always faster? And why isn't mmap the default usage pattern if it's better in most cases (and easier to use to boot)?
Mike has only given part of the answer there. GNU grep obviously is very I/O tuned. And GNU grep optimizes the special cases of constant strings and of regexps that (more or less) start with a constant string -- but there's more!
GNU grep is also fast for most commonly encountered non-constant-string regexps (even those that don't start with a constant string) because its regexp engine avoids backtracking by doing an on-the-fly conversion (of many patterns) to a DFA. These extra cases are algorithmically neat and when you want them, you're glad they're there -- but they are less sexy in benchmarks because the most common use case in the real world, by far, is a search for a constant string.
How far could you go in a discussion like this before BSD grep could reasonably be considered a derived work of GNU grep?
The discussion is going deeply into how GNU grep is implemented so it's clearly not a clean-room reverse engineering kind of situation. On the other hand nothing is being discussed that could be subject of copyright as only ideas and algorithms are put forth and no code is shown. How careful do you have to be to be sure?
I'm assuming you're asking when BSD grep has to be released under the GPL? As the BSD license is slightly more restrictive than the GPL, I imagine that the GPL would have to be adopted if there was significant code usage, which there doesn't appear to be in this case.
The case where code is used would make it clear cut that you need to GPL it. You can relicense BSD (without advertisement clause) to GPL but not the other way around.
My doubt is at what point would you need to GPL it even though you haven't actually reused any code, just described the GPL code to someone who then reimplemented it into the BSD grep. This case still seems pretty benign as only general algorithms and approaches are shared but it made me consider where the line might be.
I'd be surprised if the GNU people claimed use of algorithms was sufficient to warrant the forced adoption of GPL, as that goes against everything they stand for. Rather, I suspect a direct transfer of code would be the minimum necessary condition for the FSF to raise a stink.
This case is clearly not over the line but it seems to me that the minimum can't be a direct use of code. If someone did enough documentation of the architecture and algorithms used in the Linux kernel and handed that off to someone else who reimplemented I assume it would be hard to argue it wasn't a derived work.
When LWN does a description of how the VM or scheduler works they are effectively describing the code. But maybe that's the same as a research paper on algorithms where reimplementation is clearly possible. Maybe for it to be a derived work you'd have to describe the code so exactly you'd be effectively translating it in whole to another form.
I anticipated that response, but was on my way out the door and had no time to clarify.
From the point of view of the FSF, the BSD license is more restrictive, as it grants fewer freedoms to users. From the BSD point of view, GPL grants fewer freedoms to developers, and is thus more restrictive.
In the context of the question asked, I was approaching it from the POV of the GPL's authors, given that the question was whether the BSD code would have to adopt the GPL license. I apologize for my inarticulateness.
This is the Ultimate Truth of optimizing computer programs, and it seems so few people understand it.
"Why can't you make Python faster!?" Because Python does a lot of stuff without you asking.