I tested string searching algorithms a few years ago[0][1], and Brute Force was surprisingly performant, enough that picking a string searching algorithm becomes an engineering tradeoff. It found a sentence fragment at the end of Moby Dick within 8ms, 7 times slower than Boyer-Moore. But most of us aren't searching Moby Dick, but rather a HTTP header, some user input, or a paragraph of a document. Even long-winded users won't write Moby Dick into your <textarea>. Most languages and libraries use brute force for this reason - initializing and using Boyer-Moore may be slower than brute-forcing the text. Sometimes "practically nothing" is just a brute-force search.
But Boyer-Moore is the perfect choice for Grep! The likely inputs on Unix are all huge: log files, entire directory trees, output pipes from loud programs, etc. The cost of initializing a small skip table is overwhelmed by the cost of I/O and the potential volume of text. It's not surprising that they've gone to some lengths to optimize the core inner loops and the I/O in that context.
> Brute Force was surprisingly performant, enough that picking a string searching algorithm becomes an engineering tradeoff. It found a sentence fragment at the end of Moby Dick within 8ms, 7 times slower than Boyer-Moore.
I think that line of thinking is actually symptomatic of producing the kind of software that eats up our present day powerhouses and makes them dog slow.
> "Even long-winded users won't write Moby Dick into your <textarea>"
Which has most cases already much longer than those where the setup costs of BM are larger than the gain if the match is found on average halfway in to the text.
Typically you hit that point when the 'haystack' is about 2,000 characters and the 'needle' is longer than about 4 to 5, longer 'needles' or longer 'haystacks' would increase the advantage.
So the HTTP header one is probably one situation where you'd be quicker using brute force but in those other two instances it is very well possible that BM is already faster.
This assumes that you are going to the trouble of initializing your skip table once and re-using it. Now you have state to maintain beyond the life of the function call, or else your performance is slower than brute-force. That, in addition to probably needing to write the function in the first place, means you've got all sorts of bugs to find and fix.
And anyway, what are you doing searching HTTP headers in anything more than a one-off script? More likely, you are parsing the whole header and sticking it in a hash table. So, not only aren't you searching, but even if you were, that's not the hard part. And even that is dwarfed by the application that's going to service the HTTP request. (Unless you are Google, in which case you don't need my advice.)
Searching HTTP headers is not your bottleneck. Use your language's built in string search. Premature optimization makes code slower.
I really haven't seen this problem, like, ever. All coder's are vastly more likely to preoptimise the hell out of everything, and we end up with the opposite problem - thousands of hours of wasted programmer time won't even reach 1 saved hour of user time. Bear in mind as well that the actual speed of execution isn't always the main cause of perceived slowness - if something runs 10 times as slowly, but runs in the background and never causes the user to wait, it's actually running infinitely faster, from the users perspective.
A much better solution is to write things in the easiest way for coders to change - that way, when something is found to be the actual cause of slowness, anyone can easily go in and optimise it or move it to a background thread. Optimising EVERYTHING in the hopes of obtaining speed is a fool's errand - due to the 90/10 rule, 90% of the code you optimise will never be the bottleneck.
I'm not arguing for anything, I'm stating a fact. Users choose more but slower features or in web app terms business people choose more but slower features and throw more hardware at it.
I'd prefer it your way too. One other thing, we're assuming good programmers and that's not something I'd bet on at most places.
Or rather, compared to a server log file that hasn't been properly rotated. I wrote a program a while back to incrementally parse a daily log file which was never rotated. It got say that it would take twenty minutes for for the program to just skip the lines that had been parsed previously. When those who had the power to do so started rotating the file, things speed up tremendously.
Sparing that, why not use fseek() and do a binary search to get to where you wanted to go? Even better, save the prior offset somewhere and jump there immediately on T+1?
That, I should have done. I thought to save the number of lines previously read so that I could skip them; I don't know why it didn't occur to me to just save the whole damn offset.
I used this trick the other day -- compute byte range partitions on a large file and use fseek when processing each. Is there a standard unix program that can output an arbitrary byte range from a seekable input? At first blush dd(1) looked like the ticket, but block-oriented operation means extra invocations to deal with a byte range that's not necessarily block-aligned.
I thought this sort of thing was common knowledge? Searching and sorting a certain number of items under a threshold is fastest through brute force. The problem is figuring out what that threshold might be and then have the sorting and searching programs make use of it.
I remember this topic being discussed back in undergrad.
>> initializing and using Boyer-Moore may be slower than brute-forcing the text. Sometimes "practically nothing" is just a brute-force search.
> and you are basing this claim on…?
Think of it as an up-front cost that you expend before doing any actual work. So BM is slower in some cases, and faster in others. In most short cases the cost of setting up outweighs the advantages of the more complex algorithm and then you're better off to brute-force it.
The Moby-Dick example, however is not such a case.
Just like it takes longer to prepare an offset press to make a single copy of a page compared to just slapping it in to the photo-copier. But if you need a million copies, you can't beat (rotary) offset.
edit: I've re-constructed the original comment this replied to because I'm not happy about the recent trend in comment deletions (see below), that's the fourth time in a few days that I come across this. To the author of the parent of this comment (you know who you are), why did you delete your comment after receiving 3 serious replies?
Because he was probably downmodded (appropriately) for his comment, and considered it an embarrassment.
I'm not fond of such deletions, but I think it's good for people to be able to retract their statements when they realize they don't reflect well on themselves.
Perhaps a better solution would be to allow the original comment to be "retracted" such that the author information is removed but the comment remains, so people can still understand the thread of the conversation.
Retracting just the owner means you basically allow anonymous posting (except the mods still know who wrote it, and anyone who read it before the username got wiped). That opens up a whole new can of worms.
The fact that computers use instructions in our universe?
Seriously, if you don't understand what he's basing his claim on, it's because you don't understand the Boyer-Moore algorithm. If you know how Boyer-Moore works, the reason is obvious. Instead of snarkily replying, why don't you take the same amount of time to read the wikipedia article on Boyer-Moore and see why brute force may be faster in some cases.
(Since you've already shown yourself to be lazy, however, I'll explain it: Boyer-Moore constructs two alphabet-sized integer arrays based on the "needle" you're looking for; if your haystack is smaller than twice your alphabet size, and it frequently is, then Boyer-Moore is practically guaranteed to take more time than brute force.)
Searching for "b" in "ab". Just two comparisons, compared to allocating, zeroing, and initializing a lookup table, and then doing a brute force search anyways. There's a fuzzy point where one becomes better than the other. Implementations written to be fast may brute-force length=1 strings, and implementations written for simplicity may not.
Boyer-Moore is one of the examples that made me realize clearly that on the larger scale of programmer competence I'm nobody special. Some algorithms show such out-of-the-box thinking that it blows your mind.
The most interesting thing is that most pattern matching algorithms up to then got slower with longer match strings, but Boyer-Moore actually got faster!
To quote Majikthise: "Bloody hell, now that is what I call thinking."...
The most interesting thing is that most pattern matching algorithms up to then got slower with longer match strings, but Boyer-Moore actually got faster!
Another interesting bit of trivia: In the first chapter of my thesis I present a string matching algorithm with almost exactly the same asymptotic running time as BM -- but where BM performs exact matching using no precomputed index, my algorithm performs matching with mismatches using an index.
With an index, of course, exact matching is O(log N) time -- in a peculiar way, the "cost" of inexact matching is one index worth of efficiency.
[EDIT: On second thought, this last comment meaningful at all? I'm not sure, but it's almost 4AM so I'm not going to figure it out now.]
One thing to remember, is that almost all of the unix utility programs were developed for computers that had FAR less horsepower than even a modest desktop PC of today. In the early days, literally every byte of RAM was precious as was every CPU instruction. So in that sense, necessity was the mother of invention, and developers were accustomed to working with that mindset.
Most of UNIX's utilities were written with high regard for their memory use. When GNU decided to clone them all, they actually outlined that their developers were forbidden to even review UNIX sourcecode. Convention stated that they all try to write the programs with speed in mind, and now we have these discussions.
I don't think you should optimize for space or speed at each others expense without figuring out if the trade-off is worth it.
Space and speed will always be conflicting goals during optimizations (unless you're very lucky), and for practical use the sweet spot is usually somewhere in between the two extremes.
To blindly optimize for speed will result in very wasteful behaviour when it comes to memory usage, to blindly optimize for space will result in terrible performance.
Smart software realizes when the space is available and free for the taking and will use it to find an increase in speed, it will also realize when space is at a premium and economize on it's usage.
On another note, accessing all that memory costs cycles and is almost certainly going to trash your cache, chances are that if you manage to reduce your memory footprint eventually you'll find that instead of losing speed you're gaining speed.
Open Office started up with a blank word processor document uses 83 Megs of RAM, something tells me that could be a whole lot less without sacrificing functionality or speed.
I agree completely that one should not optimize for one area without taking in account the penalties incurred on the other areas. I hope I didn't give that impression.
One of the more major reasons of GNU developing the habit of writing utilities that were significantly different in their implementation was to make sure no UNIX/BSD code could leak in, since at the time the project was underway, BSD was stuck in that whole USL vs BSDi suit, and BSD's code was tentatively deemed 'non-free'.
One also needs to account for the time period these utilities were written, and for what computer system they were originally written for. Both memory and speed were expensive. Writing these utilities probably required much more thought to the smaller details than one would think about today when writing say a word processor, even though one could benefit by doing so.
Having lots of memory can serve as a poor-man's SSD. At least on a Linux system files are read from the disk and left in memory until that memory is needed for something else. This can increase responsiveness by reducing the need to go to disk for frequently used filesystem data.
The stable version of Chrome leaks memory like crazy on Linux. I was frequently using over 1gb according to about:memory and have seen it use up to 3gb. 6.0.495.0 dev seems to be better about it.
Adobe After effects is a good example of a fairly standard app that will happily chew through 8-12 gigs of ram when working on moderately complicated videos.
I also know people who work with fluid dynamics and the new workstations they bought for work have 48 gigs of RAM.
Think about what Adobe After effects is doing, though: non-linear video editing. A bit more than a decade ago, you couldn't do that outside of Hollywood. A bit more than two decades ago, and even Hollywood couldn't do it - everyone still used celluloid.
I'm not sure that either video editing or fluid dynamics count as bloatware in my book, because the problems are inherently complex. Word processing, however...
I did non-linear video editing on my PC in the late 90s. There were certainly some limitations since the AVI format was stupid (used a signed integer for an index limiting you to 2GB file size), but my machine then was a 133MHz pentium with probably 32 or 64MB of RAM. Granted I was only working with half-height VHS (640x240), but it worked.
I haven't tried a lot of video editing software today (since I no longer pirate software like I used to), but what I have used is not huge progress from what I did in the late 90s. The only big step is that you can work with compressed streams directly, which is nice, but expected since even a modest machine today is expected to be able to decompress the latest MPEG spec in realtime.
What are you putting in those tabs? I frequently have 50+ tabs open in Chrome and rarely get to more than about 1GB total. Sometimes some long running web apps will leak enough (Gmail, I'm looking at you...) to bring a single tab up in the 100-200MB+ range, but that's rare.
Thanks, I didn't know that algorithm and your comment made me look it up.
For anyone else who skimmed the article and doesn't know the algorithm, check it out at http://en.wikipedia.org/wiki/Boyer-Moore. It's a very quick read to get the basic idea, and it's as good as jacquesm cracks it up to be.
Actually I posted this article yesterday
http://news.ycombinator.com/item?id=1624402 but I went unnoticed. The
author was working on search feature for his Hex editor and tried to
outperform GNU grep by using Boyer-Moore. After multiple attempts, he
finally discovered the hard way all the tricks used by Mike Heartel to
make GNU grep run faster than his implementation.
I read that one years ago thanks for the repost. I missed it on the submissions, they go by so fast now. I think we're too late to bring it up though, it would need quite a large number of upvotes to come back after dropping down this far.
They're about the same amount of trippiness. Boyer-Moore makes the same kind of jump table, except Boyer-Moore starts comparing at the end of the search string.
BM is my favourite algorithm, but I haven't seen any others that are as elegant and powerful. What are the other examples you're thinking of?
(BM is especially cute because the basic idea is so simple: do it backwards. Quicksort is very clever. Arithmetic coding is mindbendingly cool, but the algorithm isn't simple enough - for this old brain anyway.)
Levenshtein, DCTs, Wavelets, Phong shading, flood fill and so on. There are so many really insightful algorithms.
But BM stands out for me because it is taking the opposite approach where that was entirely non-obvious in spite of lots of people having looked at that problem for a very long time (and plenty of those people were anything but stupid).
For the record, I've written an operating system (multi-tasking, multi-user, message passing) a window manager to go with it, a bunch of applications to go with the window manager and a whole pile of commercial software besides that.
And in spite of all that I think I can make the above statement in confidence, that algorithm has something very unique and it exhibits a level of thinking that goes beyond the ordinary world of programming.
> I'd say thinking up Boyer-Moore is nothing special. It's just a logical progression of ideas. If you think about the problem long enough, you would come up with the same solution, or maybe even a better one.
What's stopping you?
I'm sure we'd love to see the Palish algorithm that beats MB by a significant margin and that does not build upon it.
A complete side-note, but it's hard to figure out where else to put this.
HN should not allow you to delete a post once it has a child comment. I find that people sometimes delete entire stacks of comments that have children (as in this case) when they don't seem to like how the argument is going. (This is not super-common, but it happens often enough that I've noticed it.) It's enormously frustrating for readers - not to mention childish on the part of the deleters.
Now, as for the rest of you kids, get off my lawn.
It seems to be something that has been 'figured out' and is now spreading. I'll theorize that it has to do with people that are afraid of the down-mod mob to avoid their karma taking a hit after posting something silly or they're afraid of being found in a decade by a google search from some prospective employer.
I could restore the deleted bits (without naming the author) if that's desired?
There are people that are simply better at this stuff than I am, I'm not going to belittle my own work by paying homage to theirs. That does not mean that I can't improve relative to where I am today though, and I strive very hard to do so. Maybe one day I will be that good.
It's the difference between arrogance and objectivity.
I'm more than happy creating the stuff I do, every now and then I come up with something that is new (at least, new to me, see http://news.ycombinator.com/item?id=1353259), but that does not mean that I'm going to see this as anything other than the stroke of genius that it was.
If it really was that easy we could all improve on every major algorithm out there just by staring at it long enough and that's exactly why BM is so neat, it wasn't just a simple incremental improvement on someone else's code.
So, I challenge you, come up with an improvement on the status quo prior to BM that does not use the principles as exposed in BM and that performs on par or better, and that are not part of the current literature.
I'll tip you $250 if you manage to do that, that should be easy money for you. And I'm good for the cash.
You will probably get a nice wikipedia entry to boot.
Why would being objective about someone elses achievements relative to your own discourage anybody? That's a motivation to try harder not to give up, after all, it shows you what is possible.
Just like hearing Jehudi Menuhin play should not cause you to burn your violin but to practice.
I'm not putting algorithms on a pedestal, I'm putting that particular example of all the algorithms available up as the one that I ran in to which taught me that I (still!) have a lot to learn.
jacquesm seems like an accomplished guy. I believe he created internet video feeds. I just bookmarked his HN comment feed along with tptacek because it's a great way to find the good technical threads on HN.
When jacquesm is in awe of some developer, I find it inspirational, to think: hey, he's a mortal, maybe I can be that good too one day.
> The key to making programs fast is
> to make them do practically nothing.
> ;-)
is a paraphrase of something I posted here a long time ago:
> You can't make programs run faster,
> you can only make them do less.
While not entirely true (and Ph.D. theses have been written about the corners where it's wrong) it's an excellent start when you have to make a program run faster.
One counter-example I can think of is that of Judy trees - http://judy.sourceforge.net/ which uses more instructions to try to keep the data in cache for as long as possible.
A (CPU) cache-line fill is additional time required to do a read reference from RAM when a word is not found in cache. In today's computers the time for a cache-line fill is in the range of 50..2000 machine instructions. Therefore a cache-line fill should be avoided when fewer than 50 instructions can do the same job.
> One counter-example I can think of is that of Judy trees - http://judy.sourceforge.net/ which uses more instructions to try to keep the data in cache for as long as possible.
It's not a counter-example. The cost of running a program includes the cost to access memory as well as the cost of executing instructions. It also includes the cost to access disk/flash.
BTW, the HAT-trie is supposedly a speed improvement over Judy and all other known data structures for that problem. I don't yet understand it well enough to evaluate that claim.
You can make them do less, but you can also make them spend less time doing nothing at all. Concurrency optimizations and prefetching fall into this category.
One of the issues of Apple's Develop magazine a long time ago explained the Boyer-Moore, the Tuned Boyer-Moore, and the Self-Tuning Boyer-Moore, which is what I used for a C library I was working on at the time.
Boyer-Moore is a great algorithm, but it's for fixed-string searching. grep, in the general case, handles regular expression searching. Is there some way to extend Boyer-Moore to regular expressions that I'm unaware of? Or is GNU grep's use of Boyer-Moore limited to when the search string is just a literal?
In practice, most regular expressions contain some literal strings. You can use Boyer-Moore to anchor the match, then do a full regexp match from there.
..and the classical answer to implementing full Regex is finite state automaton, correct? At least, there is a one to one correspondence.
I'm curious, though, about what tricks are used in actual implementations to speed things up, and what modern Regex features necessitate climbing further up the Chomsky Hierarchy. (I seem to recall reading about features getting slipped into Regex engines that made them no longer finite state, but can't recall what they were, right now.)
Check out Shift-Or: used in agrep, and another example of a very clever algorithm. Rather than translating a non-deterministic finite state automata to a deterministic one, it uses the boolean operations of the hardware to simulate the NFA directly. Result: linear time regexp for patterns that have less than the bit-length of a machine's registers. I.e., 32 bytes on x86, 64 bytes on x86_64, etc.
Is there a simple generalization of Boyer Moore to regular expressions? Can something be said about the optimality of string searching, for example by assuming a probability distribution of input texts?
"The key to making programs fast is to make them do practically nothing. ;-)"
In a prefect world, I would prefer many programs that did nothing and worked together, than a monolithic program that does anything (and even contains a kitchen sink!) but is slow.
So come on fellow developers, let's make a bunch of nothing!
Lots of little programs that can be easily chained together.
I think that's usually described as "the Unix philosophy." It's not limited to GNU (nor does it originate with GNU). See, for example, the Wikipedia article on "Unix philosophy"[1]:
Doug McIlroy, the inventor of Unix pipes and one of the founders of the Unix tradition, summarized the philosophy as follows:[2]
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
This is usually abridged to "Write programs that do one thing and do it well".
An external program communicating over pipes can let the OS handle buffering, run on another process core, be swapped out for another program that speaks the same protocol (perhaps in a faster language), won't crash the whole system, etc.
A program running as a library subroutine has a bit less overhead, and can use more context (library-native data structures, rather than piped text), but this is also usually more language-specific. Working within a "full environment" language like Smalltalk has a lot of advantages, but it also needs comprehensive libraries for your problem domain, or you're back to using external programs.
There's an insightful aside about this in Joe Armstrong's _Programming Erlang_, in the chapter about ports - Erlang code can load foreign code as linked-in libraries, but a buggy library will make the whole system unstable in a way that code running in a foreign process and communicating via message passing will not. He argues for running code in an external process (a "port") by default.
Of course, having a comprehensive (but low-level) library in C with wrappers in higher-level languages is an option. High-level languages' type systems / object models can be very different, though, and it takes experience to translate a C API to feel native to Python/Lua/Ruby/etc.
It also works to structure a program as a C library, but provide a small standalone program which gives it a command line interface. SQLite and Lua are good examples of the latter approach.
I (still) think that message passing is a vastly undervalued mechanism. Erlang is a really interesting language by the way, I wished I had more time to devote to learning it.
One strategy that I've used before is to write every program as a library, then put a very thin command-line program in front of it. All new functionality goes into the library, then gets exposed.
That gives the best of both worlds. I have my usable program. And later I can easily integrate it into any other library that can benefit.
One particular area where the external program bit is annoying is when you're querying or setting system/package parameters or defaults on a unix machine, there must be 100 incompatible ways of doing that.
Usually the only way to get the job done is to escape to a shell, which really feels kludgy.
I can see some of the charm of 'images' such as used by smalltalk.
Or maybe the appeal of glue languages like Bash or (one style of) Perl. It is very nice to have the ability to treat arbitrary programs like libraries for your program. (Unfortunately, because of the GNU/BSD split - among other things - this style of programming is also completely brittle. One non-standard flag, and boom.)
I suppose the trick, then, would be to take a bare system call and somehow automatically promote it to a process in the operating system?
You could do this for a given function signature, perhaps. But it's not clear how it would work in the general case. Is there some universal function signature that makes sense for any kind of API, that makes sense both for pipes and for the semantics of the specific program?
You're on the way to re-inventing Erlang I think ;)
That would be one way to do it, but system calls are generally assumed to be atomic, promoting them to process status would be quite tricky.
Multi-threaded kernels effectively do some of this by allowing you to execute multiple system calls in parallel.
The typical way in which an 'image' based system works is that all your persistence (including compiled code) continues to reside in the image, over time you get the same kind of build-up that you typically see in file systems, only there there is no difference between 'programs' and 'subroutines'.
But that's not 'natural', that's an external process with a whole pile of start-up and shut-down overhead. Incidentally, some of the worst C code I've ever seen used piped unix shell commands all over the place, as if there is no penalty to doing this.
That's true. I think the tricky part is providing segregation where it's needed. That is, the problem is in supporting systems and programs which don't conform to the standard. Process segregation and scheduling is one way of solving this problem. Another is to simply enforce the design uniformly across the entire system. IIRC Plan9 does this slightly, but a better example may be a system which uses e.g., Haskell as the base language and simply requires all programs (modules) to be interfaced with the system with public facing symbols. A 'grep' IO monad may be an interesting way of linking and segregation.
It does seem that GNU is bigger on it than the BSDs these days, though. The modern BSD approach is to write libraries, and then have shell utilities just be frontends on those libraries (sometimes with multiple front ends sharing the same library).
Stop! Whoever crosseth the bridge of Death, must answer first these questions three, ere the other side he see!
"What... is your name?"
"Sir Brian of Bell."
"What... is your quest?"
"I seek the Holy Grail."
"What... are four lowercase letters that are not legal flag arguments to the Berkeley UNIX version of `ls'?"
"I, er.... AIIIEEEEEE!"
ack (http://betterthangrep.com) is a great alternative to grep, although certainly slower as its written in Perl. Great for working with smaller files such as source code though.
Speaking as a guy who writes in assembly too, I'm not sure that much can be gained by BM at least on modern out-of-order processors. The cost of comparing every byte is low, and to skip a few bytes can cost more in additional instructions which can slow the pipeline than it would be by just searching through every byte for the first one. Other speedups (mmap) sound reasonable.
Plan9 grep is competitive in my simple test, often faster.
It is also immune from the pathological data that will kill GNU grep - see http://swtch.com/~rsc/regexp/
and the BUGS section of your local GNU grep
#!/usr/local/plan9/bin/rc
fn 9grep { /usr/local/plan9/bin/grep '2010-[0-9][0-9]-23 02:01:57' /home/maht/lighttpd.error.log > /dev/null }
fn ggrep { /usr/bin/grep '2010-[0-9][0-9]-23 02:01:57' /home/maht/lighttpd.error.log > /dev/null }
fn mgrep { /usr/bin/grep -mmap '2010-[0-9][0-9]-23 02:01:57' /home/maht/lighttpd.error.log > /dev/null }
switch($1) {
case -9
9grep
case -g
ggrep
case -m
mgrep
case *
ls -l /home/maht/lighttpd.error.log
time /tmp/gtest -9
time /tmp/gtest -g
time /tmp/gtest -m
}
/tmp/gtest
-rw-r--r-- 1 www wheel 1113325534 Aug 23 18:51 /home/maht/lighttpd.error.log
23.67 real 3.88 user 3.74 sys
24.28 real 0.63 user 3.89 sys
23.09 real 0.56 user 3.87 sys
It looks like all three of your grep commands there are I/O-bound (where are you getting a machine with much less than a gig of RAM, but 50 megabytes per second of disk streaming? Is this an old server with a RAID?), but plan9 grep uses six times as many CPU cycles as the other greps.
I wouldn't call that "competitive", even if system-call overheads does knock that crippling slowdown down to less than a factor of two.
Also, what kind of kernel are you running there that would peg your CPU with a mere 300 megabytes per second of disk I/O? Is DMA disabled on your disk or something? Surely not, because there aren't any IDE PIO modes that are anywhere close to 50 megabytes per second.
But Boyer-Moore is the perfect choice for Grep! The likely inputs on Unix are all huge: log files, entire directory trees, output pipes from loud programs, etc. The cost of initializing a small skip table is overwhelmed by the cost of I/O and the potential volume of text. It's not surprising that they've gone to some lengths to optimize the core inner loops and the I/O in that context.
[0] http://www.jakevoytko.com/blog/2007/12/11/fun-with-string-se... I've declared bankruptcy on broken TeX and code examples... WordPress mangles them every few updates.
[1] http://www.lysium.de/blog/index.php?/archives/201-Fun-With-S... A few improvements to the code in my post