Hacker News new | past | comments | ask | show | jobs | submit login

> For a Perl-type problem (scanning and parsing big files), Perl is very fast.

I think it's a matter of what you're comparing it to.

Compared to using Perl for a general-purpose problem, Perl for scanning/parsing is fast.

Compared to scanning/parsing with C, Perl is not fast.

    $ ruby -e '1.upto(1000000) { |n| puts "This is line number #{n}" }' > file
    $ time perl -ne 'print if /number 12345/' < file
    [...]
    real    0m0.193s
    user    0m0.189s
    sys     0m0.004s
    $ time grep "number 12345" file
    [...]
    real    0m0.023s
    user    0m0.019s
    sys     0m0.005s
I gave Perl every possible advantage here. I didn't actually even write any Perl except a regular expression, which is delegated immediately to C. I didn't even write the loop in Perl, I like the Perl main() function handle that. And still the C program is almost 10x faster.

Note: these test runs are from Linux. On OS X the Perl results were almost the same, but "grep" was unexplainably way slower. It seems to hang after it's already dumped all of its output. Basically grep on OS X appears to be badly broken somehow.




I'm always wary of these kinds of sub-second benchmarks because more often than not you've only accidentally measured just the compilation and startup times.

I might have some bias though from speeding up a crusty old Perl CGI web apps with multi-second request times down to less than 100ms simply by keeping the perl processes persistent with mod_fcgid or whatever.


> I'm always wary of these kinds of sub-second benchmarks because more often than not you've only accidentally measured just the compilation and startup times.

I increased the iteration count by 10x and observed exactly the same pattern:

    $ time perl -ne 'print if /number 12345/' < file
    [...]
    real    0m1.661s
    user    0m1.586s
    sys     0m0.076s
    $ time grep "number 12345" file
    [...]
    real    0m0.188s
    user    0m0.132s
    sys     0m0.056s


Cool, thanks for entertaining my superstitions :)


I recall anecdotal reports that Perl was faster than egrep in some cases. I never tested it myself and that was a long time ago—could be a bug that is long since fixed.


I think the takeaway is that perl is "fast enough" for perl type problems. Writing a complicated file parser in C would be a nightmare.


When I have a one-time computation job that takes an hour to write and two hours to run in Perl, but in C takes 10 hours to write and half an hour to run, then Perl is faster than C.

And these one-time/rare/short jobs are much more frequent than intense, high throughput C code like the nginix web server or the node javascript interpreter.


I see your point, C is definitely the choice for long running jobs or jobs that will be run more than a handful of times, in my experience however I'm writing a lot of one-off scripts that take 30 seconds tops to run so Perl wins out pretty hard over C.


What you are seeing is different regex engines and capabilities, and grep's focus on pure speed and optimization of a common case and Perl's focus on versatility.

I see very similar results between Perl and grep, and you can see this by also including egrep, which allows slightly more complex expressions:

    [root@stats ~]# time perl -ne 'print if /number 123456/' < /tmp/file
    [...]
    real    0m1.990s
    user    0m1.937s
    sys     0m0.049s
    [root@stats ~]# time grep "number 123456" /tmp/file
    [...]
    real    0m0.158s
    user    0m0.115s
    sys     0m0.035s
    [root@stats ~]# time egrep "number 123456" /tmp/file
    [...]
    real    0m0.150s
    user    0m0.127s
    sys     0m0.023s
But what happens if we use a slightly more complex expression?

    [root@stats ~]# time perl -ne 'print if /number [1]23456/' < /tmp/file
    [...]
    real    0m1.989s
    user    0m1.925s
    sys     0m0.047s
    [root@stats ~]# time grep "number [1]23456" /tmp/file
    [...]
    real    0m1.402s
    user    0m1.366s
    sys     0m0.022s
    [root@stats ~]# time egrep "number [1]23456" /tmp/file
    [...]
    real    0m1.414s
    user    0m1.382s
    sys     0m0.031s
The difference becomes much less pronounced. What if we make the expression just a bit more complex?

    [root@stats ~]# time perl -ne 'print if /number [1]23456[0-9]*/' < /tmp/file
    [...]
    real    0m1.950s
    user    0m1.910s
    sys     0m0.039s
    [root@stats ~]# time grep "number [1]23456[0-9]*" /tmp/file
    [...]
    real    0m9.353s
    user    0m9.307s
    sys     0m0.037s
    [root@stats ~]# time egrep "number [1]23456[0-9]*" /tmp/file
    [...]
    real    0m9.539s
    user    0m9.483s
    sys     0m0.045s
So, now we have the Perl regex engine fairly static across extra complexity while grep and egrep are seeing order of magnitude time increases, and are much slower than Perl at this point. I suspect your first benchmark was the result of a specific optimization grep has that Perl doesn't, or it may be that grep was able to switch to using a DFA regex for that first case, while Perl doesn't both with a completely different regex implementation for special cases like that.

Anecdata: I needed to process a large amount of XML a while back, to the point where a week spent testing and optimizing XML parsing libraries in Perl was worth it, because it could shave weeks or months off the processing time. The winner? A regex that captured attributes and content and assigned name/value pairs directly out to a hash. This was only possible because the XML was highly normalized, but it was actually over 10 times faster than the closest competitor for XML parsing I could fine, and I checked all the libXML libXML2, and SAX libraries I could get my hands on.

In the end, it was something as simple as the following approximation:

    while (my ($doc) = $xml =~ /$get_record_xml_re/) {
        my %hash = $get_record_xml =~ /$record_begin_re$capture_name_and_value_pairs_re$record_end_re/;
        process_record( \%hash );
    }


Your results are interesting and I'd be curious to know why the grep degrades so badly on that last regex.

But the original benchmark was ridiculously biased in favor of Perl by not actually doing anything in Perl.

If Perl is actually being competitive in the unfair benchmark, the benchmark should be made more fair by actually putting some logic in Perl, and writing the equivalent logic in C. At that point, you would start to see C win again (modulo any inherent inefficiencies in grep's regex engine).

> This was only possible because the XML was highly normalized

Another way of putting this is: your regex wasn't actually an XML parser. Things that are actually XML parsers were slower. This is not too surprising.

A 10x slowdown does surprise me somewhat. It doesn't surprise me that you beat libXML or any library that builds the XML structure into a complete DOM before you can process the first record. It does surprise me that you beat SAX by 10x. SAX does have some inefficiency built in, like how it turns attributes into a dictionary internally before giving them to the application. That would probably mean that SAX bindings for Perl to a C parser would have to take a full SAX attribute hash and turn it into a Perl attribute hash. Still, 10x is pretty bad.


> Your results are interesting and I'd be curious to know why the grep degrades so badly on that last regex.

I really do think it has to do with grep swapping out regex implementations based on features needed. The last regex matches a variable length string, so it may trigger a much more complex and/or cpu-intensive regex engine to be used.

> If Perl is actually being competitive in the unfair benchmark, the benchmark should be made more fair by actually putting some logic in Perl, and writing the equivalent logic in C. At that point, you would start to see C win again (modulo any inherent inefficiencies in grep's regex engine).

I see your point, but I think it's less relevant than you suppose. Regular expressions are first class citizens in Perl, just as much as Arrays and Hashes. This doesn't just mean that the syntax has some niceties, but you can actually call Perl code within the regex itself[1], and even use this feature to build a more complex regular expression as you parse[2]. Complaining that Perl uses a regex and it isn't Perl is sort of like complaining Perl is using hashes, and any fair benchmark between C and Perl should just stick to Arrays.

> Another way of putting this is: your regex wasn't actually an XML parser. Things that are actually XML parsers were slower. This is not too surprising.

Yes. I didn't want to give the impression a wrote a general purpose XML parser that beat all the C implementations I could find. I still think it's interesting that well formed regular expressions are performant enough in this circumstance to make them a preferred alternative of the many options. I could have written a simple parser in C that would have been faster, but the solution I ended up with is quite fast, robust, and very, very easy to debug.

> That would probably mean that SAX bindings for Perl to a C parser would have to take a full SAX attribute hash and turn it into a Perl attribute hash. Still, 10x is pretty bad.

I think it's more related to the fact that the actions of the regex parsing implementation when optimized sufficiently is very close in implementation to C code that steps through a char array looking for the record beginning and ending indicators, and for the content between them, and then steps through the record looking for data items and saving the key and value for each. The main benefit of using a regex in Perl is that I get what is probably a fairly close approximation (all things considered) of a C implementation of that without having to write any C, and without having to marshal data back and forth to a library to accomplish it, with very simple and concise code.

There are other tasks which obviously aren't going to be nearly as efficient in Perl, but this exchange was spurred by someone talking about Perl being very fast for a Perl-type problem, which this definitely is.

1: http://perldoc.perl.org/perlre.html#(%3f%7b-code-%7d)

2: http://perldoc.perl.org/perlre.html#(%3f%3f%7b-code-%7d)


> I really do think it has to do with grep swapping out regex implementations based on features needed.

That wouldn't explain why one regex engine is 5x faster than another. Only looking at the regex engines themselves would tell you that.

> Complaining that Perl uses a regex and it isn't Perl

I'm not complaining that it uses a regex, I'm complaining that it doesn't do anything else.

A representative Perl program would use regexes and contain some logic that processes the results of those regexes.

> I think it's more related to the fact that the actions of the regex parsing implementation when optimized sufficiently is very close in implementation to C code that steps through a char array

I used to think that, but it is really not true unless your regex engine contains a JIT compiler.

Specialized machine code for a text parser (which is what you would get from writing C) is significantly faster than generic NFA/DFA code. In these tests, an average of 65% of runtime was saved when the regex engine included a JIT (ie. the specialized code was over twice as fast): http://sljit.sourceforge.net/pcre.html


> That wouldn't explain why one regex engine is 5x faster than another. Only looking at the regex engines themselves would tell you that.

It could definitely explain it, but it may not be the best explanation given the facts. I'll definitely concede that it's pure conjecture.

> A representative Perl program would use regexes and contain some logic that processes the results of those regexes.

Sure, depending on what you want to show. Nobody is trying to say Perl is as fast or faster than C, just that relatively, it's fast for the development cost it requires.

>> I think it's more related to the fact that the actions of the regex parsing implementation when optimized sufficiently is very close in implementation to C code that steps through a char array > I used to think that, but it is really not true unless your regex engine contains a JIT compiler.

I think we're referring to different things, which is mostly my fault for being loose with my terminology. I really only meant close to C in a conceptual manner, which yields some performance benefit by keeping a large chunk of the looping and work storing specific chunks of text low level and in the interpreter. I wasn't trying to imply the regex engine's cost was negligible or the actual machine operations we comparable in a large way.


> > regex parsing implementation when optimized sufficiently is very close in implementation to C code

> I used to think that, but it is really not true unless your regex engine contains a JIT compiler.

The P6 Rules engine is written in NQP so it gets JIT'd on the JVM and MoarVM backends.

Of course it'll be many years before the engine is seriously optimized but it's a good start.

It'll be interesting to see if the code gen of this next gen regexen engine gets good enough in 2015 to make its advantages (most notably the grapheme-by-default design) actually pay dividends.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: