I showed this to my roommate, and he mentioned in his product, the lexer is similarly written. The reason for this is, apparently, it is a really simple way of implementing a finite state machine.
Wikipedia mentions the pattern as well:
"The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex have proven (e.g. article about re2c) to produce engines that are between two to three times faster than flex produced engines.[citation needed] It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools." (http://en.wikipedia.org/wiki/Lexical_analysis)
Mostly, I'm curious about my second question. What's the state of testing the core of PHP? I'm totally unfamiliar with the implementation.
I understand that it may be dumb to refactor the lexer, but it gets less dumb if it's supported by hundreds of tests, and far more dumb if it's based on testing over time.
Yawn. Sometimes, code is both messy and perfectly functional. At that point, the best thing for the business/community is to spend time thinking about something else until you need to make changes to that chunk of code.
The best you can come up with is how tangled the code in a hand-written lexer looks? I may have skimmed a bit towards the end, but, did you actually find a bug?
That code is really bad. Dunno about you, but to me this handcrafted entanglement looks like the crap that any professional programmer should be ashamed of:
if ( .. )
{
for ( .. )
{
label:
...
}
}
else
{
if ( .. )
{
..
}
else
goto label;
..
}
The actual code is even worse than this as there's a second goto that jumps from the top for into one of else blocks at the bottom from where the control can go back to for via another goto ... I sort of can see how this code came about, but there is absolutely no excuse for checking in something this f*cked up.
If they wanted to show off, they should've just used the coroutines :-)
It has been explained that this is just a way of implementing a fast lexical scanner. This is the kind of code that you don't come up with, it is a mechanical implementation of a state machine.
Mechanical or not, this is a very untidy way of implementing a "fast lexical scanner". Even the Duff's device looks cleaner and more elegant in comparison. And it's not like there is a shortage of other implementation options that are as fast.
Didn't look like he pointed out any bugs - but that's not really the main point of a code review. It's all well and good for the compiler to understand the code and have it execute properly, but is it human-readable? Or maintainable? Having looked at the source file, I'd say that particular method is neither.
This guy is not doing a fair review. He is just badmouthing the code without providing alternatives. The code doesn't look good, but a constructive review would be much more appreciated by everyone. By acting this way, he is just making a fool of himself.
Wikipedia mentions the pattern as well:
"The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex have proven (e.g. article about re2c) to produce engines that are between two to three times faster than flex produced engines.[citation needed] It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools." (http://en.wikipedia.org/wiki/Lexical_analysis)