No I'm genuinely asking. Isn't some of the stuff here already being done by other languages, or is Facebook really breaking new ground here? (yes lbrandy I saw your earlier comment)
Facebook probably has millions of lines of PHP code, PHP code proven to work. This isn't just Facebook.com, but internal websites, their bug tracker software (that is open source), analytical software for PHP, etc. etc. Rewriting all that would be _expensive_.
Running it on normal Zend PHP, based on the benchmarks they get on HipHop, requires _way_ more hardware at the scale of Facebook, so again, very expensive.
PHP developers are abundant, PHP is a relatively easy language to teach, resources are everywhere. Due to their abundance, PHP developers (even at Facebook quality) are probably less expensive than other languages. Easier hiring and lower cost developers, cheaper.
If you read the comments, you'll find the answer further down. They do address it. At their scale, all languages/platforms break and they would need to invest the same amount of time and effort getting it to work.
It won't make a difference if you switch to the latest hipster language. You'll still need an effort of this magnitude to get it to scale the way they need it to.
Furthermore, rewriting an entire code-base is a monumental undertaking that is both expensive and complex. Especially for something as large and diverse as Facebook. It also makes zero business sense and Facebook is a business.
Why should they take a risk with a new code-base, when what they're doing is clearly working? PHP just works.
Changing languages in the scale of Facebook, means that everyone, internal or external, need to know the new language.
This transition is a big process, which usually costs much more money in trainings and loss of productivity, than having a dedicated team improving the performance of existing tools.
It's called PHC http://phpcompiler.org and it's open source.
It was Paul Biggar's PhD thesis http://blog.paulbiggar.com/archive/a-rant-about-php-compiler...
It just needs some community contributions to catch up.
There were two cool things about phc: that it had a really advanced static analyzer, and that it compiled to modules compatible with zend. However, these are somewhat mutually exclusive, and making them work well together is an open problem (though I had some ideas).
I moved on from phc, and there's no-one left in the community really. I now run https://circleci.com (applying my compiler knowledge in a different way). I tried to pass the reigns on, but nobody stepped up. Hard to find people who love PHP but are also able to hack on compilers.
I worry the php community is losing or has lost some of the really smart folks.
Suhosin (Stefan Esser) is also fading away without any updates to work with php 5.4 and that's really sad, especially considering the radical performance improvements and memory-use reductions found in 5.4 and 5.5 vs 5.2 - 5.3
Disclaimer: I dont think I've looked at PHP source in 3 years, it might have changed significantly but I suspect it has not. I'd love to be corrected with some detail!
My main point was that switch()-based dispatch isn't that bad. I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)
> Indirect threading is the fastest simple technique
I'm confused; the paper itself says: "However, because of a level of indirection, indirect-threaded dispatch is not be
as efficient as direct-threaded dispatch."
Even direct-threaded dispatch still results in an indirect branch for every VM instruction, it just tries to save a table lookup over token-threading.
Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.
Regarding the threaded code: the literature seems to be horribly inconsistent in this regard. Token threaded code might actually refer to indirect-threaded code depending on what the author has read. I was actually quite surprised to read what "indirect threaded code" originally meant when I read Debaere and van Campenhout's 1990 book "Interpretation and Instruction Path co-processing".
Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once, by disabling basic block reordering (-fno-reorder-blocks, due to a so-called "software trace cache"), otherwise this has never happened to me. Since I have nothing but the highest respect for Mike Pall, I guess that he might have experienced this for Lua, which has--to the best of my knowledge--less than 40 instructions. I will try to verify this on the weekend.
I should have been more clear; that comment was in reference to some of the techniques the paper discusses which do give up portability, namely "inline-threaded dispatch" and "context-threaded dispatch."
My overall points are: switch() isn't that bad, and most practical dispatch techniques ultimately boil down to either a single indirect branch or replicated indirect branches.
> Regarding the effect of CSE & tail-merging: I have only seen GCC to do this once
I've never seen it happen with my own eyes, but it appears the Python guys experienced this also at one point and had to tweak the flags -fno-crossjumping and -fno-gcse. http://bugs.python.org/issue4753
> I will try to verify this on the weekend.
Do you have a blog? I'd love to hear what you discover.
I second this. I'd be very interested to know who you (sb) are in real life.
> If you're giving up portability, you might as well go all the way and create a JIT.
Well no. A JIT is a massive massive amount of work, an interpreter is not. Don't throw the baby out with the bathwater. Besides, you can make a damn portable token-threaded dispatch for supported platforms (which means anything using GCC, LLVM or Visual Studio, which means basically anything that exists today), falling back to switch statements for unsupported platforms.
> Techniques like what the paper calls "inline-threaded dispatch" seem of limited usefulness, since they give neither the speed of a JIT nor the portability of a bytecode VM.
I forget the numbers, but lets say its 20%. It gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulnes".
> My main point was that switch()-based dispatch isn't that bad.
Its portable and easy to implement. It is not the fastest dispatch mechanism. It will generally be the bottleneck in an otherwise-efficient interpreter. It does not compare well to most other dispatch techniques, and is only useful if you demand that absolutely only ANSI C will do.
> I'm surprised that this recent literature distinguishes between switch dispatch and token-threaded dispatch, since as Mike Pall notes, "Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point," so the goto-spaghetti of token-threaded dispatch is likely not even worth it. (http://article.gmane.org/gmane.comp.lang.lua.general/75426)
Firstly, the work on interpreters was done in the 80s and 90s. Really awesome optimizing compilers post-date that considerably.
Secondly, Pall does not say those those dispatching techniques are equivalent. What he says is that interpreters in C are difficult, and you have to fight the optimizer. If you wrote them in assembly, as Pall did, you would not find that the techniques were equivalent.
> Ultimately the most common and practical dispatch techniques seem to boil down to either a single indirect branch or replicated indirect branches.
Its certainly common to use a single indirect branch. However, it is suboptimal. People like speed.
Finally, note that my original point is the switch-based dispatch is slow, but that better techniques in PHP are not worth it, because the rest of the interpreter is so much slower.
> [Inline-threaded dispatch] gives a 20% speed boost, for a few dozen lines of code. Hardly "limited usefulness".
Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code. I'm not seeing how you could do it without implementing your own code generator (which would take far more than a dozen lines). Is there some clever way of leveraging the C compiler's output that I'm missing?
Also, once you are generating basic blocks of native code, I would consider it a JIT at that point, even if it is not as sophisticated as a JIT that's doing register allocation and optimization. Writing this sort of JIT that uses a static register allocation and a fixed instruction sequence for each bytecode op doesn't seem that much more work than writing an interpreter in assembly language (especially if you use Mike Pall's excellent framework DynASM: http://luajit.org/dynasm.html).
> Its certainly common to use a single indirect branch. However, it is suboptimal.
I don't believe that a single indirect branch is a priori slower; while replicated dispatch makes better use of the branch predictor, it also results in less optimal i-cache usage (the LuaJIT code notes this trade-off, but observes that replicated dispatch is 20-30% faster in practice: http://repo.or.cz/w/luajit-2.0.git/blob/2ad9834df6fe47c7150d...).
I'm not saying switch is better, what I'm saying is that you can't reliably replicate dispatch in C (at least according to Mike's message that I cited earlier), which makes switch() a totally reasonable choice for interpreters written in C.
By the way, I'm not just making stuff up, I have written a JIT of my own: https://github.com/haberman/upb/blob/master/upb/pb/decoder_x...
Indeed, that was much too harsh and I retracted it. I thought I might have retracted it before anyone noticed, but I guess not. My bad, sorry about that.
> Unless I'm missing something, I think you would be very hard-pressed to generate basic blocks of native code in only a few dozen lines of platform-dependent code.
Whoops, I thought we were still talking about token-threading. I believe people implement token threading by fiddling with the compiler output. John Aycock had a paper about doing this for Perl or Python as I recall (also, since you're into JITs, you might enjoy his "a brief history of Just-in-time").
> Also, once you are generating basic blocks of native code, I would consider it a JIT at that point,
Yes, people often consider this to be a JIT. However, its much much easier to write than a "real" JIT. The complexity of writing a method-jit or trace-jit is very very high. This could probably be done in an afternoon by a great hacker.
Don't underestimate the work of writing the assembly language interpreter like Mike Pall did. Most incredibly hardcore compiler guys with whom I have discussed it were simply blown away by it. It might be one of the most hardcore feats in VM implementation history. He manually register allocated across dynamic flow control boundaries for gods sake!!
> reasonable choice for interpreters written in C
But not a fast one.
Indeed, I've bookmarked it, thanks! I couldn't find the other paper you mentioned about fiddling with compiler output; would be very interested in seeing that.
LuaJIT2 was the first interpreter and JIT that I ever studied deeply, so while I am very impressed by it, I don't have a great understanding of how it differs from previous work. How is its interpreter substantially different or more impressive than previous interpreters written in assembly language? Don't all assembly language interpreters need a fixed register allocation?
I've been a big fan of Mike's work for a long time and helped him get some Google funding: http://google-opensource.blogspot.com/2010/01/love-for-luaji...
By the way, when I was searching for the John Aycock paper I came across your dissertation which I've also bookmarked. I see that you went to Trinity College Dublin -- I visited Dublin a few years ago and what a beautiful university that is! Looks like there's a lot of interesting JIT-related research coming from there lately.
Traditionally, there were two implementation strategies: compilers and interpreters. JITs didnt become mainstream until Sun bought Hotspot and put it into the 2nd version of Java.
So you had to choose between compilers and interpreters. Compilers led to optimization, fast object code, but were complex to implement (especially multiplatform). Interpreters were simple to implement, very portable, but were very very slow.
Obviously there is more to both, but until recently, basically this is how people thought about language implementation. Considerations like a REPL, slow compilation speed, ease of prototyping, etc, were a complete side show (perhaps people cared, but you'd rarely see it discussed).
When all of the dynamic languages were implemented, they all used interpreters, written in C. They all used a simple dispatch loop to opcode implementation, and let the C compiler do the heavy lifting. All the research into fast compilers (the David Gregg/Anton Ertl stuff for example), looked at instruction set (registers/stack) and dispatch type. So when making interpreters fast, there were only 4 strategies:
- make a compiler,
- use better dispatch,
- rewrite using a register interpreter set,
- make a JIT.
Making a JIT is lunacy of course, because JITs are ridiculously hard, they're not portable. So that Pall was making a 1-man JIT (LuaJIT1) was incredible.
But that he made an interpreter that was as fast as a JIT was even more insane. In Trinity, all of us language/JIT/scripting language people were in one room, and when we heard about this we were just amazed. Nobody had even thought about the stuff - it was all brand new and novel in a field that barely had anything novel in decades! Until that point, basically all interpreters were one big while loop.
> How is its interpreter substantially different or more impressive than previous interpreters written in assembly language?
I wouldnt know, since I've not heard of any mainstream interpreters in assembly. I can only imagine that they were exactly the same as C interpreters: essentially a while loop with a switch statement, just written in assembly.
I find it amusing that you started at LuaJIT2. I would liken it to studying modern war, then wondering "why didnt they just use drone strikes at the Somme" :) Looking back from LuaJIT2, interpreters must seem really really primitive.
Don't get me wrong, I think LuaJIT2's interpreter is great, but interpreters before LuaJIT2 weren't complete crap, either. Many emulators, for example, have very good interpreters written in assembly (some aim to be cycle-accurate).
title: "Converting Python Virtual Machine Code to C."
Probably the de facto fastest interpreter is the one of the Sun HotSpot Java virtual machines. There is a paper by Robert Griesemer detailing some information, but AFAIR it is a hand-coded optimized assembly interpreter that does not too bad. LuaJIT's interpreter is doing quite well, too. (Mike Pall said in the LTU thread about trace-based compilation that the LuaJIT2 interpreter is mostly on par with the LuaJIT1 compiler, or at least not much slower.)
EDIT: replaced "LuaJIT1 interpreter" with more accurate compiler as pointed out by haberman.
The LuaJIT2 interpreter is on par with the LuaJIT1 JIT compiler. I don't believe LuaJIT1 had a separate interpreter -- it was just a JIT compiler that used the main Lua interpreter.
Even though the potential is not going to be as big as for Java, Forth, and OCaml interpreters (where people frequently report 2x speedups), for example Python gains between 20 and 45%. But somebody already replied to a similar inquiry and said that ANSI C compatibility is more important than the increase in performance. (Python uses conditional compilation to achieve both.)
For example for the pidigits bench:
The PHP version use the GM extension
And Ruby do it in pure Ruby
So basically it's a C vs Ruby benchmark.
Many peoples complained on the forum, take a look.
The Debian shootout is interesting, but don't take it as an absolute truth.
Please show where I said that, or admit that you are putting words into my mouth.
If you believe the shootout results can reliably show that one language is faster than another, say so. Otherwise, de facto you are saying you cant rely on the results.
Your words -- not mine.
I'm stopping this dance here. Feel free to make a claim about the reliability of the results, and whether you can use them to say that one language is faster than another. Then we can dance some more, but otherwise I'm out.
> Feel free to make a claim...
The only point of "this dance" was to help people understand that you are putting words into my mouth.
Please say what you mean by "a language is faster than another".
Please say what in The Java™ Language Specification tells us that "language is faster than another."
(That's the context of the conversation, see http://news.ycombinator.com/item?id=4851781)
That's a different claim.
>> It doesn't even make sense <<
That depends on what `notJim` meant by "to compare speeds of programming languages".
`notJim` might have thought that everyone would understand he was talking about particular language implementations and particular tasks and particular programs.
The pi-digits page also shows a PHP program which does not use the GMP extension.
Why do you think no one has contributed a Ruby program that uses GMP? Is that too hard to do using Ruby?
The benchmarks game is the absolute truth about the reported measurements.
Never got a chance to say thx, so THX!
Unlike trace trees, tracelets contain no control flow. It is just a type-specialized basic block. This means that the jit output does not combinatorially explode when we encounter polymorphism. In many ways, our approach is the opposite of tracing; we jit the first time we hit any code at all. And we look terrible on loopy microbench code, but run FB's multi million LOC application very well.
So it avoid trace explosion, and I would describe this as a sort-of local (in the compiler sense of "basic block") version of a method compiler. Can you compare the approaches?
The systems that it has the most in common with are actually binary translators; e.g., VMware's software x86 hypervisor, or the Dynamo system. Those systems run basic block at a time to solve a number of problems; e.g., disambiguation of code and data. We're basic-block-at-a-time for a different reason: closure under type inference.
So I'm curious as to why a stack-based instruction set was used in the VM design?
Many of the advantages of register-based designs (ability to optimize by rewriting the program at the bytecode level, ability to write faster interpreters, ability to map bytecode registers to physical registers, etc.) weren't particularly attractive to us because we knew we were going to build an x64 JIT that did its own analysis and optimization to take advantage of type information observed at run time.
Thus, we drafted a stack-based design for HipHop bytecode. It captured PHP's semantics correctly and happened to fit in fairly well with PHP's evaluation order, so we ran with it and here we are.
Would using a register-based bytecode not have been useful for the x64 JIT?
1) A stack-based ISA is still more space-compact. (AFAIR the Shi et al. paper mentions something like a 40+% increase in space requirement for the instructions.)
2) The performance improvement of a register-based ISA is only visible for interpreters that suffer most from instruction dispatch . PHP is a rather complex programming language that is most certainly not bound by instruction dispatch at all. So I guess it could very well make sense to stick with a simple stack-based ISA, which incidentally is also easier to compile from the AST.
 for the sake of completeness: the stack architecture emits many instructions to push operands onto the operand stack. A register-based interpreter does not need those. Hence, the overall number of dispatches is lower, and if dispatch cost is your bottleneck the overall performance increases. OTOH, you need more space, because in addition to the opcodes, you need to specify/encode which registers to take operands from and put results to. (Hint: quadruple code and the likes.)
2) But this is exactly my point. Zend PHP uses a register-based bytecode. The HipHop JIT could choose anything - why choose stack-based - its not only different, but also worse! If I had to guess, I'd say it was chosen to make the HipHop interpreter easier to implement, and then was legacy code when creating the JIT, and it was easier to keep it than to replace it.
Pretty sure V8 doesn't as it has no interpreter at all, it has a fast JIT and a slow JIT, but it only ever runs native code.
I think a trace-JIT still gives you a lot of bang for the buck and is (in theory at least) easier to implement. Two known projects using trace-compilation are LuaJIT2 (usually well known) and Dalvik VM's JIT compiler (not so well known, needed to watch the Google I/O 2010 announcement.)
IonMonkey is also a JIT. However, instead of just doing translations directly from bytecode to native code, it builds up an SSA-based IR, does some optimizations (last I checked, loop-invariant code motion, range analysis, global value numbering, on-stack replacement to be able to jump to JITted code in the middle of a hot loop, and some other small optimizations). It then does register allocation using LSRA. It takes advantage of type inference to aggressively type-specialize these optimizations.
This is all more expensive than just blatting out native code, but the code performs better.
In IonMonkey-enabled Firefox (Firefox 18+), the compilation strategy is to interpret, then use JM on hot functions. Particularly hot functions will get IM compiled.
This is similar to what "crankshaft", v8's version, does. The compilation strategy is quite similar, but v8 has no interpreter--it has two compilations: a fast, non-optimized one and a slow, optimized one.
Trace-based computation is...I would say... more complicated than a traditional compiler approach. It is certainly differently-complicated than the usual compiler problem. Recording traces requires some relatively complicated state, and significantly muddle up the interpreter. Removing the tracer from the Mozilla code base made things substantially cleaner--the tracer was holding back the method-based JITs.
Fortunately for Dalvik, most applications actually spend most of their time in native code.
The paper (behind ACM paywall): http://dl.acm.org/citation.cfm?doid=2388936.2388956
Out of curiosity if there's anyone involved in Facebook on HipHop, has there ever been a discussion about just shifting from PHP to a more performant language, or is a case of still reaping the benefits of PHP in terms of dropping a developer in and not worrying about skill sets?
We already shift lots of things out of PHP into more performant (c++, java, etc) languages when we are building services and/or extensions and/or other computationally intensive things.
But I think more to your point (and if not, let me indulge in the strawman), since this comes up frequently on reddit/hn: php is not somehow -uniquely- broken. While PHP might be uniquely broken as a language (ha ha), it's not -uniquely- broken as a platform/runtime. Everything you think is performant is broken at a large enough scale. Put another way, it's not just about the switching costs of rewriting in some other language. If we could magically snap our fingers and convert the entire codebase from php to some other language, that's just the beginning of understanding, tweaking, and occasionally rebuilding everything about the runtime/libraries/etc to make it work.
At the end of the day the language itself is given way too much attention in discussions like this.
From my experience in dealing with PHP performance warts. You can make it somewhat fast, but man the insides are so messed up every time I get into the core and try to do anything I want to rip my hair out.
That's a big part of my point, but Facebook seems to have invested a lot in the HipHop compiler and VM as a way to tackle the rough spots from using php. All languages are in their own way rough in places, but I was more curious about whether anyone has gone "this is great but maybe we're at the symptoms and not the cause". But you're right about the total investment being way, way larger than just a code base rewrite.
Interesting to learn that you do use multi languages for different things, a lot gets made of HipHop and Facebooks use of php that I didn't think to guess that different levels of the service might be getting built with different tools, at least not at scale.
They let people upload comments and pictures. Anything fancy is done mostly in the client or am I overlooking something?
Fixed it for you. Scale makes simple things difficult.
HipHop alone should get Facebook's development team praise.
The problem is that the interpreter is shite.
We just normalized HPHPc performance to make the graph easier to read. HPHPc actually got considerably faster over the same period as well; since both systems share a runtime, many changes helped both.
HPHPc probably got about 20% faster over 2012, even though nobody was actively working on it, mostly through happy side effects of work that was directed at HHVM.
As to whether it would have been possible to get better performance on HPHPc, I'll let people with more knowledge answer.
The benefit of having a single runtime for development and production (and production-like staging environments) is huge, though. There may also be benefits in the push process, which while fast when compared to most other companies, which will continue to deliver benefits for increased speed.
PyPy is taking a radically different approach from what HHVM is doing (and really from what almost all other dynamic language systems are doing), and it's a fascinating system. Part of what's exciting about it is that it seems like it should be applicable to other languages with less effort than most other JITs, and we wanted to understand its potential for a language like PHP. We asked Maciej Fijalkowski (hi, Maciej, if you're reading!) to help us do a research prototype to see what the first few roadblocks would look like, and Maciej did a great job. Just because we didn't scrap our current project and shift all our resources to PyPy should not be seen as a negative reflection on PyPy at all.
CentOS 6.3 x64: https://github.com/facebook/hiphop-php/wiki/Building-and-ins...
Ubuntu 12.04 x64: https://github.com/facebook/hiphop-php/wiki/Building-and-ins...
Also wondering what APC methods you support.
Edit: notably not short array syntax, at least yet.
The PHP development environment at Facebook is unlike that found at any other company I'm aware of, and common PHP pitfalls and legacy code are actively "linted" out of the system (as an automated part of the code review process). It still wouldn't be my choice of language to use outside of Facebook, but I have no issue working on the PHP code base here (although I rarely venture into it).
Over time, much logic and processing has been offloaded to backend services, with PHP increasingly playing a (largely parallelised) dispatch, post-process, and render role. That will likely continue.
With all that said, it is questionable whether it would be an investment to move off of it - which language/environment would you think a good target would be, and why?
I'd imagine people signing up to work at FB do so despite the fact they use PHP, not because of it.
In my experience, the top-tier developers enjoy challenges and solving problems, and languages are just tools they use to defeat those challenges and solve those problems. Sometimes it's a language they really enjoy using. Sometimes it's a situation where that language isn't a suitable one.
I imagine it's neither "despite" nor "because" of PHP use; any top-tier developer is going to use the tools available to him/her, rather than pick an employer in order to use a specific set of tools. Lower-quality developers may do so, but that's probably because they're less likely to know the right tool for the job (eg someone may hate functional programming and avoid jobs that use it, but that's because they don't know when FP is the ideal way to solve the problem)
But of course, anyone who's worked on production systems knows that all the mathematical ideals in the world doesn't stop bugfixes from becoming a giant hairy mess. It's more your culture, skill, and process that allow or prevent that from getting refactored into a better solution and how long that takes. Any language can be a part of a clean or messy codebase, although some tend to skew towards one side or another - but I'd wager that's as much a factor of barrier to entry as anything else.
Actual top-tier developers don't fret much about it. Complaining about some language's warts is mostly for bike-shedding and blogging types. PG, that writes about blub languages and who-cool-is-list, used Perl in ViaWeb too (which is as inelegant a mess as PHP is, but at the time was good for web work).
Top tier developers get shit done in whatever language. And most of them rarely ever comment or blog.
Second, what evidence to you base your statement "hiring talent at FB has to be getting harder"? If you're talking about market cap / upside, Google has had no problems hiring (well, relative to other tech companies, anyway), despite a market cap 3.5x+ what FB is today.
Lastly, FB is not all PHP. A lot of services are written in C++ or Java, and use Thrift to talk to the frontend . Thrift allows a more gradual movement off PHP, by moving isolated components to other languages individually.