

A rant about PHP compilers in general and HipHop in particular. - pbiggar
http://blog.paulbiggar.com/archive/a-rant-about-php-compilers-in-general-and-hiphop-in-particular/

======
bensummers
Aside from the interesting technical details, the interesting part of this
article is how much difference a brand name makes. He had difficulty getting
people interested in PHP compilers, then Facebook announces one, and it's the
new hotness.

This is a problem many startups have. Because people use brands as a shortcut
to determine whether something is of sufficient quality to be worth taking the
time to investigate, being a new unknown is a bit of a hurdle to overcome.

~~~
jcoby
it's not really so much a brand name as just getting the concept out there.
php developers don't realize that there are option when it comes to the
runtime.

i would guess that the majority of them wouldn't even know to look for phpc or
any alternative. there just hasn't been any news of something like HPHP before
HPHP. if you want to run php, you go to php.net or install it from a package
and go from there.

combine the above with very little actual need, very little public information
on how to build a large site, and lack of maintainer interest and phpc just
won't get major traction.

I would have loved to have found phpc several years ago when i was still
working on a large site that had performance problems.

~~~
robryan
It wouldn't really surprise me if a lot of the people freelancing in PHP
wouldn't know enough about the interpreter so see any advantage in compiled
code.

~~~
jcoby
a freelancer wouldn't know and if they did they wouldn't care. (i am a
freelancer.)

we're talking about websites that are in the top 5% of all sites (by traffic)
that would even need to think about using something like phpc or hphp. the
number of those that would be willing to use a freelancer to handle their
scaling issues are probably in the single digits at best.

~~~
pbiggar
On a tangent, I wonder why lots of people call it something other than "phc".
I said phc a lot, its on the website as phc, but I've heard phpc, phpcompiler,
and tons of variations. I'm curious why.

------
cpr
I love this kind of meaty thinkpiece from someone who clearly knows what he's
talking about. So rare.

I wonder if LLVM isn't getting mature enough. Yes, the Unladen Swallow folks
are having trouble with it, but it sounds like the most important troubles are
related to dealing with Python runtime realities, so those wouldn't
necessarily apply to another runtime system.

~~~
pbiggar
Its amazing how close all of the scripting language run-time are, in terms of
design and implementation. I think using LLVM with PHP would be worse, since
the PHP interpreter is so badly written.

I expect that LLVM will be mature real soon now.

------
raganwald
There were a few really interesting technical observations in this that made
it worthwhile from my perspective.

------
jrockway
_PHP doesn’t really need a JIT. Server side programs in PHP don’t do a great
deal of dynamic stuff, and it would be incredibly rare to load some random
code at run-time, so a JIT wouldn’t be all that useful._

Uh, what? JIT compilers even improve the performance of _C_. Loading code at
runtime and having a JIT are totally orthogonal.

If you're a language like PHP where you can't guess if a value is going to be
a hash or the number 42 until runtime, JIT techniques are essential to get any
reasonable performance. (Ah, here comes the argument, "PHP apps are never CPU-
bound.")

It is odd that the article claims that JITs for dynamic languages are never
successful; there is V8 and TraceMonkey, and Parrot optimizes Perl 6 pretty
well. Java is fast because of its JIT. The author claims that LLVM's JIT
compilation doesn't work, but that's untrue; LLVM's JIT works just fine for C.
LLVM's C runs faster than GCC's statically-compiled C, anyway.

~~~
ori_b
LLVM isn't used as a JIT for C in any existing compilers (clang, llvm-gcc,
etc) as far as I know; It's used to emit static object code. (If I'm wrong, of
course, I'd appreciate a pointer to the projects)

~~~
jrockway
Dunno what the defaults are, but I'm fairly sure I've compiled C applications
with JIT enabled.

~~~
scott_s
I'm not sure what that would even mean. JIT is an optimization for virtual
machines. It compiles VM code (such as Java bytecode) or interpreted code to
machine code at runtime. C already compiles to machine code.

In order to do anything resembling JIT, you would have to insert a runtime
where one does not already exist.

~~~
jrockway
LLVM's compiler suite compiles C to LLVM bitcode, and the LLVM (low-level
virtual machine) runs that code.

~~~
foldr
I don't think LLVM provides any such virtual machine, despite its name. It has
a code generator, but no VM as such.

~~~
maximilian
I'm almost positive that LLVM provides both. You always target LLVM bytecode
with your frontend, but then you can choose either to:

\- run the bytecode using the LLVM with or without JIT.

\- compile the LLVM bytecode into x86 bytecode.

This is what clang does (afaik) and is a reasonably nice use case because you
write a static language that targets LLVM bytecode and end with a _fast_
program because LLVM will output optimized x86 binaries.

However, I'm pretty sure that the efforts like unladen swallow attempt to use
the LLVM's virtual machine in the hope that it will become very fast as they
optimize it (especially the JIT part).

~~~
nostrademons
Yeah, you just instantiate an ExecutionEngine and feed it your LLVM bytecode,
and it'll give you back C function pointers that you can call. The
ExecutionEngine is either a JIT (which'll compile your bytecode into native
code and give you that back) or it's an interpreter (where the function
pointer you get back is an entry point into the interpreter).

------
DrJokepu
I believe pretty much anything Paul Biggar has to say about dynamic language
interpreters / compilers. His talk at StackOverflow DevDays London about
dynamic languages was a great eye-opener for me and a fantastic experience.

~~~
barrkel
I was at the same talk, but I got the impression that because his perspective
comes from trying to do static analysis and optimization, he doesn't fully
appreciate some of the capabilities of dynamic languages. The overall effect
was a static typer missing the point.

But as he says in this post, PHP code doesn't necessarily take advantage of
these capabilities in practice.

~~~
pbiggar
Yeah, PHP isn't Python or Ruby, which are properly dynamic. I think those
languages are too far gone in their particular dynamic direction to be saved.
But I'm convinced you can build a language that feels dynamic, but whose meta-
programming is largely compile-time. Even the funky Python I see (metaclasses
and decorator) could largely be compiled if the language rules were slightly
tweaked.

I'm interested: which capabilities don't I appreciate? (BTW, my background is
very much being a static-typer, missing the point, and slowly realizing it).

~~~
barrkel
First off, I'm pretty much a static typer myself, so I'm not necessarily the
best advocate for the dynamic case. But by the same token, I can argue against
some of the problems of static typing.

First up: I'm talking about object-oriented approaches, with polymorphism and
dynamic dispatch being the core abstraction tool.

If you try to introduce more static typing into a system with polymorphism,
you're trying to push type decorations so that they can follow where the
expressions go a little further, so that you can reason more. Parametric
polymorphism, or generic types and methods (hence forth called generics to
avoid confusion with OO polymorphism), are a key tool. They let you better
reason about values of which types cross method boundaries. With generic
methods, you can usually infer the type parameters. With generic types, the
type arguments can serve as extra documentation. A Java ArrayList<E> seems
clearly superior to ArrayList at first glance.

But all sorts of problems start coming up when you mix polymorphism with
generics. Type inference isn't so simple anymore, as ad-hoc choices need to be
made. Instantiations of generic types have no necessary subtyping
relationship. So you have to reintroduce dynamicism in the back door via base
classes or interfaces etc. that use a top type (java.lang.Object or whatever)
to re-unify types; or you dabble in variance. And you want to do more with
your arguments of type parameter types, so you bring in type parameter
constraints, or type classes, or other means of generically handling values.

So perhaps you have use-site variance or declaration-site variance combined
with type classes combined with polymorphism. Suddenly using or declaring
generic types has gotten quite a bit harder. Getting type information to flow
into the right places starts requiring confusing tricks, particularly when you
want e.g. the methods of your ancestor to return a type related to the
concrete instantiation type, for example something like this in C# syntax:

    
    
        // so you want a GetThis method that returns a correctly
        // typed thing for the current instance
        class Base<T> where T: Base<T> { public T GetThis() { ??? } }
        class Desc: Base<Desc> {}
    

When the static typing advocate has tangled himself up in a knot like this (or
deeper, e.g. when you're passing other type arguments around across multiple
classes and hierarchies, just to get the damn types right), the dynamic typer
shakes his head, and says "don't you see - it's all just objects!".

One of the paradoxes is that declaratively specifying, at compile-time, all
the types that may reach different parts of your program at runtime, is harder
than writing the program to build the object graph at runtime in the first
place, and treating the occasional runtime type error as a bug that will be
eliminated.

Now to the other side of the story, meta-programming. I agree that you can do
lots of meta-programming at compile-time, but there are two problems, as I see
it. One: compile-time is too early to be making decisions. What do you do when
you can't afford to restart the process, but you still need to change the
running code and how it works? This isn't a theoretical concern; I've
architected such a server system in the past. It could keep on handling
requests relating to client sessions associated with a previous versions of
the server-side behaviour, but new client sessions would use the latest
version of the server-side behaviour. Rolling over the server farm from one
version to the next didn't require tricks with load balancers and restarting
app servers, etc.; it was all built into the system.

(Actually, the architecture was really interesting, and I should write it up
in more detail one day. Another key fallout of this approach was that the
session data, which was (usually) stored on disk and was a bit like a Lisp or
Smalltalk image only typically 20K in size, had a pointer (a URL) inside it
telling the server where to get the behaviour for this session. This meant
that problem sessions could be post-mortemed quite effectively: the old
session could be resurrected, stepped through, etc.)

And the second problem of compile-time metaprogramming: the formalisms you use
to munge with the types, particularly static types which the compiler will
later infer further things from, may in practice be quite different to the
imperative constructs programmers are used to using to implement behaviour.
When they want to add a method to a class, they want to do it using the same
tools (or close) to that which they'd use for adding an item to a list or hash
table.

Perhaps the meta-programming is implemented as a plugin to the compiler, or a
macro system, where the code is modifying the compiler's own definitions of
the types etc. But now we've introduced an abstraction level into the chain of
source code -> executable text that most programmers are not familiar with,
and may be intellectually ill equipped to deal with - not because they're
stupid or incompetent, but rather because they simply don't need to know.
Consider e.g. hygienic macros and the problems of which symbol refers to which
level of abstraction, the compilation level, or the runtime level; the
distinction between the level you're at when you're quoting and unquoting, in
Scheme terms. Consider the confusion of this poor fellow whom I helped out:

[http://stackoverflow.com/questions/326321/how-do-i-create-
th...](http://stackoverflow.com/questions/326321/how-do-i-create-this-
expression-tree-in-c)

When you consider the tangled mess of static typing you can get into when you
mix polymorphism with generics, and then add in the ability to munge with code
and types such that the compiler sees the modified code / types, and hopefully
you can see it's easy to add more complexity than the benefit you get out.

------
arantius
> I’ve heard the argument "you don’t need a compiler, since PHP is rarely the
> bottleneck" for many years. I think its complete bollox. ...

> Unless your PHP server is sitting there idling (which is probably the case
> for many PHP servers out there) ...

Don't these two statements (abbreviated in my quote above, but directly one
after the other as quoted) directly contradict each other? My personal
experience is also that the CPU is never the bottleneck in (well written) PHP
apps. You're usually always waiting for the DB or the network/io latency of
something else like memcache.

In fact, I know at my last job (handling ~100s of millions of dynamic PHP
requests/day) that PHP's gzip compression was turned on specifically because
we had gobs of spare CPU, so using it to save bandwidth was a win both for
speed and for the bill.

~~~
barrkel
Here we go again.

If you have lots of spare CPU on machines in a web server farm, you have too
many machines. Even though I/O is the bottleneck for any single request's
response time, CPU is generally the bottleneck on total load.

~~~
jcoby
that's a double edged sword though. if you don't have enough spare cpu, you
won't be able to handle any sort of spike in usage. say from an advertising
campaign that works a little too well or if a server goes down.

and once you start maxing out boxes, the load comes even harder since the site
loads slower. you very quickly run out of resources and suddenly you're SOL
and everybody's mad that the site won't load.

honestly, it's hard to point at any one source of problems when scaling a
site. cpu, ram, and i/o are all equally problematic. and if you under plan any
one of those it can take your site down into a spiral of doom.

you really must allocate your resources for peak load, not for average or
nominal load. and then allow for some failures.

~~~
barrkel
Sure. But under max load, that gzip will be adding to your problems.

And of course, graceful failure is an art in itself. You have to start turning
down requests before you actually hit peak capacity, or you risk getting stuck
in a thrashing pit.

------
maxklein
The guy seems petty. The name of the tool is hphp, if you drop in an i and an
o, you have hiphop, which is a perfectly fine name considering Django and all
the other music named tools. It is my belief that the dislike of the name
comes from a dislike of the type of music.

And rewriting a negative comment on your blog? That's just not done. Delete
it, but don't rewrite it ever.

~~~
pbiggar
> And rewriting a negative comment on your blog? That's just not done. Delete
> it, but don't rewrite it ever.

Why not? Its my blog. Its not like I hid it. Ultimately, to cultivate a good
community, you have to have strict rules. Now I'm not exactly curating a
massive community here, but it seems like an effective way of discouraging the
kind of dickheadedness that was his original comment.

~~~
maxklein
It's your blog, but it's not your comment.

~~~
pbiggar
That's an interesting perspective. Does the poster "own" the comment on my
blog? Do I? Since he has no control over it, is it mine? Is it still his if he
posted anonymously? Does the fact that he put my email instead of his in the
email box hand control over to me?

All interesting philosophical questions. However, since I have editorial
control, and determined that he was acting anti-socially, it seems perfectly
fine to me to scribble over it (though I could see a problem if I wasn't
transparent about it).

~~~
jemfinch
People own their own creations. The power owns (the copyright on) his comment
even if he posts it to your blog, unless you've made an agreement otherwise
(e.g., in your blog's terms of service).

You control what your blog displays, though. There is probably a technical
legal difference between deleting a comment (definitely right and legally
permissible) and modifying a comment (thus creating a derivative work of a
copyright work which you may not be licensed to derive from).

------
nraynaud
I think this person is not really informed about JIT.

Knowing the code at the beginning of the run doesn't make you know its runtime
profile, at all. But it's something the LLVM generation is not aware of.

I think that the people who saw only the suits in front of java and C# didn't
see what happent on the JVM/CLR side. And a lot of things where shaken there.

~~~
pbiggar
I'll admit I haven't ever written a JIT, but I wouldn't say I'm not informed
about them. Its a question of tradeoffs: JITs can help you get good run-time
performance if your types are static for some large portion of run-time, but
not really known at compile-time. I would allege that they are available at
compile-time. I think my research showed I could prove a type for about 60% of
variables/fields/etc, and tie it down to {int,float} or {object,null} in most
other cases. I guess I forgot to put that in.

~~~
nraynaud
You are speaking about monomorphic specialization, in the JVM benchmarks they
are at 80% of specialized code (correctly guessed concrete type), moreover,
the JVM keeps the code for various specializations if it needs too. And the
good part is that they don't do all this stuff if it's useless.

The entry into JIT is stiff, but you already spent 4years on your static
compiler, so basically you had this time to make/reuse/spacialize a JIT
interpreter that would just be a drop in replacement.

~~~
pbiggar
I would understand the term "monomorphic specialization" to mean "can be many
dynamic types, but in practice only has one dynamic type". What I'm implying
is that the vast majority of values only have one static type, if your static
analysis is good enough.

The question of "which is better" wasnt really a factor in my decision to
build a static compiler instead of a JIT. Building a JIT as a solo effort for
a complex language like PHP would probably have prevented me doing enough
research to get a PhD. More importantly, I wasn't interested in JIT, I was
interested in static analysis and (static) optimization.

------
lanstein
> The PHP interpreter is also quite memory hungry, as interpreters go. Any PHP
> value in your program uses 68 bytes of overhead [6]. An array of a million
> values takes over 68 MB.

Well, that is obscene.

~~~
Erwin
Python (and probably every other dynamically typed language that does use the
tagged pointers trick) has the same issue, even worse on 64-bit. Allocating
[sys.maxint-1 for x in xrange (1000 * 1000)] takes up 36 megabytes of memory.

Allocating dict((x, float(sys.maxint-1)) for x in xrange (1000 * 1000)), which
is closer to PHP's combined arrays-dictionaries and values which I believe are
always floats (unless I'm thinking of Perl or Javascript?), takes up 97
megabytes.

Of course if you really need 1 million floating point values, that are not
going to be None and that are contigously indexed 0 to 999999, you could do:
array.array('d',(float(sys.maxint-1) for x in xrange(1000 * 1000))) which has
no overhead.

But for a random collection of objects with attributes and lists and
dictionary, the overhead exists. I've got 15 gigabytes of Python objects on
this box across a couple dozens of processes and sometimes I dream of a time
where every bit was accounted for, where for every bit I could trace back to
where it was set and why and accounted for it.

As for the "tagged pointers" trick -- basically taking advantage of that
pointers to objects are going to always have the last bottom bits set to 0 and
saying that if they are set to 01 for example, that the top 32 or 62 bits are
an immediate integer -- I think someone tried it out in Python a few years ago
and the result were that it was no win overall, though it might have decreased
memory usage occasionally.

~~~
lanstein
I was thinking more about the overhead of building arrays of small strings.
Depending on how you measure memory usage, Python is 3-4x more efficient at
building lists of one-byte strings than PHP is wrt memory usage.

------
tlrobinson
Is there anything particularly novel about HPHP that would be patentable?

~~~
pbiggar
I don't know. I don't think there is enough information released about it.

