
When pigs fly: optimising bytecode interpreters - lelf
https://badootech.badoo.com/when-pigs-fly-optimising-bytecode-interpreters-f64fb6bfa20f
======
0xcde4c3db
Although the article is framed around language implementation, the underlying
principle is also valid for interpreting CPU instruction sets. I read quite a
few years ago that Apple at one point abandoned a particular 68K->PowerPC JIT
because an optimized interpreter was faster at running real-world applications
for cache-related reasons (keep in mind that this was the early 90s, when
mainstream L1 caches were 4/8/16KB and L2 cache was off-chip).

~~~
mehrdadn
There were JITs back then?!

~~~
blattimwind
JIT in today's sense is _at least_ 50 years old (translating regexes to
machine code at runtime).

~~~
mehrdadn
Damn!

------
kwindla
Lovely article!

Following up on the section about threaded code, Andrew W. Appel's book
_Compiling With Continuations_ really blew my mind and changed how I think
about the interconnections between compilation, optimization, and language
design. There are many ways into that deep set of ideas; I recommend the Appel
book as one of them!

    
    
      https://books.google.com/books/about/Compiling_with_Continuations.html?id=0Uoecu9ju4AC
    

And, of course, if there's an influential meme in CS, there's a Haskell paper
with a clever title that references it. :-)

    
    
      https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/compiling-without-continuations.pdf

~~~
toolslive
you should however also read the reply paper: [https://www.microsoft.com/en-
us/research/wp-content/uploads/...](https://www.microsoft.com/en-
us/research/wp-
content/uploads/2007/10/compilingwithcontinuationscontinued.pdf?from=https%3A%2F%2Fresearch.microsoft.com%2F%7Eakenn%2Fsml%2FCompilingWithContinuationsContinued.pdf)

with an even lovelier title.

~~~
naasking
And yet another: [https://www.cs.purdue.edu/homes/rompf/papers/cong-
preprint20...](https://www.cs.purdue.edu/homes/rompf/papers/cong-
preprint201811.pdf)

------
blattimwind
Excellent article.

> Everyone knows that pigs can’t fly — just like everyone thinks they know
> that bytecode interpreters, as a technology for executing high-level
> languages, can’t be sped up without resorting to labour-intensive dynamic
> compilation.

[citation needed]

PHP (in recent versions) is a good example for a bytecode VM that's quite
quick. So we already have a spectrum that probably covers an order of
magnitude or so, just in the basic speed of bytecode execution.

~~~
willvarfar
Some interpreters are faster than others. But lets list some interpreters and
their (usually jit) compilation equivs:

    
    
        cpython vs pypy
        lua vs luajit
        dalvik vs art (which isn't jit)
        zend vs hhvm
    

Okay, that last one, php7 is reckoned to be faster than hhvm on some sites.

My gut feeling, though is that is because of the inefficiencies in php, many
inherent in the language design, rather than because php7 were the only people
to know how to write a fast interpreter.

It seems a sound generalisation that you get another order of magnitude by
moving from interpretation to compilation.

~~~
tyingq
I think it's pretty impressive that php7 was roughly on par with hhvm
performance wise, yet a lot more backwards compatible. Without FB level
funding and resources.

They've also been able to iterate where subsequent 7.x releases had notable
improvements over prior ones:
[https://www.phoronix.com/scan.php?page=news_item&px=PHP-7.3-...](https://www.phoronix.com/scan.php?page=news_item&px=PHP-7.3-Performance-
Benchmarks)

~~~
yellowapple
I wonder how that parity might change now that HHVM is dropping support for
PHP entirely and only supporting Hack?

~~~
tyingq
Not sure how relevant it will be for anyone but FB. As far as I know, everyone
that went to HHVM went back to PHP.

------
mratsim
I'm surprised their threaded interpreter was slower than switch, this doesn't
reflect my benchmark as well even for post-Haswell CPU.

Also this is not true on non-x86 hardware and I unfortunately didn't test on
AMD.

The article should at least cite this paper regarding Haswell branch
prediction:
[https://hal.inria.fr/hal-01100647/document](https://hal.inria.fr/hal-01100647/document)
"Branch Prediction and the Performance of Interpreters - Don’t Trust
Folklore", 2015, Rohou et al.

Also I'm leaving my interpreter optimization resources here with decades of
research papers and highlights of the techniques I found most interesting when
implementing a fast VM: [https://github.com/status-im/nimbus/wiki/Interpreter-
optimiz...](https://github.com/status-im/nimbus/wiki/Interpreter-optimization-
resources)

~~~
vkazanov
Yeah, it's a _very_ good collection you got there! some of the papers I read,
others are in the queue.

I was also surprised to see the switch-based solution winning here. But I was
even more surprised to see how a simple change (a primitive stack top caching)
suggested by one of the readers radically changed my perf benchmarks: the
threaded interpreter was the fastest interpreter again.

------
kayamon
If you really want to speed up your bytecode interpreter, switch to a
register-based VM instead of a stack-based one.

------
Animats
_" it is entirely possible to speed up the work of such interpreters by a
factor of at least 1½."_

Story of CPython.

