
Implementing a fast interpreter without resorting to assembly - mad
http://www.emulators.com/docs/nx25_nostradamus.htm
======
fexl
Brilliant. Fascinating how he slides from switch-dispatching to function
pointers, deals with the problems there, then does that thing where the
function pointer returns the next function pointer. But that's just skimming
the surface of the brilliance there. The intricacies of his observations have
a sparkling beauty.

~~~
kabdib
I've worked with him. He's a smart guy. :-)

------
haberman
> and thus debunk one of the greatest virtual machine myths of all - the blind
> faith in the use of jitting as a performance accelerator.

It's not blind faith, it's demonstrated in pretty much every situation where
an interpreter and a JIT are both available for the same language. All of the
fastest VMs out there today (V8, LuaJIT, JVM, etc.) are based on a JIT.

> The performance level made possible by the Nostradamus Distributor also
> means that [interpreter-based] Bochs could purely interpreted pretty much
> match the performance of [JIT-based] QEMU, although I admit that is still
> left to actually implement and try out.

This sounds like blind faith to me.

~~~
sb
Just for the record, while I agree at least in principle regarding your
assessment of interpreters vs. JIT compilers, the situation seems to be far
from clear though, and I think the last word is not yet spoken on that topic.

As far as I am concerned, most of the benchmark suites out there give an
unfair advantage to JIT compilers. For example, all numeric JavaScript and
Python benchmarks can be heavily optimized by JIT compilers (essentially
removing all of their interpreters' weaknesses: (un-)boxing, dynamic typing,
and in the case of Python, reference counting; plus removing the interpreter
overhead [i.e., instruction dispatching]). Many of the benchmarks are
numerical in nature, too, even if the actual workload is usually non-
numerical. So it might very well be that your actual workload does not use any
of the fancy numerical operations that a JIT can optimize heavily. In such a
case, the additional memory consumption of the code caches and the additional
memory requirements of a generational garbage collector may in fact not give
you any practical speedups in comparison to a sophisticated interpreter using
just reference counting (which is known to be very space efficient).

Aside of this unfair skewing of benchmarks towards numerical computations,
there are other points to consider in the discussion of JIT vs. interpreters,
such as energy consumption. Does a programming language implementation using a
JIT subsystem require more or less memory than an interpreter? (I am positive
that some companies have already measured this, but there are AFAIK no
publications concerning this important question.)

Summing up, I think--as is so often the case in computer science--which of the
two techniques give the best results depends heavily on a) what trade-offs
you/your customers are willing to make [space vs. time] and b) the actual
performance is of your workload [numerical vs. non-numerical].

~~~
sb
EDIT: Of course, the question in the third paragraph should read:

Does a programming language implementation using a JIT subsystem require more
or less _energy_ than an interpreter?

------
sb
I have just read the post and have to say that I would appreciate the post
containing a more accurate description of the testing methodology. Some of the
techniques described are known in interpreter optimization and his description
of "threaded code" intepreter is actually not consistent with what it usually
means (viz. threaded code interpreters have nothing to do with "threading" the
decode for the successor instruction into the operation implementation of the
current instruction, but just moving instruction decode and dispatch into the
operation implementation.)

Aside of that, there have been papers detailing the software pipelining
approach (cf. Hoogerbrugge et al.: "A code compression system based on
pipelined interpreters", 1999), but I cannot for the love of god imagine that
the loop shown in there is faster than a JIT compiler for a very simple
reason: each of the interpreter operations is invoked via a function pointer,
which means that the compiler emits an indirect branch instruction for
implementing this. Now, these indirect branch instructions are _very_ costly,
and a simple JIT compiler could just take the actual values (callee target
addresses) of the input program and generate direct call instructions instead.
(And I am not even talking about inlining the function bodies of the operation
implementation.)

~~~
dfox
Indirect branch is not too costly when it is correctly predicted and due to
proliferation of C++ code, modern CPUs are very good at predicting indirect
branches when they mostly lead to same target.

------
radarsat1
I'm not sure I understand the attraction of writing a fast interpreter, as
opposed to writing a JIT or ahead-of-time compiler. If you're going to the
trouble of optimizing the hell out of your interpreter, yet compiling will
always result in faster execution time, why not go directly from an easy, one-
off slow implementation to writing a compiler, instead of spending your time
optimizing the execution loop?

~~~
Retric
Writing an interpreter can be a lot simpler than coding an equivalent
compiler. Suppose you have an old game you want to play that was written in
hand crafted 8080 ASM. You can emulate an old CPU at reasonable speeds and get
a complete working solution. But, if uses say self modifying code there really
is no way to directly map 8080 to x86(64) instructions.

But, a more common case is it let's you debug systems. Even if you can't run
an iPhone app in full speed on your desktop you can inspect what's going on in
detain with an emulator in a way you can't really do with compiled code.

------
dexen
The content of the website is beyond awesome.

<http://www.emulators.com/docs/nx33_qemu_0125.htm> first caugth my eye.

then <http://www.emulators.com/docs/nx30_bochs245.htm>

------
wolfgangK
Too bad I'm coming late to the party, but those interested by this post might
be interested by the test implementations that I made a few months ago to
bench various design parameters of interpreters (bytecode size, gCC computed
goto extension...) in high level (I.e boost::variant<>) C++. I don't have the
numbers at hand but I remember it was very CPU dependant. You can try for
yourself, it is only a few self contained small programs :
[https://github.com/scientific-coder/Computer-
Languages/tree/...](https://github.com/scientific-coder/Computer-
Languages/tree/master/interpreting)

