
Making Ruby Faster - timr
http://omniref.com/blog/blog/2014/11/12/making-ruby-faster/
======
kazinator
Though optimizations in an interpreter are valiant efforts, the way to
optimize the language is to develop a compiler.

There are good reasons not to start doing hairy optimizations in the
interpreter, like that it establishes the semantics which is helpful when
compiled and interpreted semantics differ, and that interpreter optimizations
will become moot once you have a compiler.

About this Ruby optimization: one thing that stands out is that there doesn't
appear to be any way to turn it off. If such a hard-coded interpreter
optimization breaks, the only way to try something without the optimization is
to revert to a build of the interpreter which didn't have it. That may not be
possible, so then you may have to build the current interpreter, but with that
change reverted.

Compilers usually have switches for selecting various optimizations. Of
course, a similar switch in an interpreter has a run-time impact: a "do this
optimization" flag has to be checked each time there is an opportunity to do
that optimization.

~~~
jeffreyrogers
Mike Pall, the writer of the LuaJIT interpreter/compiler, suggests that a good
interpreter can get a large portion of the gains you'd get from a compiler. At
one point the LuaJIT interpreter (not the regular Lua interpreter) was within
about a factor 2 of the JIT. I think the JIT has since improved, but you can
still get significant benefits from simply improving the interpreter. (I don't
have a reference for this and can't look it up now, so maybe someone else can
help me on this... I do, however, remember reading something by Mike Pall that
said this. I think it was on Lambda The Ultimate.)

Of course, the best thing to do is to optimize the interpreter and then JIT
compile the CPU intensive stuff.

~~~
riffraff
"LuaJIT's interpreter (!) beats V8's JIT compiler in 6 out of 8 benchmarks and
is not too far off in another one"

[http://lambda-the-ultimate.org/node/3851?a=1#comment-57761](http://lambda-
the-ultimate.org/node/3851?a=1#comment-57761)

~~~
mraleph
This was back in 2010 when V8 did not really have an optimizing compiler -
V8's "compiler" was a baseline one essentially gluing together individual
interpretation patterns.

Also any cross-language comparison should be done very accurately - because we
are talking about different language semantics and different benchmark
implementations.

~~~
riffraff
well, the point wasn't "an interpreter is always faster than a jit", but "a
good interpreter can get a large portion of the gains you'd get from a
compiler".

If you prefer apples to apples, quoting Mike Pall again[0]

"the LJ1 JIT compiler is not much faster than the LJ2 interpreter, sometimes
it's worse".

[0]: [http://lambda-the-ultimate.org/node/3851#comment-57646](http://lambda-
the-ultimate.org/node/3851#comment-57646)

~~~
kazinator
If improvements in A gives you gains, whereas the use of some other B also
yields more gains, then you can always say that any of the A gains are "a
portion of" the B gains.

A compiler that only beats interpretation by 2:1 is either a poor compiler or
something else is going on, like most of the work actually being done by
subroutines that cannot be whose performance is not being affected by the
compilation (Because, for instance, they are written in C and intrinsic in the
language run-time).

There do not have to be explicit calls to such functions. For instance,
compiled arithmetic that is heavy on large bignums will probably not be much
faster than its interpreted version, because cycles are actually spent in
processing the arrays of bignum digits (or "limbs"), which is done in some
bignum library code. The code being compiled looks innocuous; it just has
formulas like (a+b)*c, but these turn into bignum library calls. Since the
bignum library is written in C and compiled, the calls run equally fast
whether called by interpreted or compiled code. That's where most of the time
is spent, and so compiling the interpreted code makes no difference overall,
even if 90% or more of the time spent there is knocked out by the compiler.
(Amdahl's Law.)

------
munificent
Really neat article, though it makes me wonder why they didn't just remove
`sym_equal` entirely and replace all usages of it with `rb_obj_equal`. At the
very least, that `#define` needs a comment saying it _must_ have the same
_identity_ as `rb_obj_equal` or a perf loss will happen.

~~~
timr
Well, among other things, you can override the object equality check in Ruby
-- so you can't simply rely on pointer identity as a check of equality.

That said, yeah, it'd be nice if you could just make the assumption that if
two objects point to the same memory, they're _de facto_ identical...

~~~
jrochkind1
You can make that assumption for symbols. And that is in fact what the code
was doing -- both before and after optimization. The optimization was about
something else, weirder, that was still making string comparison (where you
can't just check the pointer) faster even though it ought to be slower. Did
you read the article?

~~~
timr
_" The optimization was about something else, weirder, that was still making
string comparison (where you can't just check the pointer) faster even though
it ought to be slower. Did you read the article?"_

I'm the author of the article.

The optimization here is that the check for plain-old symbol equality is
happening without the need for a full ruby method dispatch. You can't make
that assumption in the general case, because it's possible to override
equality in Ruby, which then requires more work.

If you look at the full source code for the method in question, you'll see
that it does special checks for Fixnums and Floats and Strings, then a check
for the _default_ object equality (i.e. does the comparison use
rb_obj_equal?), then, finally, it falls back to a full method call.

------
jaredcwhite
Every tiny little thing like this that gets further optimized in the MRI will
result in major performance gains in advanced frameworks/codebases down the
road. So this is great news. I'm looking forward to a near-future time when
the claim Ruby=Slow! is much harder to make.

~~~
dkarapetyan
There is an inherent overhead to all dynamically typed languages like Ruby and
Python, method dispatch. Overcoming this problem is not easy and even JRuby
with all the muscle of the JVM JIT is still pretty slow compared to the
statically typed equivalent in Java or any other language that doesn't have to
pay for method look-ups during run-time.

The situation is not hopeless though and projects like Truffle/Graal
([https://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+...](https://wiki.openjdk.java.net/display/Graal/Truffle+FAQ+and+Guidelines))
are pushing the boundary of what is possible in terms of performance for
dynamically typed languages. There is also PyPy and RPython
([http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast...](http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast_enough_time))
which again leverages some neat JIT techniques to make things fast.

~~~
vanderZwan
> _There is an inherent overhead to all dynamically typed languages like Ruby
> and Python, method dispatch_

I can't claim to fully understand how they pull it off (not being that
familiar with compiler internals), but I thought Julia didn't suffer from this
problem?

~~~
ihnorton
Julia is really designed to _feel_ dynamic while minimizing the run-time
dynamicity the compiler must account for. Type inference and method
specialization play a big role, and you can read the Julia papers (on
julialang.org) for a discussion of many of the design choices.

Many modern JIT compilers use multiple techniques for aggressive
specialization of code sections that meet optimization heuristics (whether
tracing-based or otherwise), thus carefully written JavaScript (under V8 and
others), Lua (under LuaJIT) and Python (under PyPy and others) can be quite
fast. On these platforms there are usually implicit or explicit language
subsets and coding styles required to get maximum performance out of the
compiler (as with Julia: it is possible to write slow code in any language).
For example, ASM.js is an optimization-friendly, explicit subset of
JavaScript.

------
pgz
Great article! What I don't understand is why can't they make the compiler
inline the function instead inlining manually?

~~~
timr
This is more than just C function inlining -- the Ruby interpreter, when it
does a method call, ends up doing a lot of expensive bookkeeping at runtime to
maintain its own internal call stack.

This optimization allows the interpreter to completely bypass a _Ruby_ method
call, which is a big win.

~~~
pgz
I meant having something like:

if (check_cfunc(ci->me, rb_obj_equal)) { return rb_obj_equal(recv, obj); }

and trusting the compiler to inline rb_obj_equal (or using inline), although I
reckon you don't probably want rb_obj_equal to always be inlined.

~~~
timr
Oh, sorry...wasn't clear.

The problem with just doing the rb_obj_equal call for everything is that if
people override eql? in their Ruby code, you need to call that overridden
method, instead.

The consequences of this decision mean that the comparison operator needs to
be more complex.

------
mrinterweb
My impression was that ruby symbols were defined once and mapped to the same
memory space. If I am correct in my assumption, wouldn't it be more efficient
to pass the arguments in as pointers and compare the equality of the pointers?

~~~
djur
That's actually what they're doing. VALUE is a typedef for uintptr_t, so
rb_obj_equal is just testing whether the pointers are the same.

~~~
mrinterweb
Thanks for looking into that, Matt. It has been a long time since I did any
real C programming, and I could not believe that they would be passing by
value for this kind of equality check. The word "VALUE" is a little deceiving
considering that it is actually a uintptr_t.

~~~
stormbrew
It is uintptr_t in size, but it's actually a bit more complicated than that.
If the 'value' is a fixnum, the low bit is set to 1 and the bottom 31 or 63
bits of the integer are shifted to the left and stored directly in the VALUE
[1].

And then there's a few other bit patterns (all with the low bit set to zero)
that also mean embedded values [2], including symbols, where they're actually
an index into a string table rather than an actual object pointer.

So a lot of the time, VALUE is anything but a misnomer. Also, the name for
this pattern is tagged pointer. It's one of the most notable things ruby
borrowed from emacs lisp [3].

Also, the entire point of symbols is that they don't need to be dereferenced
(or strcmp'd) to compare them. That's not the slow part of symbol comparison.

[1]
[https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#...](https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#L234)

[2]
[https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#...](https://github.com/ruby/ruby/blob/trunk/include/ruby/ruby.h#L383)

[3]
[https://www.gnu.org/software/emacs/manual/html_node/elisp/Ob...](https://www.gnu.org/software/emacs/manual/html_node/elisp/Object-
Internals.html)

------
pmontra
Almost offtopic because not directly related to symbols performances in Ruby
but, coincidence, I was reading a 2011 post from Rubinius site
[http://rubini.us/2011/02/17/rubinius-what-s-
next/](http://rubini.us/2011/02/17/rubinius-what-s-next/) a couple of minutes
before seeing this on HN.

At almost 50% of the post there is a code snippet that explains why Ruby
method calls must be slower than the ones of statically typed languages. It
also explains how Rubinius was addressing that 3 years ago.

Hopefully they'll optimize that in all language implementations.

~~~
donpdonp
also almost off topic but the lead Rubinius dev just posted a great writeup on
rubinius 3.0 which could be a possible future of ruby in general.

[http://rubini.us/2014/11/12/rubinius-3-0-part-3-the-
instruct...](http://rubini.us/2014/11/12/rubinius-3-0-part-3-the-
instructions/)

~~~
pmontra
Interesting read. Thank you.

------
skratlo
Makes me wonder why the C compiler (gcc?) isn't clever enough here and is not
inlining all that non-sense. At least I would expect
SufficientlySmartCompiler™ to do so. Doing the inline by hand just proves how
dark is the age we're living in.

~~~
masklinn
> Makes me wonder why the C compiler (gcc?) isn't clever enough here and is
> not inlining all that non-sense.

There's an indirection through a function pointer (ci->me). The manual
optimisation is there to see whether the function pointer matches a baseline
pointer, and apply that baseline directly instead of invoking the "generic"
dispatch machinery (which involves setting new callframes & al)

If you look at the whole function, there's a few special case before that
which statically dispatch the case of a comparison between numbers or strings,
so here we're in the "general" case and one last optimisation available is to
see whether equality has been overridden at all.

~~~
chrisseaton
People are often confused about how a JIT can be faster than a static compiler
- this is a great example of why that can be the case. A dynamic compiler
would be able to speculatively inline through the function pointer, where in a
static compiler that is not tractable with what we currently know about
compilers.

~~~
gsg
That's not really true - it's just (possibly partial) defunctionalisation. The
problem isn't that we don't know how to do it, but that the necessary whole
program architecture has various drawbacks.

See Stalin and MLton for examples of a static compiler performing such
analyses.

~~~
chrisseaton
Consider the case of a program which applies a processing function to pixels
in an image. Which processing function to run depends a command line
parameter. How would whole program analysis help you know which function you
are going to use? But a JIT will see you keep calling the same function and
inline it. Not even profile directed feedback will help you if each time you
run the program you use a different function.

I know Stalin and MLton but not the research you mention - can you point me at
any papers?

~~~
gsg
It's true that whole program compilation doesn't cover speculation (and many
other cases of dynamicism, like running code that you download or construct at
runtime). But it does allow inlining through a function pointer as in the OP,
which you suggested is impossible for a static compiler.

The classic paper on defunctionalisation is Reynolds' "Definitional
Interpreters for Higher-Order Programming Languages". There's also a huge
whack of papers at [http://mlton.org/References](http://mlton.org/References),
some of which go into MLton's compilation strategy (I don't remember which
ones to point you at, though).

------
cliffordheath
"Symbols are this unique, quasi-string construct" \- not even a little bit
unique. Interned atoms were a core feature in the very first Lisp
interpreters, and the idea has been copied in many languages since.

