Very nice to see improvement in Ruby's GC. Back in Ruby 1.8 days, I read the source of the GC implementation and I was less than impressed. I should take the time and revisit that code to see what kind of improvements have been done.
I've been hacking on a little scripting language lately. To see how its performance compares, I run a few benchmarks against other similar (dynamically typed, bytecode compiled) languages: Lua, LuaJIT (interpreted), Python, and Ruby.
Like others, I had internalized "Ruby is slow" through osmosis. But the version of Ruby I happen to have on my machine is 2.0.0.
In my little benchmarks, it turns out Ruby is one of the fastest. I haven't compared to 1.8.7, but I'm guessing that was much slower.
>> "To see how its performance compares, I run a few benchmarks against ... LuaJIT (interpreted)"
Why would you use LuaJIT in interpreted mode?
That defeats the main reason for using LuaJIT, and you'll see massive improvement in speed if you run it not interpreted 
It should also be noted that, even in the slower interpreted mode of LuaJIT - it was still the fastest in completing your benchmarks compared to all other languages/implementations. It will run even faster if you don't run LuaJIT in interpreted mode.
Running LuaJIT interpreted is a deliberate choice.
Some platforms, in particular iOS and game consoles, do not allow generated machine code. I want my language to be usable for games, and that's one reason I'm using bytecode. (Simplicity is the main reason.)
Given that, I think it's most helpful for me to compare Wren's performance to other bytecode language implementations. Comparing to JIT compiled languages is a bit apples/oranges. They tend to be something like an order of magnitude faster (and an order of magnitude more complex!).
Meanwhile, LuaJIT's interpreted mode is very fast as you note. I think it's one of the fastest bytecode interpreters around, so it's a great target for Wren to aim for.
If you want a lot more detail on my thoughts about this, see here:
What makes ruby (MRI particularly) slow are metaprogramming things that slow it down when done at runtime (dynamically definding methods and classes), its global interpreter lock (preventing proper parallelism) and the single-threaded super-slow GC.
Integer math is fast because they're using tagged pointers and thus don't need to allocate real objects for them. Method calls also are reasonably optimized, although they don't have inline caches afaik.
You're basically testing a few tiny bits of the runtime that happen to be reasonably fast. The big frameworks like rails do a lot of stuff that is horribly inefficient. Combined the aforementioned single-threadedness and slow GC leads to ruby being slow in real-world applications even if a few microbenchmarks are ok for an interpreted language.
This changed in 1.9. Ruby is now compiled into a stack machine based bytecode. Ruby 1.8 used the AST internally, you're right there. This made e.g. method calls very slow compared to the new byte code representation.
That's correct. I still think Ruby 1.8 would be a useful data point just because it's a widely used language that many people are familiar with, but in this case I'm comparing it to Ruby 2.0, so it is apples/apples.
For what it's worth, I look at these benchmarks as validating feasibility more so than really trying to win some performance fight between languages. In order to argue that my language is suitable for real world use, I need to show it's performance is comparable to other languages that are widely used. From that angle, Ruby is a fine comparison, regardless of how it's implemented.
As someone woefully uninformed about these things, why can't GC be implemented as a separate thread, maybe with a lower priority than the primary interpreter? Would a separate thread not be able to count references to objects or something?
> Parallel tracing needs an assumption that "do not move (free) memory
> area except sweeping timing". Current CRuby does.
> For example: "ary << obj". Yes, the CRuby's memory management strategy
> (assumption) is different from normal interpreters.
GC needs to know about all references in the program. If the mutator (the program) is running concurrently with the collector, it's difficult (though not impossible) for the collector to construct this consistent view.
Many styles are implemented that way. Check out Java for (many) examples.
Naively, with Ruby's mark-sweep collector, you might want to do the sweep phase in parallel, adding garbage blocks to the free list. You could do the mark phase as well, with I believe a pause to make the marks a consistent snapshot.