Other that, anything that improves performance is a welcome change.
He is a member of ruby-core too, so I guess if enough people in core like these patches they should go right in. However I don't see OOBGC patch there in that list.
So yeah, it should be quite different.
I've been hacking on a little scripting language lately. To see how its performance compares, I run a few benchmarks against other similar (dynamically typed, bytecode compiled) languages: Lua, LuaJIT (interpreted), Python, and Ruby.
Like others, I had internalized "Ruby is slow" through osmosis. But the version of Ruby I happen to have on my machine is 2.0.0.
In my little benchmarks, it turns out Ruby is one of the fastest. I haven't compared to 1.8.7, but I'm guessing that was much slower.
If you want the gory details, here's the results of running them against Lua 5.3.2, LuaJIT 2.0.2, Python 2.7.5, and Ruby. (Yes, I should try against Python 3. I will.):
score time wren score relative
binary_trees - wren 3193 0.31s
binary_trees - lua 1366 0.73s 233.72%
binary_trees - luajit 6256 0.16s 51.04%
binary_trees - python 1376 0.73s 231.96%
binary_trees - ruby 3003 0.33s 106.34%
fib - wren 3078 0.32s
fib - lua 2785 0.36s 110.52%
fib - luajit 6900 0.14s 44.62%
fib - python 1331 0.75s 231.37%
fib - ruby 3548 0.28s 86.77%
for - wren 6080 0.16s
for - lua 1990 0.08s 50.71%
for - luajit 5914 0.02s 13.24%
for - python 2825 0.35s 215.25%
for - ruby 6595 0.15s 92.19%
method_call - wren 4707 0.21s
method_call - lua 1674 0.60s 281.17%
method_call - luajit 4221 0.24s 111.53%
method_call - python 767 1.30s 613.69%
method_call - ruby 3061 0.33s 153.77%
Why would you use LuaJIT in interpreted mode?
That defeats the main reason for using LuaJIT, and you'll see massive improvement in speed if you run it not interpreted 
It should also be noted that, even in the slower interpreted mode of LuaJIT - it was still the fastest in completing your benchmarks compared to all other languages/implementations. It will run even faster if you don't run LuaJIT in interpreted mode.
Some platforms, in particular iOS and game consoles, do not allow generated machine code. I want my language to be usable for games, and that's one reason I'm using bytecode. (Simplicity is the main reason.)
Given that, I think it's most helpful for me to compare Wren's performance to other bytecode language implementations. Comparing to JIT compiled languages is a bit apples/oranges. They tend to be something like an order of magnitude faster (and an order of magnitude more complex!).
Meanwhile, LuaJIT's interpreted mode is very fast as you note. I think it's one of the fastest bytecode interpreters around, so it's a great target for Wren to aim for.
If you want a lot more detail on my thoughts about this, see here:
Integer math is fast because they're using tagged pointers and thus don't need to allocate real objects for them. Method calls also are reasonably optimized, although they don't have inline caches afaik.
You're basically testing a few tiny bits of the runtime that happen to be reasonably fast. The big frameworks like rails do a lot of stuff that is horribly inefficient. Combined the aforementioned single-threadedness and slow GC leads to ruby being slow in real-world applications even if a few microbenchmarks are ok for an interpreted language.
I haven't followed Ruby development, but the last time I heard, Ruby 1.8 and prior was a tree-walking interpreter---simple to implement but not fast at all.
For what it's worth, I look at these benchmarks as validating feasibility more so than really trying to win some performance fight between languages. In order to argue that my language is suitable for real world use, I need to show it's performance is comparable to other languages that are widely used. From that angle, Ruby is a fine comparison, regardless of how it's implemented.
> Parallel tracing needs an assumption that "do not move (free) memory
> area except sweeping timing". Current CRuby does.
> For example: "ary << obj". Yes, the CRuby's memory management strategy
> (assumption) is different from normal interpreters.
If you allocate an object somewhere --after-- that area has been scanned, the GC will treat that area as garbage, and then you will have memory corruption, leading to segfaults and other nice things.
While parallel mark/sweep is certainly doable (Java does this) it's not easy to get right.
Concurrency, as I understand it, is dividing things into tasks that can run (to some degree) independantly of eachother.
Parallelism is two or more concurrent tasks running simultaniously.
This if you have one gc thread and one application thread, you have the possibility of them running in parallel.
But in the context of GC (in my understanding) the distinction goes like
* concurrent GC is when (some phases of) the garbage collection happen in a separate thread from the main program.
* parallel GC is when (some phases of) the collection are performed by multiple threads.
So e.g. in a mark&sweep collector you can have a background thread that performs the mark phase concurrently or you can have a parallel collector that uses multiple threads to perform the mark phase.
In MRI you can't do the latter but it could be possible to do the former, according to the thread referenced earlier.
Here is a good introductory talk: http://www.infoq.com/presentations/Understanding-Java-Garbag...
Naively, with Ruby's mark-sweep collector, you might want to do the sweep phase in parallel, adding garbage blocks to the free list. You could do the mark phase as well, with I believe a pause to make the marks a consistent snapshot.
 http://www.oracle.com/technetwork/java/javase/tech/g1-intro-..., for one.
EDIT: what I mean is that there is still a need for locks, but a lot of work can be done concurrently.