

The Broken Promises of MRI/REE/YARV - ice799
http://timetobleed.com/the-broken-promises-of-mrireeyarv/

======
iam
I think this is a problem that exists across any VM that implements a GC, not
just Ruby.

.NET CLR has the exact same problem (perhaps a harder one, since CLR has a
moving GC), so anytime they touch GC references (pointers to objects that are
collectible) it's always wrapped in an explicit GC stack frame (think GC
struct that lives on the stack). Furthermore, all reads/writes are carefully
done with macros (which of course expands to volatile + some other stuff) to
make sure the compiler doesn't optimize it away.

On the one hand, this is nice because they don't need to scan the C-stack (it
scans the VM stack and the fake GC frame stacks -- well it's one stack but you
skip the native C frames), on the other hand this means that any time a GC
object is used in C code (ok, actually it's C++) they have to be real careful
to guard it.

Of course bugs crop up all the time where an object gets collected where it
shouldn't have, it happens so often that there is a name for it -- "GC Hole".

Astute readers and users of p/invoke may remark that they don't have to set up
any "GC frames" -- that is because this complicated scheme is not exposed
outside of the CLR source. Regular users of .NET who want to marshal pointers
between native/managed can simply request that a GC reference gets pinned, at
which point I'm mostly sure it won't get collected until it's unpinned.

The bad news is I'm almost positive there is nothing you can do with just C
here to make this problem go away. You'd want stuff to magically just happen
under the hood, and C++ is the right way to go for that.

It's probably possible to create an RAII style C++ GC smart pointer that would
be 99% foolproof at the expense of some performance. It gets a little bit
trickier if we are doing a moving collector. I am thinking it could ref/unref
at creation/destruction, and disallow any direct raw pointer usage not to
shoot yourself in the foot.

Of course the people writing the GC still need to worry about this..

~~~
tsuyoshi
Anyone who has written an extension to a garbage-collected language in C will
have run into this issue. Personally I've written extensions for Guile, OCaml,
Ruby, MLton, and Java, and all of them have tricky rules for making your C
code safe for garbage collection. Using volatile is the wrong way to do this
though... this tells me that the people figuring this stuff out for Ruby don't
really know C that well.

------
thibaut_barrere
I do appreciate the technicality of the article, but I'm not sure to agree
with the first point of conclusion: how does it makes MRI (and related)
'fatally flawed' though? (real question).

What makes it 1/ irreversible and 2/ bad for today's users?

EDIT: as well, I wouldn't stop using Ruby because of that; I would use JRuby
or Rubinius or IronRuby (if I understand well, these ones are not affected?)

~~~
dlikhten
I fail to see how this is an all hands abandon ship issue. If its a critical
issue in all 3 interpreters they should be fixed asap if possible. At worst
with a flag.

If rubinius/ironruby/jruby have no issues, this may become moot eventually as
rubinius is gaining lots of traction recently and is becoming faster by the
release outperforming standard ruby vms in many cases.

~~~
evanphx
Neither Rubinius nor JRuby (and probably IronRuby too) have this issue because
they all use accurate garbage collection rather than conservative. Accurate
requires much more bookkeeping since all pointers must always be properly
identified, but if you start writing a system with accurate GC, it's pretty
easy. Bugs like this are a direct result of a conservative GC strategy (and
these bugs, as I'm sure you got reading Joe's post, really really suck to
find).

~~~
pmjordan
This class of subtle bugs exists whether or not your GC is accurate as soon as
you take the red pill and leave the VM environment. If you forget to add your
C pointer to the accurate GC's root set, you're just as dead. Related story:
<http://news.ycombinator.com/item?id=217189>

~~~
evanphx
But that is by definition a tractable problem because the source will show
that the root set isn't being used properly. (additionally, in practice this
proves to be a rare and easy to fix bug)

------
davesims
This post is a weird mix of careful technical analysis and douchey, Zed Shaw-
style hysterical overstatement.

However, I would like to see Matz' response to the recommended steps for a fix
at the end. Sounds like a reasonable goal to add for Ruby 2.0.

Note to self: Listening to Papoose while writing a technical blog post turns
your otherwise important observations into a Chicken Littleish, end-of-the
world rant.

~~~
Nelson69
I kind of branded it a bit "douchey" at first too but then as I thought about
it, it seemed remarkably restrained considering he debugged this issue. It's
not like this happened all the time, had to get kind of lucky and build and
calibrate a system just right to capture it.

I don't intend this to be an inflammatory question, I'm sort of a perpetual
ruby novice, it's never been my day job and I've never managed to sort of
catch up with the community, as soon as I feel pretty good with something I
find it's been obsoleted a couple times. I like it but how does the community
at large deal with stuff like this? This guy found a real bug and invested
some time in it, do other rubyists just deal with crashes and restart their
stuff? Do they just consider it part of "being on the cutting edge?" Or do
they not even notice?

~~~
msbarnett
In practice crashes due to this issue simply do not occur very often. I think
I've had the VM segfault twice in the last two or three years.

That's what makes the hyperbolic tone of this article so douchey; he wrote up
an interesting dissection of an edge case issue as though it were an ongoing
catastrophe, mostly just to inject a bunch of chest-thumping rock-star bravado
that added nothing of value to the discussion.

~~~
knewter
I've actually had an ungodly metric ton of ruby segfaults in the past month or
so, and almost never before that. At least one of them has definitely been GC-
related - see "therubyracer is not thread safe" for one problem I've been
running into. You also have to use PassengerSpawnMethod conservative to avoid
GC-related failures in passenger with rails 3.1.

I'm not sure if those are both related to this or not, but I've had
drastically more segfaults lately than in my past 6 years of ruby programming.
It's getting pretty bad imo.

~~~
riffraff
but how much of that is the interpreter's fault?

I know I can't run typhoeus + thin on 1.9.2 on OSX as it reliably crashes
every ten minutes and I have no clue on how to debug it, but it is not a
problem with the interpreter, it's a problem with external libraries.

------
kingkilr
I think this goes to a pretty simple point: anything you have to do by hand
you will eventually get wrong. Thus, to a first approximation anything that
can be automated, probably ought to. To show off this principle I'm going to
show off some of the PyPy source code:
[https://bitbucket.org/pypy/pypy/src/default/pypy/module/sele...](https://bitbucket.org/pypy/pypy/src/default/pypy/module/select/interp_epoll.py)

This is the implementation of `select.epoll`. Somethings you'll notice there's
no GC details (allocations outside the GC of C level structs are handled
nicely with a context manager), and we have a declarative (rather than
imperative) mechanism for specifying argument parsing to Python level methods,
this ensures consistency in readability as well error handling, etc.

------
wingo
Cute. The Boehm-Demers-Weiser collector has GC_reachable_here for this reason.
Guile has scm_remember_upto_here since before it switched to libgc. I'm sure
other systems have their things too.

That said, I like Handle, the RAII thing that V8 uses. It also allows for
compacting collection. Too bad C doesn't do RAII.

~~~
thibaut_barrere
.Net has GCHandle [1] and I believe the JVM calls to JNI have a similar
mechanism (GetXXCritical [2])

[1] <http://www.shafqatahmed.com/2008/05/memory-control.html>

[2]
[http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index....](http://publib.boulder.ibm.com/infocenter/javasdk/v5r0/index.jsp?topic=/com.ibm.java.doc.diagnostics.50/diag/understanding/jni_copypin.html)

------
wonnage
Can someone dissect this a little more? My understanding is the pointer to str
never gets written to the stack, and so str on the heap might get freed before
zstream_append_input makes use of it. But how could the GC see this/what is
the faulty assumption?

~~~
fhars
The point is that the GC _cannot_ see that and so assumes that the object is
no longer referenced and can be freed. A conservative collector works by
scanning the live memory of the process for things that look like pointers
into the same live memory and then assumes that all objects that are not the
target of any of these pointers are garbage. Tough luck if the only reference
to a live object lives in a register.

~~~
ice799
registers are scanned, too. the bug is not that the ref is in a register. the
bug is that there are no refs anywhere. not on the stack and not in any
register.

~~~
yellowredblack
This statement confused the heck out of me (wow! magic free memory) but of
course, the pointers are being held to the contents of the memory, just not to
the start of the object, which is what the GC cares about.

Perhaps the GC could be modified to track pointers not just to the head of
object but to any address within it. Alternatively, C-coders working with Ruby
could just say "I'm using this gc object" before calling C code.

I don't see this is a fatal flaw at all. Sounds like its just a bug. Now if,
as many here assert, this bug is present all over the Ruby VM, then that's
pretty unfortunate. Is that the case, or just hyperbole?

------
xpaulbettsx
So, what this really seems to boil down to, is:

The Ruby C API is returning objects that are not correctly reference-counted
for a short period of time and are incorrectly subject to GC.

This doesn't seem fatal to me, just not reasonably fixable from the GC side.
It _might_ be true, that a new API is needed to hold refs in the C side.

~~~
benblack
I am apparently in that foolish minority that believes language runtimes
should not segfault/corrupt themselves while running correct code. That this
problem requires significant effort just to hack around, while actually fixing
it would take a major architectural change, is what elevates this from mere
"lolwut?" to fatally flawed. There are good alternative runtimes for Ruby,
such as the JVM and the CLR, that do not suffer from this problem. Y'all
should use them.

Funktacularly yours,

Lil' B

~~~
davesims
If edge case segfaults were fatal flaws Windows should never have shipped. I
say 'edge case' because obviously there are millions of lines of Ruby code
running for years on MRI/YARV/REE that have not encountered this error often
enough to cause the kind of breathless panic you seem to think is appropriate.

BTW the CLR is not a good alternative runtime for Ruby, might not ever be:
[http://www.zdnet.com/blog/microsoft/whats-next-for-
microsoft...](http://www.zdnet.com/blog/microsoft/whats-next-for-microsofts-
ironruby/7034)

You did good work here -- don't hurt your credibility with overstatement.

~~~
jjore
Well, the problem here is that C using gems are going to often be memory
corruptingly buggy until and unless either the gem source is updated to
declare the proper parts volatile or Ruby's own C API is reworked to evolve
this bug out of existence and then gems would have to be updated to use the
API anyway.

Both problems are hard and the current state of affairs is apparently some
random amount of the time we'll get memory corruption bugs.

~~~
KirinDave
It's worse than that. We don't actually know where it occurs. There are
clearly some gems where it does, but it could also be occurring elsewhere in
the VM.

Just figuring this out is a non-trivial project.

------
CPlatypus
"Very few people out there know that the volatile type qualifier exists"? Only
if there are "very few" kernel programmers, embedded programmers, and others
who have used C for anything low-level and/or multi-threaded. Otherwise, no.
Sorry, but knowing about it doesn't make you special.

"Volatile" is the wrong fix, by the way. That's just depending on yet another
non-required behavior. There is in fact no further reference to "str" between
the function call and the reassignment at the start of the next iteration, so
there's nothing for "volatile" to chew on. This particular version of this
particular compiler just happens to add an extra pair of stack operations in
this case, but it's not truly required to. A real fix would not only mark the
variable as volatile but also add a reference after the function call. The
same "(void)str;" type of statement that's often used to suppress "unused
argument/variable" warnings should count as a reference to force correct
behavior here.

------
softbuilder
Well plus one for a blog post with a theme song, anyway.

