The article mentions that Unicorn's out-of-band garbage collection is problematic because the way it works - running the GC after every request, and requiring turning off the normal GC - is overkill. But there the community is working on a better solution.
In particular, Aman Gupta described this very same problem and created a gem which improves out-of-band garbage collector, by only running it when it is actually necessary, and by not requiring one to turn off the normal GC. Phusion Passenger (even the open source version) already integrates with this improved out-of-band garbage collector through a configuration option. This all is described here: http://blog.phusion.nl/2014/01/31/phusion-passenger-now-supp...
Just one caveat: it's still a work in progress. There's currently 1 known bug open that needs reviewing.
Besides the GC stuff, Phusion Passenger also has a very nice feature called passenger_max_requests. It allows you to automatically restart a process after it has processed the given number of requests, thereby lowering its peak memory usage. As far as I know, Unicorn and Puma don't support this (at least not out of the box; whether there are third party tools for this, I don't know). And yes, this feature is in open source.
Based on what I've seen in NodeJS, isn't it about time that Ruby has some kind of temporary "Buffer" class that represents data in a way intrinsically different from String?
This would allow explicit clearing of the data in a way that would break how String works, but in the context of Buffer it would be allowed.
If the Ruby GC isn't cutting it for you, maybe old-school memory management is the way to go, right?
From the article, the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory.
The article claims this got worse with the new generational GC algorithm, which sought to minimize the amount of work the GC has to do during the collection pause. By marking long lived objects as probably OK, you end up with fewer objects to free per GC cycle.
The problem then, is if you allocate some massive strings, and they get marked as "old" then they might never get collected by the GC. It's not clear to me if the article author sees it this way, but apparently there is some kind of bugs where some class of objects get marked "old" by accident, and it sounds like it's exacerbated by existing interpreter architecture.
Work on the Ruby interpreter is weirdly silo'ed off and mostly done by Japanese developers, so there's a significant barrier to entry for any enterprising C developer to roll her sleeves up and get hacking.
Anyhow, so, my understanding of your question is no, it would not help outside of really specific optimizations? It depends on what this Buffer class would do. Would they let you mark them as being young? Or avoid extra malloc calls because you have special information on the size of your strings? Kinda depends on how the generational algorithm was implemented? At this stage my knowledge grows thin.
> From the article, the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory.
Which, it should be noted, has actually always been a bit of a problem with Ruby's GC. Particularly when it comes to C extensions that can't or won't use certain patterns to hint to the VM about how much memory they're really using.
I remember a really common memory issue with ImageMagick's extension along these lines back in the 1.8.x days. ImageMagick allocates huge objects and gave the Ruby VM very little insight into that fact because it used its own malloc. You'd wind up with a perfectly normal, in terms of memory use, Ruby app that didn't trigger GCs often enough and the IM objects wouldn't get cleaned up in a timely way, so it looked like you had a huge leak. You'd then spend days trying to figure it out.
Work on the Ruby interpreter is weirdly silo'ed off and mostly done by Japanese developers, so there's a significant barrier to entry for any enterprising C developer to roll her sleeves up and get hacking.
This is wrong. Ruby Developers welcome contribution in any form. Also, they have various resource to get started:
Official Contributing Guide: http://ruby-doc.org/core-2.1.1/doc/contributing_rdoc.html
Ruby Hacking Guide: http://ruby-hacking-guide.github.io/
Book on ruby internals: http://www.amazon.com/Ruby-Under-Microscope-Illustrated-Internals/dp/1593275277
RubySpecs: http://rubyspec.org/
Further, incase you are stuck. you can post on the mailing-lists. someone will surely help you get started.
It's entirely possible that it's changed in the last few years, but at the very least what he said was once very true. There has traditionally been a very real and very painful language barrier to the ruby core team.
But, to be fair, it's a bit of a goose and gander kind of situation. People everywhere else in the world have to deal with that kind of situation all the time.
It has, very significantly. There is still a 日本語-only mailing list (ruby-dev), but the English one has significantly more traffic (ruby-core). No decisions are made in ruby-dev that are not also discussed in ruby-core.
In addition, it's only that mailing list that's split; the bugtracker is in English, the help is all in English (with one or two 日本語 translations).
It's actually never been easier to contribute to Ruby.
I'm talking about side-stepping the GC engine completely by allowing a form of quasi-manual control over buffer data. Being able to call a method that actually immediately releases buffer data would, I think, help considerably, rather than hoping and praying that the GC eventually gets around to clearing it up.
It might also be possible to create a sort of "weak reference" String-type class where you can manually ditch the data associated with it and not wreck other references to it, they'd just revert to empty string.
Seems like this could be done without having to get down and dirty in the GC itself.
There's nothing really stopping you from doing this with a cext or just using a String with all its encodings set to BINARY. I don't really see why it would be helpful here, though. The Ruby VM is essentially abdicating the difficulties of managing a large block heap to the platform's implementation, this wouldn't really change with a buffer class that could be similarly large.
> running the GC after every request, and requiring turning off the normal GC - is overkill.
Sort of reminiscent of Erlang's per-process GC. Although, given what I've seen of Ruby's internals, I'm sure the similarities stop at the very highest conceptual levels.
Ruby 2.1.1 introduces a new environment variable, RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR, that helps to mitigate the heap growth caused by the generational garbage collector added in 2.1. By setting this variable to a value lower than the default of 2 (we are using the suggested value of 1.3) you can indirectly force the garbage collector to perform more major GCs, which reduces heap growth.
This needs to be upvoted more. We went through similar pains when upgrading rather big app to Ruby 2.1.1 and setting RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR value to 1.3 did help reduce memory usage of Unicorn workes by almost 100MB.
I am not saying this is actual fix and I would like to hear from OP if he tried this environment variable and if it had any impact.
TL;DR New generational garbage collector of Ruby has some issues and ruby-core is working to fix it.
We're using this & found it works well at slowing down heap growth however we still end up with the same maximum memory usage as before, it just takes longer to get there.
No, you should be using Puma.[0] Unicorn is a work of art and it is incredibly simple, but with that comes a lot of waste. Puma won't fix this problem as it lies in MRI, but you will be able to run way less processes so total memory consumption won't be such an issue.
Ruby 2.1 comes with an asterisk. It's a lot faster but you should take some time to tune the GC to your application's needs. Aman Gupta[1] has some excellent posts on his blog about how to do this. On Rails apps that I have upgraded from 2.0 to 2.1 we have seen around 25% (and up to as high as 50% in some places) decrease in response times. The GC oddities will all certainly get better in Ruby 2.2 (and maybe even in minor releases of 2.1 but I doubt it).
Being multi-process is the "waste". Each process has its own heap, its own copy of the AST and bytecode, etc. Even with preforking and copy-on-write, the overhead introduced by processes is still significantly bigger than running many threads within 1 process.
Of course, this doesn't have to be a problem for everyone. Whether threads are advantageous depends a lot on the workload.
I think a little extra memory in exchange for a whole lot of safety and code simplicity is a fair tradeoff. If you don't have enough memory to spawn a few hundred subprocesses I'd suggest you tackle that first.
The first step of scaling an app is not to make sure you have a server with at least 43GB of RAM.
If we say a typical Rails process is 150MB, then consider several hundred of them...
150MB * 300 = 43GB
So, while your point is valid, in that there is value in safety and code simplicity, we're not talking just "a little" memory. Consider your use case, that's all.
Your point is about economics, not scaling per se. A program that requires lots of memory may be wasteful, but it could still very well scale (in terms of its ability to service requests under increasing load).
Nevertheless, the memory requirements of a multiprocess server are not a linear function of the footprint of the master process. fork(2) on modern Unices (including Linux, since, like forever) has copy-on-write semantics for a child's memory usage. So when you fork a new process, the memory overhead is relatively small (only the page tables are copied, not the heap, and never the program text or shared libraries).
Only when a child process modifies or allocates a new heap page or exec()s another program will it incur additional memory overhead. So the additional memory required by a child Unicorn process (assuming the parent preloads the application server code) won't usually be anywhere near 150MB (and if there's a memory leak it's trivial to free it up by killing the child after it services a request).
Building a JVM is harder than building an RVM. I was asking from the perspective of user code. Shared memory requires mutexes in both threaded and forked code - this is not an advantage of multiple processes. In fact, multiprocess shared memory is generally harder to work with than threaded shared memory.
We're talking about Ruby here, though. MRI threading has concurrency limitations, so processes are better here IMO.
I tend to think out of the box for multiprocess shared memory - Kyoto Cabinet and Redis can be helpful here, to name a few. Even SysV shared memory is pretty easy to use.
It varies from app to app and what the underlying hardware is. On "real" hardware, meaning not virtual CPUs and virtual threads, I've seen one Puma worker replace 24 Unicorns.
Most cloud (god I hate that word) providers will be giving you vCPUs, in those cases usually one Puma will replace 4-8 Unicorn workers.
Your mileage may vary based on whether your app is more IO or CPU bound. Most Rails apps are incredibly IO bound, though.
Yes. But support for Aman Gupta's improved out-of-band garbage collector, is in the open source version.
The Enterprise version also happens to support out-of-band garbage collection even for multithreaded apps, which as far as I know is not supported by Puma (yet?)
I totally agree with codinghorror that a blog without comments in not a blog, this is a prime example. No way to respond to the author without jumping through crazy hoops.
As for the, full-of-bait, title. Ruby's GC is ready for production and being used in production in plenty of places. Just don't go expecting rainbows if you disable it. And expect memory doubling with 2.1.1 if left untuned. You can choose how much you want the memory to increase from 2.0
First, sorry we don't have comments yet. I'm the author, so you no longer have to be enraged. Perhaps just annoyed.
Second, with Ruby 2.1, out of the box, this code:
while true do
"a" * (1024 ** 2)
end
leads to infinite process growth. There should be no need for "memory doubling" to run this code -- you're generating throwaway strings of identical size. Similar code, in other languages, does not lead to out of control memory consumption.
Also, the problem isn't caused by disabling the GC. This happens in stock Ruby 2.1. Disabling the GC (and running it once per request) is the fix for the problem.
> There should be no need for "memory doubling" to run this code -- you're generating throwaway strings of identical size.
How do you know they are throwaway?
Your code involves two method calls. While it would be unlikely that someone has overridden them given that they are core String and Fixnum methods, it is perfectly possible. E.g:
class String;
alias :old :*
def * right
$store ||= []
$store << self
old(right)
end
end
"foo" * 5
"bar" * 3
p $store
In other words, even with seemingly innocent calls like that, it takes extra work to be able to reuse that memory without a full GC pass. It's certainly not impossible, but it's not there yet.
So I agree with you in principle, but this is one of those areas where the malleability of Ruby objects makes it tricky to optimize.
(one possible example approach is to set aside a bit to indicate "has at least once been stored somewhere where it can escape" as "poor mans escape analysis" - if your object is only ever stored in local variables that have no been captured by a lambda, or passed as arguments, then it can't escape higher than where it was created, and so you can take shortcuts, otherwise you'd still need a full gc pass)
My blog doesn't have a comment section. It's because I sometimes post controversial opinions and I don't want the garbage that often pollutes comment sections ending up permanently attached to my writing. You can't delete overly negative comments without seeming like a fascist, so what do you do?
I prefer to submit my posts to HN or other sites where people can comment and not have their comments permanently attached.
As for Jeff Atwood's comparison of a blog without comments to a pulpit, that's ridiculous. Does he also hate books and essays? My blog is simply a collection of essays.
Heroku Ruby guy here. I recommend Puma. Especially if your app is threadsafe. Even if it isn't you can still run multiple workers and only one thread and you get protection against slow clients (that Unicorn doesn't have). https://devcenter.heroku.com/articles/deploying-rails-applic...
I admire your recent work with threads in Ruby with the Puma libraries and the threaded in memory queue (I was working on something similar myself, I keep forgetting to get back to it though). Most companies I see using Heroku are always wondering if their bill could be lower and most of the time they are using preforking servers and background job processors.
I look forward to your talk at Ancient City Ruby next week!
Any idea why the docs[1] recommend using Unicorn instead?
On Cedar, we recommend Unicorn as the webserver. Regardless of the webserver you choose, production apps should always specify the webserver explicitly in the Procfile.
Wow, this is awesome. You should add a link to the Puma guide [1] as I've been wanting to make use of Puma workers for a while but have always been worried about running out of memory.
Unsure how it is these days, but sometime back there were pretty huge issues with the rolling restart function (cluster mode pretty much didn't work, and would cause downtime) and UNIX socket handling in Puma (would hold onto sockets, would have to login to servers to clear the issue). This caused a decent bit of downtime on occasions.
I would expect these to be fixed by now, and for certain setups it simply wouldn't be an issue, but definitely put a bit caveat in "Use Puma!"
Performance wise we saw it actually be worse than Unicorn at my former company. There were a few issues but the biggest one was no out of band GC at the time, so GC time was part of the requests and something a user saw. There were some use cases where it should have destroyed Unicorn but still strangely lagged (threaded, high IO operations. Should have been able to do far more with less since Puma continues to serve while Unicorn workers will wait). Memory usage did seem better.
We ended up back on Unicorn after a few weeks of fighting. I would imagine it to be better now (no longer deal with servers at all, so no idea), but I found the dissonance between everyone heaping love on Puma at the time and the actual experience jarring.
It seemed like a fantastic project, but at the time only for certain projects and certain infrastructure setups while Unicorn was fairly bulletproof.
Of course all of this is old knowledge and with the people who are involved in the project I expect things are far better these days. Caveat lector and such. :D
There is no "default Heroku server" you get exactly what you specify. If you specify nothing you get what Rails runs by default which is webrick and happens to be awful in production. It's not that "Heroku has a bad default" here but rather Rails' default web server is not intended to run in Production. Most of our docs try to emphasize that you should be using another server other than webrick. Some people want to just bang out a proof of concept and not worry about configuring Puma or Unicorn, for them webrick is fine.
Does Heroku reach out to people that are running WEBrick in production? Or do you have a report or something showing how many are running WEBrick in production? I agree that it takes a lot of ignoring of warnings and documentation to run WEBrick in production, and Heroku's docs do a great job educating people, but nonetheless I meet people (usually 1-2 people startups where tech founder isn't very strong) that are running WEBrick in prod :(
We tend to make side optimization recommendations whenever we notice something is wrong and you open a support ticket. We could add it to the warnings at the end, but I can safely say most people don't read them. Still worth having for those that don't, good suggestion.
There is no 'default' Heroku server on Cedar, it runs what your app asks to run. If that's Thin you'll be running on Thin. However, Heroku recommend Unicorn.
I can't imagine why a company that charges a ton for RAM would recommend a preforking application server ;) Also, when using vCPUs versus using real CPUs there's a noticeable drop off in thread performance so I doubt Puma performs all that great on Heroku :(
I recommend Puma and work for Heroku, though threadafety is still a huge issue. It also pre-forks in hybrid mode which makes it super awesome. Also it takes care of slow clients!
In fact, without specifying anything else, a Rails app deployed to Heroku will default to using webrick. Consequently, at Waza 2013 they revealed that webrick is the most used server at Heroku, and is 3x more popular than Thin. [0]
http://www.isrubyfastyet.com/ seems to show that Rails+JRuby performance remains disappointingly bad. The memory usage is kind of nuts too, not sure why it's so high.
I think a lot of Rubyists unfortunately don't write thread safe code :( I've worked on many Rails apps that have some mutable state on the ApplicationController and the models. Also, developing on the JVM requires a change in tooling that a lot of Rubyists aren't used to, as well as the slow JVM start up times which can make TDD with JRuby a pain. JRuby is consistently getting better, though.
> mutable state on the ApplicationController and the models
Well, mutable state on the models seems pretty fair, unless I'm misunderstanding you. It's rather unforgivable on ApplicationController though.
Edit: and the slow startup time doing TDD with JRuby was my primary complaint when I developed in it. Although iirc you can use nailgun or something to keep the JVM active and mitigate that pain.
Is the JRuby environment sufficiently different that you can't TDD against MRI and then save JRuby for final validation testing and deployment? They shouldn't be that different, unless you're not writing pure ruby code.
Alternatively, Nailgun should reduce a lot of that pain. Effectively you have a running server instance of JVM and run some clients on it. That gives you JVM speed advantages without the slow startup. JRuby natively supports it: https://github.com/jruby/jruby/wiki/JRubyWithNailgun
The downside is that you can leak memory if you have circular references. The programmer needs to take care to not let this happen. (This is essentially the cause of the Javascript memory leaks in IE, although the reference counting only occurred at the boundary between Javascript objects and browser objects.)
I believe python handles this by occasionally running a garbage collector that looks for leaked objects.
Objective C offers weak pointers (pointers that "don't count" and zero themselves out when all the other pointers to the object disappear), which can aid a programmer in avoiding circular references.
Finally, I believe it has been shown that reference counting can be slower that a well-implemented garbage collector, but I suspect this depends on your specific workload.
Reference counting is also the reason why Python requires a GIL. Incrementing a reference count is an opportunity for a race condition (another thread could read the refcount in between your read and write), which means that every reference count needs to be protected by a lock. Either you do fine-grained locking for each object, or you add a GIL for the whole interpreter. The former will absolutely kill performance (not only do you need to increment a refcount with every assignment or function call, you need to take a lock). The latter makes the whole interpreter thread-hostile.
You're off in the weeds. A refcount can be incremented/decremented with a simple atomic compare-and-swap, and that's exactly what most refcounting systems do.
> The downside is that you can leak memory if you have circular references. The programmer needs to take care to not let this happen.
Not entirely true[1]:
> CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references.
The "most object" and "not guaranteed" refer to[2]:
> Objects that have __del__() methods and are part of a reference cycle cause the entire reference cycle to be uncollectable, including objects not necessarily in the cycle but reachable only from it. Python doesn’t collect such cycles automatically because, in general, it isn’t possible for Python to guess a safe order in which to run the __del__() methods. If you know a safe order, you can force the issue by examining the garbage list, and explicitly breaking cycles due to your objects within the list.
Note that this is Python 2 thing; in Python 3, the behavior has changed, and cycles with __del__ methods get collected. I include it because many people still use Python 2. As for Python 2, however, I believe this case is rare, and requires both an object to be involved in a cycle (a bit rare) and have a __del__ method (I've never written such a method in many years of Python).
> Objective C offers weak pointers
Python also offers weak pointers.
> Finally, I believe it has been shown that reference counting can be slower that a well-implemented garbage collector
My understanding is that this is true, but workload dependent. I think this mostly comes from being able to quickly allocate memory, because your heap is just allocating by moving a pointer the required number of bytes through the heap. (Allocation is just an addition, mostly.) When a collection happens, objects not collected move to a different heap (the next generation), and the heap used for allocation is emptied out. The amount of work and moving depends on the number of objects remaining, which tends to be low, since objects are short lived.
The advantage to pure refcounting, in languages such as C++, is that collection is deterministic. In the rare case that you have a cycle, then you do have to handle it manually (that's the tradeoff), but I find these are extremely rare. Memory management in modern C++ is a non-issue most of the time. And prevents bugs like:
data = open(filename, 'rb').read()
I see this bug far too often. One could argue that in Python, it'll get collected via refcount, but this isn't guaranteed by the language. I've seen the same bug in Java/C# code, and there are no refcounts to save you there. (Of course, these languages have a with/using/etc. block, and it's an easy fix, yet nonetheless, these bugs are frequent.)
I would suspect that it's because a ruby ARC would require thinking more about references (the strong/weak paradigm would, I think, need to come into play) which isn't really a concept in Ruby. I think it would be difficult to ensure code written without arc would still run properly.
But I agree with you. I much prefer ARC over any GC I've come into contact with.
Performance suffers. Every write requires an adjustment to an object's reference count, and if you have objects that can be cross-thread an atomic test-set at that. The underlying allocator has to deal with fragmentation, because there is no copying. Furthermore, every object grows a count field.
Another vote for puma. We're using Unicorn in our production environment right now with the unicorn-worker-killer gem, but our initial tests with ruby 2.1 and puma in dev/QA are going well so we're looking to move to that setup in the near future.
Oh Unicorn... a little GC tuning goes a long way, and isn't that hard... For most apps the time spent in GC, while annoying, is far and away not the bottleneck.
... especially since Ruby 2.0 as the GC is much, much faster.
I'll be honest, I don't think the memory bloat is that problematic if you design the app well (business logic definitely prevents this some time)... but in general you can pass most of that off to a background worker, or pre-cache responses so you aren't bloating your app server instances/threads.
Couldn't this problem be largely solved the same way C# does it, with its GC.AddMemoryPressure function? If you're using lots of unmanaged memory, you simply inform the GC when you're allocating and freeing it, so that in can be taken into account.
I haven't needed to maintain a production service using the JVM, but I've seen coworkers debug one where the service was producing too much garbage for the GC to keep up with, and then falling over with an OOM error.
No, using another language wouldn't have solved the issue. Modifying the code to reduce the amount of garbage produced was one of the solutions. Deleting the code not too long after was the ultimate solution.
of course, Jvm is insecure by design, that's the whole point of a VM, really, to run anything no matter the context, so they're always be the same ways to escape the sandbox: http://seclists.org/fulldisclosure/2013/Jul/172 The only thing keeping it secure is legions of devs and an "application server" propagating trust.
I really do not understand what the Jave EE version history page has to do with the security of the Java runtime.
As for the other link, I think you have mixed things up a little. Those vulnerabilities are ways to bypass the Java sandbox, when Java is running in the browser as an applet. This is really not comparable to Ruby, as it cannot run in the browser and does not even have a sandbox functionality as far as I know.
Show me a way to remotely execute code on a machine just because it is running Java, and you will have my fullest attention :-)
I mean security can always be improved but you are comparing apples to apples and declaring one of the two to be an orange without decent evidence or support.
The article mentions that Unicorn's out-of-band garbage collection is problematic because the way it works - running the GC after every request, and requiring turning off the normal GC - is overkill. But there the community is working on a better solution.
In particular, Aman Gupta described this very same problem and created a gem which improves out-of-band garbage collector, by only running it when it is actually necessary, and by not requiring one to turn off the normal GC. Phusion Passenger (even the open source version) already integrates with this improved out-of-band garbage collector through a configuration option. This all is described here: http://blog.phusion.nl/2014/01/31/phusion-passenger-now-supp...
Just one caveat: it's still a work in progress. There's currently 1 known bug open that needs reviewing.
Besides the GC stuff, Phusion Passenger also has a very nice feature called passenger_max_requests. It allows you to automatically restart a process after it has processed the given number of requests, thereby lowering its peak memory usage. As far as I know, Unicorn and Puma don't support this (at least not out of the box; whether there are third party tools for this, I don't know). And yes, this feature is in open source.