Hacker News new | past | comments | ask | show | jobs | submit login
What causes Ruby memory bloat? (joyfulbikeshedding.com)
184 points by adamors 6 days ago | hide | past | web | favorite | 69 comments





The author never made clear if they are measuring virtual memory usage, or physical memory usage. Having a lot of virtual memory does not "cost" much: it's just having permission to use a lot of memory if you so wish. It's possible to have huge chunks of your virtual address space with no physical memory backing them. Physical memory is the amount of physical RAM your process is using.

Memory managers tend to greedily request a lot of virtual memory because there's usually little harm in doing so, and the act of asking the kernel for more permission (read: more virtual memory) is slow.

Minor quibble: it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator." That is the memory allocator for C, and Ruby just so happens to be implemented in C. The glibc allocator will use the system calls `brk()`, `sbrk()` and/or `mmap()` to request memory from the kernel (http://man7.org/linux/man-pages/man2/brk.2.html; http://man7.org/linux/man-pages/man2/mmap.2.html). Nothing really changes per the punchline.


I wrote a bit about this here, with some small GC.stat hacks to manage it a tad, when I talk about how ruby has it's own heap and manages garbage collection lower down in this post: https://ginxh.io/2018-04-18/high-performance-priority-queues...

Even fluentd had this issue with a default heap size lower than what was needed to initialize itself, causing hiccups, until the GC.stat RUBY_GC_HEAP_GROWTH_FACTOR was tuned in the source code.

Ruby memory bloat is everywhere. Being familiar with gc.stat and being able to tune ruby applications as you test them is a good habit to have if you develop or work with ruby based tools.

Julia Evans, an SRE at Stripe took a sabbatical to work on a ruby memory profiler. Her updates are here: https://jvns.ca/categories/ruby-profiler/

This is a great blog post, it's what I would have wanted to know if I went a bit deeper into this. Good stuff.


>it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator."

I watched an interview with Bryan Cantrill some time ago and one of the things he mentioned is that libc is considered part of the operating system for other unixes, and Linux is the only one that decided to redefine operating system to exclude libc.


As per ISO C, libc is part of the compiler.

Whatever else ends up there, like POSIX support, is implementation specific.


I wonder why the ISO C standard is specifying who is responsible for implementing libc. That seems screwy at best.

Regardless, afaik OpenBSD and FreeBSD both ship libc as part of the operating system. It makes a lot of sense, really, because libc is bound to need to make syscalls, which differ strongly depending on the kernel, whereas the user interface of libc is pretty much the same across any operating system.


Would memory allocation also be hardware dependent and thus vary as well depending on what hardware the OS decides to support?

You mean like devices without MMUs? I think that might be the case but I'm not sure. Under Linux you need a special kernel and libc to handle such devices IIRC.

Because not every OS is an UNIX clone and C also targets bare metal.

- afaik freestanding C specifies no standard library at all (except pure headers,) so bare metal targets providing libc are going beyond the standard.

- I'm just saying it doesn't need to specify who implements the standard library. I've read the standard before and I don't really recall this coming up, I'm just assuming this assertion is accurate.

(And of course, the overwhelming reality is that libc is provided by none of the compilers except Microsoft C, since Clang and GCC tend to be used with glibc.)


There is a world beyond gcc, clang and MSVC.

Just out of my head, MikroElektronika, TI, Green Hills, Embarcadero, IBM, HP, Unisys, Intel, PGI.


Even then, the comments regarding "freestanding" C runtimes are the same. Targeting bare metal puts the standard library squarely out of scope of the standard.

Author here. I measured RSS with swap disabled, so physical memory usage.

> Minor quibble: it's not accurate to call `malloc()` and family in glibc the "operating system's memory allocator."

I know. I just explained it like that in order to keep the material digestible to a wider audience.


Thank you for the clarification.

> I measured RSS with swap disabled, so physical memory usage.

That sounds correct to me. More details you could get from /proc/<pid>/smaps.

There are a couple of other issues I found confusing in your text. If you intentionally make simplifications, I would recommend to add at least a footnote to indicate so.

Linux has only 1 heap. But while the heap is declared to be of a certain maximum size using brk() it does not mean that all of the pages are really in RAM (as before, swapping aside). A page gets only really allocated when it is accessed the first time. And it can get deallocated again using madvise() if the program know that it is no longer needed. malloc_trim() calls madvise(). So you are correct, it does not only move the top of the heap. Probably it was that way many years ago. So in general the Linux heap can be sparsely mapped to RAM. But if you use RSS your measuring takes that into account already.

The arenas are a feature of glibc. They are documented (to some degree) in the man pages, so I would not call them magic. Also malloc_info() prints how they are used. It appears to me that arena 0 is on the Linux heap. If a program is multithreaded, glibc will call mmap() to create an additional memory area to be used for the additional arenas.

While avoiding mutex contention might be a good thing if the threads are allocating many small blocks, the overhead for arenas might not be justified for rarely allocating bigger areas as the Ruby heap implementation probably does it. So limiting the number of arenas might indeed be beneficial, it's just a typical trade-off one needs to understand. Not that that is easy, but magic is the wrong description IMHO.

Let's call the glibc functionality above the system allocator (I'm not sure whether this is the correct name or whether it even has an official name). However, glibc has yet another completely different allocator, the mmap allocator. For allocations bigger than MMAP_THRESHOLD glibc will not use the heap or any of the additional arenas, but just directly make a completely new memory mapping from the Linux kernel. Again these will lazily/sparsely allocated to RAM. The mmap allocator does not use arenas at all.

I am not a regular Ruby user, so I don't know whether the Ruby heap uses glibcs's system allocator or the mmap allocator or what mix of both. (Mixing them is transparent to a program.) However, from your description I would guess that it mostly uses the system allocator, otherwise arenas would not make any change and AFAIK malloc_trim() has no effect on the mmap allocator at all. Have you considered either changing the setting of MMAP_THRESHOLD or the Ruby interpreter code such, that the Ruby heap uses only the mmap allocator?

It appears to me that having the Ruby heap management on top of the glibc system allocator is just not a good idea. The system allocator tries to be a good compromise for a widely varying spectrum of applications. But the Ruby heap management is one very specific case, partially duplicating the work that the glibc allocator does. I'd guess having the Ruby heap running closer to the operating system should improve things. The mmap allocator of glibc might be an easy way to achieve that. Otherwise the Ruby head should use mmap() directly, because it should know best how it wants to use the memory. But that would be much more difficult to implement of course.

P.S. Your visualizer looks impressive. Unfortunately I did not have time to study it in detail. Did you make sure that the code does never access a page that has been mapped but never accessed by Ruby? Because that would dirty the page and increase the RSS.


> Having a lot of virtual memory does not "cost" much

To some extent, I agree, at least on 64 bit systems (on 32 bit systems, there was the risk of running out of address space).

However, the memory in question here is, for the most part, not freshly allocated, but was in use once. This means that it used to be backed by physical memory, and before that backing can be withdrawn, the page has to be written to disk.

It seems to me that the best solution would be to call madvise(MADV_FREE) on these regions, in which case they can be unbacked without further ado. I'm somewhat surprised that the memory allocator does not do this itself already.


> However, the memory in question here is, for the most part, not freshly allocated, but was in use once.

Because the author does not differentiate between virtual and physical memory, I can’t agree with that. It’s quite possible most of the memory “freed” was never in use. In which case, there’s not much benefit. And there is the probable downside that allocation heavy applications will pay a lot more.


> It’s quite possible most of the memory “freed” was never in use.

Given the allocation patterns shown, it would seem to me that if two blocks are still in use, it's fairly likely that the region in between also were in use once (there may be exceptions due to pools etc, but generally memory is parcelled out in a linear fashion).

> And there is the probable downside that allocation heavy applications will pay a lot more.

What would the cost be? All that would happen is that the free pages are marked as clean. I'm sure that's not entirely free, but bound to be considerably cheaper than paging the page out and in again.


> What would the cost be? All that would happen is that the free pages are marked as clean.

That's a minor page fault. You get an OS-level exception, switch to kernel mode, process the page fault by marking the page as loaded, then switching back to user mode. That is expensive if you do it a lot.

> but bound to be considerably cheaper than paging the page out and in again.

Yes, of course, a minor page fault is cheaper than a major page fault. But both are more expensive than no page fault, which is what happens if you just never free the page to the OS and there's plenty of available physical memory.


That's what jemalloc does (besides avoiding fragmentation) and one of the reasons why people use it. http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms

The author never made clear if they are measuring virtual memory usage, or physical memory usage. Having a lot of virtual memory does not "cost" much: it's just having permission to use a lot of memory if you so wish. It's possible to have huge chunks of your virtual address space with no physical memory backing them. Physical memory is the amount of physical RAM your process is using.

Yes, that remains very unclear. Only physical memory is interesting in the end, address space exhaustion is not on issue on most systems today and large pages are typically not used. The graphs have "virtual", "dirty" and "clean" in them.

"clean" normally means that the page in RAM is just a copy of a mass storage. If the kernel runs short of memory, it can use this page for other purposes. No harm done, except that the system gets slower when it needs to page in the same page later.

"dirty" normally means that the page in RAM is not backed by mass storage. The kernel must keep it reserved for the current purpose in order not to lose data.

If we assume that author's application does not do swapping (it really shouldn't for 230 MB, otherwise the machine is seriously overloaded and really slow) all heap in use is always dirty. The amount of dirty equals usage of physical RAM.

I am not sure which Linux tool that can easily report the use of physical RAM for the heap. Would RssAnon from /proc/self/status be a good approximation? It certainly contains the stack, too. But that should not grow a lot unless you have infinite recursion.


Here are some benchmarks on an 8-year-old, 4-core i7 running OSX Sierra. Parsing a 115Mb log file for lines containing a 15-character word (regex: \b\w{15}\b) we have:

  LANG / TIME* / RAM

  JS (Node 11.11) / 8.4s / 100Mb
  Ruby 2.6 / 19.1s / 14.8Mb
  PHP 7.3 / 4.4s / 5.6Mb
  Python 3.7 / 24.7s / 4.2Mb
  Perl 5.26 / 14.0s / 1.0Mb
*These figures are for runtime, ie. with startup time deducted. Ruby's startup time (0.55s) is much longer than the other languages (Python:0.06s, Perl:0.02s, PHP:0.14s).

Ruby's memory usage is 3.5 times that of Python for only a 30% speed gain. Perl uses 1/15 of the RAM used by Ruby and is 35% faster but it could be argued that Perl 5's lack of built-in OO accounts for some of this ..... until you look at PHP which has built-in OOP and uses 2/5 of the RAM used by Ruby whilst performing 3.2 times as fast.

I love that Ruby is designed from programmer happiness but the shine starts to wear off when you look at its memory usage. Slow is bearable as it's only marginal but Ruby's memory usage is orders of magnitude higher it seems. Matz's goal of making Ruby 3 times faster is only half the battle, maybe even only a third. If an increase in speed comes at the expense of even greater memory use then Ruby will not survive.


If Ruby was performant enough to disrupt the web app industry ten years ago, it is performant enough today. Ruby's survival is not dependent on trivial fluctuations in benchmark numbers.

Ruby is just as shiny as it was in 2006, if not more so. If you feel happy writing log parsers in JavaScript or PHP well then all power to you, but I'd rather chew on a shoe.

I do my data processing in Ruby, if there's a performance issue I'll fork out to a little Go widget. But that's when I need a factor ten improvement or more.


Ruby disrupt those that were never exposed to AOLServer or Zope, we just had a sense of deja-vu.

Could very well be, I was a young developer when the Rails hype happened, to me it was disrupting clumsy Java and .Net web architectures, and the ugly ad hoc PHP methodology. I have never heard of AOLServer or Zope, no one ever told me about them.

The problem for Ruby on Rails is that a lot of its good idea were eventually borrowed while it was encumbered by Ruby's performance. No one develop web apps like they did on Java when Rails came out. A lot of good ideas from Ruby and RoR became mainstream and popular but done on faster platforms.

"it could be argued that Perl 5's lack of built-in OO"

That's interesting to me. I consider bless() and package namespaces to be "built in OO."

Is there some OO concept that combo doesn't provide?

Also, for "Parsing a 115Mb log file for lines containing a 15-character word" ? Wouldn't that just be:

perl -ne 'print if /\b\w{15}\b/'

I'm surprised that's not faster.


I was trying to create a level playing field as PHP and Python don't lend themselves to one-liners as well as Ruby and Perl.

If blessing a hashref is real OO why has so much energy been expended on Moose and its offspring? Before I left, back in 2013, there was also much gnashing of teeth over whether Perl needed a MOP so I don't think everyone agreed that bless() is enough.


"If blessing a hashref is real OO why has so much energy been expended on Moose and its offspring?"

That's a good question. It's always mystified me. A little boilerplate doesn't bother me. As far as I can tell, it's because people didn't like writing the constructor boilerplate and getters/setters. Maybe for the "isa" pretend types?


If memory mattered that much we’d all still be using Perl.

Some of us still are. And not just because of memory. :)

Memory may be cheap on a VPS but on AWS, GCP and Azure it sure ain't.

Which method did you use to match the line in Ruby? I noticed String[/\b\w{15}\b/] performs way better than String.match(/\b\w{15}\b/).

The Ruby code is:

  IO.foreach('logs1.txt') {|x| puts x if /\b\w{15}\b/.match? x }

What happens if you lift the regexp out of the block and reuse the same regexp instance?

Ruby's smart enough to freeze regexen by default. I think that landed around 2.0/2.1/2.2

   pry(main)> 3.times { puts /i_am_a_reg_ex/.object_id }
   70131978677720
   70131978677720
   70131978677720

Tried:

  regex = Regexp.new '\b\w{15}\b'
  IO.foreach('logs1.txt') {|x| puts x if regex.match? x }
... but performance and RAM usage were identical.

It could be argued anyway that String[/\b\w{15}\b/] is not idiomatic.

Are you interested in sharing the log file if it doesn't contain anything sensitive or a reasonable facsimile with scripts?

Why was the memory usage so high for the node/js version? Was the whole file being loaded into memory? Care to share the code?

  var fs = require('fs'), byline = require('byline');
  var stream = byline(fs.createReadStream('logs1.txt', {encoding: 'utf8'}));
  stream.on('data', function(line) { if (line.match(/\b\w{15}\b/)) console.log(line); });

That regex should have been declared a variable on its own line.

Not sure that comparing base memory usage on tiny programs that aren’t including any commonly used libraries is that useful.

I'd argue that excluding libraries gives a clearer impression of the relative performance of the languages themselves.

In classic glibc form, malloc_trim has been freeing OS pages in the middle of the heap since 2007, but this is documented nowhere. Even the function comment in the source code itself is inaccurate.

https://stackoverflow.com/questions/15529643/what-does-mallo...


I'm used to a random comment on stack overflow being the source of truth for angular, django and the like. But this is the first time I've seen it for glibc!

The reason glibc malloc doesn't like to free random pages in the middle of a mapping is probably because that inflates the number of PTEs needed to describe it. Say you have one mapping and free a page in the middle of it - now you need to PTEs to describe that. Similarly the mapping shown in the last image probably requires a few dozen PTEs to accommodate the holes.

That isn't free (it requires cache & TLB space), but it's entirely possible that Ruby is slow enough on the interpreter and data model level for this to not matter much.

Edit: Turns out malloc_trim doesn't actually modify the mapping but rather uses madvise(DONTNEED), so a higher address resolution cost probably only materializes under memory pressure.


It really makes sense that something like this would be the case.

All the experts say "oh, Ruby uses lots of memory for [reason] and it can't really be fixed", so no one even tries.

Until someone comes along who is either motivated, smart, or ignorant(!) enough to try to fix it anyway, and finds that the commonly accepted answer was wrong.

This happens all the time, especially in science. Trust, but verify, I suppose.


> All the experts say "oh, Ruby uses lots of memory for [reason] and it can't really be fixed", so no one even tries

This isn’t true at all. It’s well understood that jemalloc 3.x exhibits lower resident set size because it more readily releases pages.

This idea has been around for at least 3 years: https://bugs.ruby-lang.org/issues/12236


In the context of Rails applications hosted on EC2, then I've not found Ruby's memory usage to really be an issue. In my experience most Rails apps range between 150-500MB per instance.

My current employer typically uses M5 instances which have a ratio of 1 vCPU : 4 GiB Ram.

Running Unicorn you'll probably only want 1.5 instances per vCPU. Even a memory heavy Rails app is probably only going to utilise ~20% of the available memory.

Running threaded Puma, you probably want only a single process per vCPU and maybe 5-6 threads. In my apps running 5 threads per process typically increases memory of the process by 20%. So in that instance you'd only utilise 15% of the available memory on a M5 instance.

If you are having memory issues on Rails, then quick wins are upgrading your Ruby version. I saw 5-10% drop in memory usage with each of the major version 2.3.x -> 2.4.x -> 2.5.x.

Also if it is an old app, check you've not built up cruft in your Gemfile. Removing unused gems can be another quick win for reducing memory usage.


I recently spent weeks investigating a very similar issue in a Haskell program, giving deep into its memory manager and glibc's malloc.c (where I found multiple bugs, showing that much code in there appears to never have gotten proper review in decades).

Writing a memory visualiser is exactly what I needed and planned to do next, so this is a great contribution for anybody working on problems like this.


I know that a Ruby shop naturally wants to use Ruby for everything, but when the job described is:

a simple multithreaded HTTP proxy server written in Ruby (which serves our DEB and RPM packages)

then I would reach for Linux ipvs or haproxy, and apache or nginx to do the serving. Good tools already exist for these things, it's a shame not to use them.

(And we have a Ruby dev group, so please don't accuse us of having a phobia or hatred of Ruby.)


I think since Hongli is the main developer of Phusion Passenger, it's logical for him to want to explore the problem space.

While I agree with you, I also see no harm in experimentations and changes to the status quo for the sake of trying to build something better.


This is a pretty old piece of software, it might predate nginx. Or if it doesn't, it probably does some non trivial url rewriting or other logic that would be a pain to do in nginx.

In any case it's just something thrown together to solve a need quickly and effectively. The performance characteristics might not even have been an issue, they just caught his eye.


nginx has been around since 2004. haproxy since 2001. httpd since 1995.....

Re-inventing the wheel is almost certainly harder than learning the configuration syntax of any of these.


Writing an http proxy is 5 lines in node.js, and not much more in Ruby. I could do it in either without even consulting a reference. I spent hours learning nginx configuration, and could spend hours more. If all I need is a simple proxy with some logic why do it? It's not reinventing the wheel, it's building a wheel that's good enough, using the materials you got.

I love it. When you think you know what the issue is,

  * prove it is really the issue
  * fix it
  * prove that you fixed it

Tracking the MRI enhancement at: https://bugs.ruby-lang.org/issues/15667

This was a really interesting read. It makes me wonder though - are other languages affected by this? I haven't heard any similar reports from, say, Java or Python.

It's a long known problem with anything using 'real' threads and glibc malloc (to some extent, any malloc really). As an example and contrary to wozer's comment, Java is affected by this: https://github.com/prestodb/presto/issues/8993

I have experienced similar memory fragmentation problems with Python 2.x.

Java has a compacting garbage collector. So it should not be affected.


Except that heap fragmentation was not the issue the author identified...


Excellent post! I didn’t quite understand why MALLOC_ARENA_MAX=2 outperforms this solution (slightly) and what are the trade-offs exactly ... can anyone shed more light on this?

I'd have to look more deeply myself, but it likely means larger allocations up front, to avoid having to call out to the OS to make smaller allocations. So if you know you're going to use the larger allocations anyway it'll probably be almost no trade off but if you aren't sure you'll probably use more than you needed.

It doesn't mean larger allocations, it means more allocations. glibc will try to use different arenas for different OS threads to avoid lock contention, if you allow a lot of arenas then malloc will allocation a lot of chunks of memory assuming you have lots of threads requesting memory.

It looks like OP is victim of something similar to https://sourceware.org/bugzilla/show_bug.cgi?id=23416, and malloc_trim is only papering over it.

Source code of the visualizer: https://github.com/FooBarWidget/heap_dumper_visualizer

Best thing is this can be called directly from Ruby with FFI.


I call GC.start() manually in various places. It seems to tame memory usage.

When you GC.start() you are probably forcing a round of garbage collection. I assume it is slow to run GC.start(), but after gives you some runtime advantages, but only for a bit (until you call it again), but then the heap fills up again. You can tune the RUBY_GC_HEAP_GROWTH_FACTOR in GC.stat after initializing and load testing a few times to avoid having to call GC.start() and manually force garbage collections by allocating enough space for the system to run its own processes, while minimizing initializing unutilized memory or having too small of a heap size, which will trigger too many garbage collection runs, both requiring expensively slow kernel system calls.

I hope someone submits this as an issue to the ruby repo, bc this is awesome.



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: