
What causes Ruby memory bloat? - adamors
https://www.joyfulbikeshedding.com/blog/2019-03-14-what-causes-ruby-memory-bloat.html
======
scott_s
The author never made clear if they are measuring _virtual_ memory usage, or
_physical_ memory usage. Having a lot of virtual memory does not "cost" much:
it's just having permission to use a lot of memory if you so wish. It's
possible to have huge chunks of your virtual address space with no physical
memory backing them. Physical memory is the amount of physical RAM your
process is using.

Memory managers tend to greedily request a lot of virtual memory because
there's usually little harm in doing so, and the act of asking the kernel for
more permission (read: more virtual memory) is slow.

Minor quibble: it's not accurate to call `malloc()` and family in glibc the
"operating system's memory allocator." That is the memory allocator for C, and
Ruby just so happens to be implemented in C. The glibc allocator will use the
system calls `brk()`, `sbrk()` and/or `mmap()` to request memory from the
kernel ([http://man7.org/linux/man-
pages/man2/brk.2.html;](http://man7.org/linux/man-pages/man2/brk.2.html;)
[http://man7.org/linux/man-pages/man2/mmap.2.html](http://man7.org/linux/man-
pages/man2/mmap.2.html)). Nothing really changes per the punchline.

~~~
microtherion
> Having a lot of virtual memory does not "cost" much

To some extent, I agree, at least on 64 bit systems (on 32 bit systems, there
was the risk of running out of address space).

However, the memory in question here is, for the most part, not freshly
allocated, but was in use once. This means that it used to be backed by
physical memory, and before that backing can be withdrawn, the page has to be
written to disk.

It seems to me that the best solution would be to call madvise(MADV_FREE) on
these regions, in which case they can be unbacked without further ado. I'm
somewhat surprised that the memory allocator does not do this itself already.

~~~
scott_s
> However, the memory in question here is, for the most part, not freshly
> allocated, but was in use once.

Because the author does not differentiate between virtual and physical memory,
I can’t agree with that. It’s quite possible most of the memory “freed” was
never in use. In which case, there’s not much benefit. And there is the
probable downside that allocation heavy applications will pay a lot more.

~~~
microtherion
> It’s quite possible most of the memory “freed” was never in use.

Given the allocation patterns shown, it would seem to me that if two blocks
are still in use, it's fairly likely that the region in between also were in
use once (there may be exceptions due to pools etc, but generally memory is
parcelled out in a linear fashion).

> And there is the probable downside that allocation heavy applications will
> pay a lot more.

What would the cost be? All that would happen is that the free pages are
marked as clean. I'm sure that's not entirely free, but bound to be
considerably cheaper than paging the page out and in again.

~~~
scott_s
> What would the cost be? All that would happen is that the free pages are
> marked as clean.

That's a minor page fault. You get an OS-level exception, switch to kernel
mode, process the page fault by marking the page as loaded, then switching
back to user mode. That is expensive if you do it a lot.

> but bound to be considerably cheaper than paging the page out and in again.

Yes, of course, a minor page fault is cheaper than a major page fault. But
both are more expensive than _no_ page fault, which is what happens if you
just never free the page to the OS and there's plenty of available physical
memory.

------
cutler
Here are some benchmarks on an 8-year-old, 4-core i7 running OSX Sierra.
Parsing a 115Mb log file for lines containing a 15-character word (regex:
\b\w{15}\b) we have:

    
    
      LANG / TIME* / RAM
    
      JS (Node 11.11) / 8.4s / 100Mb
      Ruby 2.6 / 19.1s / 14.8Mb
      PHP 7.3 / 4.4s / 5.6Mb
      Python 3.7 / 24.7s / 4.2Mb
      Perl 5.26 / 14.0s / 1.0Mb
    

*These figures are for runtime, ie. with startup time deducted. Ruby's startup time (0.55s) is much longer than the other languages (Python:0.06s, Perl:0.02s, PHP:0.14s).

Ruby's memory usage is 3.5 times that of Python for only a 30% speed gain.
Perl uses 1/15 of the RAM used by Ruby and is 35% faster but it could be
argued that Perl 5's lack of built-in OO accounts for some of this ..... until
you look at PHP which has built-in OOP and uses 2/5 of the RAM used by Ruby
whilst performing 3.2 times as fast.

I love that Ruby is designed from programmer happiness but the shine starts to
wear off when you look at its memory usage. Slow is bearable as it's only
marginal but Ruby's memory usage is orders of magnitude higher it seems.
Matz's goal of making Ruby 3 times faster is only half the battle, maybe even
only a third. If an increase in speed comes at the expense of even greater
memory use then Ruby will not survive.

~~~
bluedino
If memory mattered that much we’d all still be using Perl.

~~~
ryl00
Some of us still are. And not just because of memory. :)

------
wrs
In classic glibc form, malloc_trim has been freeing OS pages in the middle of
the heap since 2007, but this is documented nowhere. Even the function comment
in the source code itself is inaccurate.

[https://stackoverflow.com/questions/15529643/what-does-
mallo...](https://stackoverflow.com/questions/15529643/what-does-malloc-
trim0-really-mean)

~~~
anitil
I'm used to a random comment on stack overflow being the source of truth for
angular, django and the like. But this is the first time I've seen it for
glibc!

------
blattimwind
The reason glibc malloc doesn't like to free random pages in the middle of a
mapping is probably because that inflates the number of PTEs needed to
describe it. Say you have one mapping and free a page in the middle of it -
now you need to PTEs to describe that. Similarly the mapping shown in the last
image probably requires a few dozen PTEs to accommodate the holes.

That isn't free (it requires cache & TLB space), but it's entirely possible
that Ruby is slow enough on the interpreter and data model level for this to
not matter much.

Edit: Turns out malloc_trim doesn't actually modify the mapping but rather
uses madvise(DONTNEED), so a higher address resolution cost probably only
materializes under memory pressure.

------
spectre256
It really makes sense that something like this would be the case.

All the experts say "oh, Ruby uses lots of memory for [reason] and it can't
really be fixed", so no one even tries.

Until someone comes along who is either motivated, smart, or ignorant(!)
enough to try to fix it anyway, and finds that the commonly accepted answer
was wrong.

This happens all the time, especially in science. Trust, but verify, I
suppose.

~~~
jashmatthews
> All the experts say "oh, Ruby uses lots of memory for [reason] and it can't
> really be fixed", so no one even tries

This isn’t true at all. It’s well understood that jemalloc 3.x exhibits lower
resident set size because it more readily releases pages.

This idea has been around for at least 3 years: [https://bugs.ruby-
lang.org/issues/12236](https://bugs.ruby-lang.org/issues/12236)

------
guy_c
In the context of Rails applications hosted on EC2, then I've not found Ruby's
memory usage to really be an issue. In my experience most Rails apps range
between 150-500MB per instance.

My current employer typically uses M5 instances which have a ratio of 1 vCPU :
4 GiB Ram.

Running Unicorn you'll probably only want 1.5 instances per vCPU. Even a
memory heavy Rails app is probably only going to utilise ~20% of the available
memory.

Running threaded Puma, you probably want only a single process per vCPU and
maybe 5-6 threads. In my apps running 5 threads per process typically
increases memory of the process by 20%. So in that instance you'd only utilise
15% of the available memory on a M5 instance.

If you are having memory issues on Rails, then quick wins are upgrading your
Ruby version. I saw 5-10% drop in memory usage with each of the major version
2.3.x -> 2.4.x -> 2.5.x.

Also if it is an old app, check you've not built up cruft in your Gemfile.
Removing unused gems can be another quick win for reducing memory usage.

------
xsmasher
I love it. When you think you know what the issue is,

    
    
      * prove it is really the issue
      * fix it
      * prove that you fixed it

------
dsr_
I know that a Ruby shop naturally wants to use Ruby for everything, but when
the job described is:

a simple multithreaded HTTP proxy server written in Ruby (which serves our DEB
and RPM packages)

then I would reach for Linux ipvs or haproxy, and apache or nginx to do the
serving. Good tools already exist for these things, it's a shame not to use
them.

(And we have a Ruby dev group, so please don't accuse us of having a phobia or
hatred of Ruby.)

~~~
tinco
This is a pretty old piece of software, it might predate nginx. Or if it
doesn't, it probably does some non trivial url rewriting or other logic that
would be a pain to do in nginx.

In any case it's just something thrown together to solve a need quickly and
effectively. The performance characteristics might not even have been an
issue, they just caught his eye.

~~~
Twirrim
nginx has been around since 2004. haproxy since 2001. httpd since 1995.....

Re-inventing the wheel is almost certainly harder than learning the
configuration syntax of any of these.

~~~
tinco
Writing an http proxy is 5 lines in node.js, and not much more in Ruby. I
could do it in either without even consulting a reference. I spent hours
learning nginx configuration, and could spend hours more. If all I need is a
simple proxy with some logic why do it? It's not reinventing the wheel, it's
building a wheel that's good enough, using the materials you got.

------
nh2
I recently spent weeks investigating a very similar issue in a Haskell
program, giving deep into its memory manager and glibc's malloc.c (where I
found multiple bugs, showing that much code in there appears to never have
gotten proper review in decades).

Writing a memory visualiser is exactly what I needed and planned to do next,
so this is a great contribution for anybody working on problems like this.

------
sams99
Tracking the MRI enhancement at: [https://bugs.ruby-
lang.org/issues/15667](https://bugs.ruby-lang.org/issues/15667)

------
jplayer01
This was a really interesting read. It makes me wonder though - are other
languages affected by this? I haven't heard any similar reports from, say,
Java or Python.

~~~
wozer
I have experienced similar memory fragmentation problems with Python 2.x.

Java has a compacting garbage collector. So it should not be affected.

~~~
favorited
Except that heap fragmentation was _not_ the issue the author identified...

------
xtracto
Interesting, I read about this from here:
[https://medium.com/@floriendrees/a-poodle-is-a-dog-ruby-
is-n...](https://medium.com/@floriendrees/a-poodle-is-a-dog-ruby-is-not-
rails-83f8d1bb4f0e)

------
gingerlime
Excellent post! I didn’t quite understand why MALLOC_ARENA_MAX=2 outperforms
this solution (slightly) and what are the trade-offs exactly ... can anyone
shed more light on this?

~~~
simcop2387
I'd have to look more deeply myself, but it likely means larger allocations up
front, to avoid having to call out to the OS to make smaller allocations. So
if you know you're going to use the larger allocations anyway it'll probably
be almost no trade off but if you aren't sure you'll probably use more than
you needed.

~~~
mbell
It doesn't mean larger allocations, it means more allocations. glibc will try
to use different arenas for different OS threads to avoid lock contention, if
you allow a lot of arenas then malloc will allocation a lot of chunks of
memory assuming you have lots of threads requesting memory.

------
glandium
It looks like OP is victim of something similar to
[https://sourceware.org/bugzilla/show_bug.cgi?id=23416](https://sourceware.org/bugzilla/show_bug.cgi?id=23416),
and malloc_trim is only papering over it.

------
aboutruby
Source code of the visualizer:
[https://github.com/FooBarWidget/heap_dumper_visualizer](https://github.com/FooBarWidget/heap_dumper_visualizer)

Best thing is this can be called directly from Ruby with FFI.

------
dnprock
I call GC.start() manually in various places. It seems to tame memory usage.

~~~
stdcli
When you GC.start() you are probably forcing a round of garbage collection. I
assume it is slow to run GC.start(), but after gives you some runtime
advantages, but only for a bit (until you call it again), but then the heap
fills up again. You can tune the RUBY_GC_HEAP_GROWTH_FACTOR in GC.stat after
initializing and load testing a few times to avoid having to call GC.start()
and manually force garbage collections by allocating enough space for the
system to run its own processes, while minimizing initializing unutilized
memory or having too small of a heap size, which will trigger too many garbage
collection runs, both requiring expensively slow kernel system calls.

------
ykevinator
I hope someone submits this as an issue to the ruby repo, bc this is awesome.

