
Copy-on-write friendly Python garbage collection - happy-go-lucky
https://engineering.instagram.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf
======
ComputerGuru
Not directly pertaining to the article but in relation to the comments here, I
feel it’s necesary to remind that the real problem with GC isn’t average
processing time/max simultaneous requests but rather the large standard
deviation resulting from GC pauses (and not to mention the poor cache
friendliness, memory evictions, and more as a result - but those are taken
into account in the the average run times so we can ignore them in this
context).

As a related metric, sometimes it’s not your average response times and scale
that matters but how much headroom you’ve got for a sudden influx of traffic.
If memory/pagefile thrashing and hard faults bring your system screeching to a
halt, you won’t be happy no matter your scale when you can’t benefit from the
eyeballs and pageclicks coming your way.

~~~
rbehrends
Eh, incremental garbage collection is technology that has been around since
the 1980s/1990s. For purely sequential systems (which is what we're talking
about here), it is pretty easy to make GC soft real-time [1]. We are talking
about maximum pause times well below just the network latency between the
client and the web server.

Cache friendliness is a double-edged sword. There are plenty of use cases
where a GC can be more cache-friendly than manual memory management, in
particular if you have a generational GC with a bump allocator.

[1] What makes life (considerably) harder is when you have multiple threads
sharing a heap, but that's not at issue here.

~~~
CoolGuySteve
If it’s pretty easy, where are all these soft real-time garbage collectors?

~~~
rbehrends
Note that I was specifically talking about the sequential case, not multi-
threaded languages.

OCaml has had one since the 1990s [1]; it's a fairly standard generational,
compacting collector with incremental collection for the old generation. Lua
has had an incremental GC since version 5.1 (in 2006); as it's frequently used
as a scripting language for video games, it's safe to assume that pause times
aren't much of an issue.

The problem is that virtually every language since then has pretty much
decided to have all threads use a global shared heap. Once you do that, you
run into all kinds of challenges, such as root discovery from thread stacks
without stopping the world. That said, there are plenty of languages that have
this option, anyway; it's simply more challenging, not impossible.

Languages that are single-threaded maintain thread-local heaps don't have the
problem. Python and Ruby (unlike Lua) have issues for historical reasons (they
started out with basic reference counting and mark-and-sweep collection,
respectively, and then had to maintain backwards compatibility [2]).

Intermediate designs (having both thread-local heaps _and_ shared heaps at the
same time) are also possible, but that design space hasn't been explored much.

[1] [http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-
latencie...](http://prl.ccs.neu.edu/blog/2016/05/24/measuring-gc-latencies-in-
haskell-ocaml-racket/#ocaml-version)

[2] I think that in principle one could make the cycle detector in Python
incremental (it's basically a form of trial deletion); Ruby eventually got an
incremental GC for its major generations in 2.2, but I believe there are still
some inherent limitations due to the lack of write barriers in C code.

------
IgorPartola
A more general question: why is it so typical for a Python we worker to have
its memory grow infinitely with each request processed? I don’t expect the
kind of performance I can get out of C with its ability to only allocate
memory on the stack or to pre-allocate all the buffers I’d need. And I have
found a number of memory leaks in Python libraries and especially poorly
written Python C modules over the years. But still I would expect something
like a simple ideomatic Django application to not have its memory footprint
grow idefinitely. Is this just life now? Are there any good tools out there
for figuring out what part of the application keeps requesting objects that
don’t get destroyed? Are corporations people?

~~~
sametmax
> why is it so typical for a Python we worker to have its memory grow
> infinitely with each request processed?

It's not. The cases where it happened to me during the last 10 years and
dozens of web projects spanning on multiple frameworks were generally me doing
something stupid:

\- using a mutable object as default value or in a class attribute

\- doing something in __del__

\- keeping DEBUG=True for Django (which is very well known for causing memory
leaks)

> Is this just life now?

Nope. I have big Django apps running (like 500k users/day streaming video
sites) and it doesn't happen.

> Are there any good tools out there for figuring out what part of the
> application keeps requesting objects that don’t get destroyed?

Yes for course: [http://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-
pyt...](http://tech.labs.oliverwyman.com/blog/2008/11/14/tracing-python-
memory-leaks/)

~~~
misterbowfinger
>\- using a mutable object as default value or in a class attribute

Python noob here. Can you expand on that a little? Or perhaps link a post? I
can't think of an example or how it'd impact memory

~~~
nitely
Here[0]. Mutable objects as default values create a single instance of the
object. For iterables (list, set, etc) the behaviour is particularly
surprising coz it accumulates/memoize every item ever added (i.e a list that
grows/leaks with every call to the function). If you use a good IDE (like
pycharm), it will warn you about this stuff.

[0] [http://docs.python-
guide.org/en/latest/writing/gotchas/](http://docs.python-
guide.org/en/latest/writing/gotchas/)

------
lathiat
Aaron Patterson recently talked about doing similar work for Ruby, with good
technical details - he's quite good at explaning these kinds of issues for
people less familiar with them. Explains how and why you want to be copy-on-
write friendly and what that means for Ruby's GC - and the impacts of his
work.

I watched the RubyConf AU version:
[https://www.youtube.com/watch?v=nAEt36XNtAE&t=2482s](https://www.youtube.com/watch?v=nAEt36XNtAE&t=2482s)

But looks like there was a version at Rubyconf as well that may or may not be
better/different:
[https://www.youtube.com/watch?v=8Q7M513vewk](https://www.youtube.com/watch?v=8Q7M513vewk)

------
kasabali
I've recently read their previous post, now this one and I really enjoyed
reading both of them. I appreciate their practical approach and solving their
issues with minimal changes.

On the other hand, I am disappointed that this change goes into upstream
Python instead of properly solving the problem by making reference count
implementation really CoW friendly so that all applications would benefit from
it without the need for a careful use of a special function.

------
eeks
Managing large memory objects, shared or not, is a very commmon problem with
garbage-collected languages. The solution of “hiding” the region from the GC
has been around for quite some time for languages such as Java, C#, and even
OCaml. That being said, it’s a very enjoyable little write up.

------
vosper
When exactly is this new GC behavior useful? When forking processes? Is it
something we'd see used in the multiprocessing or concurrent modules?

~~~
bpicolo
Memory is typically the scaling bottleneck for Python workers for web servers.
One way to cut back drastically on this is to load as much as possible in
startup and then fork for request serving processes. The problem is that
python refcounting causes a lot of copy-on-writes for data that's really just
used as read-only data.

This change allows you to run more workers with less memory.

Instagram has had a few good posts about how they've approached this problem,
here's another: [https://engineering.instagram.com/dismissing-python-
garbage-...](https://engineering.instagram.com/dismissing-python-garbage-
collection-at-instagram-4dca40b29172)

~~~
sametmax
> Memory is typically the scaling bottleneck for Python workers for web
> servers.

Wut ?

When I have performances problems, my RAM is full with varnish cache, redis
stored objects and postgres buffer.

Python is like, low, low on the list.

Again, most people are NOT instagram or Google. They have a very atypical
load.

~~~
ComputerGuru
> Again, most people are NOT Instagram or Google

And, again, can we stop with these pointless comparisons? Past 1 front end
server the only cost that matters is how many requests your one node can
handle and how much that one node costs. If you’re not a VPS and you have your
frontend http cache correctly configured, then it doesn’t matter how much
smaller than Google you are, comparisons _are_ valid (although, of course, the
fewer servers you have the more you can afford to spend on them; though you
probably aren’t making as much money as IG/FB/Google either...).

~~~
sametmax
The only metrics is economic. The question is how much does those performance
constrains cost you ?

------
sigmonsays
One web request should not be served by one process at a time. This is a
terrible software architecture which is just the fault of python. I love
python, but it is not suitable for backend work at scale. Its simply too
costly in terms of resources required. While its interesting to see
improvements in the GC, it also makes equal if not more sense to migrate away
from python. Just my 0.02$

~~~
sametmax
> One web request should not be served by one process at a time

It's doesn't have to be. You can use asyncio and deal with many requests at a
time by one process.

> I love python, but it is not suitable for backend work at scale.

What's at scale ?

Most projects I work on IRL, including in banks and administrations, never
reach the level of scale this even remotely a problem. Too many people think
they are Google.

~~~
ComputerGuru
Do you mean intranet bank apps? Or public facing sites?

~~~
sametmax
It doesn't matter. In France, there are 70 millions people. Remove old and
young ones, and the ones that are not clients in your bank. Then spread the
usage on the week, then on 24h, divide by the number of requests that actually
hit your Python backend (so no cache, no static files, etc). You got what ?
1000 requests/s top ? That's nothing.

Bank sites are not youtube.

Besides, operations on your bank account don't even hit the Python backend,
but a dedicated system. Usually some COBOL dinosaure they froze, wrapped into
a Java service and exposed through a RESTish API so that the rest of their
system can use it without ever having to touch it again.

~~~
figgis
And if you were really wanting to get as much performance as you wanted out of
python there's plenty of options like epoll...

