
Dismissing Python Garbage Collection at Instagram - shivawu
https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172
======
chubot
This is a great post. I like how they walked through all the steps and
especially the "perf" tool.

Ruby has a patch to do the same thing -- increase sharing by moving reference
counts out of the object itself:

Index here:

[http://www.rubyenterpriseedition.com/faq.html#what_is_this](http://www.rubyenterpriseedition.com/faq.html#what_is_this)

First post in a long series:

[http://izumi.plan99.net/blog/index.php/2007/07/25/making-
rub...](http://izumi.plan99.net/blog/index.php/2007/07/25/making-rubys-
garbage-collector-copy-on-write-friendly/)

I think these patches or something similar may have made it into Ruby 2.0:

[http://patshaughnessy.net/2012/3/23/why-you-should-be-
excite...](http://patshaughnessy.net/2012/3/23/why-you-should-be-excited-
about-garbage-collection-in-ruby-2-0)

[https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-
and-...](https://medium.com/@rcdexta/whats-the-deal-with-ruby-gc-and-copy-on-
write-f5eddef21485#.10aa2bnnw)

The Dalvik VM (now replaced by ART) also did this to run on phones with 64 MiB
of memory:

[https://www.youtube.com/watch?v=ptjedOZEXPM](https://www.youtube.com/watch?v=ptjedOZEXPM)

I think PHP might do it too. It feels like Python should be doing this as
well.

------
stcredzero
It's basically "cheating" at GC by exploiting a very narrow use case. I saw a
trick like this at Smalltalk Solutions in 2000 with a 3D game debugging tool.
The "GC" actually simply threw everything away for each frame tick.

Someone needs to come up with something like a functional language based on a
trick like this. Or maybe a meta-language akin to RPython, so people can write
domain specific little languages for doing things like serving web requests,
combined with domain specific "cheating" GC that can get away with doing much
less work than a full general purpose GC.

Couldn't a pure functional programming environment be structured to allow for
such GC "cheating?"

~~~
true_religion
ErlangVM's GC does exactly this. Memory is scoped in a closure around a
process, and any time the process goes down all its memory its thrown away.

Also anytime a functions scope terminates, all its memory immediately goes
away.

This can be done because its a functional language with immutable data
structures.

~~~
dsp1234
I work in VBScript inside of classic ASP and the statement, _" Also anytime a
functions scope terminates, all its memory immediately goes away."_, is true
there. I doubt anyone would suggest that VBScript is a functional language
with immutable data structures.

Additionally, all memory used by a page is cleaned up when the page is
finished processing. This has to do with "Memory is scoped" more than
immutable data structures.

~~~
philosopheer
I think you mean memory usage is lexically scoped because no pointers to
blocks are saved or returned? I think, and didn't really want to think harder,
just thought in the same sense that functional brings some useful baggage with
it, so does lexical scope.

------
seangrogg
I find Instragram's engineering blog to be really awesome (I especially like
their content on PostgreSQL). As well, it seems like they managed to implement
a solid solution to a problem they were facing.

That being said, I wonder if their team considered implementing a different
language that was meant to work without GC overhead. I'm all for working with
something you're familiar with, but this seems like they've hit the point
where they know enough of the problem surface area that they should be able to
start optimizing for more than just 10% efficiency by turning off a selling
point of safer languages.

~~~
ris
> That being said, I wonder if their team considered implementing a different
> language that was meant to work without GC overhead.

Er, hold on there. As discussed in the article, python only uses GC to cope
with cases which the refcount can't handle. Which turns out in some situations
to be a relatively small amount. The performance problem here didn't actually
come from any "fundamental overhead" of GC, just an incidental implementation
detail.

Deciding to jump languages just because of a 10% performance gain which
pointed them in that direction (which, as I've discussed above, it didn't)
would be ill advised.

> by turning off a selling point of safer languages

I don't know if I'd say GC is a _selling point_ of a safer language, it
doesn't really have a lot to do with safety in itself and the authors
certainly didn't manage to remove any language safety in disabling the GC.

------
elvinyung
Nice. I worked on something like this at an internship. I wrote a Unicorn-like
preload-fork multiprocess server in Ruby (for other reasons).

I realized that the workload (which involved a large amount of long-lived
static data on the heap) would have seen enormous memory savings, if only we
weren't running with Ruby 1.9's mark-and-sweep GC algorithm that marked every
object during the mark phase.

I briefly experimented with turning off GC and periodically killing workers.
Thankfully, in _that_ situation, all we actually had to do was upgrade to Ruby
2.2, which does have a proper CoW-friendly incremental GC algorithm.

`fork` is awesome.

------
kbd
One of their issues was that Python runs a final GC call before process exit.
Why _does_ Python run that final GC call if the process is exiting anyway?

~~~
conistonwater
Perhaps it's to make sure all the objects' finalizers are run also, making
sure to release everything, not just memory and file descriptors? The
finalizers are user-defined, so could include side effects that the GC would
not know about.

~~~
chris_7
Does Python guarantee that finalizers are called at all? Most languages don't,
I thought.

~~~
ambivalence
PEP 442 describes the situation with good detail, including how it was in
Python 2 and how Python 3 improves on it:

[https://www.python.org/dev/peps/pep-0442/](https://www.python.org/dev/peps/pep-0442/)

------
n00b101
> Instagram can run 10% more efficiently

Seems quite risky/costly for a mere 10% computational efficiency gain. If
you're going to change the memory model of a programming language, might as
well shoot for __10x __improvement instead of 10%.

~~~
krn1p4n1c
At scales like Instagram/FB, 10% is a big improvement.

~~~
slackoverflower
Then 10x such be even better for their users! It would be a noticeable
difference at that point.

~~~
owaislone
At their scale, it's not about making it faster for the users. It's already
fast enough. It's about making the clusters more efficient to save money. At
this scale, they have already made things a lot fast for their users by
throwing money at it.

------
bsaul
Is it just me, or does it look like the typical example of short term hack
that will blow up in your face pretty quickly, and turn your life in a
constant stream of low-level tinkering ?

I suppose people at instagram didn't just stop there, but are also planning
for more long term solution to optimizing their stack ( aka migration to a
more performant language).

~~~
topspin
> turn your life in a constant stream of low-level tinkering

It appears to me this is the fate of anyone that has to make garbage
collection scale. [1] I guess it's all worth it; build fast, win big and then
struggle with the GC as an exercise in technical debt once you can afford
enough staff to focus on the problem. Dreary.

The insight that due to reference counting GC turns reads into writes is
interesting; I think I've encountered that before but can't remember where. As
the gap between CPU throughput and RAM latency grows this becomes an ever
greater point of pain.

> aka migration to a more performant language

That happens occasionally with small systems, and it usually works. But big
operations with complex code bases don't do that very often. Facebook has re-
implemented PHP at least twice now to improve run time performance without
changing their language. Just one anecdote in a long list of heroic anecdotes
about those who will do any number of backflips and somersaults to avoid
replacing their chosen language.

[1]
[https://news.ycombinator.com/item?id=12043271](https://news.ycombinator.com/item?id=12043271)

------
bitwize
Fun fact: Lisp originally had no GC. It just allocated and allocated memory
till there was none left, and then it died, after which the user dumped their
working heap to tape, restarted Lisp, and loaded the heap back from tape.
Since only the "live" objects were actually written, the heap took up less
memory than before and the user could keep going.

------
Animats
_Instagram’s web server runs on Django in a multi-process mode with a master
process that forks itself to create dozens of worker processes that take
incoming user requests._

So this is all a workaround for Python's inability to use threads effectively.
Instead of one process with lots of threads, they have many processes with
shared memory.

~~~
snissn
Just a small nit pick, it's not shared memory it's Copy on Write memory that
comes from forking. It would be interesting to see an explanation of the
tradeoffs between threading and multiprocessing in their app servers too!

~~~
Animats
Right, they're not using shared memory for interprocess communication. The
article used the term "shared memory", but they just mean read-only sharing.

------
eugenekolo2
Noting that some other library might call `gc.enable()` is correct. But, then
ignoring the fact that another library can simply call `gc.set_threshold(n >
0)` seems like an obvious bug in the waiting, and the same issue as something
calling `gc.enable()`

------
jondot
This is called out of band GC. We've been doing it for years in Ruby with
Unicorn [https://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-
ou...](https://blog.newrelic.com/2013/05/28/unicorn-rawk-kick-gc-out-of-the-
band/)

However when the ruby community moved to Puma which is based on both processes
and threads it was needed less. Not that this is rocket science (it's still
far behind the JVM and .NET), I assume a hybrid process/thread model is
something that hadn't reached a critical mass in the
Python/Django/Flask/Bottle community?

~~~
icebraining
uWSGI can definitively run in an hybrid process/thread model.

------
kilink
They mentioned msgpack was calling gc.enable(), but it looks like that issue
was fixed quite a while ago in version 0.2.2:

[https://github.com/msgpack/msgpack-
python/blob/2481c64cf162d...](https://github.com/msgpack/msgpack-
python/blob/2481c64cf162d765bfb84bf8e85f0e9861059cbc/ChangeLog.rst#bugs-
fixed-10)

------
placeybordeaux
This writing feels a little sloppy

> At Instagram, we do the simple thing first. [...] Lesson learned: prove your
> theory before going for it.

So do they no longer do the simple thing first?

More on topic: this seems like they optimized something in a way that might
really constrain them down the road. Now if anyone creates an object that
isn't covered by ref-counting they will get OOMs.

------
jsmeaton
Carl Myer, a Django core dev, presented at Django under the hood on using
Django at instagram. It was a really good talk that goes through how they
scaled and what metrics they use for measuring performance.
[https://youtu.be/lx5WQjXLlq8](https://youtu.be/lx5WQjXLlq8)

------
rcthompson
I actually didn't know that CPython had a way of breaking reference cycles. I
seem to remember reading that reference counting was the only form of garbage
collection that CPython did. Maybe this was the case in the past?

~~~
jzl
It was introduced in Python 2.0. You're not alone ... I still think of 2.0 as
kinda new myself. :)

Look in this doc for "Optional Collection of Cyclical Garbage":
[https://www.python.org/download/releases/2.0/](https://www.python.org/download/releases/2.0/)

~~~
rcthompson
Released in October 2000... yeah, that's probably around the last time I paid
any attention to the implementation details of GC in Python.

------
theossuary
> Each CoW triggers a page fault in the process.

Maybe I misunderstood how page faults work, but I thought this process was
reversed. I.e. Each page fault triggers a CoW, not the other way around?

------
fulafel
It could help to colocate all the refcounts in a contiguous block of memory,
column store style. You would nly get a page fault per 1024 objects.

------
zaptheimpaler
I'm confused - doesn't a worker run out of memory if GC is disabled?

~~~
danielvf
Python is also reference counted, and this does the bulk of the work - the GC
is just for things that were missed. Instagram has the process that spins up
the Python works kill and replace any that eventually use of the allowed
threshold of memory.

~~~
noobermin
Question, but as someone who doesn't do this sort of work...is this typical?
That things would balloon that requires you to periodically kill things sounds
like fuzzy logic somewhere to me.

I get that software is complex and people have simple deadlines...

~~~
toast0
This is pretty common for a forked server model. You already need to handle
the case where the worker process dies, so it's simple to also occasionally
kill it on purpose. Memory thresholding is nice, but you also have things like
MaxRequestsPerChild from Apache. Restarting the worker after say 100,000
requests is cheaper than profiling / tracking down slow memory leaks. OTOH,
when you get down to MaxRequestsPerChild 10, you have clear problems that
should be easy to track down; you can also do CPU usage thresholding to limit
damage from infinite loops that are hard to reproduce.

------
Bino
This is actually very clever, and really solves their problem pretty neat!

------
BerislavLopac
Using threading to handle user requests with Python seems very wrong to me.
They might see solid improvement by ditching WSGI and employing a non-blocking
solution (like Tornado, aiohttp or Sanic), running on PyPy as multiple
instances behind a load balancer.

~~~
bpicolo
It's not multi-threaded, it's multi-processed. Non-blocking IO is also not a
magical fix to any of their relevant problems.

------
gigatexal
I didn't know about atop or perf profiling. Cool write up.

------
nrjdhsbsid
Instead of a bunch of hacks that are obviously going to blow up in someone's
face one day why not just use a more suitable platform?

Forking threads for web pages is so old school...And Python is a terrible
choice for something at their scale.

Just redo the hosting bit in Java or golang and call it a day. If their UI
code is sufficiently isolated from the back end it's not a huge deal.

Instagram is a pretty small application feature-wise, a few devs could
probably do it in a couple months

~~~
therealmarv
Instagram is a living example for when you have big scaling problems you just
scale. You are probably right with your arguments... the question is: Is it
worth to choose a good scaling architecture when it slows your initial
development down? Instagram is python, many Youtube parts are python, FB is
even worse and is PHP at many parts.

So what? Sure they would scale better when using Erlang, Java or Go... but
sometimes it is wiser to finish building something than making the best
ultrascalable system. If you are really successful you will find ways to
scale.

~~~
nrjdhsbsid
I've heard this so many times but I always wonder how the unicorn effect skew
this. Do a majority of companies with such scaling issues fail? I've seen
products go down in flames firsthand because performance was so bad. Never at
a start-up but I assume it happens there too.

Facebook is a good example because of all the money they've thrown at a
problem they shouldn't have. First was a custom PHP interpretor then a
compiler and now hack. If they didn't have nearly unlimited money to throw at
it would things have ended differently?

Language choice is one of the easiest choices to make. Pick a fast one out of
the box if you plan to get big. It's not like the faster languages take orders
of magnitude longer to write code in, the effect is minimal at best.

------
discodave
If you think about it, this approach is actually very similar to the
FaaS/Lambda/Serverless model. Each request lives in its own little container
which gets thrown away after every execution. This approach means you reduce
the amount of shared state and lots of problems like garbage collection either
get easier or go away.

~~~
jsmeaton
Not really. Where they recovered the 10% memory was by keeping shared state
shared, and reducing the amount of memory that was copied into each process.

With something like Django, there's quite a big startup tax you have to pay,
so at the scale of Instagram they would need quite a lot more servers to
handle the same load if every request was served lambda style.

~~~
discodave
I think you're both right and wrong. You're right in that if you did run
Django inside Lambda it would be inefficienct. But you're wrong because that's
not how you should adopt Lambda. This is a similar situation to everybody re-
architecting all their single-point-of-failure on-premise apps to the cloud in
the first place.

The point of Lambda is that you don't startup Django with each request. You
can think of the fixed/shared part of Django being replaced by API Gateway.
Lambda then replaces the non-shared threads that get started for each request.

~~~
jsmeaton
I see where you're coming from, but not all state that _could_ have been
shared can be pushed into the API Gateway, like Code Objects and settings. To
adopt lambda as you suggest would require a re-architecture (not Django), and
at that point you're probably better off going for something more efficient.
They even mention that memory cache hit ratio improved giving more CPU
throughput, which totally disappears if you're starting a new process for
every request.

