
Performance Impact of Removing an Out-of-Band Garbage Collector - jbredeche
https://githubengineering.com/removing-oobgc/
======
bluesnowmonkey
Of course removing it reduced your CPU consumption. What did it do to p99
request latency? That was the trade-off being made by OOBGC.

Does Github use SOA? The more backend services that are involved in handling a
user facing request, the more the latency of frontend requests will be
dominated by the _tail_ latency of backend services. So in distributed systems
it makes much more sense to focus on those tail latencies. Who cares about
average latency?

~~~
hinkley
> Who cares about average latency?

People who are bad at statistics. So most of us programmers.

Seriously though, I don’t think many of us get around to acknowledging that
microservices are distributed computing and then consider all of the
limitations that go with it.

Latency is dominated by the slowest thing you can’t start earlier. So in real
distributed computing you see a degree of duplicate effort in order to improve
latency. It reduces throughout but increases the utility of the system.

When I took CS in college there was a proper class on distributed computing.
When I got out I discovered not everybody had one of these (and I think in my
program it was an elective. IIRC I chose it instead of kernel design). I had
hoped that situation improved but I got tired of asking people disappointing
questions about their education so I stopped asking.

~~~
srean
> People who are bad at statistics.

Heh! sadly very true.

Many of those who do care about the p95s of an internal service often do it
with a sense of chivalry that they have that 5% of the users back. The end
user latency percentiles would be quite different from the service side
latency, more so if its composed by a "Last_of" or a Max operator.

At times I feel tempted to write a library that offers primitives to glue
microservice calls such as fan_out, retry, timeout etc but one that can reason
about the percentiles of the output from the percentiles of the input.

BTW are statistical properties of latencies covered in distributed system
courses ? I thought they were mostly about different kinds of guarantees, such
as consensus, atleast-once, atmost-once etc etc.

~~~
hinkley
Heck no, I got a little bit in a multivariate stats class but the rest I had
to pick up in the field or a bit from books. I had a particularly illuminating
moment with my ops guy trying to work out how often we could expect a drive to
fail in a ten drive array and it was pretty shocking.

My knowledge is pretty thin, but the fact that I even know what questions to
ask puts me in a better spot to contribute than 4 out of 5 people in the room.
That shouldn’t be how things are, but that’s where we are. Every new (old)
technique is a bunch of naive people crossing their fingers and hoping things
go better than last time.

~~~
srean
I thought as much. Your handle has anything to do with
[https://books.google.co.in/books/about/Theoretical_Statistic...](https://books.google.co.in/books/about/Theoretical_Statistics.html?id=ppoujo-
BInsC) ? very nice book BTW

~~~
hinkley
Nope just random.

~~~
srean
Spoken like a true statistician :)

------
twic
I find this an interesting contrast to Go's move towards a "request-oriented
collector":

[https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ft...](https://docs.google.com/document/d/1gCsFxXamW8RRvOe5hECz98Ftk-
tcRRJcDFANj2VwCB0/edit)

Perhaps a key difference is this point from the Github document:

"Since the OOBGC runs the GC after the response is finished, it can cause the
web worker to take longer in order to be ready to process the next incoming
request. This means that clients can suffer from latency due to queuing wait
times."

It seems like a little bit more intelligence in the load balancer that sits in
front of the application servers would go a long way here. Could it only route
requests to servers which have finished garbage collecting and are ready to
serve again?

~~~
puzzle
Now you have another problem: the load balancer needs to know accurately when
the backend is ready. This can be prone to race conditions, too, especially
for services that have very short response times, unless you manage to keep
the LB/RPC protocol in lockstep with the garbage collector's behaviour. E.g.
you might have to extend the LB/RPC protocol for responses in-band, to include
a bit that says "those were the results for the request and, by the way, don't
send me new work until I signal you again", which also assumes a permanent
connection between the two ends, etc.

It is doable, but it benefits greatly if you have complete control over the
whole stack. It gets harder and harder as more languages are thrown into the
mix.

~~~
ASalazarMX
Come on, we can solve that by adding a message queue manager in between.
Whenever in doubt, just throw more layers at the problem.

~~~
adrianratnapala
Yes! Especially when fixing latency problems.

------
ksec
I am actually surprised GitHub still has OOBGC, which I thought was made
redundant in 2014/2015 when Ruby 2.2 had incremental marking, and further
improvement in GC 2.3, 2.4 and now 2.5.

In 2.6 they are introducing something similar called Sleepy GC

[https://bugs.ruby-lang.org/issues/14723](https://bugs.ruby-
lang.org/issues/14723)

There are even better memory default for 2.6 using less memory. ( Or better
switch to Jemalloc )

[https://bugs.ruby-lang.org/issues/14718](https://bugs.ruby-
lang.org/issues/14718)

I cant find the thread, but there also a test showing how on Alpine Linux
memory and performance were much better for Ruby Apps.

Meanwhile we are waiting for Truffle Ruby as a true drop in replacement, that
should take lots of Ruby Pain point away. ( Native Image deployment, Better
Memory handling and GC, Much faster JIT )

------
nneonneo
Interesting work, thanks for sharing.

Nitpick:

> The blue line is CPU utilization for the day the patch went out. You can see
> a great drop around 15:20.

I see a dip at 18:20, not at 15:20 - if there is a dip at 15:20, maybe it
would be worth highlighting it.

~~~
irishsultan
I'd assume that it's a timezone issue. The author probably didn't check to see
if the labels on the x-axis matched his own timezone and just pointed out
where he knew the drop happened, instead of where it actually happened (on his
graph).

------
chris_wot
That's a remarkable author profile on that article.

------
titzer
TLDR: they saved 400 to 1000 cores by switching off the switching off of the
GC during requests.

The fact that they are running Ruby and are spending 1000 cores on GC is o_O.

~~~
k__
Isn't Ruby known for these kind of issues?

~~~
vidarh
In this case it was Github's hack to work around no-longer-existing issues
with Ruby's GC that was the issue.... Note that the speedup came from removing
their hack in favour of relying on the default behaviour of Ruby 2.4

~~~
jashmatthews
It was actually Ruby 2.2 from 2014 that made this redundant. 2.2 introduced
incremental marking, removing the last long GC phase in MRI Ruby.

