
Open-sourcing a 10x reduction in Apache Cassandra tail latency - mikeyk
https://engineering.instagram.com/open-sourcing-a-10x-reduction-in-apache-cassandra-tail-latency-d64f86b43589
======
the8472
_> The graph shows that a Cassandra server instance could spend 2.5% of
runtime on garbage collections instead of serving client requests. The GC
overhead obviously had a big impact on our P99 latency_

No, this is not obvious. If you have a fully concurrent GC then spending 25
out of 1000 CPU cycles on memory management does not "obviously" have an
impact on your 99th percentile latency. It would primarily impact your
throughput (by 2.5%), just like any other thing consuming CPU cycles.

 _> We defined a metric called GC stall percentage to measure the percentage
of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and
could not serve client requests._

Again, this metric doesn't tell you anything if you don't know how long each
of the pauses are. If they are at the limit infinitesimally small then you are
again only measuring the impact on throughput, not latency.

Certainly, GCs with long STW pauses do impact latency, but then you need to
measure histograms of absolute pause times, not averages of ratios relative to
application time. That's just a silly metric.

And neither does the article mention which JVM or GC they're using. Absent
further information they might have gotten their 10x improvement relative to
some especially poor choice of JVM and GC.

~~~
teacpde
> If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on
> memory management does not "obviously" have an impact on your 99th
> percentile latency.

I try to understand the meaning. Is it saying the latency caused be GC is
applied to all requests, not just the ones that observe 99th percentile
latency?

~~~
the8472
No, that would be an incremental GC working in very small time slices.

A concurrent GC spends CPU cycles on different cores to do its work, which
means it will not cause latency outliers in the threads processing the
requests. They are still CPU cycles you don't have to serve other requests,
hence they still affect throughput.

That is a simplified explanation of course, there are a lot of caveats.

In my original post I was mostly speaking about the measurement though, since
they are measuring throughput when they are concerned about latency, those are
somewhat related but depending on circumstances only weakly so.

~~~
ninkendo
What about the requests that are processed by a core that's doing a GC?
Wouldn't that cause a higher P99 latency exactly as you'd expect?

Spending 2.5% of cycles on GC doesn't mean those cycles are perfectly
distributed. The distribution of GC work onto cores is bound to have some
cores doing more work than others, which would (I would think) manifest itself
as some requests that land on an unlucky core getting more latency. Isn't it
totally expected that this would cause a P99 latency spike?

~~~
the8472
Maybe if you operated your system at the saturation point, which you really
don't want to do in practice. Instead you want your queues to be mostly empty.
Bursts are inevitable but bursts coinciding with GC doing work hopefully is a
beyond 99th percentile thing. Of course if we're speaking purely theoretical
we could also assume spherical cows in a vacuum and say that requests don't
burst and simply arrive in a metronome-like trickle and then the spikes
evaporate too. This is basically queuing theory.

And you could also use more CPU cores than request workers, that way you will
always have spare core capacity and thus your latency will not be directly
impacted. That is if you _really really_ value latency more than throughput.

Again, my main point is that throughput and latency are not the same thing.
There is some relation in so far that you cannot fulfill latency promises if
your throughput is insufficient and your queues start filling up. But below
the saturation point it's a lot more complicated, especially in parallel
systems with bursty arrivals.

------
dikanggu
We do want to contribute our work back to the Cassandra upstream, instead of
keeping it as a fork. So that more users from C* community can benefit from
the improvements. The pluggable storage engine is an ambitious project
([https://issues.apache.org/jira/browse/CASSANDRA-13474](https://issues.apache.org/jira/browse/CASSANDRA-13474)).
Any help will be appreciated!

~~~
russellspitzer
Saw you talking about this on the Distributed Data Show

[https://academy.datastax.com/content/distributed-data-
show-e...](https://academy.datastax.com/content/distributed-data-show-
episode-37-cassandra-instagram-dikang-gu)

------
gfosco
RocksDB is used all over Facebook, powers the entire social graph. Great
storage engine that pairs well with multiple DBMS: MySQL, Mongo, Cassandra...
We'll be at Percona Live 2018 in April, giving several talks, and are looking
forward to hanging out and talking with users in our lounge area. We're
working hard to support our open source community as well!
[https://github.com/facebook/rocksdb](https://github.com/facebook/rocksdb)

------
openasocket
I'm not an expert on these things, but it seems to me if you're implementing a
database in Java you wouldn't want to keep your data on the JVM Heap, as this
seems to indicate. My understanding is that in most applications (like
servers) the average object lives for a very short period of time, and most GC
implementations are built from that idea. But, in a database, especially an
in-memory database, the majority of the objects are going to live for a very
long time. That makes the mark phase of GC a lot more expensive, puts more
pressure on the generations, etc.

Is my guess here correct, or are there things I'm missing or mistaken on?

~~~
jakewins
This is correct; the standard approach here is to use regular c-style memory
management for the data the system is managing, and the JVM heap only for the
database "infrastructure".

This hybrid approach gives the benefit of a managed runtime and safety of GC
for most of your code, but allows the performance of raw pointers/malloc for
key code paths.

Some examples of this pattern on the JVM:

\- The Neo4j Page Cache, Muninn,
[https://github.com/neo4j/neo4j/blob/3.4/community/io/src/mai...](https://github.com/neo4j/neo4j/blob/3.4/community/io/src/main/java/org/neo4j/io/pagecache/impl/muninn/MuninnPageCache.java#L58)

\- The Netty projects implementation of jemalloc for the JVM:
[https://github.com/netty/netty/blob/4.1/buffer/src/main/java...](https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java)

~~~
tkahnoski
If so any JVM based datastore could probably benefit.

I wonder how long before we see a similar result from ElasticSearch. (Only
other huge JVM based store I can think of).

~~~
ddorian43
Hbase is going offheap as much as possible. Voltdb uses java for management
and c++ for low-level.

They will write c++ in java eventually. Depending on how much performance you
REALLY need.

The same for elasticseach, if you want performance you need to do the same
thing scylladb did to cassandra (per-core-sharding, skip filesystem across
cores etc)

~~~
ddorian43
In elasticsearcch terms, vespa.ai, which claims better
performance/scalability/maintanability uses c++ for lucene layer and java for
the solr/elasticsearch layer.

There are blog posts speeding lucene by 2x+ by changing some stuff to c/c++.
There are libraries (trinity) claiming 2x+ performance .

There is google-engineer saying "bigtable is 3x faster than hbase" that I've
read.

------
haglin
"To reduce the GC impact from the storage engine, we considered different
approaches and ultimately decided to develop a C++ storage engine to replace
existing ones."

I wonder how the numbers would have looked with the new low latency GC for
Hotspot (ZGC).
[https://wiki.openjdk.java.net/display/zgc/Main](https://wiki.openjdk.java.net/display/zgc/Main)

Early results from SPECjbb2015 are impressive.
[https://youtu.be/tShc0dyFtgw?t=5m1s](https://youtu.be/tShc0dyFtgw?t=5m1s)

~~~
tibbetts
Yes, also Azul Zing. Really anytime someone says they have a problem with GC
and suggests spending a million dollars of engineer time building a new
system, they should consider Zing first. It works and is a way more efficient
way of spending money to fix GC latency problems.

~~~
ADefenestrator
For a small to medium sized shop, sure. For someplace with thousands or tens
of thousands of nodes, the new system ends up cheaper in the long run.

------
Thaxll
Weird, did they try to use
[https://www.scylladb.com/](https://www.scylladb.com/)?

~~~
en4bz
I was going to say the same thing. It seems pretty clear at this point that
Java is not a good programming language to build a database on if you care
about strong 99% latency guarantees. The engineers in the article came to this
conclusion and so did the Scylla people years ago.

Scylla is AGPL for the OSS version though so testing it out would not be an
option without getting a commercial license first.

~~~
HippoBaro
Doesn't AGPL allow commercial use?

~~~
halestock
Per the AGPL preamble[0]:

"The GNU General Public License permits making a modified version and letting
the public access it on a server without ever releasing its source code to the
public.

The GNU Affero General Public License is designed specifically to ensure that,
in such cases, the modified source code becomes available to the community. It
requires the operator of a network server to provide the source code of the
modified version running there to the users of that server. Therefore, public
use of a modified version, on a publicly accessible server, gives the public
access to the source code of the modified version."

[0]
[https://www.gnu.org/licenses/agpl-3.0#preamble](https://www.gnu.org/licenses/agpl-3.0#preamble)

------
tschellenbach
For Stream's feed tech we also moved from Cassandra to an in-house solution on
top of RocksDB. It's been a massive performance and maintenance improvement.
This StackShare explains how Stream's stack works. It's based on Go, RocksDB
and Raft: [https://stackshare.io/stream/stream-and-go-news-feeds-for-
ov...](https://stackshare.io/stream/stream-and-go-news-feeds-for-
over-300-million-end-users)

------
3uclid
Unrelated: as a CS undergrad, I read this article and was immediately
inspired. This is definitely the type of work I want to be doing when I
graduate (infrastructure engineering). But my next thought was: where do I
start?!

Any advice?

~~~
ddorian43
Still in school ? (don't understand different <type>grad). See: GSOC Seastar
Framework
[https://summerofcode.withgoogle.com/organizations/6190282903...](https://summerofcode.withgoogle.com/organizations/6190282903650304/)

~~~
3uclid
Yeah, still in school (3rd year). I have intern experience, but it seems like
these type of positions are way too advanced for me at the moment. Just unsure
how to progress...

~~~
ddorian43
1\. see the other comment for CMU 2\. learn c++ + data structures 3\. try
seastar (or any other lib?) and try to follow the tutorial 4\. go to mailing
list and ask for help when you're stuck 5\. look into issue-queue for small
tasks and grow little by little

Makes sense ? Another project you may look at is [https://github.com/phaistos-
networks/Trinity](https://github.com/phaistos-networks/Trinity) which is a
library and should be simpler than a whole framework/db (trinity is like
lucene/rocksdb compared to full-db cassandra/scylladb).

------
StreamBright
In a similar situation we just adjust the GC and started to use G1GC which
resulted in similar numbers.

~~~
coryfoo
I bet that didn't take N engineers 12 months to build out, either

~~~
threeseed
Cassandra uses G1GC by default.

If it was as simple as tweaking a few GC settings to get 10x improvement
pretty sure Datastax would've done it by now.

~~~
jjirsa
The problem with making a general purpose DB is you have to have general
purpose defaults. The Cassandra defaults are "dont crash anywhere", not "be
super fast and low latency". You can definitely do 5x better than the default
with some basic jvm tuning.

That said: the IG folks certainly know how to tune JVMs. There are IG (and
former IG, I saw rbranson post) folks in this thread that know how to tune the
collectors, so assume that the 10x they see is beyond what you'd get from
simple tuning.

------
fdeliege
Join our meetup to chat with some of the developers:
[https://www.meetup.com/Apache-Cassandra-Bay-
Area/events/2483...](https://www.meetup.com/Apache-Cassandra-Bay-
Area/events/248376266/)

~~~
jjirsa
So sad I’m not in town that week

------
en4bz
Has any tried running Casandra on Azul Zing[1]? The slowdown here is not
surprisingly related to GC pauses which Azul has eliminated in Zing.

[1] [https://www.azul.com/products/zing/](https://www.azul.com/products/zing/)

~~~
rbranson
The licensing cost of Zing generally makes this a bad trade-off. It's much
cheaper to purchase more hardware. Zing is targeted at vertically scaling very
large JVM heaps, where it's valuable to have massive amounts of data on a
single, big machine.

~~~
nitsanw
As an ex-Azul employee I can say there's a good number of Azul clients using a
Zing+Cassandra setup, so the price point is right for some people at the very
least. Zing licence cost has also changed in recent years (3.5k per server
last I looked, and that is before you haggle some bulk deal) so not sure if
your impression is calibrated to that new price point.

------
adrianratnapala
As a Java scoffer trying to be fair-minded, I resisted the urge to joke that
"it's was the GC, stupid" and assume that a big project like Cassandra had
somehow worked around the GC latency problems.

But, what? It turns out the article is really about replacing Java with C++.

~~~
cestith
It's about using something in one language for its features and only porting
the critical sections to C++ via a clean API. This is the sort of advice we've
been giving people for decades. Choose the language for what you want to
build, measure and profile performance if necessary, find the bottleneck on
the hot path, decouple that from the bulk of the code, and reach to a lower
level for performance only in that clearly defined section.

They managed to generalize one application that meets their feature needs to
be a front end to another existing application with fewer features but better
performance as a back end. They're optimizing their hot path by decoupling it
from the rest of the application and handing off to C++ code they didn't even
have to write. Adding pluggable storage engines to Cassandra means that if
they make the API smooth enough they can have engines in C, C++, Erlang, Go,
Rust, ML, or whatever in the future without changing their front end. That's a
big win even beyond this tail latency issue.

~~~
majidazimi
Well, other than storage engine, the next big part of a database software is
the query planner/optimizer which Cassandra doesn't have (due to simple KV
nature of it). So there isn't much remaining. In a long term plan, rewrite
them all and you have single code base and you'll benefit from mighty C++ in
other components of the database. And there is still room for more
optimizations: SIMD, ...

The GC problem is not limited to C*. This shit(virtual machine) is hitting the
whole Hadoop stack: HDFS, Hive, Spark, Flink, Pig...

Immense number of tickets in any fairly large cluster is related somewhat to
GC and JVM behavior.

------
cmrdporcupine
I remember using quite early versions of Cassandra back in an ad-tech startup
I was at back in 2009 or 2010, spending unfortunate amounts of time fighting
the JVM GC and trying to tune things so it behaved responsibly. It was a real
problem then and I know a lot of work went into fixing GC behaviour. Then I
stopped using Cassandra for work, but it's unfortunate this is still an issue?

What I took out of that is that I really feel like something like Cassandra is
better suited to implementation in a language like C++ or Rust. And I believe
others have since come along and done this.

I really liked the gossip-based federation in Cassandra though.

~~~
estebank
It sounds like you might be interested in TiKV.

[https://github.com/pingcap/tikv](https://github.com/pingcap/tikv)

~~~
cmrdporcupine
Thanks.

Since coming to Google I don't get the opportunity to compare/evaluate/deploy
tools like this anymore. Smarter people than me make choices like that :-)

------
bfrog
Meanwhile scylladb looks like a better option for numerous reasons

------
yazr
Or just try and benchmark Azul VM with pause-less GCs ?!

(I have used Azul in low-latency production environments. It has pros and cons
but it certainly beats re-writing the storage layer... )

~~~
truth_seeker
Curious to know the cons of using it, except being commercial.

~~~
manigandham
> except being commercial

That's the biggest, especially for when it's a Facebook company. Otherwise it
works well but can be pricey.

The JVM is getting a new fully concurrent collector though called Shenandoah:
[https://www.google.com/search?q=shenandoah+gc](https://www.google.com/search?q=shenandoah+gc)

~~~
domsj
Not just shenandoah[1], it's also getting zgc[2], so the low latency future
looks bright for the jvm.

1\.
[https://wiki.openjdk.java.net/display/shenandoah/Main](https://wiki.openjdk.java.net/display/shenandoah/Main)
2\.
[https://wiki.openjdk.java.net/display/zgc/Main](https://wiki.openjdk.java.net/display/zgc/Main)

------
jjirsa
Nicely done! Looking forward to the pluggable storage engine.

~~~
pas
The JIRA tickets don't really shine with much hope :/

[https://issues.apache.org/jira/browse/CASSANDRA-13474](https://issues.apache.org/jira/browse/CASSANDRA-13474)
[2 comments from 2017 Apr]
[https://issues.apache.org/jira/browse/CASSANDRA-13475](https://issues.apache.org/jira/browse/CASSANDRA-13475)
[~100 comments, but the last one is from 2017 Nov, by the InstaG engineer]

And the Rocksandra fork is already ~3500 commits behind master, so upstreaming
this will be interesting.

Oh, and the Rocksandra fork is already kind of abandoned - no commits since
2017 Dec. (which probably means this is not actually the code that runs under
Instagram.)

~~~
dikanggu
This is the rocksandra branch,
[https://github.com/Instagram/cassandra/tree/rocks_3.0](https://github.com/Instagram/cassandra/tree/rocks_3.0),
we develop it on top of Cassandra 3.0. It's the code we are running on our
production servers.

~~~
pas
Thanks for the git push and the reply! A few minutes ago it was still pointing
to the older commit.

And upstream Cassandra is already 11 minor releases away. Won't that become a
problem with something as fundamental/low-level as pluggable storage engines?

------
rbranson
Did you all find that there were changes to the Java heap/GC configuration
that would make tuning this setup different? I imagine if most everything that
"sticks" is moved off heap, the GC could be tuned more heavily for young gen
throughput vs trying to balance it with long-lived objects.

~~~
dikanggu
Yeah, for Rocksandra, we are able to use much smaller heap size, and most of
the objects are recycled during the young gen GC.

------
agnivade
> We also observed that the GC stalls on that cluster dropped from 2.5% to
> 0.3%, which was a 10X reduction!

Umm .. shouldn't the stalls go to 0, because now you have moved to C++ ? Or is
this the time it takes for the manual garbage collection to occur ?

------
steeve
Why not use ScyllaDB ? (Serious)

~~~
cnlwsu
Answered a bit before but they have a team that knows c* well. Cassandra is
proven to handle petabytes at scale in production systems.

------
xuanyue
Is there any trade off after replacing LSM tree-based storage engine to
RocksDB storage engine?

~~~
irfansharif
RocksDB is also an LSM structured KV store.

------
welder
Great, now can you fix the Python Cassandra Driver to work in a multi-threaded
application environment without the connection pooling bugs and default
synchronous app-blocking (vs lazy-init) connection setup?

[https://github.com/datastax/python-
driver](https://github.com/datastax/python-driver)

------
ismail
So question:

Any thoughts on replacing HDFS + Yarn + Hive + HBASE with GulsterFS +
Kubernetes + Cassandra

??

~~~
ddorian43
Hbase is sync+globally sorted, while cassandra is not, so probably not.

------
alsadi
Can we add lz4 to the blend to reduce disk IO?

