Hacker News new | past | comments | ask | show | jobs | submit login
Open-sourcing a 10x reduction in Apache Cassandra tail latency (engineering.instagram.com)
408 points by mikeyk on Mar 5, 2018 | hide | past | web | favorite | 164 comments

> The graph shows that a Cassandra server instance could spend 2.5% of runtime on garbage collections instead of serving client requests. The GC overhead obviously had a big impact on our P99 latency

No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles.

> We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests.

Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency.

Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric.

And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC.

you clearly didn't read the post very closely. They said 2.5% of CPU cycles were spent on stop-the-world young generation collections, not on the sum total of all memory mangement. That means that 2.5% of the time the app is entirely stalled on just these collections. Given that stop-the-world pauses are never evenly distributed throughout time, it should be very much expected that this much GC stalling would affect p99 latencies.

It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.

> Given that stop-the-world pauses are never evenly distributed throughout time

That is not a given. And, even distribution is only part of the equation. If they are sufficiently short, then even being somewhat unevenly distributed should not have much of an impact on latency. For example, if the max length of a pause were 1ms, and 99p latency were 15ms, you'd have to be fairly unlucky to see a 33% increase in latency99 due to GC. That would entail 5 of 25 pauses happening during a 20ms period in a 1s window.

(This idea is not purely hypothetical. For example, Go's GC has very low STW periods.)

> It's pretty much accepted everywhere

Eh. Apparently everyone thinks C is the best language for cryptography and other secure but not particularly perf sensitive code. Go figure. Sometimes the wisdom of the masses is not wisdom. Best not to appeal to it during argumentation.

You're right on your first point. I responded a little harshly to OP because I believe they responded too harshly to the blog post. The P99 GC latency data they give is us not sufficient to explain their perf gains. More likely, the metric they were tracking was well-correlated with their perf gains, but not the only or even primary cause. The other perf gains could have been caused by reduction in older generation collection times, for instance.

I'd love to see how Go's GC performs when running an application similar to Cassandra, on multiple cores, with gigabytes of memory allocated.

Go's GC is non-moving, incremental and single-generation. Hotspot GCs are all moving collectors so they make different tradeoffs. IBM's metronome collector would be the closest one to Go's properties.

This is a thing others have already done, although I don't have citations to hand.

So why do people keep building latency sensitive things in the JVM? And then they manage to get hugely popular?

Cassandra is a constant struggle with the GC. I’d guess the cost of running it is at least an order of magnitude greater compared to if it had been implemented in c++ or something more sensible.

To be fair, a lot of big companies know how to tune the JVM. A TON of HUGE companies write a LOT of java. What you consider a constant struggle, a lot of very large companies consider trivial.

I'm not sure it's trivial. Tuning the JVM is an entire cottage industry. JVM performance experts can make 1000+/day tuning the JVM and are in high demand. Companies spend huge amounts of engineering effort to keep the JVM running smoothly. I used to be involved in this side of things pretty heavily at a HFT firm, which almost exclusively used Java.

In my opinion it's a colossal waste of resources. Classic example of using the wrong tool for the job.

And it's still cheaper to hire a guy with a skill like that for a few weeks, or even keep him permanently - and keep a larger development team of cheaper C# or Java devs, than it is to replace them all with higher-payed C/C++ devs which would probably take longer to get the same functionality up & running.

Good Q. You might like to check ScyllaDB written in C++, which is supposed to have considerably better performance than Cassandra (also low tail-latency) and a level of compatibility with it: https://www.scylladb.com/

Many of these open-source databases started as internal projects inside big companies, where Java/JVM allowed for more productivity and cross-platform deployment with more skill reuse of the team. Then they grew from there and now it's too late to rewrite the whole thing.

If you were starting a database-focused company from the beginning than choosing C++ is a better decision, which is exactly what ScyllaDB has done with their cassandra clone. Along with general algorithm and decision improvements, it'll provide 10-100x the same performance at lower latency on the same server.

We're also starting to see more projects written in Go now, which is still a managed runtime but usually better at handling these kinds of low-level systems.

Go is a much better choice for systems work. Largely because the GC has a low pause (sub ms) target. I'd still be hesitant to use it for very latency sensitive things, or memory intensive applications. Prometheus, for example, has struggled with golang's memory management (bad memory fragmentation, wasteful memory usage). But I think it's a great compromise if you don't want to deal with memory management.

Sure, I would also .NET/C# to the list now.

.NET Core on Linux is very fast and there are some great developments around fast low-level (yet managed) managed memory manipulation that can lead to some very fast software.

OpenJDK/Hotspot is not the only game in town. Those who really need low latencies can opt to use other JVMs (some commercial) with GCs that provide very low pause times, usually at the expense of some percentage points of throughput. In large corporate environments that might not be a problem.

Apparently these people enjoy GC/JVM languages more than C++.

GC languages like Java is much easier to write, and can be made performant when required.

Yes, this is the reason why Java is chosen. But I feel pretty strongly that databases are system engineering problems, and should be written using a proper systems language.

Something like Java makes implementation easier, but operation more difficult and costly.

classic hacker news comment.

this thing you built and open sourced, has gotten you real measurable results? allow me to list the many ways you’re probably wrong and doing it incorrectly

Measurable results are all well and good, but it can be helpful to know how the baseline was established. Measurable results aren't "portable" without a well-established baseline.

Both code and benchmark are open sourced. We'd love to hear how it performs for you.

This is a valid criticism of the methodology / explanation. It's not about the results. You can agree with the positive results (and they're great! - you've done awesome work and clearly show an improvement) and still say the explanation how/why they were achieved is not great.

Exactly. Immediately from the premise of the paper, I was looking forward to a discussion on how they tried various strategies to tune their JVM/GC parameters and found nothing. The "well I guess we gotta replace this with a C++ solution" sentiment smacks of poor software engineering practice, despite the results.

That you didn't find it, doesn't mean they didn't do any. Maybe it wasn't worth mentioning. Maybe they wanted to keep the post short. Some team successfully did a storage engine transplant and you're saying they're doing poor engineering?

That's definitely not what I was suggesting in my comment.

> The "well I guess we gotta replace this with a C++ solution" sentiment smacks of poor software engineering practice, despite the results.

On the flip side, recognizing when you are not efficiently using your time to solve a problem with the current approach and deciding to switch to another one is what I would consider a vital part of good software engineering, and also one of the hardest things to become good at without a lot of experience.

I'm not sure that's the case here, but I don't doubt it's possible they could easily have wasted more time testing and tweaking Java than it took to implement this solution. Whether that's how it would have (or possible even did to a degree) played out is an open question.

Good memory management is important for a high performance, low latency DBMS. I don't see how getting rid of GC is bad engineering practice. If a tool is not well suited for a problem, use another tool. Could you please explain your opinion?

//Edit: removed 2nd part, which wasn't really that important anyway...

That might be an argument for not using Cassandra. It’s a pretty big leap to reimpmenting half of Cassandra in c++.

> If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency.

I try to understand the meaning. Is it saying the latency caused be GC is applied to all requests, not just the ones that observe 99th percentile latency?

It's saying that whether it affects latency or just throughput depends on how those pauses are distributed in absolute terms, not just the ratio. There's a big difference in 99th percentile latency between a 1ms pause every 400ms and a 10 second pause every 67 minutes, but they both work out to 2.5% by the ratio metric.

So yes, at the `infinitesimally small` end, time would be 'stolen' evenly from all request threads and would not be a contributing factor to the 99th percentile.

No, that would be an incremental GC working in very small time slices.

A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests. They are still CPU cycles you don't have to serve other requests, hence they still affect throughput.

That is a simplified explanation of course, there are a lot of caveats.

In my original post I was mostly speaking about the measurement though, since they are measuring throughput when they are concerned about latency, those are somewhat related but depending on circumstances only weakly so.

What about the requests that are processed by a core that's doing a GC? Wouldn't that cause a higher P99 latency exactly as you'd expect?

Spending 2.5% of cycles on GC doesn't mean those cycles are perfectly distributed. The distribution of GC work onto cores is bound to have some cores doing more work than others, which would (I would think) manifest itself as some requests that land on an unlucky core getting more latency. Isn't it totally expected that this would cause a P99 latency spike?

Maybe if you operated your system at the saturation point, which you really don't want to do in practice. Instead you want your queues to be mostly empty. Bursts are inevitable but bursts coinciding with GC doing work hopefully is a beyond 99th percentile thing. Of course if we're speaking purely theoretical we could also assume spherical cows in a vacuum and say that requests don't burst and simply arrive in a metronome-like trickle and then the spikes evaporate too. This is basically queuing theory.

And you could also use more CPU cores than request workers, that way you will always have spare core capacity and thus your latency will not be directly impacted. That is if you really really value latency more than throughput.

Again, my main point is that throughput and latency are not the same thing. There is some relation in so far that you cannot fulfill latency promises if your throughput is insufficient and your queues start filling up. But below the saturation point it's a lot more complicated, especially in parallel systems with bursty arrivals.

> A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests.

That makes sense, how about GC for single-threaded languages, e.g. Nodejs?

Just because a language is single threaded in the code you write doesn't mean it doesn't use threads behind the scenes, like for GC or other things.

We do want to contribute our work back to the Cassandra upstream, instead of keeping it as a fork. So that more users from C* community can benefit from the improvements. The pluggable storage engine is an ambitious project (https://issues.apache.org/jira/browse/CASSANDRA-13474). Any help will be appreciated!

Saw you talking about this on the Distributed Data Show


RocksDB is used all over Facebook, powers the entire social graph. Great storage engine that pairs well with multiple DBMS: MySQL, Mongo, Cassandra... We'll be at Percona Live 2018 in April, giving several talks, and are looking forward to hanging out and talking with users in our lounge area. We're working hard to support our open source community as well! https://github.com/facebook/rocksdb

I'm not an expert on these things, but it seems to me if you're implementing a database in Java you wouldn't want to keep your data on the JVM Heap, as this seems to indicate. My understanding is that in most applications (like servers) the average object lives for a very short period of time, and most GC implementations are built from that idea. But, in a database, especially an in-memory database, the majority of the objects are going to live for a very long time. That makes the mark phase of GC a lot more expensive, puts more pressure on the generations, etc.

Is my guess here correct, or are there things I'm missing or mistaken on?

This is correct; the standard approach here is to use regular c-style memory management for the data the system is managing, and the JVM heap only for the database "infrastructure".

This hybrid approach gives the benefit of a managed runtime and safety of GC for most of your code, but allows the performance of raw pointers/malloc for key code paths.

Some examples of this pattern on the JVM:

- The Neo4j Page Cache, Muninn, https://github.com/neo4j/neo4j/blob/3.4/community/io/src/mai...

- The Netty projects implementation of jemalloc for the JVM: https://github.com/netty/netty/blob/4.1/buffer/src/main/java...

> but allows the performance of ... malloc for key code paths.

Everything is relative, I guess.

Hah :) I don't mean that malloc itself is fast, I mean that having non-jvm heap memory is fast.

A permanent memory block on the JVM heap can't use pointers to refer to it, since GC moves objects around. And even though those blocks will never be collected, they make up additional work for the GC to track.

If so any JVM based datastore could probably benefit.

I wonder how long before we see a similar result from ElasticSearch. (Only other huge JVM based store I can think of).

Hbase is going offheap as much as possible. Voltdb uses java for management and c++ for low-level.

They will write c++ in java eventually. Depending on how much performance you REALLY need.

The same for elasticseach, if you want performance you need to do the same thing scylladb did to cassandra (per-core-sharding, skip filesystem across cores etc)

In elasticsearcch terms, vespa.ai, which claims better performance/scalability/maintanability uses c++ for lucene layer and java for the solr/elasticsearch layer.

There are blog posts speeding lucene by 2x+ by changing some stuff to c/c++. There are libraries (trinity) claiming 2x+ performance .

There is google-engineer saying "bigtable is 3x faster than hbase" that I've read.

People with an interest may want to check out Azul's Zing JVM and the following blog post testing it out with Lucene... http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-w...

Cassandra does a bunch of stuff off-heap - we keep things like Bloom Filters, our compression offsets (to seek into compressed data files), and even some of the memtable (the in-memory buffer before flushing) in direct memory, primary for the reasons you describe.

We still have "other" things on-heap. The biggest contributor to GC pain tends to be the number of objects allocated on the read path, so this patch works around that by pushing much of that logic to rocksdb.

There are certainly other things you can do in the code itself that would also help - one of the biggest contributors to garbage is the column index. CASSANDRA-9754 fixes much of that (jira is inactive, but the development work on it is ongoing).

The purpose of separating into young and old generation is that it's easier to find dead objects in the young generation (as you said, average object lives for a short period of time). You only have to scan this subset for a minor GC. It doesn't really matter how many long-lived objects you have as long as you can avoid needing to do a major GC.

Don't you still need to scan the old generation during minor GC, in case a field in one of the older objects was modified to point to an object in the young generation? Or are there optimizations you can use to quickly and efficiently find references from the older generation to the younger?

> in case a field in one of the older objects was modified to point to an object in the young generation?

This is typically handled by https://en.wikipedia.org/wiki/Write_barrier#In_Garbage_colle...

For a long time, the guidance was to install jemalloc and then use off-heap objects. I can’t recall what it was, but that broke in the 3.0.x series and is unlikely to return. The feature stream (3.1.x) allegedly can use jemalloc again, but we’ve been slow to adopt it so I can’t provide proof.

The good news is, at least for my workloads, garbage collection is significantly better in 3.0 than 2.1 even without off-heap objects. Not sure if it's generating less or just generating it in a way that's easier to collect cheaply, but I saw pause times and total collection work drop significantly with the same settings (G1 collector)

If you want to keep data off heap you need to use sun.misc.Unsafe and allocate / free by yourself. I guess it is called unsafe for a reason. With G1GC you can do magical things to reduce the GC overhead which I always recommend as the first step before trying off heap.

ByteBuffer.allocateDirect is another off-JVM-heap solution that's not marked unsafe.

You don't even have to go through Unsafe. ByteBuffer.allocateDirect() gives you a chunk of off-heap memory.

"To reduce the GC impact from the storage engine, we considered different approaches and ultimately decided to develop a C++ storage engine to replace existing ones."

I wonder how the numbers would have looked with the new low latency GC for Hotspot (ZGC). https://wiki.openjdk.java.net/display/zgc/Main

Early results from SPECjbb2015 are impressive. https://youtu.be/tShc0dyFtgw?t=5m1s

Yes, also Azul Zing. Really anytime someone says they have a problem with GC and suggests spending a million dollars of engineer time building a new system, they should consider Zing first. It works and is a way more efficient way of spending money to fix GC latency problems.

For a small to medium sized shop, sure. For someplace with thousands or tens of thousands of nodes, the new system ends up cheaper in the long run.

Because GC related issues don't undergo from "problem" state to "solved" state. It is just a never ending stream of issues, that the team need to resolve, specially in a database realm in which metrics are hugely workload dependent.

yes, a comparison across multiple JVMs would be nice

Weird, did they try to use https://www.scylladb.com/?

Why throw away something proven to run at massive scale, that you understand and trust for something that's new, has never been run at that scale, and you have no experience running? If you have a team of software engineers, and the latency problem is a software problem, fix the software problem.

When you already know Cassandra, and you already know RocksDB, and you already have an engineering team, it makes far more sense to combine the two things you know how to use at scale than to try to use some new thing NOBODY has run at scale.

> some new thing NOBODY has run at scale

Outbrain uses ScyllaDB in production at scale across multiple data centers. Not sure if it's Instagram scale, but still enough to prove it's reliability and performance.


7 hosts in that poc, that is not "at scale"

Scylla can handle 10-100x the load of Cassandra on the same servers. Scale is more than just the number of hosts.

Data density is a thing. If u putting 10tb on a c* host, switching to Scylla doesn’t fix the issues that putting 1pb of data on a host would involve (ie backing that up). Throughput of 100mb of data done in marketing benchmarks are rarely relevant.

Ok, but that's a different issue and nobody is suggesting 1PB of data on a single node as a good idea. The comment was that "scale" is more than just a simple count of nodes. Even if you keep the data the same size, Scylla can handle it with much better performance which is a good enough reason for many to use it.

Since that poc they're using Scylla on hundreds of machines... that's just an old post.

I was going to say the same thing. It seems pretty clear at this point that Java is not a good programming language to build a database on if you care about strong 99% latency guarantees. The engineers in the article came to this conclusion and so did the Scylla people years ago.

Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.

> Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.

Huh? The AGPL is not a non-commercial-use-only license.

If you have proprietary software that you would like to combine with AGPL code (i.e., not interact with as a service) and is available to the general public over the Internet, and you want keep your code proprietary, sure, you may not want to use the AGPL. But you could say the same thing about proprietary software you want to combine with GPL code and sell to the general public.

If you're either using the software through it's existing defined public interfaces, or you're okay releasing anything you modify or link into the software, the AGPL (and the GPL) are fine. Lots of people distribute proprietary products that include GPL code, like Chromebooks, Android phones, routers, GitHub Enterprise, etc. We figured out years ago that the Linux kernel is not just a non-commercial product. Why are we having the same misconceptions about the AGPL?

The exact interpretation varies from company to company. Some companies take the strict stance of "if you use this library in any way in your application, you must open source your entire application." I've found that some libraries explicitly state that requirement within their FAQs for their community/free edition as opposed to their commercially (and paid) licensed equivalent.

At the end of the day, it's not worth risking yourself (or your company) when the owners of the library claims a software license works a certain way and you disagree. Sure you might be right and you might even prevail in court, but the potential legal fees usually aren't worth the trouble.

I ran into this issue when I was selecting a library to generate PDFs for my internship over the summer: https://itextpdf.com/AGPL

FWIW, I’m pretty sure Facebook also uses mongodb heavily (or at least through its acquisition of parse) I don’t know if they have a commercial license or not, but they aren’t strangers to the AGPL

MongoDB also makes their interpretation of the AGPL pretty well known - unfortunately it’s at odds with established interpretation of the normal GPL.

Requiring GPL/AGPL software as a dependency even if you don’t link to it, but instead talk to it over the network does not mean you haven’t developed a derivative work in terms of the letter and spirit of the license. This is in-part why I steer 100% clear of MongoDB, there’s nothing stopping them from changing their view on the license and deciding to pursue legal action against people who use it in non-AGPL compatible manners down the road.

I think you answered the question yourself [1].

The wording is ambiguous and as far as I know there have been no court cases yet that have yet to define what constitutes a connection between the end user and whether transitive connections count. If it's ambiguous to a software developer then corporate lawyers are definitely going to say no.

[1] https://news.ycombinator.com/item?id=16523858

EDIT: I realize that AGPL is valid for commercial use but since its terms are so onerous, especially once the lawyers get involved, it effectively makes the AGPL unusable in a larger corporation.

There are those who've deployed on Java with tight latency requirements: https://martinfowler.com/articles/lmax.html?t=1319912579 - Benchmarked at around 6 million transactions/second.

The issue isn't so much Java the language, as it is being aware of the GC, and developing with it in mind.

That removes the value proposition of Java though which is that you don't need to worry about memory management. If you need to mentally track every implicit allocation and deallocation in Java then you are essentially writing code in a kneecapped version of C++.

Even in a database, most of the code isn't performance sensitive. Making your life a bit harder in the fast path so it's easier in the slow path is at least a tradeoff worth considering.

Your fast path is still crippled by your slow paths' mess. It doesn't matter if you've isolated and optimized those paths in isolation, to the GC there's just Your Process and it's going to suspend Your Process whenever it wants for however long it needs regardless of what's currently happening.

So if you're latency sensitive then all of your code needs to be aggressive at avoiding object creation. All of your code becomes part of "the fast path", even if it's in a different thread.

Or you isolate your fast path in a different process or a non-GC'd runtime, the later being the approach taken here by Instagram.

Based on previous research [1], almost every part of a DB is in the hot path.

[1] http://15721.courses.cs.cmu.edu/spring2018/papers/02-inmemor...

I'd definitely believe that every part of query execution is in the hot path, as this paper describes. But most code in any reasonable DB system isn't part of query execution.

> If you need to mentally track every implicit allocation and deallocation in Java then you are essentially writing code in a kneecapped version of C++.

Well, Java has the advantage of being platform (and to a certain degree, runtime) independent, plus a robust set of best practices and ecosystem when it comes to modules and library handling, which is pretty hard to get done right for C/C++ projects.

Node.js is also platform independent and has a package manager but I wouldn't use it for a High Performance / Low Latency application like a database.

> Well, Java has the advantage of being platform (and to a certain degree, runtime) independent

What is the benefit of that? Who on earth runs a DB written in Java on windows? Any useful server software will end up using platform native features, be it SQL server, MySQL, HBase, ...

> Who on earth runs a DB written in Java on windows?

There still are lots of Windows-only shops.

developing in Java with awareness of the GC doesn't mean tracking allocation and deallocation of memory, it means developers should avoid allocating lots of new Objects when possible. In practice this means creating view, cursor, or offset type Objects that map to arrays of more primitive data types.

Exactly. Developing GC aware code is still easier and safer than c++ memory management. Particularly because when you screw up in c++ you get crashes or data loss, and when you screw up in Java you mostly just get GC pauses.

Mandatory mention of D and its @nogc feature (as it sounds, a compile-time guarantee that a function doesn't allocate on the garbage-collected heap)


(ScyllaDB employee here)

I don't believe one would need a commercial license just to test a product in any way? They are not making that part of any product at that point, so no concerns here.

They can't test on production servers.

Fake data, non-userfacing servers, sure.

Even if that is a problem, we provide anyone that is interested with a 30-day evaluation license of Scylla Enterprise.

That puts you about 2/3rds of the way down the list of things to try. You're before all the products that have no evaluation license at all, but after all the open source options that developers can test at their leisure.

If I'm trying a products evaluation license, you can be sure I've tried literally every other option under the sun first, including investigating the possibility rolling my own if situationally appropriate. No form of development is slower than the kind where I have to wait for a company in another timezone to give me permission to use their software, so it's always last on my options list unless the company has frankly amazing reviews that pique my curiosity.

Instagram doesn't operate any user-facing Cassandra servers, though. They run user-facing web servers that talk to Cassandra internally.

I don't like the AGPL because it's unclear on this exact sort of thing, but it does seem to me like the obvious reading of "all users interacting with it remotely through a computer network" does not encompass the connection between Instagram end users and their internal Cassandra.

And, in any case, they released sources for the thing they came up with - which is all that the AGPL requires. If they're okay with doing that, they can definitely use the AGPL for production commercial software.

Counldn't sticking a proxy in front of any AGPL software defeat its purpose then? If you don't consider transitive connections it seems pointless to me.

That's probably a question for a lawyer, but I would not be surprised at an interpretation that a proxy that just mirrors the API of the thing it proxies doesn't insulate you from license compliance, in the same way that a library that just wraps a GPL library doesn't insulate you from license compliance. The question is whether the user is interacting with the AGPL product - if you're talking to software via a proxy, you'd likely say you're interacting with it, but if you're talking to some other software that happens to use that software, are you really interacting with it?

I guess the weird case is that when I'm using the Instagram app, I wouldn't say I'm personally interacting with even the Instagram front-end servers (the way I am in a browser), I'm just interacting with the app which happens to use the servers. And that does sound like not what the license authors would like.

The server is AGPL. The client is Apache licensed. So I don't see a problem with using the AGPL version in commercial product.

Noone claims that a product using the MySql driver is a derivative work of the MySql server?

Edit: Of course, IANAL...

That's specifically what the AGPL does (as opposed to the GPL.) The copyleft "infection" is deliberately transmitted via network clients, not just static linking.

So you actually can't release a permissively licensed client for an AGPL server. I mean, they did, clearly, but the AGPL itself would seem to make that inconsistent.

But then none of this has ever been litigated and both the AGPL and GPL themselves are very confusingly worded so shrug.

As I understand it, with the GPL, you must offer source code under the GPL to everyone you distribute the software to. With the AGPL, the same goes for those that use the software over the network.

So you must offer the source of the database everyone who connects to the database over the network, under the AGPL. But if you deliver a web app, not a database-as-a-service, your users don't connect to the database. And since this database uses the Cassandra protocol, I'd say your web app isn't a derived work of the database in any way.

Of course, that last part is the sticky bit. But if applications using database servers via a well defined protocol are judged to be derived works, we might have other problems - hence the reference to MySql in my first post.

Requiring GPL’d software to function means your product is a derivative work, full stop as far as the spirit and letter of the license is concerned. Using it over a network instead of linking against it doesn’t change this, if you depend on MySQL or any of the forks and distribute your software it must be made available under a GPL-compatible license. The requirements of the AGPL become clear in this regard as well; network use is distribution with the AGPL - incorporating AGPL’d software into your application means you must consider your entire application as licensed under the AGPL or compatible license.

MongoDB muddied the waters here by deciding to interpret the AGPL differently, but I wouldn’t risk your business on it.

Doesn't AGPL allow commercial use?

Per the AGPL preamble[0]:

"The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public.

The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version."

[0] https://www.gnu.org/licenses/agpl-3.0#preamble

Yes it does, but AGPL licenced software is super banned at all major companies because it is very viral. You have to make derivative software available under AGPL even if the end user accesses it only over a network.

It does, but there are conditions that apply. Some companies don't like going that route - which in my opinion is not a thing, but I can't understand the concern.

But for testing, I don't see any impediment.

- has anyone run it FB scale? for how long?

- how many experienced scylladb devops are there globally that we can hire?

Those questions asked at BigTechCo before it adopts somebody elses tech.

FB already operates RocksDb and Cassandra so there's way less technical, career, financial risk for just hacking the two together with some aggressive refactoring.

Does FB still use Cassandra? I thought they abandoned them ages ago and then databricks picked it up?

FB abandoned Cassandra (which was really only used for message inbox indexing) when they redid how messages work years ago, but the re-adopted a large C* infra when they bought Instagram.

The article is literally written by Instagram, which is FB.

Well, it's a separate product that Facebook acquired. True or not, it's a common perception that Facebook abandoned Cassandra.


> then databricks picked it up

I think it's DataStax. Databricks is the company behind Spark.

My thought exactly. Would be interesting to know if they did and if yes, why they chose to develop something in-house anyway.

For Stream's feed tech we also moved from Cassandra to an in-house solution on top of RocksDB. It's been a massive performance and maintenance improvement. This StackShare explains how Stream's stack works. It's based on Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

Unrelated: as a CS undergrad, I read this article and was immediately inspired. This is definitely the type of work I want to be doing when I graduate (infrastructure engineering). But my next thought was: where do I start?!

Any advice?

I'd say no matter what kind of job you get, you can put 10% of your time into similar problems. Even simple CRUD apps can have interesting problems like this. In my experience every project has instances of engineers shooting themselves in the foot, or unforeseen problems cropping up. If you have a bit of self-motivation you can dig into them and learn a lot and improve things. I do this and find it very satisfying.

Happy to help you get started working on Cassandra. http://cassandra.apache.org/doc/latest/development/patches.h... Has some basic entry pointers. There’s also a dev mailing list that’s reasonable active.

Still in school ? (don't understand different <type>grad). See: GSOC Seastar Framework https://summerofcode.withgoogle.com/organizations/6190282903...

Yeah, still in school (3rd year). I have intern experience, but it seems like these type of positions are way too advanced for me at the moment. Just unsure how to progress...

1. see the other comment for CMU 2. learn c++ + data structures 3. try seastar (or any other lib?) and try to follow the tutorial 4. go to mailing list and ask for help when you're stuck 5. look into issue-queue for small tasks and grow little by little

Makes sense ? Another project you may look at is https://github.com/phaistos-networks/Trinity which is a library and should be simpler than a whole framework/db (trinity is like lucene/rocksdb compared to full-db cassandra/scylladb).

In a similar situation we just adjust the GC and started to use G1GC which resulted in similar numbers.

I bet that didn't take N engineers 12 months to build out, either

Cassandra uses G1GC by default.

If it was as simple as tweaking a few GC settings to get 10x improvement pretty sure Datastax would've done it by now.

The problem with making a general purpose DB is you have to have general purpose defaults. The Cassandra defaults are "dont crash anywhere", not "be super fast and low latency". You can definitely do 5x better than the default with some basic jvm tuning.

That said: the IG folks certainly know how to tune JVMs. There are IG (and former IG, I saw rbranson post) folks in this thread that know how to tune the collectors, so assume that the 10x they see is beyond what you'd get from simple tuning.

Not sure you understand that there are different versions of Cassandra and I never mentioned that we used Datastax version. Using G1GC with default settings does not give you anything btw.

2 engineers, 2 weeks because we had to evaluate every change we made with production traffic.

Join our meetup to chat with some of the developers: https://www.meetup.com/Apache-Cassandra-Bay-Area/events/2483...

So sad I’m not in town that week

Has any tried running Casandra on Azul Zing[1]? The slowdown here is not surprisingly related to GC pauses which Azul has eliminated in Zing.

[1] https://www.azul.com/products/zing/

The licensing cost of Zing generally makes this a bad trade-off. It's much cheaper to purchase more hardware. Zing is targeted at vertically scaling very large JVM heaps, where it's valuable to have massive amounts of data on a single, big machine.

As an ex-Azul employee I can say there's a good number of Azul clients using a Zing+Cassandra setup, so the price point is right for some people at the very least. Zing licence cost has also changed in recent years (3.5k per server last I looked, and that is before you haggle some bulk deal) so not sure if your impression is calibrated to that new price point.

Have friends who have used it, they report that it works reasonably well. Especially in p99.

By what factor/magnitude p99 was improved ? Any idea ?

Actually, it appears that is one of the premises[1] they sell Zing on.

[1]: https://www.azul.com/solutions/cassandra/

As a Java scoffer trying to be fair-minded, I resisted the urge to joke that "it's was the GC, stupid" and assume that a big project like Cassandra had somehow worked around the GC latency problems.

But, what? It turns out the article is really about replacing Java with C++.

It's about using something in one language for its features and only porting the critical sections to C++ via a clean API. This is the sort of advice we've been giving people for decades. Choose the language for what you want to build, measure and profile performance if necessary, find the bottleneck on the hot path, decouple that from the bulk of the code, and reach to a lower level for performance only in that clearly defined section.

They managed to generalize one application that meets their feature needs to be a front end to another existing application with fewer features but better performance as a back end. They're optimizing their hot path by decoupling it from the rest of the application and handing off to C++ code they didn't even have to write. Adding pluggable storage engines to Cassandra means that if they make the API smooth enough they can have engines in C, C++, Erlang, Go, Rust, ML, or whatever in the future without changing their front end. That's a big win even beyond this tail latency issue.

Well, other than storage engine, the next big part of a database software is the query planner/optimizer which Cassandra doesn't have (due to simple KV nature of it). So there isn't much remaining. In a long term plan, rewrite them all and you have single code base and you'll benefit from mighty C++ in other components of the database. And there is still room for more optimizations: SIMD, ...

The GC problem is not limited to C*. This shit(virtual machine) is hitting the whole Hadoop stack: HDFS, Hive, Spark, Flink, Pig...

Immense number of tickets in any fairly large cluster is related somewhat to GC and JVM behavior.

I remember using quite early versions of Cassandra back in an ad-tech startup I was at back in 2009 or 2010, spending unfortunate amounts of time fighting the JVM GC and trying to tune things so it behaved responsibly. It was a real problem then and I know a lot of work went into fixing GC behaviour. Then I stopped using Cassandra for work, but it's unfortunate this is still an issue?

What I took out of that is that I really feel like something like Cassandra is better suited to implementation in a language like C++ or Rust. And I believe others have since come along and done this.

I really liked the gossip-based federation in Cassandra though.

It's still an issue, but a lot less of one. 3.0 is a big improvement in terms of GC behavior. Haven't tested 3.11.x yet, but it look in theory like a decent improvement in terms of rounding off corner cases and adding instrumentation.

It sounds like you might be interested in TiKV.



Since coming to Google I don't get the opportunity to compare/evaluate/deploy tools like this anymore. Smarter people than me make choices like that :-)

Meanwhile scylladb looks like a better option for numerous reasons

Or just try and benchmark Azul VM with pause-less GCs ?!

(I have used Azul in low-latency production environments. It has pros and cons but it certainly beats re-writing the storage layer... )

Curious to know the cons of using it, except being commercial.

> except being commercial

That's the biggest, especially for when it's a Facebook company. Otherwise it works well but can be pricey.

The JVM is getting a new fully concurrent collector though called Shenandoah: https://www.google.com/search?q=shenandoah+gc

Not just shenandoah[1], it's also getting zgc[2], so the low latency future looks bright for the jvm.

1. https://wiki.openjdk.java.net/display/shenandoah/Main 2. https://wiki.openjdk.java.net/display/zgc/Main

Needs a stronger machine to be effective (more cores & more memory)

Minor configuration issues (we had a very complex environment, custom kernel, weird network stuff, JNIs)

Nicely done! Looking forward to the pluggable storage engine.

The JIRA tickets don't really shine with much hope :/

https://issues.apache.org/jira/browse/CASSANDRA-13474 [2 comments from 2017 Apr] https://issues.apache.org/jira/browse/CASSANDRA-13475 [~100 comments, but the last one is from 2017 Nov, by the InstaG engineer]

And the Rocksandra fork is already ~3500 commits behind master, so upstreaming this will be interesting.

Oh, and the Rocksandra fork is already kind of abandoned - no commits since 2017 Dec. (which probably means this is not actually the code that runs under Instagram.)

This is the rocksandra branch, https://github.com/Instagram/cassandra/tree/rocks_3.0, we develop it on top of Cassandra 3.0. It's the code we are running on our production servers.

Thanks for the git push and the reply! A few minutes ago it was still pointing to the older commit.

And upstream Cassandra is already 11 minor releases away. Won't that become a problem with something as fundamental/low-level as pluggable storage engines?

I'm a committer, I'm familiar with the JIRA ticket.

Could you share your thoughts on how likely and how soon will the RocksDB engine be available as part of normal Cassandra? Also, how big is the gap between 3.0.x and 3.x? Any improvements between 3.0.x and 3.x regarding tail latency/performance? Thanks!

I think it's likely. It's decidedly nontrivial, and the hardest part will be the (very slow) design phase where we actually make sure the interfaces are defined properly, but I think there are enough interested people to make sure it happens.

There are some meaningful changes between 3.0 and 3.11 (notably a compressed chunk cache for storing some intermediate data blocks and a significant change to the way the column index is deserialized) that do help tail latencies, and there's certainly quite a bit more low hanging fruit, but the biggest contributor to p99 latencies is the GC collections, and the read path still contributes the most JVM garbage, so this is still probably a meaningful improvement over 3.11.

It would be great. But I don't think it could happen. The pluggable storage engine would greatly increase the cognitive complexity of the code.

Of course it could happen. Pluggable engine has a lot of positives - not only does it enable features like this, it also helps modularize the codebase making it more testable, and there are other people who will develop storage engines for their own use case over time (look at the evolution of - for example - MySQL storage backends for examples of this).

Did you all find that there were changes to the Java heap/GC configuration that would make tuning this setup different? I imagine if most everything that "sticks" is moved off heap, the GC could be tuned more heavily for young gen throughput vs trying to balance it with long-lived objects.

Yeah, for Rocksandra, we are able to use much smaller heap size, and most of the objects are recycled during the young gen GC.

> We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

Umm .. shouldn't the stalls go to 0, because now you have moved to C++ ? Or is this the time it takes for the manual garbage collection to occur ?

Why not use ScyllaDB ? (Serious)

Answered a bit before but they have a team that knows c* well. Cassandra is proven to handle petabytes at scale in production systems.

Is there any trade off after replacing LSM tree-based storage engine to RocksDB storage engine?

RocksDB is also an LSM structured KV store.

Great, now can you fix the Python Cassandra Driver to work in a multi-threaded application environment without the connection pooling bugs and default synchronous app-blocking (vs lazy-init) connection setup?


So question:

Any thoughts on replacing HDFS + Yarn + Hive + HBASE with GulsterFS + Kubernetes + Cassandra


Hbase is sync+globally sorted, while cassandra is not, so probably not.

Can we add lz4 to the blend to reduce disk IO?

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact