I like Cassandra and I think it's something you should look at hard if what you want is an eventually consistent database because multi-DC and availability are important to the problem you are solving.
The blog post fails to make that case anywhere. I am guessing that is because it is an ad for GCE and not Cassandra.
I know these kinds of benchmarks are necessary to get and keep your widget on the map, but they always annoy me because I know N other ways to arrive at the same or better numbers with a different set of tradeoffs. Demonstrating scale out is not great for differentiation because it says nothing about the total cost or complexity of use.
Why is it that only GCE + Cassandra could have done a million of these particular kind of writes in a cost effective way?
I also feel like they are fudging how elasticity works. It's not instant. Bringing up a cluster of a given size is not the same as adding node at time to a running cluster to arrive at that size and actually receiving full benefit from the additional capacity. From the wording it sounds like the entire cluster was brought up at once.
Overall my beef is that the blog post fails to inform the reader and is mediocre as a benchmark of anything other than GCE. And let's not even start with the complete lack of random reads, does GCE offer SSD instances yet?
I ran those tests. One of the challenges in benchmarking is to pick the workload, so I chose one our customers do more often. For IO testing I prefer to run FIO tests, and for networking iperf, but unless you do this for a living it is hard to relate to microbenchmarks. It would also leave out the Java settings.
The reason I left the tests running for an hour is because at some point Cassandra gets into compactions, which are CPU and read intensive. The engine memory maps the files, so the IO subsystem backing the storage sees a lot of random IOs as page faults kick in. Streaming IOs get in during fsyncs. Again, easier to see using FIO.
The cluster was brought up at once, and I added the ramp up time on the chart so folks could see it.
I had three goals for this test:
- Show low latency is possible with remote storage and proper capacity planning. This is why I used quorum commit, which forces the client to wait for at least two nodes to commit.
- Share the settings I used. If you open the GIST and download the tarball you'll find all changes I did to the cassandra.yaml and cassandra-env.sh. This benefits our customers directly because it gives them a starting point.
- Recommend that customers look at all samples, not only the middle 80% the stress tool reports. Otherwise the cluster looks much better than it is.
Thank you for the comments! I can assure you I read them, and I'll incorporate suggestions as possible.
Thanks I missed the link to the tarball in the GIST. Now that I can see the workload I see that it is 100 million keys for the entire cluster of 330 nodes?
That is 8 terabytes (n1-standard-8) of RAM for 18.6 * RF gigabytes of data or am I wrong? The entire thing should fit in the page cache assuming compaction keeps up. There are no deletes so no tombstones. On an overwrite workload I don't know what the space amplification will be and whether you will actually run out of RAM for caching.
90% of people who read a benchmark aren't going to look at it the way I do. They don't have a cost model that says how fast things should be and how much they should cost. I do and when things are fitting in memory I have a different a different set of expectations. If I am looking at the wrong instance type please let me know.
For the workload you described Cassandra shouldn't be doing random IO. I would expect there to be three + N streams of sequential IO. The write ahead log, memtable flushing, compaction output, and N streams reading tables for compaction.
All the write IO can be deferred and rescheduled heavily because fsyncs are infrequent. Reads for compaction are done by background tasks and shouldn't effect foreground latency.
If read ahead is not working for compaction (and killing disk throughput) that may be something that needs to be addressed. Compaction should be requesting IOs large enough to amortize the cost of seeking. Page faults in memory mapped files don't stop read ahead and I think that the kernel will even detect sequential access and double read ahead. For a workload like this with no random reads you could configure the kernel to read ahead 2-4 megabytes and the page cache would probably absorb it fine.
For the workload you described the only fsyncs would be the log (every 10 seconds) and memtable flushes/compaction finishing and that would normally be so infrequent as to not move the needle on overall IO capacity (although obviously it consumes sequential IO). You set trickle fsync so we are still talking about 10 megabyte writes.
Granted there are many things I don't know about Cassandra. I've plumbed a lot of it, but it isn't my day job. Using a 24 gigabyte heap and a 600 megabyte young generation is questionable to me. I think Cassandra can do better, and I also think there are several tools that would do the same job with 10-20x less nodes by exploiting the fact that they never have to do random reads from disk.
Hi - I figured you know what you're doing based on the details you picked on. I suspect you'd have way more fun reading the raw logs, but that would make for a post that is incredibly hard to parse. I will try to figure out a venue for them.
It was 100M keys per loader, so 30x100M for the low latency cluster and 45x100M for the high latency cluster. I sized to keep running long enough to go over half dozen major compactions, not necessarily to eat storage space.
Most of the IO is sequential, which really makes no difference for our backend (sorry, I have to leave it at that). Read performance ends up affecting the performance when the system reaches its limits of pending compactions, which I limited in the config. I could probably get a higher number without it, but it looked more realistic to set limits and I'd not deploy a server without the limits.
I had to make the fsyncs more frequent, not less, to reduce the odds of an operation sitting behind a long flush. This was a counter-intuitive finding for me, but it makes sense considering Persistent Disks throttle both IO sizes and IOPS. We do that to make sure a noisy neighbouring VM won't affect yours. So I turned on trickle_fsync and tuned the size of the maximum syncs to never have a slow flush, but a lot of smaller flushes.
Tuning the Java GC was for the same reason - I did not want long GC cycles. The settings I used were based on the guidance from Java memory ergonomics. The DataStax distribution has tips and hints in the cassandra-env.sh file itself, which I read it and followed through. I also read more about Java memory than I ever want to read again from various vendors. As the joke goes "I did even the things that contradicted the other things" before getting to the settings you saw.
I hope you appreciate the fact I set limits on some of the most dangerous knobs, for instance limiting the RPC server and using HSHA as opposed to unbounded sync.
I understand the limitations of the benchmark I used, and the limitations of the post. I tried to stay within recommended settings for all subsystems, have safe limits for everything I found dangerous, and I picked the workload because that is what our customers do. I don't advocate one storage solution over another, we love everyone that buys our solutions :)
> The cluster was brought up at once, and I added the ramp up time on the chart so folks could see it.
Can any person really bring up a cluster of this size? On AWS, you need special permission to launch more than ~20 instances of one type, and it's granted only after you make a business case for it, which they won't grant to regular peons. With my regular Google account that doesn't have any kind of special approval, can I really launch 330 instances within a few minutes?
There is a CPU quota limit (and a disk limit) that you have to get approved to raise (and I think there's a limit to how many requests/second you can make to the GCE API), but if you get those approvals, you're good.
The blog post fails to make that case anywhere. I am guessing that is because it is an ad for GCE and not Cassandra.
I know these kinds of benchmarks are necessary to get and keep your widget on the map, but they always annoy me because I know N other ways to arrive at the same or better numbers with a different set of tradeoffs. Demonstrating scale out is not great for differentiation because it says nothing about the total cost or complexity of use.
Why is it that only GCE + Cassandra could have done a million of these particular kind of writes in a cost effective way?
I also feel like they are fudging how elasticity works. It's not instant. Bringing up a cluster of a given size is not the same as adding node at time to a running cluster to arrive at that size and actually receiving full benefit from the additional capacity. From the wording it sounds like the entire cluster was brought up at once.
Overall my beef is that the blog post fails to inform the reader and is mediocre as a benchmark of anything other than GCE. And let's not even start with the complete lack of random reads, does GCE offer SSD instances yet?