
Cassandra Hits One Million Writes Per Second on Google Compute Engine - bgoldy
http://googlecloudplatform.blogspot.com/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html
======
arielweisberg
I like Cassandra and I think it's something you should look at hard if what
you want is an eventually consistent database because multi-DC and
availability are important to the problem you are solving.

The blog post fails to make that case anywhere. I am guessing that is because
it is an ad for GCE and not Cassandra.

I know these kinds of benchmarks are necessary to get and keep your widget on
the map, but they always annoy me because I know N other ways to arrive at the
same or better numbers with a different set of tradeoffs. Demonstrating scale
out is not great for differentiation because it says nothing about the total
cost or complexity of use.

Why is it that only GCE + Cassandra could have done a million of these
particular kind of writes in a cost effective way?

I also feel like they are fudging how elasticity works. It's not instant.
Bringing up a cluster of a given size is not the same as adding node at time
to a running cluster to arrive at that size and actually receiving full
benefit from the additional capacity. From the wording it sounds like the
entire cluster was brought up at once.

Overall my beef is that the blog post fails to inform the reader and is
mediocre as a benchmark of anything other than GCE. And let's not even start
with the complete lack of random reads, does GCE offer SSD instances yet?

~~~
ivansmf
I ran those tests. One of the challenges in benchmarking is to pick the
workload, so I chose one our customers do more often. For IO testing I prefer
to run FIO tests, and for networking iperf, but unless you do this for a
living it is hard to relate to microbenchmarks. It would also leave out the
Java settings.

The reason I left the tests running for an hour is because at some point
Cassandra gets into compactions, which are CPU and read intensive. The engine
memory maps the files, so the IO subsystem backing the storage sees a lot of
random IOs as page faults kick in. Streaming IOs get in during fsyncs. Again,
easier to see using FIO.

The cluster was brought up at once, and I added the ramp up time on the chart
so folks could see it.

I had three goals for this test: \- Show low latency is possible with remote
storage and proper capacity planning. This is why I used quorum commit, which
forces the client to wait for at least two nodes to commit. \- Share the
settings I used. If you open the GIST and download the tarball you'll find all
changes I did to the cassandra.yaml and cassandra-env.sh. This benefits our
customers directly because it gives them a starting point. \- Recommend that
customers look at all samples, not only the middle 80% the stress tool
reports. Otherwise the cluster looks much better than it is.

Thank you for the comments! I can assure you I read them, and I'll incorporate
suggestions as possible.

~~~
arielweisberg
Thanks I missed the link to the tarball in the GIST. Now that I can see the
workload I see that it is 100 million keys for the entire cluster of 330
nodes?

That is 8 terabytes (n1-standard-8) of RAM for 18.6 * RF gigabytes of data or
am I wrong? The entire thing should fit in the page cache assuming compaction
keeps up. There are no deletes so no tombstones. On an overwrite workload I
don't know what the space amplification will be and whether you will actually
run out of RAM for caching.

90% of people who read a benchmark aren't going to look at it the way I do.
They don't have a cost model that says how fast things should be and how much
they should cost. I do and when things are fitting in memory I have a
different a different set of expectations. If I am looking at the wrong
instance type please let me know.

For the workload you described Cassandra shouldn't be doing random IO. I would
expect there to be three + N streams of sequential IO. The write ahead log,
memtable flushing, compaction output, and N streams reading tables for
compaction.

All the write IO can be deferred and rescheduled heavily because fsyncs are
infrequent. Reads for compaction are done by background tasks and shouldn't
effect foreground latency.

If read ahead is not working for compaction (and killing disk throughput) that
may be something that needs to be addressed. Compaction should be requesting
IOs large enough to amortize the cost of seeking. Page faults in memory mapped
files don't stop read ahead and I think that the kernel will even detect
sequential access and double read ahead. For a workload like this with no
random reads you could configure the kernel to read ahead 2-4 megabytes and
the page cache would probably absorb it fine.

For the workload you described the only fsyncs would be the log (every 10
seconds) and memtable flushes/compaction finishing and that would normally be
so infrequent as to not move the needle on overall IO capacity (although
obviously it consumes sequential IO). You set trickle fsync so we are still
talking about 10 megabyte writes.

Granted there are many things I don't know about Cassandra. I've plumbed a lot
of it, but it isn't my day job. Using a 24 gigabyte heap and a 600 megabyte
young generation is questionable to me. I think Cassandra can do better, and I
also think there are several tools that would do the same job with 10-20x less
nodes by exploiting the fact that they never have to do random reads from
disk.

~~~
ivansmf
Hi - I figured you know what you're doing based on the details you picked on.
I suspect you'd have way more fun reading the raw logs, but that would make
for a post that is incredibly hard to parse. I will try to figure out a venue
for them.

It was 100M keys per loader, so 30x100M for the low latency cluster and
45x100M for the high latency cluster. I sized to keep running long enough to
go over half dozen major compactions, not necessarily to eat storage space.

Most of the IO is sequential, which really makes no difference for our backend
(sorry, I have to leave it at that). Read performance ends up affecting the
performance when the system reaches its limits of pending compactions, which I
limited in the config. I could probably get a higher number without it, but it
looked more realistic to set limits and I'd not deploy a server without the
limits.

I had to make the fsyncs more frequent, not less, to reduce the odds of an
operation sitting behind a long flush. This was a counter-intuitive finding
for me, but it makes sense considering Persistent Disks throttle both IO sizes
and IOPS. We do that to make sure a noisy neighbouring VM won't affect yours.
So I turned on trickle_fsync and tuned the size of the maximum syncs to never
have a slow flush, but a lot of smaller flushes.

Tuning the Java GC was for the same reason - I did not want long GC cycles.
The settings I used were based on the guidance from Java memory ergonomics.
The DataStax distribution has tips and hints in the cassandra-env.sh file
itself, which I read it and followed through. I also read more about Java
memory than I ever want to read again from various vendors. As the joke goes
"I did even the things that contradicted the other things" before getting to
the settings you saw.

I hope you appreciate the fact I set limits on some of the most dangerous
knobs, for instance limiting the RPC server and using HSHA as opposed to
unbounded sync.

I understand the limitations of the benchmark I used, and the limitations of
the post. I tried to stay within recommended settings for all subsystems, have
safe limits for everything I found dangerous, and I picked the workload
because that is what our customers do. I don't advocate one storage solution
over another, we love everyone that buys our solutions :)

BTW, while we do not sell Cassandra as a service, we do sell Cloud SQL
([https://developers.google.com/cloud-
sql/](https://developers.google.com/cloud-sql/)). Maybe the sales guys will
give me a cut :)

Sharing the config was more work, but I suspected I was going to learn
something by doing it. Thank you for the feedback!

PS: sorry for the delay replying - Y Combinator says I am posting too much. I
am not quite sure at what point I will hit my daily quota of replies.

------
primitivesuave
In the end these are just eye-popping benchmarks. What really drives people to
your platform is developer friendliness and the speed by which you can get
started. Google App Engine has this, but there's not a lot of information
floating around the media and blogosphere as there is for Heroku and
DigitalOcean.

I use Google App Engine for auto scaling an API and it works brilliantly.
Super easy to set up and develop on. But recently I started preferring
DigitalOcean simply because there is a community that is constantly posting
tutorials and answering questions. To me, that's more valuable than the
distant prospect of handling 1M writes/second.

~~~
jeremydw
While this doesn't directly pertain to your comment, perhaps it's somewhat
related to onboarding with GAE...

I've been developing on GAE since it was released in 2008 and recently became
"Google Cloud Platform Certified".

I'm considering offering a "Google App Engine for Startups" class/workshop (at
something like General Assembly). The class would primarily include details
about the platform's architecture and best practices for building high-
scalability apps (like, say, Snapchat) so your app can "scale without thinking
twice." Do you think there'd be some interest in a class like this?

~~~
jfim
From organizing past events, it's much easier to just do it and have it flop
(nobody's going to notice it if it does) than try looking if there's interest
or not. Ask your target venue if you can do it, make sure it has some
visibility (ie. make sure it's in the program), setup a slide deck or two and
do it. Good luck!

------
t0mas88
The "one million" number grabs some attention, but this isn't that special in
my opinion. Doing 3333 writes/sec per node is not that hard with Cassandra,
actually it can be much faster if you use a setup with fast local storage and
split for example the data disk from the commit log. The Google article reads
as if they used network storage and 1 volume per node, both are bad ideas for
Cassandra as documented by Datastax.

~~~
jasonlotito
Keep in mind, it's not "One Million Writes Per Second," it's "One Million
Writes Per Second on Google Compute Engine" with "Google Compute Engine" being
the key point to the article.

The "one million writes per second" for Cassandra has been written about
before (in this case, on AWS):
[http://techblog.netflix.com/2011/11/benchmarking-
cassandra-s...](http://techblog.netflix.com/2011/11/benchmarking-cassandra-
scalability-on.html)

~~~
wodzu
It is worth noting GCE is more expensive now that AWS was back in 2011.

According to Netflix article the AWS experiment did run at a cost $561 per 2h,
that is ~$280 per hour. Perhaps they were not utilized the cluster fully in
those 2h in which case we should multiply the 1h test that performed 500k
inserts per second, in that case the cost would be $182*2 = ~$365 per 1h.

GCE test did run at the cost of $330 per hour. Give or take few dollars
difference if anything it's surprising GCE can do at roughly the same cost
what AWS was capable of 2+ years ago.

Saying all that GCE guys did a great effort. I wonder though how much speed
you can squeeze from AWS and at what cost now when AWS is sporting SSD disks.

~~~
ivansmf
Hi, The cost we published includes the time to setup the whole cluster, warm
up the data nodes, and run for 5 minutes at 1M per second.

Our run rate is $281 per hour, which is the same as AWS a couple of years
back. What changed is that we are using quorum commit, the data is encrypted
at rest, we have very low tail latency, and we look at all samples when
computing that.

Computing our price is easier because we do not charge per access.

Here is the formula for our run rate:

30 loaders (n1-highcpu-8) at $0.522 per hour: 248.7

300 nodes (n1-standard-8) at $0.829 per hour: 15.66

300 1TB PDs that run at 0.055555556/hour: 16.67 Total: 281.03

But keep an eye on us. This is for today prices.

------
nightshowerer
Similar but more detailed post from Netflix about 2 and a half years ago:
[http://techblog.netflix.com/2011/11/benchmarking-
cassandra-s...](http://techblog.netflix.com/2011/11/benchmarking-cassandra-
scalability-on.html)

------
fooyc
That's only 3000 writes per VM per second.

Now I would like to see how many reads it can do at the same time. Cassandra
does mostly-sequential writes, however most reads do couples of disk seeks.

------
capkutay
Can anyone comment on Cassandra's performance versus HBase? HBase adds the
complexity of dealing with the whole HStack and clustering can be a pain if
you have no need to use HDFS/Zookeeper in the first place. Cassandra seems
nice because its a single platform and each node is an equal member of the
cluster, no need for designating HMaster, data nodes, etc. I was just
wondering if Cassandra's widely regarded as less performant.

I know the true answer lies in my exact use cases and weeks of initial
testing, but it would be nice to hear someone's opinion first.

~~~
coolsunglasses
Honestly, just google "HBase vs. Cassandra" and go from there.

Just keep in mind that between the two, Cassandra has improved a lot more than
HBase has.

HBase can be an easy choice if you already have a Hadoop cluster and want to
roll the results of Map-Reduce jobs into HBase keys.

The DataStax crew has a decent stack for turning batch/OLAP jobs into
queryable keys.

If you have no need of that, want tunable consistency, favor write
availability over read performance, then Cassandra might be a fit.

Just uh, don't pretend Cassandra clusters are necessarily trivial to manage
just because they're homogenous.

IMHO: put off moving to any of these technologies as long as possible.

------
outside1234
Is anyone actually using Google Compute Engine? I haven't bumped into anyone
using it and would love to hear what the real world experience is with it.

~~~
t0mas88
I've tested it, network latency is by far not as good as for example Rackspace
Cloud. And tool-support is very far behind Amazon and OpenStack. Basically
Google Compute Engine is something that might have had a chance years ago but
at this point in time it's quite pointless because it has no upsides compared
to existing far better established solutions and is behind on features,
documentation, tools and user-support.

And I expect that not to get better soon if nobody is using it...

~~~
ithkuil
> And I expect that not to get better soon if nobody is using it...

If the people running GCE see that people are not using it because it sucks at
X, they will probably do their best to address X. So complaining is a good
thing, especially if it contains enough information in order to understand the
problem (e.g. latency from/to where? internal network or external network?)

------
oamoruwa
The 1M writes/second capability is a valuable benchmark in analyzing
continuously variable and real-time information, similar to CFD analysis are
high-throughput computational analysis where you end up dealing with multi-
variate dimensions of data continuously in flux.

------
roncohen
Did anyone actually care to see if that data can be read back? :)

[https://dev.mysql.com/doc/refman/5.0/en/blackhole-storage-
en...](https://dev.mysql.com/doc/refman/5.0/en/blackhole-storage-engine.html)

~~~
hibikir
Unix admins have been using a far better technology for decades: Store in
/dev/null, extract from dev/random. The write speeds are just as fast, but you
will eventually get your results back. It is said that it'd be quicker to
extract the next Game of Thrones novel from this system than to wait for
George R R Martin to finish writing it.

------
vithlani
These cloud-y benchmarks are a joke.

1\. What is the point of benchmarking so many writes? Instead, give us the
cost of write-read and a few use cases: analysis, data mining, etc. And what
of the bandwidth costs?

2\. Say you are running a moderately intensive application that needs to be up
10 hours a day. $330/hour x 10 = $3300. Multiply that by a few days and where
does that leave you?

Suddenly having your own infrastructure with kit like this:
[http://www.fusionio.com/products](http://www.fusionio.com/products) in it is
not such a bad idea! This also means you don't have to use any funny NoSQL
products, can use standard programming, non cloudy infrastructure and a decent
RDBMs setup that non Silicon-valley, non-hipster humans have a chance of
understanding and reliably maintaining.

