
Revisiting 1M Writes per second - shifte
http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html
======
rdtsc
My other favorite "scalability" study is from WhatsApp:

[http://www.youtube.com/watch?v=c12cYAUTXXs](http://www.youtube.com/watch?v=c12cYAUTXXs)

That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

(that walk title was create before the acquisition and was mean to imply
message count, after the acquisition it got a secondary meaning).

The one thing that is fascinating about it, is how small their team was
compared to the volume and complexity of the operation.

~~~
dav-
The thought of even touching systems at such scales terrifies me.

~~~
rdtsc
Agreed. Someone asked once how to you simulate load like that to test and he
answered "we can't" we try to do gradual deployment and rolling upgrades.

------
MoOmer
I love using Cassandra; it's been a dream for analytics. Thank you Netflix et
al. for not only moving the project forward, but providing research and
commentary like this and others available at
[http://wiki.apache.org/cassandra/ArticlesAndPresentations](http://wiki.apache.org/cassandra/ArticlesAndPresentations)

~~~
jxf
Can you talk a little about how you're using it for analytics and why you like
it over other similar choices?

------
arielweisberg
Would be great to know the exact data set size (not size in the database). I
can't get an order of magnitude sense of what I am looking at without that.

I know I can divine it from the parameters to stress, but I have no idea if
the row keys generated by different clients overlap and I don't know the
default number of columns nor their size.

I think it's also important in this kind of benchmark to describe the
distribution of access especially for a read intensive benchmark. Without that
you really don't know what your are looking at. I am a fan of scrambled
Zipfian.

~~~
jdf
It seems like this info should be front in center in the test. 1M/s 100kb
writes is much more impressive than 1M/s 16 byte writes.

That said, there's a previous benchmark linked to at the top of the post:

[http://techblog.netflix.com/2011/11/benchmarking-
cassandra-s...](http://techblog.netflix.com/2011/11/benchmarking-cassandra-
scalability-on.html)

 _The client is writing 10 columns per row key, row key randomly chosen from
27 million ids, each column has a key and 10 bytes of data. The total on disk
size for each write including all overhead is about 400 bytes._

There are 3 replicas, so figure that in as well.

------
dnBldGVy
So the test was run on a 285 node cluster with 60 clients. It would be nice to
know how they arrived at those numbers. Was there some sort of formula used to
calculate how large each group should be? How much trial and error was
involved.

------
JonoBB
Couldn't help but notice: $398.70 per hour = $9568.80 per day = ~$3.5m per
annum. They obviously get a discount...but still. What kind of discount do
guys like this get?

~~~
alex_sf
I'd be surprised if anyone got discounts as deep as Netflix considering their
usage.

For what it's worth, 3.5m/yr is about 0.08% their revenue.

~~~
JonoBB
Well, that puts things in perspective.

------
jbellis
This is running Cassandra 1.2.x which is over 18 months old.

Here's some performance notes on the latest (2.1rc4):
[http://www.datastax.com/dev/blog/cassandra-2-1-now-
over-50-f...](http://www.datastax.com/dev/blog/cassandra-2-1-now-
over-50-faster)

~~~
walls
Will 2.1 actually be usable? It seems like 2.0 has missed that mark in every
single release.

~~~
threeseed
Are you saying 2.0 isn't usable ?

Because there are many companies (including ours) who would strongly disagree
with you about that.

------
turnip1979
I wonder where the term sidecar originated from and what is the precise
definition? Is this something invented by the netflix OSS people or does it
predate that?

~~~
jedberg
It's from here:
[https://www.google.com/search?q=motorcycle+sidecar&source=ln...](https://www.google.com/search?q=motorcycle+sidecar&source=lnms&tbm=isch)

~~~
turnip1979
I was hoping for a more precise software concept. Maybe it is loosey goosey.

~~~
jedberg
The definition I use is a piece of software that runs independently to
encapsulate infrastructure libraries. In other words it gives you a way to
access the libraries without having to import them into your code.

------
arrryarr
How many companies can possibly afford the cost and management pain of running
a 285 node database? Few would have 285 servers of any type? So if this is
what it takes to get 1M writes on Cassandra that is some poor ROI. 10K
writes/sec is not impressive.

~~~
threeseed
I don't understand your point.

The type of companies who could afford this are the types of companies who
need 1M writes/second. Which are few and far between. And yes 10K is not that
impressive but 1M is. And with Cassandra you could continue to improve that
number just be rolling out more nodes.

~~~
jandrewrogers
It is quite cheap to average a million writes per second. I've done it with
five servers on AWS, and that was spatially indexing billions of GeoJSON
polygons through storage while running queries against the index.

Many companies need far in excess of a million writes per second. Basically,
most machine-generated data sources, whether it is personal location data or
any other kind of telemetry. Many companies that do not generate that data
themselves buy and consume it. I know of companies doing over a billion writes
per second.

Cassandra is pretty good for this type of thing among open source software but
it is not nearly as efficient as it could be in terms of write throughput. If
the storage engine is correctly designed, you should be able to drive 10GbE
all the way through storage -- call it 1 GB/sec per node. However, that does
mean you can't do things like mmap()-ing files; those interfaces are slow due
to poor scheduling by the OS when the throughput is very high.

~~~
rsynnott
> It is quite cheap to average a million writes per second. I've done it with
> five servers on AWS, and that was spatially indexing billions of GeoJSON
> polygons through storage while running queries against the index.

How big was your total dataset. It's cheap to average a million writes per
second if the dataset fits in RAM, or at least if the index set does, with the
right database. It can be less cheap for a data set far larger than RAM, as
for most databases write amplification becomes a significant problem.

~~~
jandrewrogers
For that case, several terabytes IIRC. It was not in-memory, that database
engine was pushing records through disk storage. Obviously there were no
R-trees or any other kind of slow secondary indexing; the database itself is
deeply and fundamentally spatially organized, even for text and numbers data.

While there is some write amplification it is less than most databases. It
only takes few disks before the scheduler can get significantly more bandwidth
out of the disks than a 10GbE network has to drive that activity, so there is
extra capacity. The bottleneck on most server hardware is the silicon between
the storage and memory if you are doing it right.

