Hacker News new | past | comments | ask | show | jobs | submit login
Revisiting 1M Writes per second (netflix.com)
119 points by shifte on July 25, 2014 | hide | past | favorite | 38 comments



My other favorite "scalability" study is from WhatsApp:

http://www.youtube.com/watch?v=c12cYAUTXXs

That's 'Billion' with a 'B': Scaling to the Next Level at WhatsApp

(that walk title was create before the acquisition and was mean to imply message count, after the acquisition it got a secondary meaning).

The one thing that is fascinating about it, is how small their team was compared to the volume and complexity of the operation.


The thought of even touching systems at such scales terrifies me.


Agreed. Someone asked once how to you simulate load like that to test and he answered "we can't" we try to do gradual deployment and rolling upgrades.


I love using Cassandra; it's been a dream for analytics. Thank you Netflix et al. for not only moving the project forward, but providing research and commentary like this and others available at http://wiki.apache.org/cassandra/ArticlesAndPresentations


Can you talk a little about how you're using it for analytics and why you like it over other similar choices?


Would be great to know the exact data set size (not size in the database). I can't get an order of magnitude sense of what I am looking at without that.

I know I can divine it from the parameters to stress, but I have no idea if the row keys generated by different clients overlap and I don't know the default number of columns nor their size.

I think it's also important in this kind of benchmark to describe the distribution of access especially for a read intensive benchmark. Without that you really don't know what your are looking at. I am a fan of scrambled Zipfian.


It seems like this info should be front in center in the test. 1M/s 100kb writes is much more impressive than 1M/s 16 byte writes.

That said, there's a previous benchmark linked to at the top of the post:

http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

The client is writing 10 columns per row key, row key randomly chosen from 27 million ids, each column has a key and 10 bytes of data. The total on disk size for each write including all overhead is about 400 bytes.

There are 3 replicas, so figure that in as well.


So the test was run on a 285 node cluster with 60 clients. It would be nice to know how they arrived at those numbers. Was there some sort of formula used to calculate how large each group should be? How much trial and error was involved.


Couldn't help but notice: $398.70 per hour = $9568.80 per day = ~$3.5m per annum. They obviously get a discount...but still. What kind of discount do guys like this get?


I'd be surprised if anyone got discounts as deep as Netflix considering their usage.

For what it's worth, 3.5m/yr is about 0.08% their revenue.


Well, that puts things in perspective.


If they were doing this for real, they'd be using reservations; with one-year reservations, the i2.xlarge bit (the servers) cost $905k/year.


For this type of "report" the vendor will usually chuck in the hosting for free. It would have conditions like "can only be used for this test (not anything else" and "results must be published on your blog" etc. etc.

Microsoft do this all the time - its great publicity. It would be interesting to see if they (Microsoft) have something showing similar results...


This is running Cassandra 1.2.x which is over 18 months old.

Here's some performance notes on the latest (2.1rc4): http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-f...


Will 2.1 actually be usable? It seems like 2.0 has missed that mark in every single release.


Are you saying 2.0 isn't usable ?

Because there are many companies (including ours) who would strongly disagree with you about that.


I wonder where the term sidecar originated from and what is the precise definition? Is this something invented by the netflix OSS people or does it predate that?



I was hoping for a more precise software concept. Maybe it is loosey goosey.


The definition I use is a piece of software that runs independently to encapsulate infrastructure libraries. In other words it gives you a way to access the libraries without having to import them into your code.


How many companies can possibly afford the cost and management pain of running a 285 node database? Few would have 285 servers of any type? So if this is what it takes to get 1M writes on Cassandra that is some poor ROI. 10K writes/sec is not impressive.


I don't understand your point.

The type of companies who could afford this are the types of companies who need 1M writes/second. Which are few and far between. And yes 10K is not that impressive but 1M is. And with Cassandra you could continue to improve that number just be rolling out more nodes.


It is quite cheap to average a million writes per second. I've done it with five servers on AWS, and that was spatially indexing billions of GeoJSON polygons through storage while running queries against the index.

Many companies need far in excess of a million writes per second. Basically, most machine-generated data sources, whether it is personal location data or any other kind of telemetry. Many companies that do not generate that data themselves buy and consume it. I know of companies doing over a billion writes per second.

Cassandra is pretty good for this type of thing among open source software but it is not nearly as efficient as it could be in terms of write throughput. If the storage engine is correctly designed, you should be able to drive 10GbE all the way through storage -- call it 1 GB/sec per node. However, that does mean you can't do things like mmap()-ing files; those interfaces are slow due to poor scheduling by the OS when the throughput is very high.


> It is quite cheap to average a million writes per second. I've done it with five servers on AWS, and that was spatially indexing billions of GeoJSON polygons through storage while running queries against the index.

How big was your total dataset. It's cheap to average a million writes per second if the dataset fits in RAM, or at least if the index set does, with the right database. It can be less cheap for a data set far larger than RAM, as for most databases write amplification becomes a significant problem.


For that case, several terabytes IIRC. It was not in-memory, that database engine was pushing records through disk storage. Obviously there were no R-trees or any other kind of slow secondary indexing; the database itself is deeply and fundamentally spatially organized, even for text and numbers data.

While there is some write amplification it is less than most databases. It only takes few disks before the scheduler can get significantly more bandwidth out of the disks than a 10GbE network has to drive that activity, so there is extra capacity. The bottleneck on most server hardware is the silicon between the storage and memory if you are doing it right.


Out of curiosity, what would a correctly designed storage engine do to get better throughput than mmap()ing files?


On Linux, you would use io_submit + O_DIRECT on a small number of large files allocated in large chunks. In short, you become the I/O scheduler and buffer manager instead of the operating system, while removing most of the implicit context switches. It does require much more code than mmap()-ing and fairly sophisticated code at that.

If you are doing it well, 3-5x throughput improvement seems to be average upside in my experience, which is huge. The scheduler behind mmap() simply does not have enough context about the workload to make good paging decisions leading to a lot of suboptimal or wasted I/O, and this is magnified when the storage I/O is under pressure. In principle, if you write your own I/O scheduler you can always make sure that the optimal I/O operation is executed at the optimal time.


They probably mean 10K writes/sec/node, which would be correct assuming a replication factor of three. It doesn't sound huge, but if it were sustained, with a key set vastly larger than memory, it's not too bad.


How would you build out at a metrics cluster that needs 1M writes/sec?

285 nodes that are easily automated to create/destroy/monitor doesn't seem like a management pain to me, personally. Just depends if you have the write internal tools built.


I have seen single servers do 250k writes/second. So it really depends on the data and access patterns more than some arbitrary one size fit's all solution.


I suspect systems where the dataset, or at least the indexes, largely fit in memory?


Most of the active data-set fit in ram, but considering you can get 100+GB of RAM that's generally not much of an issue. Also a single mid range SSD can break 120K writes per second so 0+1 RAID arrays can get crazy fast for less money than you might think.


Well, I mean, you _can_ get 100+GB of RAM, but you'll certainly pay for it. The machines used in the example have 30GB of RAM and 800GB SSD. For a fairly random access pattern and most of the storage used, the RAM's not going to be helping that much. Let's say they're using 600GB of the 800GB, replication factor of three, that's 57TB over their cluster. It's pretty big.


Uhh, you can get 128GB on a basic dell server that costs less than 4k. It's 512+GB of ram that starts getting pricey now days and 1TB is actually an option.

Though, I agree if you dataset is large enough and you need random access it's not going to help much.


Ok, so you're super-admin. And about the ROI aspect? What's the cost for achieving the 1M write/sec metric in that mammoth cluster?


I asked you how you would architect something that needs 1M writes/sec if you weren't using Cassandra. Ignore the fact that netflix is making this into a single cluster.

I'm not saying I'd do it by myself. I'm saying with the write tools its doable and not unreasonable.


The replication factor is set to 3, meaning that all data is stored on tree separate nodes - in different availability zones. So in practice the writes per sec is 3x.


Well, it's not unreasonable (the maintenance part) with Cassandra.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: