Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Cassandra, HBase, Hypertable, Voldemort. Lowest read latency?
29 points by ghotli on June 29, 2010 | hide | past | web | favorite | 31 comments
I've got a great deal of information that I need to store in a key value format. I need access to that data as quickly as possible. Writes are only going to occur quarterly. Any thoughts?

you need to think in terms of reads/sec vs latency, not just latency

in other words 1) what is the throughput (load) you expect your data store to sustain in reads/sec

2) what is the acceptable latency in msec for your application under expected load (see 1)

and 3) how many simultaneous connections your db has to support

to figure out which data store suits you best in terms of capacity/performance you have two choices:

1) run a benchmark on your hardware, ideally with your own client

2) run a benchmark on your hardware, ideally with your own client

i'd recommend starting with this: http://wiki.github.com/brianfrankcooper/YCSB/

dont forget that the tools you mentioned are intended for storing terabytes of data on hundreds/thousands of machines and their key benefit is scaling the bulk throughput rather than optimizing performance

check out MongoDB, Redis, Tokyo Tyrant if you have up to 10-20GB, these are superfast tools but limited to storage/IO/memory capacity of a single box

also look at some of these benchmarks (but running your own test is always better)






I've benchmarked Tokyo Tyrant, MongoDB, and Redis for a smaller dataset (~20gb). We ended up going with Tyrant because it stored our serialized data more efficiently than the others. That index lives in memory on a big ec2 instance. The data I'm talking about will easily be around half a terabyte or more and I need efficient random access within it. At least initially. I assume that in the medium term this amount of data will grow quite a bit larger. I really do just want to use something like Tokyo Tyrant, but the data management overhead of sharding the datasets across enough high memory ec2 machines to get acceptable performance is prohibitive to going that route. I've kind of been driven to the dynamo / bigtable approach if only because it can store amounts of data in these sizes and can easily be ramped up by adding more servers.

That said, I'm still in the analysis stage of where our architecture needs to go next. It very well may be cost prohibitive to go with one of the technologies I mentioned.

Thank you for the benchmark tools and guidelines. They're immediately applicable to the analysis I'm performing this week.

If "key value" is important, have you considered Redis? If things like Cassandra and HBase are the Lisp or Haskell of the database world, Redis is the assembler - it's crazy fast but you need to take care of your own underlying data structures (beyond strings, lists, and two types of set).

[Disclaimer: I am a committer on Project Voldemort]

Redis is very cool (I really want to play with it - congrats to VMWare on hiring its author), but it is neither a disk-oriented system [EDIT: fixed per comment, was disk-persistent] nor (most importantly, IMO) a fully distributed system.

See a comment in this blog post for great visual guide to these systems from these two axes (disk-orientation and distribution):


Redis supports full disk persistence by adding the following values to your redis.conf:

    appendonly yes
    appendfsync always
Yes, fsyncing after every operation will be slower than fsyncing every second or so.

Thanks, fixed it.

Sure, but the OP specified:

I need access to that data as quickly as possible. Writes are only going to occur quarterly.

The emphasis is on speed and writes are rare, so being disk-oriented wouldn't provide any advantages.

I guess the main question when considering Redis is: Does the entire database fit into memory?

Redis supports "virtual memory".

My initial thoughts were for a Redis style database. It certainly won't live in memory. That's why I've been pushed towards a dynamo / bigtable approach.

Though, as has been raised a few times, that is not a restriction in Redis anymore.

that would be "does the entire keyspace fit in memory?"

For most of these systems, the answer will depend on how much hardware (disks and RAM) you're throwing at the problem. The RAM-to-disk ratio is a big factor. If your entire data set fits in memory, the latency will be measured in microseconds. If none of your data fits in memory and each request requires a disk seek, the average latency will be closer 10 milliseconds (assuming there are enough disks to keep up with the requests). Take a look at average latency section in the following benchmark comparing Hypertable with HBase: http://www.hypertable.com/pub/perfeval/test1/ Hypertable average latency goes from about 1.5 milliseconds down to about .5 milliseconds. These numbers are average, however, some requests hit disk and were close to 10 milliseconds, but many requests were served out of the cache. The read request pattern also makes a big difference. Read patterns with locality (e.g. Zipfian distribution) will have much a much better cache hit percentage. -Doug

Thank you for the numbers about hypertable. In my exploration I've been leaning towards hypertable but I haven't done any benchmarking of my own yet.

My read patters will probably have locality but I have not yet figured out how to model the data in a key / value style to preserve the locality of the data. Can you describe what you mean by Zipfian distribution?

An easy way to think about Zipfian distributions is something like "the most popular 30% of your data set is queried 70% of the time". Its basically the temporal locality property that a lot caching systems exploit. For more look at http://en.wikipedia.org/wiki/Zipf%27s_law and www.icis.ntu.edu.sg/scs-ijit/1204/1204_6.pdf

That paper was great. Thanks.

Disclaimer: I'm a committer on Project Voldemort and while the default read/write storage engine used by Voldemort is BerkeleyDB-JE -- which uses a log-structured B+ tree -- it's very easy to get sub 1 ms read latency and 10,000-20,000 reads/sec on a single node, depending on the amount of memory/hardware. To put it in another way, if you're using Voldemort you probably don't need memcached. While BDB-JE read performance does begin to decline when your data:memory ratio goes in excess of ~5:1, but in one case we've able to support a ratio as high as 10:1 while meeting all SLAs with quorum reads/writes on a 6 node cluster. Other storage engines e.g., MySQL are available and it's trivial to write more.

I'll make my bias fully known: I believe that for smaller clusters (and by smaller, I mean <500-1000 nodes) the Dynamo model (used by Voldemort, Cassandra and Riak) is more suited: no special nodes (you can use the same hardware everywhere, same recovery procedures, same capacity planning approach, same expansion procedures), vector clocks and quorums for consistency, gossip for cluster state, much simpler multi-datacenter operation. It's an easier model to implemented (thus less prone to bugs) and is easier on operations (no need to maintain separate classes of machines as spares, very easy to restore a failed machines or add a new one).

Nonetheless, I am not a distributed systems researcher and plenty of real ones would have very serious reasons to disagree with me, depending on the specific use cases: in theory, for read-heavy traffic the BigTable model, single agreed-upon master for a specific token range (thus no need for quorum reads) should yield very low latency and scale extremely well; data is partitioned amongst tablet servers and the metadata master is used only for assigning tablets to tablet servers; as there's one true copy of the metadata, there's also no need for gossip and programmers don't have to deal with vector clocks (they can use high level abstractions like GQL on the AppEngine or the transactional MegaStore). In practice, however, BigTable took multiple years to build (as did GFS and Chubby); very little is also known about Spanner which is also a multi-year effort. Keep in mind those are multiple years with literally the world's best engineers and world's best data: they can mine logs from >1,000,000 machines to support any hypotheses they have.

With the Dynamo model, in addition to Cassandra and Voldemort, you should also consider Riak, which has the excellent Bitcask storage engine available which is append-only (writes are strictly sequential) and requires only O(1) random disk seeks for read queries; it also uses the OS page cache to be friendly to Erlang's garbage collector. An engine like this has been (independently) in development for Voldemort, but it's not yet ready for prime time.

Here are some things for you to consider:

* How much data do you have in relation to memory? To make the long story short if all your data fits into memory, at least with Voldemort, Cassandra, Hypertable (and probably with HBase as well) you're going to be able to easily saturate commodity ethernet with a modest sized cluster. Pick one on the basis of the distribution model: do you care about "read your writes" consistency (i.e., if using Cassandra/Voldemort/Riak are you going to need to do quorum reads?), do you need high availability, or are you okay with a single point of failure?

* Keep in mind that while many systems are very extensible as far as storage (and thus performance) and query models go (storage engines are pluggable and view support is coming soon to the mainline Voldemort branch) they are not pluggable as far as the distribution model goes: it would be difficult to centralize all metadata in Voldemort and make it a master-slave system with Paxos leader elections, it would be even more difficult to make HBase/Hypertable asymmetric and based on eventual consistency/quorums.

* When you're dealing with volumes of data larger than what can fit into memory, you need to perform a benchmark based on a realistic simulation. Important thing you need is the traffic distribution that your app actually receives i.e., how many requests can be served out of the cache? Generally Internet traffic (and my apologies in advance if you're not building a web app, that is a huge assumption on my part) follows the Zipfian (rather than uniform) probability distribution which is supported by the YCSB (Yahoo Cloud Serving Benchmark) tool.

Voldemort includes a slightly modified version of YCSB and inside (see ./bin/performance-tool.sh for the YCSB-based tool; see ./bin/remote-test.sh for the legacy home-brew tool, the two will be merged together into one) so you can do this. YCSB (the mainline version) should support Cassandra and HBase out of the box (someone else pointed to it in another comment).

If you've got any questions in regards to Voldemort, please feel free to email me (my contact information is in my profile), drop by #Voldemort on irc.freenode.org or email the mailing list: http://groups.google.com/group/project-voldemort


What do you mean by quarterly? As in the fiscal quarter as in quarter of the time? If you mean the former (almost certainly not the case, but thought I'd ask), you may want to consider building your indexes in Hadoop and pushing them to Voldemort (this also works well on a daily or multiple-times-a-day basis):


Thanks for the extensive answer. We buy a lot of data that comes in four times a year for our interactive mapping engine. Initially I would only want to update that data every quarter. That said, if I get a system like this up and running, there are other datasets that would be ideal in a system like this that would require quite a few writes every a day. Logs, cached images, etc.

If you don't need sophisticated distribution, you can consider Redis or TokyoCabinet as alternatives.

You can also give Keyspace a try. In your use-case, you'd be running it in single mode, in which case it's basically a high-speed network wrapper around BerkeleyDB with a nice client library and docs.


How many records? How much uncompressed data in total?

Are lookups always by exact key, or by nearest key to a probe? Are keys consulted essentially at random, or in batches of consecutive ascending/descending accesses?

Billions of records. I haven't yet determined which way to slice the dataset to be most efficient. A fair amount of it is denormalized and my assumption is that the initial dataset will be on the order of half a terabyte or more.

Right now the system I've conceived of has lookups by exact key, but the subsequent queries would be for data that is "near" to the previous query. I may mitigate that by increasing the size of data within the initial query to ensure that the "next query" is already loaded into memory.

The keys will always be consulted at random. That's kind of the problem.

Only half a terabyte? All these distributed options are overkill.

Fit one machine with 5 160GB SSDs. Use anything. Classic BerkeleyDB or BerkeleyDB-JE should do.

To begin with, yes. It also depends on how heavily I denormalize the data for recall. It could be many times that down the road. I'm not looking to implement something heavily distributed right now, but I know that day is coming with my datasets and I want to ensure I know what needs to be built when the time comes.

Cassandra's out. It's whole design is for work loads with lots of writes mixed in.

Not really the case. There all sorts of optimizations in place (row caching, Bloom filters) that give adequate read performance.

Is the read performance suitable for the OP, given his workload, his hardware (i.e., the amount of RAM he can give to Cassandra, the speed of his disks)? I don't know, but the only way to find out is experimentally.

being really good at writes does not mean it's really bad at reads, it is actually pretty good (in our cassandra cluster we have ~1ms latency which may be tolerable or not but it's not a lot) EDIT: and we don't have row caching enabled


Define 'great deal'? How many records are we talking about?

and on how many/what sort of hosts would be you spreading those records over?

I've answered the size question in a few other posts, but I'm thinking billions of records, half a terabyte or more of denormalized data. Stored in whatever ec2 instances I need to accomplish the job. If the required computing power is more than my business can afford then it's really up to the management to make the determination as to whether or not they really want to go down this path.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact