Disclaimer: I'm a committer on Project Voldemort and while the default read/write storage engine used by Voldemort is BerkeleyDB-JE -- which uses a log-structured B+ tree -- it's very easy to get sub 1 ms read latency and 10,000-20,000 reads/sec on a single node, depending on the amount of memory/hardware. To put it in another way, if you're using Voldemort you probably don't need memcached. While BDB-JE read performance does begin to decline when your data:memory ratio goes in excess of ~5:1, but in one case we've able to support a ratio as high as 10:1 while meeting all SLAs with quorum reads/writes on a 6 node cluster. Other storage engines e.g., MySQL are available and it's trivial to write more.
I'll make my bias fully known: I believe that for smaller clusters (and by smaller, I mean <500-1000 nodes) the Dynamo model (used by Voldemort, Cassandra and Riak) is more suited: no special nodes (you can use the same hardware everywhere, same recovery procedures, same capacity planning approach, same expansion procedures), vector clocks and quorums for consistency, gossip for cluster state, much simpler multi-datacenter operation. It's an easier model to implemented (thus less prone to bugs) and is easier on operations (no need to maintain separate classes of machines as spares, very easy to restore a failed machines or add a new one).
Nonetheless, I am not a distributed systems researcher and plenty of real ones would have very serious reasons to disagree with me, depending on the specific use cases: in theory, for read-heavy traffic the BigTable model, single agreed-upon master for a specific token range (thus no need for quorum reads) should yield very low latency and scale extremely well; data is partitioned amongst tablet servers and the metadata master is used only for assigning tablets to tablet servers; as there's one true copy of the metadata, there's also no need for gossip and programmers don't have to deal with vector clocks (they can use high level abstractions like GQL on the AppEngine or the transactional MegaStore). In practice, however, BigTable took multiple years to build (as did GFS and Chubby); very little is also known about Spanner which is also a multi-year effort. Keep in mind those are multiple years with literally the world's best engineers and world's best data: they can mine logs from >1,000,000 machines to support any hypotheses they have.
With the Dynamo model, in addition to Cassandra and Voldemort, you should also consider Riak, which has the excellent Bitcask storage engine available which is append-only (writes are strictly sequential) and requires only O(1) random disk seeks for read queries; it also uses the OS page cache to be friendly to Erlang's garbage collector. An engine like this has been (independently) in development for Voldemort, but it's not yet ready for prime time.
Here are some things for you to consider:
* How much data do you have in relation to memory? To make the long story short if all your data fits into memory, at least with Voldemort, Cassandra, Hypertable (and probably with HBase as well) you're going to be able to easily saturate commodity ethernet with a modest sized cluster.
Pick one on the basis of the distribution model: do you care about "read your writes" consistency (i.e., if using Cassandra/Voldemort/Riak are you going to need to do quorum reads?), do you need high availability, or are you okay with a single point of failure?
* Keep in mind that while many systems are very extensible as far as storage (and thus performance) and query models go (storage engines are pluggable and view support is coming soon to the mainline Voldemort branch) they are not pluggable as far as the distribution model goes: it would be difficult to centralize all metadata in Voldemort and make it a master-slave system with Paxos leader elections, it would be even more difficult to make HBase/Hypertable asymmetric and based on eventual consistency/quorums.
* When you're dealing with volumes of data larger than what can fit into memory, you need to perform a benchmark based on a realistic simulation. Important thing you need is the traffic distribution that your app actually receives i.e., how many requests can be served out of the cache? Generally Internet traffic (and my apologies in advance if you're not building a web app, that is a huge assumption on my part) follows the Zipfian (rather than uniform) probability distribution which is supported by the YCSB (Yahoo Cloud Serving Benchmark) tool.
Voldemort includes a slightly modified version of YCSB and inside (see ./bin/performance-tool.sh for the YCSB-based tool; see ./bin/remote-test.sh for the legacy home-brew tool, the two will be merged together into one) so you can do this. YCSB (the mainline version) should support Cassandra and HBase out of the box (someone else pointed to it in another comment).
What do you mean by quarterly? As in the fiscal quarter as in quarter of the time? If you mean the former (almost certainly not the case, but thought I'd ask), you may want to consider building your indexes in Hadoop and pushing them to Voldemort (this also works well on a daily or multiple-times-a-day basis):
Thanks for the extensive answer. We buy a lot of data that comes in four times a year for our interactive mapping engine. Initially I would only want to update that data every quarter. That said, if I get a system like this up and running, there are other datasets that would be ideal in a system like this that would require quite a few writes every a day. Logs, cached images, etc.
dont forget that the tools you mentioned are intended for storing terabytes of data on hundreds/thousands of machines and their key benefit is scaling the bulk throughput rather than optimizing performance
check out MongoDB, Redis, Tokyo Tyrant if you have up to 10-20GB, these are superfast tools but limited to storage/IO/memory capacity of a single box
also look at some of these benchmarks (but running your own test is always better)
I've benchmarked Tokyo Tyrant, MongoDB, and Redis for a smaller dataset (~20gb). We ended up going with Tyrant because it stored our serialized data more efficiently than the others. That index lives in memory on a big ec2 instance. The data I'm talking about will easily be around half a terabyte or more and I need efficient random access within it. At least initially. I assume that in the medium term this amount of data will grow quite a bit larger. I really do just want to use something like Tokyo Tyrant, but the data management overhead of sharding the datasets across enough high memory ec2 machines to get acceptable performance is prohibitive to going that route. I've kind of been driven to the dynamo / bigtable approach if only because it can store amounts of data in these sizes and can easily be ramped up by adding more servers.
That said, I'm still in the analysis stage of where our architecture needs to go next. It very well may be cost prohibitive to go with one of the technologies I mentioned.
Thank you for the benchmark tools and guidelines. They're immediately applicable to the analysis I'm performing this week.
If "key value" is important, have you considered Redis? If things like Cassandra and HBase are the Lisp or Haskell of the database world, Redis is the assembler - it's crazy fast but you need to take care of your own underlying data structures (beyond strings, lists, and two types of set).
[Disclaimer: I am a committer on Project Voldemort]
Redis is very cool (I really want to play with it - congrats to VMWare on hiring its author), but it is neither a disk-oriented system [EDIT: fixed per comment, was disk-persistent] nor (most importantly, IMO) a fully distributed system.
See a comment in this blog post for great visual guide to these systems from these two axes (disk-orientation and distribution):
For most of these systems, the answer will depend on how much hardware (disks and RAM) you're throwing at the problem. The RAM-to-disk ratio is a big factor. If your entire data set fits in memory, the latency will be measured in microseconds. If none of your data fits in memory and each request requires a disk seek, the average latency will be closer 10 milliseconds (assuming there are enough disks to keep up with the requests). Take a look at average latency section in the following benchmark comparing Hypertable with HBase: http://www.hypertable.com/pub/perfeval/test1/
Hypertable average latency goes from about 1.5 milliseconds down to about .5 milliseconds. These numbers are average, however, some requests hit disk and were close to 10 milliseconds, but many requests were served out of the cache. The read request pattern also makes a big difference. Read patterns with locality (e.g. Zipfian distribution) will have much a much better cache hit percentage. -Doug
Thank you for the numbers about hypertable. In my exploration I've been leaning towards hypertable but I haven't done any benchmarking of my own yet.
My read patters will probably have locality but I have not yet figured out how to model the data in a key / value style to preserve the locality of the data. Can you describe what you mean by Zipfian distribution?
An easy way to think about Zipfian distributions is something like "the most popular 30% of your data set is queried 70% of the time". Its basically the temporal locality property that a lot caching systems exploit.
For more look at http://en.wikipedia.org/wiki/Zipf%27s_law and www.icis.ntu.edu.sg/scs-ijit/1204/1204_6.pdf
Not really the case. There all sorts of optimizations in place (row caching, Bloom filters) that give adequate read performance.
Is the read performance suitable for the OP, given his workload, his hardware (i.e., the amount of RAM he can give to Cassandra, the speed of his disks)? I don't know, but the only way to find out is experimentally.
being really good at writes does not mean it's really bad at reads, it is actually pretty good (in our cassandra cluster we have ~1ms latency which may be tolerable or not but it's not a lot) EDIT: and we don't have row caching enabled
Billions of records. I haven't yet determined which way to slice the dataset to be most efficient. A fair amount of it is denormalized and my assumption is that the initial dataset will be on the order of half a terabyte or more.
Right now the system I've conceived of has lookups by exact key, but the subsequent queries would be for data that is "near" to the previous query. I may mitigate that by increasing the size of data within the initial query to ensure that the "next query" is already loaded into memory.
The keys will always be consulted at random. That's kind of the problem.
To begin with, yes. It also depends on how heavily I denormalize the data for recall. It could be many times that down the road. I'm not looking to implement something heavily distributed right now, but I know that day is coming with my datasets and I want to ensure I know what needs to be built when the time comes.
I've answered the size question in a few other posts, but I'm thinking billions of records, half a terabyte or more of denormalized data. Stored in whatever ec2 instances I need to accomplish the job. If the required computing power is more than my business can afford then it's really up to the management to make the determination as to whether or not they really want to go down this path.