in other words 1) what is the throughput (load) you expect your data store to sustain in reads/sec
2) what is the acceptable latency in msec for your application under expected load (see 1)
and 3) how many simultaneous connections your db has to support
to figure out which data store suits you best in terms of capacity/performance you have two choices:
1) run a benchmark on your hardware, ideally with your own client
2) run a benchmark on your hardware, ideally with your own client
i'd recommend starting with this: http://wiki.github.com/brianfrankcooper/YCSB/
dont forget that the tools you mentioned are intended for storing terabytes of data on hundreds/thousands of machines and their key benefit is scaling the bulk throughput rather than optimizing performance
check out MongoDB, Redis, Tokyo Tyrant if you have up to 10-20GB, these are superfast tools but limited to storage/IO/memory capacity of a single box
also look at some of these benchmarks (but running your own test is always better)
That said, I'm still in the analysis stage of where our architecture needs to go next. It very well may be cost prohibitive to go with one of the technologies I mentioned.
Thank you for the benchmark tools and guidelines. They're immediately applicable to the analysis I'm performing this week.
Redis is very cool (I really want to play with it - congrats to VMWare on hiring its author), but it is neither a disk-oriented system [EDIT: fixed per comment, was disk-persistent] nor (most importantly, IMO) a fully distributed system.
See a comment in this blog post for great visual guide to these systems from these two axes (disk-orientation and distribution):
I need access to that data as quickly as possible. Writes are only going to occur quarterly.
The emphasis is on speed and writes are rare, so being disk-oriented wouldn't provide any advantages.
My read patters will probably have locality but I have not yet figured out how to model the data in a key / value style to preserve the locality of the data. Can you describe what you mean by Zipfian distribution?
I'll make my bias fully known: I believe that for smaller clusters (and by smaller, I mean <500-1000 nodes) the Dynamo model (used by Voldemort, Cassandra and Riak) is more suited: no special nodes (you can use the same hardware everywhere, same recovery procedures, same capacity planning approach, same expansion procedures), vector clocks and quorums for consistency, gossip for cluster state, much simpler multi-datacenter operation. It's an easier model to implemented (thus less prone to bugs) and is easier on operations (no need to maintain separate classes of machines as spares, very easy to restore a failed machines or add a new one).
Nonetheless, I am not a distributed systems researcher and plenty of real ones would have very serious reasons to disagree with me, depending on the specific use cases: in theory, for read-heavy traffic the BigTable model, single agreed-upon master for a specific token range (thus no need for quorum reads) should yield very low latency and scale extremely well; data is partitioned amongst tablet servers and the metadata master is used only for assigning tablets to tablet servers; as there's one true copy of the metadata, there's also no need for gossip and programmers don't have to deal with vector clocks (they can use high level abstractions like GQL on the AppEngine or the transactional MegaStore). In practice, however, BigTable took multiple years to build (as did GFS and Chubby); very little is also known about Spanner which is also a multi-year effort. Keep in mind those are multiple years with literally the world's best engineers and world's best data: they can mine logs from >1,000,000 machines to support any hypotheses they have.
With the Dynamo model, in addition to Cassandra and Voldemort, you should also consider Riak, which has the excellent Bitcask storage engine available which is append-only (writes are strictly sequential) and requires only O(1) random disk seeks for read queries; it also uses the OS page cache to be friendly to Erlang's garbage collector. An engine like this has been (independently) in development for Voldemort, but it's not yet ready for prime time.
Here are some things for you to consider:
* How much data do you have in relation to memory? To make the long story short if all your data fits into memory, at least with Voldemort, Cassandra, Hypertable (and probably with HBase as well) you're going to be able to easily saturate commodity ethernet with a modest sized cluster.
Pick one on the basis of the distribution model: do you care about "read your writes" consistency (i.e., if using Cassandra/Voldemort/Riak are you going to need to do quorum reads?), do you need high availability, or are you okay with a single point of failure?
* Keep in mind that while many systems are very extensible as far as storage (and thus performance) and query models go (storage engines are pluggable and view support is coming soon to the mainline Voldemort branch) they are not pluggable as far as the distribution model goes: it would be difficult to centralize all metadata in Voldemort and make it a master-slave system with Paxos leader elections, it would be even more difficult to make HBase/Hypertable asymmetric and based on eventual consistency/quorums.
* When you're dealing with volumes of data larger than what can fit into memory, you need to perform a benchmark based on a realistic simulation. Important thing you need is the traffic distribution that your app actually receives i.e., how many requests can be served out of the cache? Generally Internet traffic (and my apologies in advance if you're not building a web app, that is a huge assumption on my part) follows the Zipfian (rather than uniform) probability distribution which is supported by the YCSB (Yahoo Cloud Serving Benchmark) tool.
Voldemort includes a slightly modified version of YCSB and inside (see ./bin/performance-tool.sh for the YCSB-based tool; see ./bin/remote-test.sh for the legacy home-brew tool, the two will be merged together into one) so you can do this. YCSB (the mainline version) should support Cassandra and HBase out of the box (someone else pointed to it in another comment).
If you've got any questions in regards to Voldemort, please feel free to email me (my contact information is in my profile), drop by #Voldemort on irc.freenode.org or email the mailing list: http://groups.google.com/group/project-voldemort
What do you mean by quarterly? As in the fiscal quarter as in quarter of the time? If you mean the former (almost certainly not the case, but thought I'd ask), you may want to consider building your indexes in Hadoop and pushing them to Voldemort (this also works well on a daily or multiple-times-a-day basis):
Are lookups always by exact key, or by nearest key to a probe? Are keys consulted essentially at random, or in batches of consecutive ascending/descending accesses?
Right now the system I've conceived of has lookups by exact key, but the subsequent queries would be for data that is "near" to the previous query. I may mitigate that by increasing the size of data within the initial query to ensure that the "next query" is already loaded into memory.
The keys will always be consulted at random. That's kind of the problem.
Fit one machine with 5 160GB SSDs. Use anything. Classic BerkeleyDB or BerkeleyDB-JE should do.
Is the read performance suitable for the OP, given his workload, his hardware (i.e., the amount of RAM he can give to Cassandra, the speed of his disks)? I don't know, but the only way to find out is experimentally.