
Ask HN: Cassandra, HBase, Hypertable, Voldemort. Lowest read latency? - ghotli
I've got a great deal of information that I need to store in a key value format. I need access to that data as quickly as possible. Writes are only going to occur quarterly. Any thoughts?
======
helwr
you need to think in terms of reads/sec vs latency, not just latency

in other words 1) what is the throughput (load) you expect your data store to
sustain in reads/sec

2) what is the acceptable latency in msec for your application under expected
load (see 1)

and 3) how many simultaneous connections your db has to support

to figure out which data store suits you best in terms of capacity/performance
you have two choices:

1) run a benchmark on your hardware, ideally with your own client

2) run a benchmark on your hardware, ideally with your own client

i'd recommend starting with this:
<http://wiki.github.com/brianfrankcooper/YCSB/>

dont forget that the tools you mentioned are intended for storing terabytes of
data on hundreds/thousands of machines and their key benefit is scaling the
bulk throughput rather than optimizing performance

check out MongoDB, Redis, Tokyo Tyrant if you have up to 10-20GB, these are
superfast tools but limited to storage/IO/memory capacity of a single box

also look at some of these benchmarks (but running your own test is always
better)

<http://research.yahoo.com/files/ycsb.pdf>

[http://www.slideshare.net/schubertzhang/hbase-0200-performan...](http://www.slideshare.net/schubertzhang/hbase-0200-performance-
evaluation)

<http://code.google.com/p/redis/wiki/Benchmarks>

<http://capttofu.livejournal.com/23381.html>

<http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/>

~~~
ghotli
I've benchmarked Tokyo Tyrant, MongoDB, and Redis for a smaller dataset
(~20gb). We ended up going with Tyrant because it stored our serialized data
more efficiently than the others. That index lives in memory on a big ec2
instance. The data I'm talking about will easily be around half a terabyte or
more and I need efficient random access within it. At least initially. I
assume that in the medium term this amount of data will grow quite a bit
larger. I really do just want to use something like Tokyo Tyrant, but the data
management overhead of sharding the datasets across enough high memory ec2
machines to get acceptable performance is prohibitive to going that route.
I've kind of been driven to the dynamo / bigtable approach if only because it
can store amounts of data in these sizes and can easily be ramped up by adding
more servers.

That said, I'm still in the analysis stage of where our architecture needs to
go next. It very well may be cost prohibitive to go with one of the
technologies I mentioned.

Thank you for the benchmark tools and guidelines. They're immediately
applicable to the analysis I'm performing this week.

------
petercooper
If "key value" is important, have you considered Redis? If things like
Cassandra and HBase are the Lisp or Haskell of the database world, Redis is
the assembler - it's crazy fast but you need to take care of your own
underlying data structures (beyond strings, lists, and two types of set).

~~~
bravura
I guess the main question when considering Redis is: Does the entire database
fit into memory?

~~~
ghotli
My initial thoughts were for a Redis style database. It certainly won't live
in memory. That's why I've been pushed towards a dynamo / bigtable approach.

~~~
petercooper
Though, as has been raised a few times, that is not a restriction in Redis
anymore.

------
nuggetwheat
For most of these systems, the answer will depend on how much hardware (disks
and RAM) you're throwing at the problem. The RAM-to-disk ratio is a big
factor. If your entire data set fits in memory, the latency will be measured
in microseconds. If none of your data fits in memory and each request requires
a disk seek, the average latency will be closer 10 milliseconds (assuming
there are enough disks to keep up with the requests). Take a look at average
latency section in the following benchmark comparing Hypertable with HBase:
<http://www.hypertable.com/pub/perfeval/test1/> Hypertable average latency
goes from about 1.5 milliseconds down to about .5 milliseconds. These numbers
are average, however, some requests hit disk and were close to 10
milliseconds, but many requests were served out of the cache. The read request
pattern also makes a big difference. Read patterns with locality (e.g. Zipfian
distribution) will have much a much better cache hit percentage. -Doug

~~~
ghotli
Thank you for the numbers about hypertable. In my exploration I've been
leaning towards hypertable but I haven't done any benchmarking of my own yet.

My read patters will probably have locality but I have not yet figured out how
to model the data in a key / value style to preserve the locality of the data.
Can you describe what you mean by Zipfian distribution?

~~~
smallpotatoes
An easy way to think about Zipfian distributions is something like "the most
popular 30% of your data set is queried 70% of the time". Its basically the
temporal locality property that a lot caching systems exploit. For more look
at <http://en.wikipedia.org/wiki/Zipf%27s_law> and www.icis.ntu.edu.sg/scs-
ijit/1204/1204_6.pdf

~~~
ghotli
That paper was great. Thanks.

------
strlen
Disclaimer: I'm a committer on Project Voldemort and while the default
read/write storage engine used by Voldemort is BerkeleyDB-JE -- which uses a
log-structured B+ tree -- it's very easy to get sub 1 ms read latency and
10,000-20,000 reads/sec on a single node, depending on the amount of
memory/hardware. To put it in another way, if you're using Voldemort you
probably don't need memcached. While BDB-JE read performance _does_ begin to
decline when your data:memory ratio goes in excess of ~5:1, but in one case
we've able to support a ratio as high as 10:1 while meeting all SLAs with
quorum reads/writes on a 6 node cluster. Other storage engines e.g., MySQL are
available and it's trivial to write more.

I'll make my bias fully known: I believe that for smaller clusters (and by
smaller, I mean <500-1000 nodes) the Dynamo model (used by Voldemort,
Cassandra and Riak) is more suited: no special nodes (you can use the same
hardware everywhere, same recovery procedures, same capacity planning
approach, same expansion procedures), vector clocks and quorums for
consistency, gossip for cluster state, much simpler multi-datacenter
operation. It's an easier model to implemented (thus less prone to bugs) and
is easier on operations (no need to maintain separate classes of machines as
spares, very easy to restore a failed machines or add a new one).

Nonetheless, I am not a distributed systems researcher and plenty of real ones
would have very serious reasons to disagree with me, depending on the specific
use cases: in theory, for read-heavy traffic the BigTable model, single
agreed-upon master for a specific token range (thus no need for quorum reads)
should yield very low latency and scale extremely well; data is partitioned
amongst tablet servers and the metadata master is used _only_ for assigning
tablets to tablet servers; as there's one true copy of the metadata, there's
also no need for gossip and programmers don't have to deal with vector clocks
(they can use high level abstractions like GQL on the AppEngine or the
transactional MegaStore). In practice, however, BigTable took multiple years
to build (as did GFS and Chubby); very little is also known about Spanner
which is also a multi-year effort. Keep in mind those are multiple years with
_literally_ the world's best engineers and world's best data: they can mine
logs from >1,000,000 machines to support any hypotheses they have.

With the Dynamo model, in addition to Cassandra and Voldemort, you should also
consider Riak, which has the excellent Bitcask storage engine available which
is append-only (writes are strictly sequential) _and_ requires only O(1)
random disk seeks for read queries; it also uses the OS page cache to be
friendly to Erlang's garbage collector. An engine like this has been
(independently) in development for Voldemort, but it's not yet ready for prime
time.

Here are some things for you to consider:

* How much data do you have in relation to memory? To make the long story short if all your data fits into memory, at least with Voldemort, Cassandra, Hypertable (and probably with HBase as well) you're going to be able to easily saturate commodity ethernet with a modest sized cluster. Pick one on the basis of the distribution model: do you care about "read your writes" consistency (i.e., if using Cassandra/Voldemort/Riak are you going to need to do quorum reads?), do you need high availability, or are you okay with a single point of failure?

* Keep in mind that while many systems are very extensible as far as storage (and thus performance) and query models go (storage engines are pluggable and view support is coming soon to the mainline Voldemort branch) they are _not_ pluggable as far as the distribution model goes: it would be difficult to centralize all metadata in Voldemort and make it a master-slave system with Paxos leader elections, it would be even more difficult to make HBase/Hypertable asymmetric and based on eventual consistency/quorums.

* When you're dealing with volumes of data larger than what can fit into memory, you need to perform a benchmark based on a realistic simulation. Important thing you need is the traffic distribution that your app actually receives i.e., how many requests can be served out of the cache? Generally Internet traffic (and my apologies in advance if you're not building a web app, that is a _huge_ assumption on my part) follows the Zipfian (rather than uniform) probability distribution which is supported by the YCSB (Yahoo Cloud Serving Benchmark) tool.

Voldemort includes a slightly modified version of YCSB and inside (see
./bin/performance-tool.sh for the YCSB-based tool; see ./bin/remote-test.sh
for the legacy home-brew tool, the two will be merged together into one) so
you can do this. YCSB (the mainline version) should support Cassandra and
HBase out of the box (someone else pointed to it in another comment).

If you've got any questions in regards to Voldemort, please feel free to email
me (my contact information is in my profile), drop by #Voldemort on
irc.freenode.org or email the mailing list:
<http://groups.google.com/group/project-voldemort>

P.S.,

What do you mean by quarterly? As in the fiscal quarter as in quarter of the
time? If you mean the former (almost certainly not the case, but thought I'd
ask), you may want to consider building your indexes in Hadoop and pushing
them to Voldemort (this also works well on a daily or multiple-times-a-day
basis):

[http://project-voldemort.com/blog/2009/06/building-a-1-tb-
da...](http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-
at-linkedin-with-hadoop-and-project-voldemort/)

~~~
ghotli
Thanks for the extensive answer. We buy a lot of data that comes in four times
a year for our interactive mapping engine. Initially I would only want to
update that data every quarter. That said, if I get a system like this up and
running, there are other datasets that would be ideal in a system like this
that would require quite a few writes every a day. Logs, cached images, etc.

------
kunley
If you don't need sophisticated distribution, you can consider Redis or
TokyoCabinet as alternatives.

------
Maro
You can also give Keyspace a try. In your use-case, you'd be running it in
single mode, in which case it's basically a high-speed network wrapper around
BerkeleyDB with a nice client library and docs.

<http://scalien.com/keyspace>

------
gojomo
How many records? How much uncompressed data in total?

Are lookups always by exact key, or by nearest key to a probe? Are keys
consulted essentially at random, or in batches of consecutive
ascending/descending accesses?

~~~
ghotli
Billions of records. I haven't yet determined which way to slice the dataset
to be most efficient. A fair amount of it is denormalized and my assumption is
that the initial dataset will be on the order of half a terabyte or more.

Right now the system I've conceived of has lookups by exact key, but the
subsequent queries would be for data that is "near" to the previous query. I
may mitigate that by increasing the size of data within the initial query to
ensure that the "next query" is already loaded into memory.

The keys will always be consulted at random. That's kind of the problem.

~~~
gojomo
Only half a terabyte? All these distributed options are overkill.

Fit one machine with 5 160GB SSDs. Use anything. Classic BerkeleyDB or
BerkeleyDB-JE should do.

~~~
ghotli
To begin with, yes. It also depends on how heavily I denormalize the data for
recall. It could be many times that down the road. I'm not looking to
implement something heavily distributed right now, but I know that day is
coming with my datasets and I want to ensure I know what needs to be built
when the time comes.

------
eclark
Cassandra's out. It's whole design is for work loads with lots of writes mixed
in.

~~~
strlen
Not really the case. There all sorts of optimizations in place (row caching,
Bloom filters) that give adequate read performance.

Is the read performance suitable for the OP, given his workload, his hardware
(i.e., the amount of RAM he can give to Cassandra, the speed of his disks)? I
don't know, but the only way to find out is experimentally.

------
manish
Voldemort?

------
Devilboy
Define 'great deal'? How many records are we talking about?

~~~
abeppu
and on how many/what sort of hosts would be you spreading those records over?

~~~
ghotli
I've answered the size question in a few other posts, but I'm thinking
billions of records, half a terabyte or more of denormalized data. Stored in
whatever ec2 instances I need to accomplish the job. If the required computing
power is more than my business can afford then it's really up to the
management to make the determination as to whether or not they really want to
go down this path.

