
Riak's Bitcask - A Log-Structured Hash Table For Fast Key/Value Data - yarapavan
http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-log-structured-hash-table-for-fast-keyvalue.html
======
rozim
To me this is all nice and straightforward except for the merge i.e. when does
it happen? Magically in the background or only at restart.

In either case then the question is how does this affect the responsiveness of
the system.

This paper <http://downloads.basho.com/papers/bitcask-intro.pdf> is a nice
easy ready with another level of detail over the linked to article.

I suspect they could consider \- varint encoding of the key and value sizes on
disk \- aligning some writes to disk block boundaries to avoid spanning 2
blocks

though both probably only have minor gains.

For more inspiration see Jeff Dean's comments on the SSTable data structure at
Google: [http://osdi2006.blogspot.com/2006/10/paper-bigtable-
distribu...](http://osdi2006.blogspot.com/2006/10/paper-bigtable-distributed-
storage.html)

~~~
metabrew
Riak has "windowed merges", ie you can schedule compaction at your off-peak
times. Or not at all, I guess, if you never delete or modify existing data.

~~~
siculars
the window merge is a new thing in riak 0.14, i believe. and you are correct
the merge is only a consideration based on your use case. if you are using
riak as a pure dump of immutable data you will not need to merge. if, however,
your use case consists of a number of edits you should consider a more
aggressive merge strategy. As all updates/deletes are appended to bitcask and
are actually writes, your dead bytes will start growing rapidly.

------
snissn
"Under heavy access load we’ve already seen Bitcask do well. So far it has
only seen double-digit gigabyte volumes, but we’ll be testing it with more
soon."

~~~
bobf
So they've only tested heavy access load on a double-digit gigabyte volume?
That concerns me a bit, as I'm considering migrating from HBase to Riak, which
would be several TB of data.

~~~
seiji
Isn't it one storage space per vnode? If you have 256 vnodes in your cluster
and each vnode is at double digit gigabytes, you end up with between two and
25 terabytes available in the cluster.

~~~
siculars
yes, afaik, that is correct. atm, there is one bitcask "cask" that is opened
for each vnode. the number of vnodes (virtual nodes) each physical machine is
responsible for is a function of the ring size, aka total number of vnodes in
a cluster, and the number of physical machines in the cluster.

------
xtacy
Related project: RAMCloud at Stanford.

<http://fiz.stanford.edu:8081/display/ramcloud/Home>

It's a _pure_ in memory key-value store, that aims to give the lowest latency
access to data as possible (~1 to 10us for small bytes of data).

~~~
roder
I wouldn't call it "related" or even "similar"… it's similar in that they're
both key value stores, but there's no durability and Ramcloud stores both keys
& values in memory; whereas bitcask only stores keys in memory as an "index"
to the value on disk.

~~~
strlen
Your operating includes a page cache (in some cases it may have even more
e.g., ZFS on Solaris with ARC). It can very effectively load these indexed
values into memory. Of course your own cache and direct I/O may be more
efficient if you're building a search index or a relational database, but for
a key/value store the page cache should be very effective.

