

Riffle: a high-performance write-once key/value storage engine for Clojure - prospero
http://blog.factual.com/how-factual-uses-persistent-storage-for-its-real-time-services

======
jdp
This is pretty similar to Sparkey[0] and bam[1]. Sparkey also comes from
growing out of cdb's limitations. It supports block-level compression like
Riffle does, and is optimized for accepting bulk writes. Riffle's linear-time
merge behavior lifted from Sorted String Tables is a nice alternative to
accepting writes at runtime. bam is cool in that it takes a plain separated
values file as input, and builds an index file from a minimal perfect hash
function over the input file.

[0]: [https://github.com/spotify/sparkey](https://github.com/spotify/sparkey)
[1]:
[https://github.com/StefanKarpinski/bam](https://github.com/StefanKarpinski/bam)

~~~
prospero
There are a lot of variants on this design out there, I had seen 'bam' but not
the Spotify implementation. An additional constraint we had that I didn't
allude to in the post was avoiding JNI, which adds some nasty failure modes
for remote installations that can be very hard to debug. This meant any C
implementation was off-limits for us.

It's unfortunate that using the JVM means that some wheels need to be
reinvented, but those are the breaks, I guess.

------
fiatmoney
"While memory-mapping is used for the hashtable, values are read directly from
disk, decoupling our I/O throughput from how much memory is available."

Whether you're mmap'ing or using read(), you're hitting the page cache before
you hit disc, and potentially evicting the LRU page thereof. Glancing through
the source it doesn't look like they're using actual "direct IO" (which, in
order to be performant, would have to have its own caching layer).

That being the case, for lots of tiny reads & writes I'd expect mmap to be
superior to read() and write().

~~~
prospero
A caching layer for random reads where the dataset is 10x larger tham memory
isn't hugely useful. If you get a hit, great, but you can't count on it.

For memory mapping to make sense you need to fetch a big chunk of data,
whereas read() gets a page's worth. For the data size and read pattern
described in the post, the latter is much more desirable.

~~~
fiatmoney
It's the opposite. As I said, read() and a read into a mmap'do array both hit
the OS page cache first, and will bring in ~4K of data on miss; read() also
has the overhead of a system call. For tiny read/writes the advice is to use
mmap. This is different if you're doing "direct io" and bypassing the OS page
cache because you have your own caching layer, but I don't think they do.

~~~
prospero
Whose advice? Check out RocksDB's front page:
[http://rocksdb.org](http://rocksdb.org). Empirically what you're saying isn't
true in my experience, rather mmap should be used when there's decent
coherency w.r.t. the available memory. Without knowing what you're basing your
belief on, I really can't address it.

~~~
fiatmoney
It looks like they are making some claims about OS level bottlenecks
specifically with the virtual memory subsystem. This is something I'd like to
look into; all I can find is the particular quote but no explanation of where
they think the bottleneck actually lies. The experience of, e.g., the SQLite
folks seems to be different.

[https://www.sqlite.org/mmap.html](https://www.sqlite.org/mmap.html)

