
Hammerspace: Persistent, Concurrent, Off-heap Storage - lennysan
http://nerds.airbnb.com/hammerspace-persistent-concurrent-off-heap-storage/
======
noelwelsh
Talk about an inaccurate title! The improvement is a combination of off-heap
storage and sharing storage amongst processes. I'm surprised they didn't look
at Redis for this problem.

These tricks have been used for a while in the JVM world. Here's a JVM
equivalent of Hammerspace: [http://www.mapdb.org/](http://www.mapdb.org/) And
here's some slides concerning off-heap optimisations in Cassandra:
[http://www.slideshare.net/jbellis/dealing-with-jvm-
limitatio...](http://www.slideshare.net/jbellis/dealing-with-jvm-limitations-
in-apache-cassandra-fosdem-2012)

On the JVM GC time is usually only an issue when the heap gets over 2GB or so.
MRI's GC is not in the same league as the JVM's, but even so, 80MB should be
easily handled. As such I'm guessing the memory consumption of multiple
processes is causing the main issue, which would be solved if Ruby had real
threads. JRuby has real threads, and many other language runtimes do as well.
It seems like a lot of engineering effort is going into working around the
deficiencies of MRI, a problem that can be easily solved by switching to
something better.

~~~
jtai
We did benchmark local memcache, but accessing the strings over the network
(even locally) was much much slower than sparkey. Redis would likely have been
similar to memcache.

The savings came not just from avoiding redundant GC over the same 80MB, but
also from not going through the churn involved with loading that data over the
network on process startup.

We have a large codebase that's written against MRI, so it's non-trivial to
just switch to something else.

~~~
awda
Maybe just add a shared-memory front-end to memcache?

------
joevandyk
I wonder if they would need this if they used a single ruby process with many
threads (instead of many ruby processes).

Their problems are mainly a result of needing to access 80 megabytes of slowly
changing translation data. Since they run many ruby processes and have memory
growth issues, this translation data was taking a while to load.

If they had a single stable ruby process running on each box, possibly they
wouldn't have had these issues.

~~~
MrBuddyCasino
Yeah, all the time I had a little devil on my shoulder shouting "my jvm laughs
at your mri". Thats said, I like the approach to use what the OS offers -
essentially atomic file system constructs and the cache. When you control your
platform, things can go faster by cutting away those abstraction layers.

------
jblow
These guys never heard of shared memory, apparently?

Does Ruby not provide a facility to use shared memory? I guess you don't get
it by default in a GC'd language because the GC thinks it owns the world.

~~~
awda
Durable shared memory -- like, a (loocal) database?! I'm kind of amazed at the
degree of wheel-reinventing going on here.

~~~
ars
You don't even need a database. Just a file that you memmap. You can even use
fixed length strings in the file to make it really fast.

Basically just take your hash structure and write it to disk.

~~~
krenoten
djb's cdb is an excellent use case for this, as updates are infrequent and
stale data is tolerable. It is an extremely fast constant database that when
hot will sit snugly in page cache.

~~~
jbooth
Even if your data doesn't all fit in page cache with cdb, the 8 bytes per
entry (hashcode & offset) of overhead in the main hashtable absolutely will,
so even if you're too big for memory, you get single-seek-per-lookup.

The only problem with it is the 32bit offsets mean a 4GB max. Not too hard to
fork it for 64bit offsets, though.

~~~
jtai
We ended up using sparkey as the first backend for hammerspace, but
hammerspace was written to support multiple backends. We benchmarked both cdb
and sparkey, and their performance was very similar for our use case. At under
100mb of data, we weren't concerned about the 4G limitation or about the data
not fitting in cache. I don't think sparkey has the 4G limitation, and someone
has already forked cdb to support 64 bit offsets:
[https://github.com/pcarrier/cdb64](https://github.com/pcarrier/cdb64)

------
babs474
Man there is a lot of negative snark at the top of this thread.

I'm not sure if this system is a good idea or not but I wish some commenters
would spend more time comparing their proposed solutions (shared mem, local
db, memmap...) to Hammerspace rather than contentless dismissal.

~~~
jtai
Comparison to shared memory or mmap'd files -- by using files we let the
filesystem manage the cache, so we don't have to be concerned with the data
growing and causing memory pressure. We also don't have to worry about how to
store the data in a format friendly to fast retrieval, since this is provided
by sparkey.

Comparison to local dbs like cdb, sparkey, mapdb, gettext files etc. --
hammerspace is a gem that uses sparkey under the hood, so these solutions are
more or less one and the same. The difference is that hammerspace exposes a
ruby hash-like API to make integration with existing applications easier. It
also provides concurrent writer support, which many local dbs don't do.

Whether it's a good idea or not is for you to judge -- we've open sourced the
gem in hopes that it will be useful to someone, just as sparkey and gnista
were useful to us.

~~~
babs474
Sounds reasonable to me, especially the memory pressure thing.

It seems like some of the criticism in this thread comes from people who
haven't bothered to understand what you've done and summarily dismiss it as
stupid. That attitude has been bothering me lately on HN or maybe its
programmers in general.

Anyway, thanks for the interesting contribution.

------
georgemcbay
Original HN thread topic ("How Airbnb Improved Response Time by 17% By Moving
Objects From Memory To Disk") is misleading compared to actual article
contents, but speaking to the topic subject rather than the article, I do find
it pretty common for many developers to blanket assume that memory-based
caching is always the way to go, because, well, memory is fast and disks are
slow.

This sort of thinking ignores the fact that filesystems already have their own
(often very well-tuned) caching systems and in some cases (eg. sendfile(2) in
Linux) the kernel can do zero-copy writes from files to the network that
(along with decent fs caching) will easily outperform app-level memory-
caching. Of course, this only applies for data that will remain relatively
static, but often your best option is to mostly get out of the way and let the
OS do the heavy lifting unless you've measured actual loads and are sure your
solution is better.

------
joshwa
Armchair quarterbacking:

* Dedicate ruby processes to a particular subset of locales

* Parallelize your memcache queries

* Break up locale files into MRU/LRU strings to reduce size

* Denormalize locales (in memory, cache, whatever) into single values for most common pages. (use with MRU/LRU above)

As an aside, still don't understand how process->kernelspace driver->platter
is faster than process->kernelspace socket->process->RAM? Especially for
random access patterns. I suspect a memcache misconfiguration?

~~~
jtai
Good points!

We did initially try dedicating ruby processes to particular locales -- when
we saw a big improvement we knew we were on the right track. Doing so
permanently would be more difficult. To "follow the sun" we would need to
shift capacity to Europe and Asia during some hours and back to the US in
others.

Parallelizing memcache queries is difficult because we don't know ahead of
time what translations will be required to render a page.

We /are/ only working with the most recently used strings. Strings not
accessed in the last 4 days are not loaded.

I'm not sure what "denormalize locales" means exactly.

Sparkey is fast because the files end up in the filesystem cache and most of
the work is done in C. Going through the dalli gem to grab the translations
out of memcache causes a lot of temporary ruby objects to be created.

~~~
msandford
How I read "denormalize locales" as:

1\. Have a database of rows which are say: translation_id, locale_id,
translation_text

2\. That is really a 2-D array, translation[translation_id][locale_id] =
translation_text

3\. Reshuffle to translation[locale_id][translation_id] = translation_text
(note the swapping of the indexes)

4\. Generate a map of page_url => [arary of tranlation IDs]. You can do this
because the number of translation_ids of a given page doesn't change any
faster than the translation_ids themselves.

5\. Create an in-memory object of all the specific translation_ids for a
specific page in a specific locale

For an implementation of 5 I think it'd be preferable to have one giant string
that contains all the translation text in it and then a list/array of indices
that you use to decide where in that one big string you pull your substring
from. That might well be more memory efficient than many individual substrings
(this is what they did to make HBase about 5x faster) and it'll definitely
give you some cache locality goodness.

------
toddh
You can dynamically load/unload shared libraries so the data is only shared
once between all processes. A win is you can also optimize the memory layout
of the translation tables (can be in C), for which a hash is probably not
optimal. This can all be automated in the build process using the database as
a source. During software upgrades processes must be aware enough to know when
to reload. And since all shared memory schemes use virtual memory you still
have potential latency issues because of paging. Not sure if a .so can be
pinned. Another win is it is read only so you don't have to worry about
corruption.

~~~
jtai
Our community of translators update translations all the time. Updates are
immediately reflected so that our translators have instant feedback, and we'd
like our users to see the updated translations sooner than every time we
deploy too. So that rules out a purely build-time solution for us.

~~~
toddh
I guess build time is a bit of a misnomer. Any time changes occur, perhaps
they aggregated for an hour, the files are rebuilt and redistributed. Since
the API doesn't change it doesn't require changes in the process code, just
the shared lib. But I understand, it was just a thought that fit the given
criteria.

------
jcampbell1
Sounds like they re-invented .mo files from gettext.

[https://www.gnu.org/software/gettext/manual/html_node/MO-
Fil...](https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html)

------
northisup
The article does not address why this outsourced heap is better than other
outsourced heaps.

------
ashayh
I don't get it.

Did they actually benchmark all possible options like shared memory or Sqlite
or mysql memory engine (periodically backed)?

They say memcache (or redis) would have been slower because of network latency
even over localhost. But did they benchmark.

------
rnbrady
Pretty graphs! Drawn using?

~~~
Mindless2112
Kibana [1] if the image file name is accurate.

[1]
[http://www.elasticsearch.org/overview/kibana/](http://www.elasticsearch.org/overview/kibana/)

~~~
rnbrady
Doh. Thanks!

------
gustaf
Awesome work!

