Hacker News new | past | comments | ask | show | jobs | submit login
Hammerspace: Persistent, Concurrent, Off-heap Storage (airbnb.com)
73 points by lennysan on Jan 7, 2014 | hide | past | favorite | 36 comments

Talk about an inaccurate title! The improvement is a combination of off-heap storage and sharing storage amongst processes. I'm surprised they didn't look at Redis for this problem.

These tricks have been used for a while in the JVM world. Here's a JVM equivalent of Hammerspace: http://www.mapdb.org/ And here's some slides concerning off-heap optimisations in Cassandra: http://www.slideshare.net/jbellis/dealing-with-jvm-limitatio...

On the JVM GC time is usually only an issue when the heap gets over 2GB or so. MRI's GC is not in the same league as the JVM's, but even so, 80MB should be easily handled. As such I'm guessing the memory consumption of multiple processes is causing the main issue, which would be solved if Ruby had real threads. JRuby has real threads, and many other language runtimes do as well. It seems like a lot of engineering effort is going into working around the deficiencies of MRI, a problem that can be easily solved by switching to something better.

We did benchmark local memcache, but accessing the strings over the network (even locally) was much much slower than sparkey. Redis would likely have been similar to memcache.

The savings came not just from avoiding redundant GC over the same 80MB, but also from not going through the churn involved with loading that data over the network on process startup.

We have a large codebase that's written against MRI, so it's non-trivial to just switch to something else.

Maybe just add a shared-memory front-end to memcache?

I wonder if they would need this if they used a single ruby process with many threads (instead of many ruby processes).

Their problems are mainly a result of needing to access 80 megabytes of slowly changing translation data. Since they run many ruby processes and have memory growth issues, this translation data was taking a while to load.

If they had a single stable ruby process running on each box, possibly they wouldn't have had these issues.

Yeah, all the time I had a little devil on my shoulder shouting "my jvm laughs at your mri". Thats said, I like the approach to use what the OS offers - essentially atomic file system constructs and the cache. When you control your platform, things can go faster by cutting away those abstraction layers.

These guys never heard of shared memory, apparently?

Does Ruby not provide a facility to use shared memory? I guess you don't get it by default in a GC'd language because the GC thinks it owns the world.

Durable shared memory -- like, a (loocal) database?! I'm kind of amazed at the degree of wheel-reinventing going on here.

The scary thing is how proud they are of their system; it has a cute name and everything.

If the person who CTO'd that worked for me, they would not be CTO any more.

You don't even need a database. Just a file that you memmap. You can even use fixed length strings in the file to make it really fast.

Basically just take your hash structure and write it to disk.

This is what we did, except leaning on file system cache instead of explicit mmapping.

djb's cdb is an excellent use case for this, as updates are infrequent and stale data is tolerable. It is an extremely fast constant database that when hot will sit snugly in page cache.

We used sparkey for hammerspace, which is very similar to cdb. cdb was actually the first thing we evaluated, and it validated parts of our approach.

Even if your data doesn't all fit in page cache with cdb, the 8 bytes per entry (hashcode & offset) of overhead in the main hashtable absolutely will, so even if you're too big for memory, you get single-seek-per-lookup.

The only problem with it is the 32bit offsets mean a 4GB max. Not too hard to fork it for 64bit offsets, though.

We ended up using sparkey as the first backend for hammerspace, but hammerspace was written to support multiple backends. We benchmarked both cdb and sparkey, and their performance was very similar for our use case. At under 100mb of data, we weren't concerned about the 4G limitation or about the data not fitting in cache. I don't think sparkey has the 4G limitation, and someone has already forked cdb to support 64 bit offsets: https://github.com/pcarrier/cdb64

They're Rubyists, what did you expect?

(other than misogynistic and crude library names) (shared memory -> require 'gangbang' or somesuch)

I came here to say the same thing. This is EXACTLY what shared memory is for!

If they are going to write a whole library for ruby, why not just write one to talk to shared memory?

Man there is a lot of negative snark at the top of this thread.

I'm not sure if this system is a good idea or not but I wish some commenters would spend more time comparing their proposed solutions (shared mem, local db, memmap...) to Hammerspace rather than contentless dismissal.

Comparison to shared memory or mmap'd files -- by using files we let the filesystem manage the cache, so we don't have to be concerned with the data growing and causing memory pressure. We also don't have to worry about how to store the data in a format friendly to fast retrieval, since this is provided by sparkey.

Comparison to local dbs like cdb, sparkey, mapdb, gettext files etc. -- hammerspace is a gem that uses sparkey under the hood, so these solutions are more or less one and the same. The difference is that hammerspace exposes a ruby hash-like API to make integration with existing applications easier. It also provides concurrent writer support, which many local dbs don't do.

Whether it's a good idea or not is for you to judge -- we've open sourced the gem in hopes that it will be useful to someone, just as sparkey and gnista were useful to us.

Sounds reasonable to me, especially the memory pressure thing.

It seems like some of the criticism in this thread comes from people who haven't bothered to understand what you've done and summarily dismiss it as stupid. That attitude has been bothering me lately on HN or maybe its programmers in general.

Anyway, thanks for the interesting contribution.

Original HN thread topic ("How Airbnb Improved Response Time by 17% By Moving Objects From Memory To Disk") is misleading compared to actual article contents, but speaking to the topic subject rather than the article, I do find it pretty common for many developers to blanket assume that memory-based caching is always the way to go, because, well, memory is fast and disks are slow.

This sort of thinking ignores the fact that filesystems already have their own (often very well-tuned) caching systems and in some cases (eg. sendfile(2) in Linux) the kernel can do zero-copy writes from files to the network that (along with decent fs caching) will easily outperform app-level memory-caching. Of course, this only applies for data that will remain relatively static, but often your best option is to mostly get out of the way and let the OS do the heavy lifting unless you've measured actual loads and are sure your solution is better.

Armchair quarterbacking:

* Dedicate ruby processes to a particular subset of locales

* Parallelize your memcache queries

* Break up locale files into MRU/LRU strings to reduce size

* Denormalize locales (in memory, cache, whatever) into single values for most common pages. (use with MRU/LRU above)

As an aside, still don't understand how process->kernelspace driver->platter is faster than process->kernelspace socket->process->RAM? Especially for random access patterns. I suspect a memcache misconfiguration?

Good points!

We did initially try dedicating ruby processes to particular locales -- when we saw a big improvement we knew we were on the right track. Doing so permanently would be more difficult. To "follow the sun" we would need to shift capacity to Europe and Asia during some hours and back to the US in others.

Parallelizing memcache queries is difficult because we don't know ahead of time what translations will be required to render a page.

We /are/ only working with the most recently used strings. Strings not accessed in the last 4 days are not loaded.

I'm not sure what "denormalize locales" means exactly.

Sparkey is fast because the files end up in the filesystem cache and most of the work is done in C. Going through the dalli gem to grab the translations out of memcache causes a lot of temporary ruby objects to be created.

How I read "denormalize locales" as:

1. Have a database of rows which are say: translation_id, locale_id, translation_text

2. That is really a 2-D array, translation[translation_id][locale_id] = translation_text

3. Reshuffle to translation[locale_id][translation_id] = translation_text (note the swapping of the indexes)

4. Generate a map of page_url => [arary of tranlation IDs]. You can do this because the number of translation_ids of a given page doesn't change any faster than the translation_ids themselves.

5. Create an in-memory object of all the specific translation_ids for a specific page in a specific locale

For an implementation of 5 I think it'd be preferable to have one giant string that contains all the translation text in it and then a list/array of indices that you use to decide where in that one big string you pull your substring from. That might well be more memory efficient than many individual substrings (this is what they did to make HBase about 5x faster) and it'll definitely give you some cache locality goodness.

As an aside, still don't understand how process->kernelspace driver->platter is faster than process->kernelspace socket->process->RAM? Especially for random access patterns. I suspect a memcache misconfiguration?

The point is to use the file system cache, so in the common case you're not hitting platter. I'm assuming all these machines do is serve tons of web pages, so the common case will be VERY common and you will almost never hit platter (e.g. only on a deployment of new strings).

My intuition says that memcache is slower because of the extra process switching (and potential context switching). With their solution, you go from Ruby -> kernel in the common case. With the memcache solution, you have to go from Ruby -> kernel -> back to user space for memcache, EVERY time.

Also as far as I remember, using a local TCP socket isn't as fast as using a Unix socket or pipe. The network stack actually does a lot of stuff.

I like their solution a lot better. Fewer moving parts. Let the kernel manage the file system cache. The kernel knows how much physical memory you have and will aggressively use it. Memcache has no idea (AFAIK), and if you are too aggressive you could end up with page faults in the memcache process anyway.

The problem with using the kernel-managed the fs cache is that you don't control it. If the kernel decides it wants to reclaim that memory for some reason, well, guess what, you're going to the spinny metal bits.

Two memcached points: 1. You can run memcached over a unix socket. 2. You can have it use locked memory so it will never page fault.

As others have noted, mmap is another option.

I don't really see your point, because the kernel does plenty of other things on your behalf.

You don't control the process switching to memcached either. It's possible memcached will get descheduled, its data removed from L* caches, and your web server process will have to wait for it to be run again. With the FS cache solution you don't have that issue.

You could tune it (e.g. try to pin memcache to a dedicated core and pin web servers to other cores), but there are settings to tune FS cache behavior as well.

Same with mmap. The kernel can swap pages back to disk in that case too. It seems to boil down to the same thing, so not sure why people are so negative about using the file system. I bet you can write a test comparing serving data via mmap() vs via the file system and they will behave nearly identically.

You can dynamically load/unload shared libraries so the data is only shared once between all processes. A win is you can also optimize the memory layout of the translation tables (can be in C), for which a hash is probably not optimal. This can all be automated in the build process using the database as a source. During software upgrades processes must be aware enough to know when to reload. And since all shared memory schemes use virtual memory you still have potential latency issues because of paging. Not sure if a .so can be pinned. Another win is it is read only so you don't have to worry about corruption.

Our community of translators update translations all the time. Updates are immediately reflected so that our translators have instant feedback, and we'd like our users to see the updated translations sooner than every time we deploy too. So that rules out a purely build-time solution for us.

I guess build time is a bit of a misnomer. Any time changes occur, perhaps they aggregated for an hour, the files are rebuilt and redistributed. Since the API doesn't change it doesn't require changes in the process code, just the shared lib. But I understand, it was just a thought that fit the given criteria.

Sounds like they re-invented .mo files from gettext.


The article does not address why this outsourced heap is better than other outsourced heaps.

I don't get it.

Did they actually benchmark all possible options like shared memory or Sqlite or mysql memory engine (periodically backed)?

They say memcache (or redis) would have been slower because of network latency even over localhost. But did they benchmark.

Pretty graphs! Drawn using?

Kibana [1] if the image file name is accurate.

[1] http://www.elasticsearch.org/overview/kibana/

Doh. Thanks!

Awesome work!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact