These tricks have been used for a while in the JVM world. Here's a JVM equivalent of Hammerspace: http://www.mapdb.org/ And here's some slides concerning off-heap optimisations in Cassandra: http://www.slideshare.net/jbellis/dealing-with-jvm-limitatio...
On the JVM GC time is usually only an issue when the heap gets over 2GB or so. MRI's GC is not in the same league as the JVM's, but even so, 80MB should be easily handled. As such I'm guessing the memory consumption of multiple processes is causing the main issue, which would be solved if Ruby had real threads. JRuby has real threads, and many other language runtimes do as well. It seems like a lot of engineering effort is going into working around the deficiencies of MRI, a problem that can be easily solved by switching to something better.
The savings came not just from avoiding redundant GC over the same 80MB, but also from not going through the churn involved with loading that data over the network on process startup.
We have a large codebase that's written against MRI, so it's non-trivial to just switch to something else.
Their problems are mainly a result of needing to access 80 megabytes of slowly changing translation data. Since they run many ruby processes and have memory growth issues, this translation data was taking a while to load.
If they had a single stable ruby process running on each box, possibly they wouldn't have had these issues.
Does Ruby not provide a facility to use shared memory? I guess you don't get it by default in a GC'd language because the GC thinks it owns the world.
If the person who CTO'd that worked for me, they would not be CTO any more.
Basically just take your hash structure and write it to disk.
The only problem with it is the 32bit offsets mean a 4GB max. Not too hard to fork it for 64bit offsets, though.
(other than misogynistic and crude library names)
(shared memory -> require 'gangbang' or somesuch)
If they are going to write a whole library for ruby, why not just write one to talk to shared memory?
I'm not sure if this system is a good idea or not but I wish some commenters would spend more time comparing their proposed solutions (shared mem, local db, memmap...) to Hammerspace rather than contentless dismissal.
Comparison to local dbs like cdb, sparkey, mapdb, gettext files etc. -- hammerspace is a gem that uses sparkey under the hood, so these solutions are more or less one and the same. The difference is that hammerspace exposes a ruby hash-like API to make integration with existing applications easier. It also provides concurrent writer support, which many local dbs don't do.
Whether it's a good idea or not is for you to judge -- we've open sourced the gem in hopes that it will be useful to someone, just as sparkey and gnista were useful to us.
It seems like some of the criticism in this thread comes from people who haven't bothered to understand what you've done and summarily dismiss it as stupid. That attitude has been bothering me lately on HN or maybe its programmers in general.
Anyway, thanks for the interesting contribution.
This sort of thinking ignores the fact that filesystems already have their own (often very well-tuned) caching systems and in some cases (eg. sendfile(2) in Linux) the kernel can do zero-copy writes from files to the network that (along with decent fs caching) will easily outperform app-level memory-caching. Of course, this only applies for data that will remain relatively static, but often your best option is to mostly get out of the way and let the OS do the heavy lifting unless you've measured actual loads and are sure your solution is better.
* Dedicate ruby processes to a particular subset of locales
* Parallelize your memcache queries
* Break up locale files into MRU/LRU strings to reduce size
* Denormalize locales (in memory, cache, whatever) into single values for most common pages. (use with MRU/LRU above)
As an aside, still don't understand how process->kernelspace driver->platter is faster than process->kernelspace socket->process->RAM? Especially for random access patterns. I suspect a memcache misconfiguration?
We did initially try dedicating ruby processes to particular locales -- when we saw a big improvement we knew we were on the right track. Doing so permanently would be more difficult. To "follow the sun" we would need to shift capacity to Europe and Asia during some hours and back to the US in others.
Parallelizing memcache queries is difficult because we don't know ahead of time what translations will be required to render a page.
We /are/ only working with the most recently used strings. Strings not accessed in the last 4 days are not loaded.
I'm not sure what "denormalize locales" means exactly.
Sparkey is fast because the files end up in the filesystem cache and most of the work is done in C. Going through the dalli gem to grab the translations out of memcache causes a lot of temporary ruby objects to be created.
1. Have a database of rows which are say: translation_id, locale_id, translation_text
2. That is really a 2-D array, translation[translation_id][locale_id] = translation_text
3. Reshuffle to translation[locale_id][translation_id] = translation_text (note the swapping of the indexes)
4. Generate a map of page_url => [arary of tranlation IDs]. You can do this because the number of translation_ids of a given page doesn't change any faster than the translation_ids themselves.
5. Create an in-memory object of all the specific translation_ids for a specific page in a specific locale
For an implementation of 5 I think it'd be preferable to have one giant string that contains all the translation text in it and then a list/array of indices that you use to decide where in that one big string you pull your substring from. That might well be more memory efficient than many individual substrings (this is what they did to make HBase about 5x faster) and it'll definitely give you some cache locality goodness.
The point is to use the file system cache, so in the common case you're not hitting platter. I'm assuming all these machines do is serve tons of web pages, so the common case will be VERY common and you will almost never hit platter (e.g. only on a deployment of new strings).
My intuition says that memcache is slower because of the extra process switching (and potential context switching). With their solution, you go from Ruby -> kernel in the common case. With the memcache solution, you have to go from Ruby -> kernel -> back to user space for memcache, EVERY time.
Also as far as I remember, using a local TCP socket isn't as fast as using a Unix socket or pipe. The network stack actually does a lot of stuff.
I like their solution a lot better. Fewer moving parts. Let the kernel manage the file system cache. The kernel knows how much physical memory you have and will aggressively use it. Memcache has no idea (AFAIK), and if you are too aggressive you could end up with page faults in the memcache process anyway.
Two memcached points:
1. You can run memcached over a unix socket.
2. You can have it use locked memory so it will never page fault.
As others have noted, mmap is another option.
You don't control the process switching to memcached either. It's possible memcached will get descheduled, its data removed from L* caches, and your web server process will have to wait for it to be run again. With the FS cache solution you don't have that issue.
You could tune it (e.g. try to pin memcache to a dedicated core and pin web servers to other cores), but there are settings to tune FS cache behavior as well.
Same with mmap. The kernel can swap pages back to disk in that case too. It seems to boil down to the same thing, so not sure why people are so negative about using the file system. I bet you can write a test comparing serving data via mmap() vs via the file system and they will behave nearly identically.
Did they actually benchmark all possible options like shared memory or Sqlite or mysql memory engine (periodically backed)?
They say memcache (or redis) would have been slower because of network latency even over localhost. But did they benchmark.