Hacker News new | comments | show | ask | jobs | submit login
Edis: a Redis for larger-than-RAM datasets (inaka.github.io)
147 points by nmcfarl 1463 days ago | hide | past | web | 58 comments | favorite

At my consulting company, Inaka, we built this originally as a side project when we were using Redis and thought we needed multi-master larger-than-RAM datasets, but then we ended up moving to CouchDB for that project.

It's currently a research project for us and we have an intern from Uppsala University working on the HA features including paxos leader election and multi-master conflict resolution. See the mutinode branch. Some people have added some additional database backends, but I don't know of any serious production users yet.

I gave a talk on it at the 2012 Erlang Factory as it's entirely Erlang-based. Here's the talk - http://vimeo.com/42630498

Benefits over Redis: Faster startup time over big datasets, pluggable backends, (eventually) Multi-master and HA (with a different approach than Redis is currently taking with Sentinel).

Drawbacks: 3x to 10x slower on all operations. We've done no work to optimize.

Have you seen Raft? It tries to be a much simpler and much more understandable alternative to Paxos.


looks interesting, thanks!

You moved from Redis to CouchDB? That sounds odd.

Why? His data set is larger than ram, and couchdb is a really great disk based key/value store.

Last time I checked CouchDB was a document store with MapReduce queries. A very different beast from Redis.

It's really a much longer story. We use Redis and Couch together. Redis for caching, Couch for long-term storage.

Whether Couch has or doesn't have Map/Reduce is irrelevant, really. Couch Map/Reduce is a toy (just like Riak) and is not to be trusted. We avoid it as much as possible.

> Couch Map/Reduce is a toy (just like Riak) and is not to be trusted.

Really? How so?

I guess I don't see the point of a Redis clone in Erlang when it would be easier to tack use cases onto something that already exists, namely Riak. It's a key-value store, it's in Erlang, it's distributed, it's getting a full-text search and it has adjustable parameters so you can pick how you want to deal with the tyranny of the CAP theorem.

The only think Riak doesn't have is a high-speed RAM-based data structure capability. But hey, Redis does that. ;-)

> I guess I don't see the point of a Redis clone in Erlang

The only reason to do it if you already have something written that uses Redis and just want to build on that.

Or perhaps you're already committed to Erlang and want to benefit from a Redis-like system?

Honestly, Redis is useful for basically 2 reasons: it's a stonkingly fast quasi-persistent store for simple data, and if you're already using it, it's a tolerable message queue and pub/sub.

It's not a convenient document store. It's not relational and it's only ACID in that the single thread serializes all access. It doesn't scale worth a damn. But it's fast enough that mostly doesn't matter.

So the question arises: why create something with none of Redis's advantages, and all of its flaws?

I'm going to assume you're not just trolling here.

There are whole companies who make elaborate servers that speak simple protocols. Membase is essentially a much more elaborate, multi-server replicated database that speaks the Memcache protocol.

We started edis as an experiment to teach some developers Erlang. We had a use case that was quite interesting at the time, so it was a great way to teach Erlang while working through some interesting database scenarios.

Imagine you have a bunch of users come into a site and they'll hit the server for an hour and then go away for at least a week, maybe more, like the demographics around a popular TV show. Imagine you need to handle a HUGE number of users for that hour, but after that hour, you have less than 1/10th of 1% of those users during the week. Also, imagine that you are pretty happy with Redis's speed, but that you've outgrown what you can comfortably handle on one server and so you are happy to tolerate less speed for more space.

This is a really nice use case for a disk-backed Redis server.

It's not the only option, of course, but we didn't need to store documents and we didn't need ACIDity.

Yeah, not trolling.

I haven't tried to design for exactly that, but my first shot at it would be independent pairs of app servers and redis instances, with redis used as an expiring cache, fronted by a load balancer with affinity, backed by centralized regular SQL. Something like MySQL has no trouble hanging onto a scadload of records it mostly doesn't touch, and you can scale the expiry time on redis down as it gets more heavily loaded, so only the hottest tip of the hot set is kept in RAM.

Since it's leveldb underneath the usual problems with write rate apply: its quite easy to get leveldb into a position where writes completely lock up for minutes at a time, and the machine is tearing itself to shreds trying to complete merges to keep up with writes.

Maybe a nice backend for an IndexedDB implementation, but for 'serious' storage, no way.

I was also surprised this used leveldb underneath, especially since the goal is for larger than memory datasets.

For every level you add an additional disk read to every query. For optimal performance you should have RAM greater than 10% the size of your dataset, so the OS disk cache can operate effectively.

Do you have any links for more information on writes locking for minutes at a time? I haven't seen this with LevelDB and I couldn't find anything after searching around.

Try searching their mailing list for "hang" and suchlike (of course they don't openly advertise such an easily triggered flaw).

LevelDB write rate initially seems amazing, since it's simply writing unsorted keys to an append-only file until the file hits 2MB or so. For bursty loads it feels great.

But the moment writes are sustained for longer than it can merge segments (say while doing a bulk load), per-write latency spikes appear (average op time jumps from <1ms to >30000ms for a single record), and eventually it'll get so far behind that all attempts to progress will hang entirely, waiting for the background compactor to free up room in the youngest generation. The effect seems to worsen exponentially with database size. To attempt to mitigate this, when LevelDB notices it's falling behind it begins sleeping for 0.1s every write.

It's especially easy to trigger on slow CPUs with spinning rust drives.


> But the moment writes are sustained for longer than it can merge segments (say while doing a bulk load), per-write latency spikes appear

Isn't that a problem common to all SSTable-based databases?

Is LevelDB any worse than HBase or Cassandra in this area?

Yes, it seems like Cassandra is suffering from compacting too, although maybe not so ugly as LevelDB.

The only solution I've found is Castle backend from Acunu [1], but there were no updates from 2011 [2] and it looks really heavy (with kernel module and all that)

[1] http://www.slideshare.net/acunu/cassandra-on-castle

[2] https://bitbucket.org/acunu/fs.hg

Any hint at how http://symas.com/mdb/ behaves ?

Been using Tokyo Cabinet for a long time now, and have repeatedly hit similar hangs. Shopping for a future datastore of this sort!

MDB had no issue on the same hardware and workload I discovered the LevelDB behaviour. It does not defer work (unless running asynchronously, in which case the OS is technically deferring work), so performance is a predictable function of database size, and unaffected by prior load.

Tokyo Cabinet should behave similarly.. can you tell us a bit more about your setup?

Sure, using TC hash datastore over millions (tenth of, not hundreds or billions) of entries, each being a compressed protobuffer. The cost of each write grows exponentially after some time. We've played with parameters, buckets size and numbers, cache, we tried over SSDs vs regular HD, no big improvement. We've considered writing a sharded version of TC (there are a couple implementations already IIRC). Typically, the problem seems to be related to the size of the file on disk and the number of buckets. Somewhere, the reads and writes become prohibitive (at least for our usage).

We like the speed of these datastores as some of our algorithms proceed with millions of calls every few seconds or so, and we like that it is not remote.

If it is a bulk load, surely slow writes aren't a serious issue as long as throughput is good? Or are you saying the average write takes 30s?

Thanks! I appreciate the heads up.

I'd like to read antirez's opinion about this.

This doesn't make much sense to me, as in redis you have to design how you are gonna store your data with it's limitation and performance in mind. So why a drop in replacement instead of redesigning your storage?

>I'd like to read antirez's opinion about this.

It looks like he already made his own (3yrs ago): https://github.com/antirez/Bigdis

> in redis you have to design how you are gonna store your data with it's limitation and performance in mind.

Hmm, that's interesting. This is the very reason I've always loved Redis and I'd love a Redis-for-disk.

> Edis is a protocol-compatable Server replacement for Redis, written in Erlang. Edis's goal is to be a drop-in replacement for Redis when persistence is more important than holding the dataset in-memory.

Isn't the entire point of Redis that it's meticulously written in C by some Italian guy, specifically designed for in-memory datasets? Protocol compatible is one thing but maybe name it something with a greater edit distance?

This came up in the discussion of "Redis as the primary data store?" -> https://news.ycombinator.com/item?id=5620960

And I was wondering if anyone was using it - and what they thought…


Also the github repo link: https://github.com/inaka/edis

The more info link to the pdf for Erlang Factory 2012 talk appears to 404 but I found a direct source http://www.erlang-factory.com/upload/presentations/503/Edis-...

>Edis (currently) uses Google's leveldb as a backend

That's nice. That means it would be (relatively) easy to change it to lmdb[1] to make it fast.

[1] http://symas.com/mdb/microbench/

I actually made a fork of redis (as opposed to a reimplementation), with the design challenge "smallest possible patch" (to make it easier to merge forward), that used LMDB as a graveyard of "old keys" for this very purpose.


I do not quite remember if there were any outstanding "didn't quite finish this corner case" items, but I don't think there were. (If I looked at it I could tell, but I'm on my phone, and should be looking at other things right now anyway.)

(edit: To be clear, I mean in "normal usage". I almost certainly didn't think through issues involving, say, replication, to make certain the semantics were always sensible; my goal was "doesn't crash or break or corrupt".)

(I think the problem was mostly that the higher-level architecture I was putting this piece into underwent a slight change that made me think I might need something slightly different, and the result became "since I literally only use the APPEND command, I can probably come up with something more directly solves my problem.)

If you have any questions, I'm on IRC (various networks, including freenode and irc.saurik.com as "saurik").

That's great, thank you!

But I remember some time ago antirez was trying to implement the same idea (it was called "virtual memory"[1]), but failed [2] mainly due the complexity of implementation.

[1] http://redis.io/topics/virtual-memory

[2] http://oldblog.antirez.com/post/redis-virtual-memory-story.h...

Yeah. He, however, also was developing, himself, the storage backend, so that's already one major source of complexity you don't have if you use an off-the-shelf embedded key-value library. He also is simply working with a wide set of use cases, both at the time and on his roadmap, where VM causes things to "kind of suck". I certainly did not have the goal of making everything "work well", I just wanted things to "not break horribly" (I added a sentence to my previous post making this more clear). If I had actually built something that would be epic for some wide range of use cases, I'd have done something more with it than just leaving it in a public git repository for months ;P. In the end this didn't even satisfy my use case...

Agree. I think the main mistake was (widely spreaded) belief that key performance factor in k-v databases is complicated multi-level caching, that is obviously hard to implement, not to say to implement it right.

By the way, what was the storage that satisfied your 'append only' use case at the end of the day? AFAIK plain LMDB could get performance on par with raw disks I/O in this case.

What I was working on was a side project, and my time got diverted to things that were more important and I haven't been able to get back to it; the key thing I was needing was a very fast "append" primitive (as in, amortized constant-time append performance) that would be very tightly encoded.

In essence, I wanted some kind of vector array implementation (std::vector, java.util.ArrayList, etc.) on a server, but in a situation where a very large number of the values would end up becoming "dead weight" at some point and, while I needed them to not be deleted, could be flushed to disk.

(At the higher level of my system I am able to shard the strings into which I'm appending so they don't themselves get so large that the eviction/recovery process becomes insane, but within windows of time I want near constant-time performance doing random indexing into the appended data.)

Redis, with the caveat that it has a needlessly and seemingly painfully wasteful format for its append-only durability solution, pretty much does this, but was RAM-only, so I set things up such that old keys could be evicted to LMDB and later could be seamlessly recovered when/if required.

I do not believe "plain LMDB" (without Redis) is a good solution to this problem, but I am willing to believe that I just don't understand how to use it very well (I never found any real "documentation", so I was reading all the comments in the code, looking at the API, checking the mailing list, and falling back on some guess/check; not many people use it).

(edit: Just realized this maybe didn't answer the question while re-reading our thread. Will continue with more focus.)

One thing that I was dissatisfied with here is that the durability of the eviction to LMDB is still only going to be on a single computer; for my purpose, once a key is evicted, its latency requirements drop dramatically, so I could potentially store it to something much slower (like S3).

Honestly, though, I don't quite remember what I decided was sub-optimal about this, because my brain is right now filled with other things that I don't quite want to fully context switch out for the purpose of trying to bring this project back to the front of my mind ;P. It worked, though.

LMDB is only meant to be an embedded engine. Distribution/replication are features to add in higher level layers. E.g., memcacheDB built on LMDB exists, you can add repcached or something else to do replication over that. We're also adding LMDB to HyperDex and Riak. We may end up writing our own cluster manager at some point, if we don't like the ones we can work with today. But the key is to stick to Unix philosophy - write small modules that focus on one particular task. LMDB addresses storage in a single node. Some other component above it can do distribution/sharding/whatever.

That's exactly what I was accomplishing by having redis do key eviction to LMDB ;P. To be clear: it wasn't that LMDB wasn't awesome for the purpose of adding disk persistence and key eviction to redis, it was that having unclustered copies of redis with disk persistence turned out to not nail what I wanted to accomplish. I figured I'd reevaluate redis's cluster support at some later time, or build something more custom now that I've figured out what primitives I needed with more certainty.

(The other comment I made about "plain LMDB" was to address a question posed by snaky, where I was assuming either embedding it into my app server or having a thin wrapper; I was assuming the suggestion was that LMDB could handle "append to key" in a durable and fast manner without relying on redis to first buffer the data to be written en masse as larger chunks, which is pretty much what it is doing in this architecture I was designing.)

(BTW, I've never talked to you, but I knew one of your former students. Hello! ;P)

Hello! ;)

OK, that makes sense. Yah, it sounded like that to me too, but LMDB's Append mode is only usable when all the input is already in sorted order. Great for bulk loads of prepared data, not much help with randomly arriving input.

Oh, you are so right about documentation. I found the presentation slides on LMDB page the only viable source of 'why and how' information that in my opinion is (ok, almost) always a way more important than API reference (especially autogenerated one). That was funny to find out Python bindings docs more useful than original, huh.

Feel free to submit doc patches. I'm pretty sure I've made all of the relevant information/description/motivation available, it sounds like you just have to collect it into a single place now.

You can also catch us on irc #openldap if you want to ask questions about it and aren't posting to the openldap-technical mailing list. The py-lmdb guy has been on irc with us a lot lately, which is probably why his docs have so much info - he keeps asking questions and getting answers.

I'm afraid that for me as a non-native English speaker (to say the least) it would be a little suboptimal to write the perfect documentation, unfortunately.

And I agree with you in that all the relevant pieces of information are there, and for me personally, as well as for saurik, source code is more than enough actually, especially considering they are really clear and compact.

I just wish LMDB were more popular and widely adopted because it's really brilliant piece of software. And I think documentation is a one of main factors in that, that's it.

Anyway, thanks for pointing, I'm thinking about a potential complexity of adding to LMDB the ability to use secondary indexes (separetely stored, due the carefully done tuning of main storage scheme) so I'm sure it'll come to discuss with you on IRC and mailing list.

Yeah, I've focused my attention more on the code than the docs, which is pretty typical for me. But frankly I think the doxygen stuff is already very illuminating.

For secondary indexing, you should take a look at what slapd back-mdb does. I know that BDB supported 2ndary as a built in feature but we never used it because it was too slow. What we're doing now is already quite fast.

LMDB looks awesome. I lost a lot of performance (~5x) moving from a memory mapped architecture to LevelDB in a database I wrote (http://skydb.io/). I'll definitely be checking out lmdb to see if I can get that performance back. Thanks!

SkyDB is definitely most interesting thing I seen this year in DB field, mostly because it's a rare example of out-of-the-box thinking and considering higher level use cases than usual (data semantics level I'd say) may give you an orders of magnitude speedup.

But should I suggest, considering SkyDB highly experimental, maybe port it to more flexible backend like Mapkeeper[1] that gives for free not only easy switchable storage backends, but (not switchable, but changeable with a little effort) network/API backends (now it's Thrift), event engine (now it's libevent), etc. It would be nice to focus on core SkyDB things (which I think are data organization, API, stored procs etc.) while "outsource" the non-core low-level jobs to things like Mapkeeper.

And by the way, lmdb backend have being merged into Mapkeeper not so long ago :-)

[1] https://github.com/m1ch1/mapkeeper/

Mapkeeper looks cool. I'll definitely read up on it some more.

Sky originally started out as a project to learn how to write a database and programming language. The storage was originally a block-based mmap format and the query engine was built on LLVM. It was a great learning experience but I am trying to take a higher level approach by delegating storage to LevelDB (or possibly LMDB instead) and the query engine uses LuaJIT. Focusing on a higher level use case (i.e. behavioral analytics) has given some crazy performance gains and great scalability.

I think once Sky matures more then it would be more interesting to be able to swap out the backends. I worry about having too many moving pieces. I'd also be concerned about copying data with the extra layers. It looks like LDMB does a zero copy which would be awesome.

Yeah, zero-copying is a great LMDB feature, even using LuaJIT stored procs you can pass pointer to mmap window and process the data in that window using ffi (did I say LuaJIT is just damn great?) and (optionally) DynASM.

Sounds interesting, but there's a typo in the first sentence: compatable -> compatible.

I spotted that too. Maybe I'm a terrible nitpicker / extremely harsh, but it kinda turned me off. Made it look instantly less of a serious project, when otherwise the look & feel of the page were pretty solid.

I was looking into https://github.com/KDr2/redis-leveldb a just a couple days ago, same concept.

I'm confused about Edis current version being 1.0. Is it production ready or just ignoring semantical versioning?

From this thread, sounds like the second point: "It's currently a research project for us and we have an intern from Uppsala University working on the HA features including paxos leader election and multi-master conflict resolution."

In this context, I think one can see this as a variant of Riak...

Except with none of the distributedness?

I started on a similar project written in Go and built on LevelDB: https://github.com/cupcake/setdb

I feel like Redisc would be a more clever name.

Setdb is a similar project https://github.com/cupcake/setdb

If you are using Java, you may want to try MapDB. It uses memory mapped files and has interesting performance. Redis server protocol implementation is on roadmap.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact