
Edis: a Redis for larger-than-RAM datasets - nmcfarl
http://inaka.github.io/edis/
======
rubyrescue
At my consulting company, Inaka, we built this originally as a side project
when we were using Redis and thought we needed multi-master larger-than-RAM
datasets, but then we ended up moving to CouchDB for that project.

It's currently a research project for us and we have an intern from Uppsala
University working on the HA features including paxos leader election and
multi-master conflict resolution. See the mutinode branch. Some people have
added some additional database backends, but I don't know of any serious
production users yet.

I gave a talk on it at the 2012 Erlang Factory as it's entirely Erlang-based.
Here's the talk - <http://vimeo.com/42630498>

Benefits over Redis: Faster startup time over big datasets, pluggable
backends, (eventually) Multi-master and HA (with a different approach than
Redis is currently taking with Sentinel).

Drawbacks: 3x to 10x slower on all operations. We've done no work to optimize.

~~~
lsb
Have you seen Raft? It tries to be a much simpler and much more understandable
alternative to Paxos.

[https://ramcloud.stanford.edu/wiki/download/attachments/1137...](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)

~~~
rubyrescue
looks interesting, thanks!

------
JulianMorrison
Honestly, Redis is useful for basically 2 reasons: it's a stonkingly fast
quasi-persistent store for simple data, and if you're already using it, it's a
tolerable message queue and pub/sub.

It's not a convenient document store. It's not relational and it's only ACID
in that the single thread serializes all access. It doesn't scale worth a
damn. But it's fast enough that mostly doesn't matter.

So the question arises: why create something with none of Redis's advantages,
and all of its flaws?

~~~
rubyrescue
I'm going to assume you're not just trolling here.

There are whole companies who make elaborate servers that speak simple
protocols. Membase is essentially a much more elaborate, multi-server
replicated database that speaks the Memcache protocol.

We started edis as an experiment to teach some developers Erlang. We had a use
case that was quite interesting at the time, so it was a great way to teach
Erlang while working through some interesting database scenarios.

Imagine you have a bunch of users come into a site and they'll hit the server
for an hour and then go away for at least a week, maybe more, like the
demographics around a popular TV show. Imagine you need to handle a HUGE
number of users for that hour, but after that hour, you have less than 1/10th
of 1% of those users during the week. Also, imagine that you are pretty happy
with Redis's speed, but that you've outgrown what you can comfortably handle
on one server and so you are happy to tolerate less speed for more space.

This is a really nice use case for a disk-backed Redis server.

It's not the only option, of course, but we didn't need to store documents and
we didn't need ACIDity.

~~~
JulianMorrison
Yeah, not trolling.

I haven't tried to design for exactly that, but my first shot at it would be
independent pairs of app servers and redis instances, with redis used as an
expiring cache, fronted by a load balancer with affinity, backed by
centralized regular SQL. Something like MySQL has no trouble hanging onto a
scadload of records it mostly doesn't touch, and you can scale the expiry time
on redis down as it gets more heavily loaded, so only the hottest tip of the
hot set is kept in RAM.

------
hosay123
Since it's leveldb underneath the usual problems with write rate apply: its
quite easy to get leveldb into a position where writes completely lock up for
minutes at a time, and the machine is tearing itself to shreds trying to
complete merges to keep up with writes.

Maybe a nice backend for an IndexedDB implementation, but for 'serious'
storage, no way.

~~~
benbjohnson
Do you have any links for more information on writes locking for minutes at a
time? I haven't seen this with LevelDB and I couldn't find anything after
searching around.

~~~
hosay123
Try searching their mailing list for "hang" and suchlike (of course they don't
openly advertise such an easily triggered flaw).

LevelDB write rate initially seems amazing, since it's simply writing unsorted
keys to an append-only file until the file hits 2MB or so. For bursty loads it
feels great.

But the moment writes are sustained for longer than it can merge segments (say
while doing a bulk load), per-write latency spikes appear (average op time
jumps from <1ms to >30000ms for a single record), and eventually it'll get so
far behind that all attempts to progress will hang entirely, waiting for the
background compactor to free up room in the youngest generation. The effect
seems to worsen exponentially with database size. To attempt to mitigate this,
when LevelDB notices it's falling behind it begins sleeping for 0.1s every
write.

It's especially easy to trigger on slow CPUs with spinning rust drives.

[https://groups.google.com/forum/?fromgroups=#!topic/leveldb/...](https://groups.google.com/forum/?fromgroups=#!topic/leveldb/ylEXQAs6XVA)

~~~
continuations
> But the moment writes are sustained for longer than it can merge segments
> (say while doing a bulk load), per-write latency spikes appear

Isn't that a problem common to all SSTable-based databases?

Is LevelDB any worse than HBase or Cassandra in this area?

~~~
snaky
Yes, it seems like Cassandra is suffering from compacting too, although maybe
not so ugly as LevelDB.

The only solution I've found is Castle backend from Acunu [1], but there were
no updates from 2011 [2] and it looks really heavy (with kernel module and all
that)

[1] <http://www.slideshare.net/acunu/cassandra-on-castle>

[2] <https://bitbucket.org/acunu/fs.hg>

------
alcuadrado
I'd like to read antirez's opinion about this.

This doesn't make much sense to me, as in redis you have to design how you are
gonna store your data with it's limitation and performance in mind. So why a
drop in replacement instead of redesigning your storage?

~~~
firefoxman1
> _I'd like to read antirez's opinion about this._

It looks like he already made his own (3yrs ago):
<https://github.com/antirez/Bigdis>

> _in redis you have to design how you are gonna store your data with it's
> limitation and performance in mind._

Hmm, that's interesting. This is the very reason I've always loved Redis and
I'd love a Redis-for-disk.

------
glurgh
> _Edis is a protocol-compatable Server replacement for Redis, written in
> Erlang. Edis's goal is to be a drop-in replacement for Redis when
> persistence is more important than holding the dataset in-memory._

Isn't the entire point of Redis that it's meticulously written in C by some
Italian guy, specifically designed for in-memory datasets? Protocol compatible
is one thing but maybe name it something with a greater edit distance?

------
nmcfarl
This came up in the discussion of "Redis as the primary data store?" ->
<https://news.ycombinator.com/item?id=5620960>

And I was wondering if anyone was using it - and what they thought…

\--

Also the github repo link: <https://github.com/inaka/edis>

~~~
chuhnk
The more info link to the pdf for Erlang Factory 2012 talk appears to 404 but
I found a direct source [http://www.erlang-
factory.com/upload/presentations/503/Edis-...](http://www.erlang-
factory.com/upload/presentations/503/Edis-ImplementingRedisinErlang.pdf)

------
snaky
>Edis (currently) uses Google's leveldb as a backend

That's nice. That means it would be (relatively) easy to change it to lmdb[1]
to make it fast.

[1] <http://symas.com/mdb/microbench/>

~~~
saurik
I actually made a fork of redis (as opposed to a reimplementation), with the
design challenge "smallest possible patch" (to make it easier to merge
forward), that used LMDB as a graveyard of "old keys" for this very purpose.

git://git.saurik.com/redis.git

I do not quite remember if there were any outstanding "didn't quite finish
this corner case" items, but I don't think there were. (If I looked at it I
could tell, but I'm on my phone, and should be looking at other things right
now anyway.)

(edit: To be clear, I mean in "normal usage". I almost certainly didn't think
through issues involving, say, replication, to make certain the semantics were
always sensible; my goal was "doesn't crash or break or corrupt".)

(I think the problem was mostly that the higher-level architecture I was
putting this piece into underwent a slight change that made me think I might
need something slightly different, and the result became "since I literally
only use the APPEND command, I can probably come up with something more
directly solves my problem.)

If you have any questions, I'm on IRC (various networks, including freenode
and irc.saurik.com as "saurik").

~~~
snaky
That's great, thank you!

But I remember some time ago antirez was trying to implement the same idea (it
was called "virtual memory"[1]), but failed [2] mainly due the complexity of
implementation.

[1] <http://redis.io/topics/virtual-memory>

[2] [http://oldblog.antirez.com/post/redis-virtual-memory-
story.h...](http://oldblog.antirez.com/post/redis-virtual-memory-story.html)

~~~
saurik
Yeah. He, however, also was developing, himself, the storage backend, so
that's already one major source of complexity you don't have if you use an
off-the-shelf embedded key-value library. He also is simply working with a
wide set of use cases, both at the time and on his roadmap, where VM causes
things to "kind of suck". I certainly did not have the goal of making
everything "work well", I just wanted things to "not break horribly" (I added
a sentence to my previous post making this more clear). If I had actually
built something that would be epic for some wide range of use cases, I'd have
done something more with it than just leaving it in a public git repository
for months ;P. In the end this didn't even satisfy my use case...

~~~
snaky
Agree. I think the main mistake was (widely spreaded) belief that key
performance factor in k-v databases is complicated multi-level caching, that
is obviously hard to implement, not to say to implement it right.

By the way, what was the storage that satisfied your 'append only' use case at
the end of the day? AFAIK plain LMDB could get performance on par with raw
disks I/O in this case.

~~~
saurik
What I was working on was a side project, and my time got diverted to things
that were more important and I haven't been able to get back to it; the key
thing I was needing was a very fast "append" primitive (as in, amortized
constant-time append performance) that would be very tightly encoded.

In essence, I wanted some kind of vector array implementation (std::vector,
java.util.ArrayList, etc.) on a server, but in a situation where a very large
number of the values would end up becoming "dead weight" at some point and,
while I needed them to not be deleted, could be flushed to disk.

(At the higher level of my system I am able to shard the strings into which
I'm appending so they don't themselves get so large that the eviction/recovery
process becomes insane, but within windows of time I want near constant-time
performance doing random indexing into the appended data.)

Redis, with the caveat that it has a needlessly and seemingly painfully
wasteful format for its append-only durability solution, pretty much does
this, but was RAM-only, so I set things up such that old keys could be evicted
to LMDB and later could be seamlessly recovered when/if required.

I do not believe "plain LMDB" (without Redis) is a good solution to this
problem, but I am willing to believe that I just don't understand how to use
it very well (I never found any real "documentation", so I was reading all the
comments in the code, looking at the API, checking the mailing list, and
falling back on some guess/check; not many people use it).

(edit: Just realized this maybe didn't answer the question while re-reading
our thread. Will continue with more focus.)

One thing that I was dissatisfied with here is that the durability of the
eviction to LMDB is still only going to be on a single computer; for my
purpose, once a key is evicted, its latency requirements drop dramatically, so
I could potentially store it to something much slower (like S3).

Honestly, though, I don't quite remember what I decided was sub-optimal about
this, because my brain is right now filled with other things that I don't
quite want to fully context switch out for the purpose of trying to bring this
project back to the front of my mind ;P. It worked, though.

~~~
hyc_symas
LMDB is only meant to be an embedded engine. Distribution/replication are
features to add in higher level layers. E.g., memcacheDB built on LMDB exists,
you can add repcached or something else to do replication over that. We're
also adding LMDB to HyperDex and Riak. We may end up writing our own cluster
manager at some point, if we don't like the ones we can work with today. But
the key is to stick to Unix philosophy - write small modules that focus on one
particular task. LMDB addresses storage in a single node. Some other component
above it can do distribution/sharding/whatever.

~~~
saurik
That's exactly what I was accomplishing by having redis do key eviction to
LMDB ;P. To be clear: it wasn't that LMDB wasn't awesome for the purpose of
adding disk persistence and key eviction to redis, it was that having
unclustered copies of redis with disk persistence turned out to not nail what
I wanted to accomplish. I figured I'd reevaluate redis's cluster support at
some later time, or build something more custom now that I've figured out what
primitives I needed with more certainty.

(The other comment I made about "plain LMDB" was to address a question posed
by snaky, where I was assuming either embedding it into my app server or
having a thin wrapper; I was assuming the suggestion was that LMDB could
handle "append to key" in a durable and fast manner without relying on redis
to first buffer the data to be written en masse as larger chunks, which is
pretty much what it is doing in this architecture I was designing.)

(BTW, I've never talked to you, but I knew one of your former students. Hello!
;P)

~~~
hyc_symas
Hello! ;)

OK, that makes sense. Yah, it sounded like that to me too, but LMDB's Append
mode is only usable when all the input is already in sorted order. Great for
bulk loads of prepared data, not much help with randomly arriving input.

------
rogerbraun
Sounds interesting, but there's a typo in the first sentence: compatable ->
compatible.

~~~
gingerlime
I spotted that too. Maybe I'm a terrible nitpicker / extremely harsh, but it
kinda turned me off. Made it look instantly less of a serious project, when
otherwise the look & feel of the page were pretty solid.

------
ricardobeat
I was looking into <https://github.com/KDr2/redis-leveldb> a just a couple
days ago, same concept.

I'm confused about Edis current version being 1.0. Is it production ready or
just ignoring semantical versioning?

~~~
Uchikoma
From this thread, sounds like the second point: "It's currently a research
project for us and we have an intern from Uppsala University working on the HA
features including paxos leader election and multi-master conflict
resolution."

------
jondot
In this context, I think one can see this as a variant of Riak...

~~~
nbevans
Except with none of the distributedness?

------
Titanous
I started on a similar project written in Go and built on LevelDB:
<https://github.com/cupcake/setdb>

------
wtetzner
I feel like Redisc would be a more clever name.

~~~
firefoxman1
Me too, then I Googled it:

<https://github.com/Moodstocks/redisk> <https://github.com/quirkey/redisk>

------
iand
Setdb is a similar project <https://github.com/cupcake/setdb>

------
qwerta
If you are using Java, you may want to try MapDB. It uses memory mapped files
and has interesting performance. Redis server protocol implementation is on
roadmap.

