
Redis persistence demystified - antirez
http://antirez.com/post/redis-persistence-demystified.html
======
sbarre
I love your blog posts so much. In 20 minutes I just learned so much about
databases and internals.

Thanks for writing these!

~~~
antirez
Thank you for reading it. It is a long post, takes some patience and interest
to read it.

~~~
giulianob
Are there any good posts on transactions? I've found articles around but
nothing very concise that explains the exact behavior of MULTI/EXEC when
things go bad.

~~~
antirez
basically MULTI/EXEC is always handled correctly, either everything or nothing
is committed to the database memory, RDB file, AOF file, slave, ...

~~~
giulianob
I had read that MULTI/EXEC would not "rollback" incase a command fails (e.g.
you do first SET, second SET, third SET fails, then the two initial SETs would
still be applied). I guess this doesn't so much have to do specifically with
persistance problems (system crash, process crash, etc..) but more with just
how transactions work in general.

Is this not true or at least no longer true?

~~~
nbpoole
<http://redis.io/commands/set>

> _Status code reply: always OK since SET can't fail._

So, what kind of situation are you envisioning?

~~~
giulianob
Maybe SET wasnt a good example .. but lets say you are writing to the wrong
type or something like that. I know that its unlikely something will fail in
production like that since you should catch it during dev but bugs can happen.

------
ypcx
For a high-valued production data, if you think your data is safe after it's
(finally) written to disk, then you're doing it wrong. Your data is not safe
until a) it left the physical database machine, b) it left the physical
datacenter to be stored into multiple datacenters around the world (e.g. S3).

Thus ideally, the (horribly slow) disk doesn't even come into play, especially
for in-memory DBs. You buffer the data in memory before they are sent out of
the machine/datacenter, but you make sure to mirror this buffer at multiple
separate physical machines (which your database cluster _should_ support), in
case one goes down or over. Once the data are committed into a replicated
store, you can clear that buffer. Fast and reliable.

This is not to say that there are zillion cases where harddrive is still the
ideal persistence device. After all, it's very hard to destroy a harddrive in
a way to make the data unrecoverable (of course, I'm talking about cases where
RAID failed or wasn't present). However in reality, data from broken
harddrives are seldom attempted to be recovered, mainly I guess due to the
price and relatively long service waiting times.

~~~
antirez
If you read the article carefully there are multiple mentions about this,
specifically I wrote that RDB persistence is just perfect for this: a single-
file compact representation of data to send _far away_ :)

~~~
ypcx
Hmm, I was thinking more of "sending chunks of AOF" out to a separate,
distributed storage (and the RDB snapshot file occasionally). Loading data
from network could then be slower than from a disk, but also faster than from
a disk, depending on network speed and number of machines to read from.

------
quink
And because I just know the whole VM deprecated thing will come up, here's a
pretty awesomely informative recent status:

<https://github.com/antirez/redis/issues/254>

~~~
Smerity
Antirez mentions that a limited set of on-disk data structures could work well
but that it will take one or two years to even reach the drawing board. Fair
enough -- Redis is first and foremost an in-memory database.

If there was the time though, I'd love to see LevelDB[1] merge the gap between
in-memory and on-disk. Inspired by Google's BigTable, all keys are kept sorted
on disk (see: SSTable[2]). Keys, lists, sets, sorted sets and hashtables could
be encoded in sorted key-value form and would be reasonably (though not
tremendously) efficient to retrieve, especially if the query can be converted
to a range query (i.e. list retrieval or set intersection). Keep hot data in
memory, the rest ends up securely on disk.

Yes, Cassandra is based on the same lineage but it's not simple or clean to
operate. Redis is a pain free setup, features a simple API but has no
transparent way to overflow to disk for cold data. LevelDB is an optional
backend for Riak but I must admit I've not explored Riak heavily... Have I
missed a contender from another database crowd?

[1]: <http://code.google.com/p/leveldb/> [2]:
[http://www.igvita.com/2012/02/06/sstable-and-log-
structured-...](http://www.igvita.com/2012/02/06/sstable-and-log-structured-
storage-leveldb/)

~~~
Seldaek
Speaking of LevelDB and Redis, did you know about Edis[1]? It is a protocol
compatible implementation of Redis that uses LevelDB as its data store. I
haven't had a chance (nor the need) to try it, but it sounds interesting.

[1]: <http://inaka.github.com/edis/>

~~~
Smerity
Indeed I hadn't -- Edis doesn't rank highly for searches involving "LevelDB +
Redis" which could be how it avoided me for so long. This sounds like what I
envisioned, so I'll be looking at it with a keen interest =] The real question
is how they implemented the encoding of the Redis data structures into the
LevelDB SSTable format and the implications that will have on performance. If
the Github issues are any indication it's an interesting proof of concept but
hasn't been used or tested widely yet[1]. Along those lines, leveldb-server
also looks interesting as a simple (API) and easy to install LevelDB backed
DB.

[1]: <https://github.com/inaka/edis/issues/2> [2]:
<https://github.com/srinikom/leveldb-server>

------
ot
For anybody interested in this topic, this SQLite doc page is excellent:

<http://sqlite.org/atomiccommit.html>

Check out in particular the sections "Hardware Assumptions" and "Things That
Can Go Wrong"

------
obtu
Durability through replication should probably be mentioned as well; either to
address performance requirements, or to provide stronger durability against
hardware failure.

~~~
_Lemon_
Redis can _only_ perform asynchronous replication because it uses a single
thread. It cannot block the main thread waiting for the network and have
acceptable performance. This makes replication as good as "appendfsync no" as
you have no guarantees as to what happened on the network write.

The upside of the design is that it makes things simple (e.g., transactions,
append only file).

(This is my understanding, please correct me if I'm wrong!)

~~~
antirez
Yes it is correct that asynchronous replication is the only way Redis handles
replication, however replication and durability are still in topic: if the
master burns in fire, the slave will contain your data ;)

However there are people that turn Redis async replication into synchronous
replication with a trick: they perform:

    
    
        MULTI
        SET foo bar
        PUBLISH foo:ack 1
        EXEC
    

Because PUBLISH is propagated to the slave if they are listening with another
connection to the right channel they'll get the ACK from the slave once the
write reached the slave. Not always practical but it's an interesting trick.

------
wildmXranat
As a long time Redis user, source code admirer and spectator of its evolution,
I have to say that I learned quite a lot about open source project management.

------
Androsynth
_One of the additional benefits of RDB is the fact for a given database size,
the number of I/Os on the system is bound, whatever the activity on the
database is. This is a property that most traditional database systems (and
the Redis other persistence, the AOF) do not have._

Can you expand on this? Specifically:

-Do you mean 'bound' as in 'limited by' or 'known'?

-Why are RDB snapshots I/O bound when other systems are not?

-Why is this an advantage?

~~~
antirez
If you write the DB on disk sequentially, every 5 minutes, the I/O you perform
is a fixed amount _regardless of the amount of writes you have against the
dataset_. For instance using pipelining Redis can easily peak 400k operations
per second, and you can have a few instances in the same box. In this setup 5
minutes of data loss may be acceptable, if you are writing 2 millions of
records per second, and RDB make this possible. The I/O performed will always
be proportional to the number of keys, it is not proportional to the
operations the instances are receiving per second. With Redis AOF, and
generally with most other databases, it is unlikely that you have an
operational mode where the I/O is simply proportional to the _size_ of the
data set, and not to the amount of reads/writes.

~~~
jorangreef
What about a hybrid RDB/AOF option, where AOF is not written immediately but
every N seconds, using the latest delta?

------
johnkchow
antirez, thanks for the objective look at Redis's internals. As a young
engineer 2 years removed from college, I feel these articles serve more to us
young'ns with lots of knowledge gaps.

