

Redis at Bump: many roles, best practices, and lessons learned - jmintz
http://devblog.bu.mp/how-we-use-redis-at-bump

======
antirez
Thank you for writing this article. As a way to show my appreciation I want to
focus on the bad side of the matter.

The article mentions that with AOF persistence there is a problem about fsync.
I'll try to go in further details here.

Basically when using Redis AOF you can select among three levels of fsync:
fsync always that will call fsync after every command, before returning the OK
message to the client. Bank-alike security that data was written on disk, but
very very slow. Not what most users want.

Then there is 'fsync everysec' that just calls fsync every second. This is
what most users want. And finally 'fsync never' that will let the OS decide
when to flush things on disk. With Linux default config writing buffers on
disk can be delayed up to 30 seconds.

So with fsync none, there are no problems, everything will be super fast, but
durability is not great.

With fsync everysec, there is the problem that form time to time we need to
fsync. Guess what? Even if we fsync in a different thread, write(2) will block
anyway.

Usually this does not happen, as the disk is spare. But once you start
compacting the log with the BGREWRITEAOF command, the disk I/O increases as
there is a Redis child trying to perform the compaction, so the fsync() will
start to be slow.

How to fix that? For now we introduced in Redis 2.2 an option that will not
fsync the AOF file while writing IF there is a compaction in progress.

In the future we'll try to find even Linux-specific ways to fsync without
blocking. We just want to say the kernel: please flush the current buffers,
but even if you are doing so, new writes should go inside the write buffer, so
don't try to delay new writes if the fsync in progress is not yet completed.
This way we can just fsycn every second in another thread.

Another option is to write+fsync the AOF log in a different process, talking
with the main process via a pipe. Guess what? The current setup at Bump is
somewhat doing this already with the master->slave setup. But there should be
no need to do this.

So surely things will improve.

About diskstore, this is I think better suited for a different use case, that
is: big data, much bigger than RAM, but mostly reads, and need to restart the
server without loading everything in memory. So I think Bump is already using
Redis in the best way, just we need to improve the fsync() process.

~~~
jamwt
Hi Antirez, thanks again for Redis. Despite our few problems with it, it
rocks. A few comments:

> With fsync everysec, there is the problem that form time to time we need to
> fsync. Guess what? Even if we fsync in a different thread, write(2) will
> block anyway

Yep, but this could be avoided if a thread was devoted to all I/O incl.
write() (and then line-level buffering really would be possible as well).
Communication with this thread would be on a thread-safe queue--the main
thread would never block on disk I/O, and only two threads would mean mutex
contention for the queue lock would be low. This would be one solution,
correct? This is a variation of your "two processes + pipe" suggestion.

> How to fix that? For now we introduced in Redis 2.2 an option that will not
> fsync the AOF file while writing IF there is a compaction in progress.

Well, we enabled that.. but, we found that it's still a problem in a couple of
circumstances:

1\. Something _other than the AOF recompaction_ makes the disk busy. Like,
say, even a moderate amount of disk activity by another process.

2\. Redis's own logging to stdout, if redirected to a file, itself can cause
the redis main thread to block if stdout is being flushed onto a busy disk.

Basically, if any I/O which may hit a disk (AOF record/flush or even logging)
is being done on the single epoll-driven thread redis uses to process incoming
requests, the system must make _very_ good guarantees that those I/O calls
will not block. We have found these guarantees practically impossible to make
on a very busy master, so we've given on up having the master do AOF work
altogether.

~~~
antirez
Thanks for the in deep reply,

Exactly the logging process can well be a thread for better performances,
thanks for the hint!

About the other scenarios where fsync will perform poorly, indeed every other
I/O is going to be a problem.

I guess the "all the AOF business in a different thread" is the most sensible
approach to follow probably, unless there is an (even Linux specific syscall)
that is able to avoid blocking but just to force commit of old data.

------
rbranson
I don't get what is so difficult about AMQP. These are clearly talented
programmers, so what gives? Even if you're not a bank, simple features like
message timeouts can make your infrastructure tremendously more resilient.

~~~
jamwt
I'd chalk it up to the general benefits of eschewing needless complexity.

I can say, empirically, none of the many, many challenges we've had building
and scaling Bump, have been related to Redis's capabilities as a messaging
bus. So "good enough" wins again.

~~~
rbranson
FWIW, I'd encourage you to still take a deeper look at AMQP, just because it
includes features you may not know you need. While I can't pretend to know
anything about your scaling challenges or the intimate details of how
messaging is utilized, I can say that there are the lessons of deep experience
with messaging baked into AMQP. You may have (or perhaps you already have) to
implement some of these features in the future. I know that I gave AMQP the
cold shoulder for a while, only to finally come around and find out it solved
many of the frustrations we were facing.

~~~
antirez
Probably the less complexity also means a much bigger amount of messages per
core.

With Redis in an entry level Linux box you can process 100k messages per
second per core. I'm not sure if current AMQP systems can handle this amount
of messages with commodity hardware.

Another reason why Redis can be a good approach I think is that it is a simple
system: simple to understand, simple to deploy, very stable. There is support
for Pub/Sub. It also supports many other data types, so for instance, want a
priority queue? Do it with a sorted set. What some special semantic for your
message passing? possibly with BRPOPLPUSH you can do it. And so forth.

A common case of this flexibility is shown by RPOPLPUSH (without the "B" so
the non blocking version). Using this command with the same source and
destination list will "rotate' the list providing a different element to every
consumer, but the data will remain on the list. At the same time producers can
push against the list. This has interesting real world applications when
things must be processed against and again (fetching RSS feeds, for instance).

So Redis is a pretty flexible tool for messaging, and I think there are for
sure great use cases for AMQP but with big overlaps with Redis, and also with
use cases where Redis is definitely a better alternative.

~~~
rbranson
Given the same semantics, RabbitMQ pushes 100k+ per core as well. It utilizes
multi-core machines quite effectively too. Obviously as you tune nobs for more
durability or reliability, you're going to take a performance hit. IMHO, Redis
is a good tool for messaging if you already use Redis in a big way and don't
need reliable delivery or any of the advanced AMQP routing. In sort of the
same way a table in MySQL can be used as a queue if you already have a MySQL
server and you're only going to be pushing a few hundred messages per minute.

~~~
ezmobius
I call BS on rabbit pushing 100K messages per core and I really like rabbit.
But it is no where near 100k messages/sec per core.

AMQP does have extra capabilities and is a good messaging system and has
advanced routing features, but you need to learn what exchanges, queues and
bindings are and how they relate before it is useful.

I've used rabbit in production on multiple systems and it is still running on
some of those. But I have switched to redis for most of my messaging needs
because of the built in data structures and persistence. It makes it a much
more versatile server and it is much easier to admin and much more stable.
Rabbit is too easy to push over when you run it out of memory.

But I had to chime in and refute your 100k messages per core on rabbit. 20k
maybe with the java client, more like 5-7k with a ruby or python client.

I can still get 80k/sec with a ruby client on redis.

The two servers are very different, redis is a swiss army knife of data
structures over the network, that is why it is so useful. AMQP and rabbit are
targeted more at enterprise messaging and integration where raw speed doesn't
matter as much as complex hierarchies of brokers and middleware.

~~~
foobarbazetc
We do 120K messages per core per second with QPid (RHM), which includes an AIO
transactional journal and multi-master clustering. Woo. :)

------
LiveTheDream
In January, Bump reported allocating a whopping 700GB of RAM for redis[1].

[1] <http://devblog.bu.mp/haskell-at-bump>

~~~
ahuibers
We (Bump) have 12 redis machines now with 72 or 96GB each. 6 masters and 6
slaves. The slaves are hot spares and persist to disk, per the blog.

~~~
rkalla
Given the inherently small foot print of Redis, your data sets are HUGE.
Looking forward to reading the Mongo article when it is ready and how it is
performing.

------
simonw
I really like the idea of pushing log messages in to a redis list and then
flushing them out to disk with another process.

I've often thought it would be useful to have a redis equivalent of MongoDB's
capped collections, specifically to make things like recent activity logs
easier to implement. At the moment you can simulate it with an rpush followed
by an ltrim, but it would be nice if using two commands wasn't necessary.

~~~
timr
I don't know...that part sounded like a hacked-up, half-implementation of
scribe: <https://github.com/facebook/scribe/wiki>

I'd be interested in hearing if they tried to use Scribe for the same task and
found it wanting, or if there's some other story.

~~~
jamwt
Could you say more about why Bump's implementation of network-based queued
logging is "hacked-up" while facebook's (by implication) isn't?

To answer your question, simply put, no one here had heard about Scribe.

~~~
timr
_"Could you say more about why Bump's implementation of network-based queued
logging is "hacked-up" while facebook's (by implication) isn't?"_

Well, mainly because Scribe was purpose-built to do log aggregation on a large
scale, and has nice features to prevent data loss in the event of network and
node failure. It's also pretty well-tested at this point, given its origins
and community. Check the wiki to which I linked.

I didn't mean my comment to imply anything negative. I was just trying to
point out to the parent comment that there's now a better option than rolling
a custom log aggregator on top of Redis. That may not have been true when you
started your system. Mea culpa.

------
ladon86
OK, so I'm running mongodb on the same machine as redis.

I do have mongodb replicated across two other machines, but could you briefly
shed light on what the problems between redis and mongo on a single box were?

~~~
wmoss
The quick answer is that MongoDB mmaps it's entire data set, so if you've got
more data than ram (likely) the OS is going to constantly have all excess ram
allocated to Mongo. This becomes an issue because Redis (very reasonably)
assumes that malloc will return quickly, however, if the OS decides it's going
to give Redis a dirty page, that malloc call just became disk bound.

------
sfphotoarts
I was curious about the Redis sets used for social graph storage and using
intersection of sets to find nodes in common. Would anyone have an idea about
the complexity of this as both the number of nodes in the graph and the number
of edges each node has?

~~~
delano
You can find out: write a script to populate sets in Redis and run some
commands. You'll likely find it's fast enough for your needs.

------
moe
Just a note about logging: It seems you're making it harder than it needs to
be. Syslog supports remote-logging.

