

Scaling Redis at Twitter [video] - caniszczyk
https://www.youtube.com/watch?v=rP9EKvWt0zo

======
mrmondo
I would be most interested in how you manage failover, sentinels are probably
the worst failover mechanism I've used in a long time.

~~~
antirez
I'm very interested in understanding what are your main pain points with
Sentinel, that would be a very useful feedback. Since the redesign and the
well specified behavior, and the incremental fixes because of the inevitable
bugs in the implementation, many Redis users are happy with it, but it is
always a good idea to understand the other side of the user base.

~~~
mrmondo
Sorry for the very late reply here. We feel that Sentinels are clearly an
afterthought, hacked on the side of Redis to tick the failover box. My main
pain point is that they actively re-write their config files kept in /etc
which is appalling behavior - especially if you want to manage your
configuration file with some sort of automation such as puppet. The actual
failover process is slow no matter how you tune it and because of this the
window for error is high during this time. Data loss with Redis failover seems
not only inevitable but also probable.

~~~
antirez
I'll try to reply to the different parts of your comment:

1) "Clearly afterthought". Sorry no actual argument.

2) Actively re-write the configuration: because of this (and the fact the
configuration is fsynced) the basic Sentinel guarantees hold even in case the
machine running the Sentinel processes crashes. Sentinel (and Redis) provide a
full API for runtime reconfiguration, so I believe there are many ways to
create automatic deployments. For people using puppet, it is possible to use
the include files, it is not perfect but works most of the time. However this
is a design decision, not a shortcoming IMHO.

3) Redis Sentinel has one of the faster failovers you can find around, I
believe you tried a _very_ old version. Example setting the down-after-
milliseconds to 2000 (2 second timeout to consider the master failing, but
note that this does not change anything, you can set it to 60 seconds, the
point is, _after_ the 60 seconds, how fast it is):

    
    
        $ date; redis-cli -p 9000 debug segfault
        Tue Sep 23 18:16:43 CEST 2014
        Error: Server closed the connection
    
        Slave logs: 2359:M 23 Sep 18:16:46.235 * MASTER MODE enabled (user request)
    

The slave was elected after 3 seconds, so the failover happened in 1 second.

    
    
        $ 2719:X 23 Sep 18:16:45.993 # +odown master mymaster 127.0.0.1 9000
        ... snip more logs here ...
        $ 2719:X 23 Sep 18:16:47.014 # +promoted-slave slave 127.0.0.1:9001
    

As you can see at ~46 the odown was reached, and at ~47 there was already the
slave promoted to master (acknowledged via INFO output processing, not just
sent).

4) Sentinel data loss has nothing to do with the speed of the failover (which
is super fast), but mostly with partitions. The distributed system that you
obtain summing Sentinel + Redis data stores is an eventually consistent system
where the merge function is using the data set of the master with the greatest
configEpoch (the latest promoted by the majority of Sentinels). This, and the
fact that Redis uses asynchronous replication, means that isolated masters +
clients, can process writes that will disappear. However Redis has
configuration options (well documented in the Sentinel official doc) in order
to bound the window of lost writes in a minority partition. The options I'm
talking about allow isolated masters to stop accepting writes after some time
no acknowledge is received by slaves.

------
devanti
Just wondering, is there a reason why twitter doesn't use one of the many
distributed in-memory database solutions? It seems like they had to write a
lot of custom layering on top of it just to scale

~~~
NathanKP
At a certain point of complexity and scale the in-house, custom distribution
layer is almost always going to outperform a general purpose distribution
system built into the database.

General purpose distributed database clusters are progressing, and if Riak, or
one of the other in memory cluster focused system had been stable and
production ready when Twitter was developing their cache layer it might have
been a strong contender.

However, Riak is still much, much slower than Redis, especially when it comes
to accepting writes. Overall, when you have the money that Twitter does and
the team that it has you can come up with something better in house that it is
more efficient for your use case. And that's what they've done here by
building on top of Redis.

~~~
rch
One could conceivably provide a Redis backend for Riak, if one were so
inclined.

