

Redis Sentinel beta released - j4mie
http://antirez.com/post/redis-sentinel-beta-released.html

======
Uchikoma
Redis has been working here on a high traffic site without any trouble for
more than a year. Excellent software. The only software I might think that's
bug free, due to the attitude of @antirez.

~~~
xal
This thread will likely get a million "us too" style posts but Redis is core
infrastructure at Shopify. It's been so solid that we recently waived our
defensive requirements that the app remains working if Redis goes down. This
allowed us to port our inventory reservation system (a huge point of potential
lock contention) completely to the new server sided LUA scripts. We have seen
a full order of magnitude speed increase from this. A reservation for a
complicated order is now measured in µs instead of ms.

~~~
primitur
I love it that you're doing it in Lua. Can you tell us more about it?

~~~
wahnfrieden
Redis now lets you write extensions in Lua so it's not unusual. It would be
interesting to hear more about though.

------
spenrose
Hey Antirez, the link to hires in the announcement is broken. Replace "anirez"
with "antirez".

~~~
antirez
thanks, fixed now.

------
jorangreef
Thanks Antirez, could you share more insight on TILT mode? Any other
alternative approaches you considered? Why use a value of 30 seconds to leave
TILT mode? If the time has shifted is it likely ever to be correct thereafter?

~~~
antirez
Hello, basically 30 seconds is exit time _only_ if no other time shifts are
detected, otherwise we set again the exit to now+30sec, and so forth.

30 seconds is set as 3 times the biggest period we have in info collection
(INFO itself is sampled every 10 seconds, while PING is every 1 second), so
that if there was a problem with the timer, in 30 seconds we are sure the new
state will get new readings for every kind of request and information we
collect, so when the TILT mode is exited, and the function to evaluate the
state is called again, it should see clean values.

Note that from the point of view of Sentinel it is ok that the new time is
wrong compared to what the real time is, we never use the absolute time. All
we need is that we have a computer clock that more or less advances regularly.

------
zerop
I think this can also be used as monitoring & notification for other services
as well in cluster & not just Redis instances !!

~~~
mryan
I imagine there is some overlap in parts of the logic for Sentinel and repmgr
(a similar tool for PostgreSQL). For example, checking to see if members of
the cluster are in-service, and choosing a new master in the event of a
failover.

I would love to see a generic tool for handing the clustering/failover
problem.

~~~
antirez
It's true there is some overlap, but also Sentinel uses things that are
specific of Redis. For instance for us two things are crucial:

1) The ability to use the master as a message bus to auto-discover things.
This is possible because every Redis instance is also a Pub/Sub server.

2) The idea that after every restart of every Redis instance we have a "runid"
that changes.

And in general the logic of the failover itself, the fact that the failure
detection is precise (some specific reply codes are considered in a way, some
others in another way), makes a non specific solution much harder to implement
with the "methods" to perform the service-specific tasks that may end to be
complex, or sometimes forces to completely change the logic of the system
(lack of Pub/Sub).

------
swah
Can anyone comment if this criticism is valid?

"In which monitoring agents rely upon correct, truthful, answers about cluster
state from the system they're monitoring"

Original: <https://twitter.com/cscotta/status/227515068030537728>

~~~
antirez
It is not :) Just replied:

"@cscotta Hi, you misunderstood how it works: Pub/Sub is only used for
discovery on startup. Sentinel-to-sentinel p2p for critical stuff."

Pub/Sub is used to make the configuration simpler when you start a Sentinel
cluster at a cold time when everything is working and your master is ok.

This allows us to auto-discover the other Sentinels, to check the slaves, and
so forth.

Instead in order to understand if a system is down, who is the Sentinel that
performs the failover, and for all the critical stuff, Sentinel to Sentinel
messages are used without caring if the master Pub/Sub works.

------
DRMacIver
> a known bug in the hiredis library that can make Sentinel crash from time to
> time, but it's not a problem with Sentinel itself

Surely if a bug in a client can crash your server that's a bug in the server
by definition even if the client is also buggy?

~~~
antirez
Sorry I was not clear. Sentinel uses the hiredis C library itself in order to
talk with other Redis instances. A bug in the C library crashes the library
and the process it is running into.

~~~
DRMacIver
Ah, I see. That makes sense. Thanks.

------
dvirsky
I've written a little experimental python client that connects to a sentinel
and keeps an image of the state of the monitored cluster as it changes.
<http://bit.ly/NNrQdI>

------
jwuphysics
The boldface isn't showing up correctly for me. Viewing on Chromium.

------
arrowgunz
That sounds sweet. Congrats. Redis has been of great use to me.

