Hacker News new | past | comments | ask | show | jobs | submit login
Redis Sentinel beta released (antirez.com)
191 points by j4mie on July 23, 2012 | hide | past | web | favorite | 23 comments

Redis has been working here on a high traffic site without any trouble for more than a year. Excellent software. The only software I might think that's bug free, due to the attitude of @antirez.

This thread will likely get a million "us too" style posts but Redis is core infrastructure at Shopify. It's been so solid that we recently waived our defensive requirements that the app remains working if Redis goes down. This allowed us to port our inventory reservation system (a huge point of potential lock contention) completely to the new server sided LUA scripts. We have seen a full order of magnitude speed increase from this. A reservation for a complicated order is now measured in ┬Ás instead of ms.

I'll put "us too" here, as we too now assume Redis to be up like our relational store(s) -- Percona MySQL. ~16GB at ~500 ops/s averaged over the last 3 months. We've been running Redis for core features for more than 21 months, and as a store, it has been the most stable and easiest to reason with for what performance we can expect and actually receive. Compared with "non-SQL" setups we deploy, if we were to start from scratch, we'd look to replace ActiveMQ, Solr, and a number of our jobs that we jam through MySQL.

Bravo to antirez and the Redis team.

I love it that you're doing it in Lua. Can you tell us more about it?

Redis now lets you write extensions in Lua so it's not unusual. It would be interesting to hear more about though.

Which site is that?

brands4friends, No 1 Shopping Club in Germany, >4M members.

Hey Antirez, the link to hires in the announcement is broken. Replace "anirez" with "antirez".

thanks, fixed now.

Thanks Antirez, could you share more insight on TILT mode? Any other alternative approaches you considered? Why use a value of 30 seconds to leave TILT mode? If the time has shifted is it likely ever to be correct thereafter?

Hello, basically 30 seconds is exit time only if no other time shifts are detected, otherwise we set again the exit to now+30sec, and so forth.

30 seconds is set as 3 times the biggest period we have in info collection (INFO itself is sampled every 10 seconds, while PING is every 1 second), so that if there was a problem with the timer, in 30 seconds we are sure the new state will get new readings for every kind of request and information we collect, so when the TILT mode is exited, and the function to evaluate the state is called again, it should see clean values.

Note that from the point of view of Sentinel it is ok that the new time is wrong compared to what the real time is, we never use the absolute time. All we need is that we have a computer clock that more or less advances regularly.

I think this can also be used as monitoring & notification for other services as well in cluster & not just Redis instances !!

I imagine there is some overlap in parts of the logic for Sentinel and repmgr (a similar tool for PostgreSQL). For example, checking to see if members of the cluster are in-service, and choosing a new master in the event of a failover.

I would love to see a generic tool for handing the clustering/failover problem.

It's true there is some overlap, but also Sentinel uses things that are specific of Redis. For instance for us two things are crucial:

1) The ability to use the master as a message bus to auto-discover things. This is possible because every Redis instance is also a Pub/Sub server.

2) The idea that after every restart of every Redis instance we have a "runid" that changes.

And in general the logic of the failover itself, the fact that the failure detection is precise (some specific reply codes are considered in a way, some others in another way), makes a non specific solution much harder to implement with the "methods" to perform the service-specific tasks that may end to be complex, or sometimes forces to completely change the logic of the system (lack of Pub/Sub).

Can anyone comment if this criticism is valid?

"In which monitoring agents rely upon correct, truthful, answers about cluster state from the system they're monitoring"

Original: https://twitter.com/cscotta/status/227515068030537728

It is not :) Just replied:

"@cscotta Hi, you misunderstood how it works: Pub/Sub is only used for discovery on startup. Sentinel-to-sentinel p2p for critical stuff."

Pub/Sub is used to make the configuration simpler when you start a Sentinel cluster at a cold time when everything is working and your master is ok.

This allows us to auto-discover the other Sentinels, to check the slaves, and so forth.

Instead in order to understand if a system is down, who is the Sentinel that performs the failover, and for all the critical stuff, Sentinel to Sentinel messages are used without caring if the master Pub/Sub works.

> a known bug in the hiredis library that can make Sentinel crash from time to time, but it's not a problem with Sentinel itself

Surely if a bug in a client can crash your server that's a bug in the server by definition even if the client is also buggy?

Sorry I was not clear. Sentinel uses the hiredis C library itself in order to talk with other Redis instances. A bug in the C library crashes the library and the process it is running into.

Ah, I see. That makes sense. Thanks.

I've written a little experimental python client that connects to a sentinel and keeps an image of the state of the monitored cluster as it changes. http://bit.ly/NNrQdI

The boldface isn't showing up correctly for me. Viewing on Chromium.

That sounds sweet. Congrats. Redis has been of great use to me.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact