Hacker News new | comments | show | ask | jobs | submit login
Redis PSYNC2 bug post mortem (antirez.com)
122 points by antirez 8 months ago | hide | past | web | favorite | 30 comments



Thanks for all you do antirez!


> the tests never try to mix PSYNC2 with replication of Lua scripts

This sort of testing gap is so hard to avoid. Any complex system has an inexhaustible number of potential feature interactions. It's very difficult at best to guess which combinations need to be tested, and often difficult to even write a test which exercises a particular combination.

What techniques do people use to ensure test coverage of complex cross-feature interactions like this? Two things I emphasize:

* Emphasize integration tests, and try to use multiple features in each test. This can help drive out interaction problems; the downside is that these tests are harder to write and (especially) maintain. It can also be difficult to exercise specific code paths.

* Write randomized tests, that exercise as many features as possible. The challenge here is identifying failures. "The program didn't assert or crash" is often a good start. A gold standard is to run a simple reference implementation alongside your real implementation, and compare the output... but that's not always possible.


If you mean randomized tests as in adding random choices to your tests, I think that is a very bad idea (that I have been bitten by before).

You NEED tests to behave consistently. If you run the tests twice against the same code, it needs to return the same result. Otherwise, how will you know you have fixed the issue the tests caught?

Speaking from experience, soon after you introduce randomness to tests, you will start to get intermittent test failures. When that happens, people will start to ignore failing tests because, 'hey, this is probably just a random failure'


It's possible to have both randomness and reproduciblility - generate a random seed, log it, then explicitly seed the RNG. If you encounter a failure you suspect is caused by a particular random seed, temporarily hard-code that seed while you debug it.

The "that's an intermittent failure, just re-run it and it usually passes" attitude is in my experience more likely due to just plain old poorly-written tests, usually with a time.sleep() or something similar that makes them unreliable.


Seeded RNG.


Exactly, you use a fixed RNG seed so that the behavior is reproducible.

For complex multithreaded code, it can become difficult to achieve consistent execution, but a fixed seed is at least a good start.


So not really random, then? What is the point of adding an RNG if it is going to output the same values every time?


Because then the testing stimulus is not hard coded anymore, but generated. It's then trivial to run longer sequences, or tweak the test.

In hardware design, where reliability is key (making a chip costs multiple millions and 6 months time), there is only random testing (with seed for reproducibility), not "that bunch of case that the person testing thought of that day"


If you generate 10k test cases that find edge cases you wouldn't have normally found, that's value. If the framework is deterministic and always runs the same 10k test cases, that's generally valuable as well.

I'm not aware of anyone currently looking to ensure their test cases our generated using cryptographically random data.


What techniques do people use to ensure test coverage of complex cross-feature interactions like this?

The Feynman Algorithm: http://wiki.c2.com/?FeynmanAlgorithm

It's the uncomfortable truth we like to dance around. Just accept you can't plan for everything. Think carefully before making design decisions. Build in contingency plans in case things go wrong.


This. Fuzzing and integration tests is what usually is able to signal a regression.


We're all human – appreciate the write-up.


Is Redis a one-man show? (plus contributors)

I am still using memcached, but consider Redis in future.

I am looking for a Redis-based high-performance message-queue that can be filled from Nodejs and consumed with PHP. Basically a high-performance message-queue that doesn't need dozens of servers to start with.


Hello frik, Redis at this point is still handled by me directly for most parts, but we got recently (a few days ago) a new OSS payed contributor, https://github.com/artix75. Fabio works part time but 100% at Redis OSS. Additionally I receive help from Redis Labs core developers, and by an impressive Chinese community, see for instance the work made recently by http://github.com/soloestoy in pull requests about bug fixing and improvements. Plus other spontaneous contribs from other companies or individuals. I continue to be the person that writes most of the new code, basically, but without the contributions to bring stability, testing and investigations to Redis, it would not be practically possible to continue the development alone. Not now that we have so many subsystems: data structures, scripting, modules, replication, cluster, sentinel, persistence, and so forth.

About message queues, I developed recently one called "Disque", but this is going to be totally ported to Redis, as a module, during the first two quarters of 2018. Otherwise there are many other solutions, many of them are based on Redis itself.


For your use case, RabbitMQ will probably suffice. It can work very well with a single server, and, if you set it up in memory-only mode, is at least competitive with (if not superior to) Redis in terms of performance. If you use Rabbit's push-based subscriptions fully, it will far exceed Redis (edit: is Redis doing push-based pubsub these days? I'm told by a coworker that it is). I only mention this since you said "high-performance message-queue" in your comment; the out of the box performance of Redis and RMQ with zero tuning is more than enough for most people.

While Redis's protocol is simpler, it is a full memory database (with enableable persistence) plus optional queue/pubsub extensions. RabbitMQ is trying to be a full queue/pubsub system only, with memory and persistent options.

Its replication and durability features, if you want to add more servers, are also much longer-standing/more battle-tested (though far from perfect; I'm looking at you, "pause-minority" failures). Redis's persistence is quite good these days, though, so that's less of a competitive point.

RabbitMQ's setup is on par with Redis for simplicity. Client libs exist for PHP and Nodejs, and, while the protocol is more complex than Redis's, that usually just means "copy lines from $how_to_guide for startup/shutdown and then just publish/consume like you'd expect".

If I wanted a cache server that I occasionally needed to subscribe to, I'd use Redis no question. For a performance-oriented queue that needed either durability or throughput scalability, I'd start with Rabbit.


I would recommend checking out Faktory (https://github.com/contribsys/faktory), by the author of Sidekiq. It's fairly new but looks interesting.

For something production-tested already, I've found RabbitMQ to be very easy to operate as a single server. You obviously don't get HA with a single server, but it's been a breeze to manage.


Try https://github.com/antirez/disque. Written by the same guy behind Redis.

> Is Redis a one-man show? (plus contributors)

Well, yes, started by one man, and he continues to lead it, with many contributions from the community.

Update: I fixed the wrong URL.


> Try https://github.com/resque/resque. Written by the same people behind Redis.

Huh? I don't think it is, Redis is by antirez .. https://github.com/resque/resque/graphs/contributors


Right, not the same devs. Resque originally was from GitHub folks. Now probably maintained by other people.


Oh damn, I meant https://github.com/antirez/disque, not resque. Apologies.


FWIW, it appears the current plan is to not continue it as a standalone app but turn it into a redis module: https://gist.github.com/antirez/a3787d538eec3db381a41654e214...


Yeah, kind of. High quality software though, recommended for many things :)


[flagged]



The reason for it, as well as the reason for it not being changed, is already explained in Redis' documentation:

https://redis.io/commands/slaveof

> A note about slavery: it's unfortunate that originally the master-slave terminology was picked for databases. When Redis was designed the existing terminology was used without much analysis of alternatives, however a SLAVEOF NO ONE command was added as a freedom message. Instead of changing the terminology, which would require breaking backward compatibility in the API and INFO output, we want to use this page to remind you that slavery is both a crime against humanity today and something that has been perpetuated throughout all human history.


Master/slave describes the relationship between the services and has been standard terminology for decades. It's pretty "lame" to push established projects to make confusing terminology changes to address whatever words the "peanut gallery" is upset about this week. I have yet to hear of any IRL victims of slavery that are miffed by the use of this terminology.

Do not be offended on someone else's behalf.


Why? This is precisely what it's been called like for decades, to call it something else would introduce unnecessary confusion.


<sigh> this is the third time I've seen this come up on HN. Because for some people the word "slave" has bad connotations. Therefore it is a cultural norm in the USA not to use the word slave, even though as you say it has been in common usage in engineering (e.g. Master/Slave flip-flop going back to the 1950s). To use the word shows cultural insensitivity in some cultures.

Like you I had not come across this issue when I landed as a fresh immigrant in California in 1996 to work on replication, but it only took a few minutes for someone to explain it to me and we carried on saying "supplier/consumer" or "origin/destination" for the subsequent 2 decades without trouble..


This is a timely reminder that the USA is not the center of the universe, Antirez is Italian.


Those people should get over it. Master/Slave denotes a relationship. It can be bad if its a relationship between humans, but good its between servers.


A slave is a type of replica, but all replica are not slaves.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: