*you're kind of missing the problem here - who should be the master?* The choice...

notacoward · on Oct 29, 2014

"People are making it sound like this issue has been ignored for years."

Hasn't it been? How does leaving that latent in the system for years make things better? I rather think it reflects on an inability to reason about failure modes (including user failure modes), and deal with them proactively instead of after data was lost.

seiji · on Oct 29, 2014

Arguing about developers not being omnipotent isn't very stable.

The users intentionally configured their options and the system responded exactly as it should have, given what it was asked to do.

notacoward · on Oct 29, 2014

This isn't about about omniscience (not omnipotence BTW). This is about a far lower standard of basic diligence, expected and met by most people who work on data-storage systems. If you're given some data to store, and there's an obvious way to retain/recover that data despite and intervening failure, then failing to do that is a betrayal of the most basic trust people put in data-storage systems. Congratulations, you've implemented the distributed-system equivalent of linking fsck to mkfs. Well done. Go pat yourself on the back for conforming to your specification.

felixgallo · on Oct 29, 2014

I don't think you're understanding redis or this problem correctly.

Redis lets you have slaves which mirror the master. Hundreds of thousands of redis installations use this pattern to provide read scaling and offline master-loss persistence, and in the normal case, this works great. I myself have implemented systems with hundreds of redis instances which have gracefully survived the loss of the primary.

In this particular instance, the user turned off persistence, didn't understand the ramifications, and then brought the master back up with an empty database after a hard kill without thinking things through.

Fortunately, the user was savvy enough to have kept backups off the slaves, as is the usual pattern, and so was able to continue service.

This is not a normal pattern and goes against the general practice.

Does that help?

notacoward · on Oct 29, 2014

I understand what you're saying, but I don't think it's a sufficient reason to throw away data. I've seen hundreds of cases where a GlusterFS user went against our advice and did something that ended up making things worse. Sometimes they even lost data. Of course, they always blame us. I'm pretty sure people who have worked on every single data-storage system ever have had similar experiences. Sometimes the user is just wrong and it's their own fault. Sometimes they're right because we made it far easier for them to make things worse than to make things better. In those cases we have to stop making excuses like "user error" or "RTFM" or "against general practice" or whatever. We need to help the user by not handing them bags of explosives. Which do you think is a better choice here?

* Default to preserving already-replicated data, provide "clean start" as an option.

* Default to throwing away data, maybe-someday implement an option to use data that's already present in the system.

Blaming the user won't prevent another user from making the same mistake with the same result. Saner defaults, and an implementation to support them, will. Who's going to complain that you saved too much of their data?

felixgallo · on Oct 29, 2014

The defaults are sane, and in fact the user here had to explicitly turn them off in order to do the thing they wanted to do. Once you reach into a configuration file and change a setting, I can't think of a software system in the world that protects you from your choice. Could you maybe name a few?

notacoward · on Oct 29, 2014

The user turned off persistence. There's no reason for a normal person to suppose that also means ignoring data that's in the system when the master comes up. The fact that the two are inextricably tied to one another in the Redis implementation is not the user's mistake.

felixgallo · on Oct 29, 2014

What do you think a slave should do if it is told to replace its state with empty state? How about half-empty state? There's really no answer that's satisfying for every possible use case (certainly I don't want my slaves to refuse if I tell them to clear the database completely on purpose). And indeed you haven't given any examples of databases that try to do 'better'. I think that's because there aren't any.