The general take-away from most of them is -- do not believe the vendors of distributed systems. Even systems that were meant to solve consensus (etcd, consul) failed:
As well it's interesting to see the different responses of teams when problems are raised.
Some are very thankful for the analysis, others don't seem to appreciate how important correctness is in this type of distributed systems. I find it's a good indicator of the maturity of the teams working on them, and how much I'm willing to trust each system.
> I find it's a good indicator of the maturity of the teams working on them, and how much I'm willing to trust each system.
I hate to nitpick but I'm generally curious. How do you know it's a good indicator? It can definitely be _an_ indicator but how do you know it was good? Did you have a follow up experience that validated what the indicator indicated?
How a project performs today tells you the zeroth derivative of its location. Looking at the commit log tells you about the first derivative. How people react to Call Me Maybe when its about their product gives you a lot of information about the second derivative.
One way is to note whether or not the response to the report dismisses it or hand-waves it away. Another response is to neglect the report for 18 months with no reply.
The tests that aphyr conducts are brutal, which you can see by casual inspection of his Call Me Maybe site or his talks. If the response of a software team tries to divert his attention or your attention from the issue, it does not leave a good impression.
To be fair, he said it was a good indicator of the maturity of the team and how much he was willing to trust the systems. He didn't say it was a good indicator of the reliability of those systems.
IIRC, Riak only did well with the CRDT data types. That result and the rather well done presentation by Peter Bourgon about CRDT and the time series database they built with it at SoundCloud made me look into more, and it seems a quite compelling concept. And obviously proven robust.
Not true, Riak was completely, 100% correct with siblings enabled, which is Totally Expected behavior for a standard Dynamo-style system.
The criticism is that last-write-wins is the default, and that's a fair criticism--that's not an honest default for a robust eventually consistent system like Riak. My guess is they default to LWW to ease onboarding of new users. Everyone is worried about rivaling MongoDB's out-of-the-box experience so you don't scare people away...
TBH, though, if you're using a distributed, eventually consistent storage system without understanding the consequences, you're probably using the wrong tool.
Postgres was a bit of a funny one. I don't know what to think of it. The partition case studies was with a single server and clients. The failure was found related to the server accepting the transaction but the ack never reaching the client (due to partition) so client thinks transaction hasn't completed.
Well, PostgreSQL is not a distributed database. Sure, the study may have involved one of the HA solutions available for PostgreSQL, but it wouldn't have been a PostgreSQL study then.
As some comment below, the reason for this observed behavior is due to the fact that the server, together with the client, form a kind of distributed system. It is not classified as a failure to perform a write and fail to ACK that to the client. The reverse (not performing an ACKed write) is a failure. However, this suggests that every non ACKed write operation should be retried with care, because it may have gone through. It's better if you can use idempotency semantics, like 9.5's upcoming http://www.postgresql.org/docs/9.5/static/sql-insert.html#SQ... (ON CONFLICT, aka UPSERT).
> It is not classified as a failure to perform a write and fail to ACK that to the client.
It's worthwhile to note that this is a fairly general problem in many, many kinds of systems.
If that's actually a problem, you can use 2PC. If you didn't get back an ack for the PREPARE TRANSACTION it failed (and you better abort it explicitly until you get an error that it doesnt exist!). Once the PREPARE succeeded you send COMMIT PREPARED until you suceed or get an error that the repared transaction doesn't exist anymore (in which case you suceeded but didn't get the memo about it).
Anyone who has run a number of postgres replicas can tell you that their ability to recover from a network partition is directly proportionate to their wal_keep_segments value and there isn't otherwise much to it.
If the network recovers while the last WAL log your replica read still exists on the server, your replica probably get to catch up, as long as your replica can read the logs as fast as they're being rotated out.
Postgres doesn't really try much else in the realm of distributed systems, so there's not much else you can test.
Skytools would be interesting, though, I've never felt particularly comfortable with replication schemes that trust an external process to ship rows and other changes, compared to the binary replication that will absolutely fail if you don't have every bit.
>Anyone who has run a number of postgres replicas can tell you that their ability to recover from a network partition is directly proportionate to their wal_keep_segments value and there isn't otherwise much to it.
Rather than wal_keep_segments, you may (should) use replication slots (http://www.postgresql.org/docs/9.4/static/warm-standby.html#...), by setting the primary_slot_name parameter in recovery.conf. With replication slots, WAL files will be kept for every replica, regardless of how far the replica lags behind. So you can very directly control the ability to recover from network partitions (although in exchange you will need to monitor the disk usage for pg_xlog at the master and, if necessary, forcibly drop the slot).
Neat, I haven't had the pleasure of running postgres 9.4 in production, and I don't run postgres at all at my current gig, but that's pretty dope.
In practice, the last time I ran pg, I was fond of wal-e which, esp if you're on EC2, is handy because you can store basically all of the WAL segments ever, with snapshots, in S3, and you can bring new replicas online without a read load on any of the existing nodes. It will also bring a replica online that's been shut off for days or weeks for service.
This is really just an S3 version implementation of the wal archiver, which was originally designed afaict for storing infinite history on NFS. Come to think of it, the new place has NFS. rubs chin
Anyway, thanks for the pointer! Def something I would look into in the future. Maybe it would be fun to write a jepsen test for recovering pg replicas.
I spoke to him privately and in terms of how well message brokers handle fault tolerance, his response was that Kafka appears to be the best at the moment.
No kidding. Someone raises reasonable points, and the engineer responds with:
"Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers."
Aphyr spending his time finding bugs and unspecified behavior in their software is dismissed as "nerdraging". And instead of discussing it openly, he should spend a day and go to "88 Stevenson", wherever that is...
I'm going to assume he means Stevenson Avenue, Polmont, Falkirk, Stirlingshire, United Kingdom -- since that's the first Google result I got for "Stevenson" as an address.
> I'm going to assume he means Stevenson Avenue, Polmont, Falkirk, Stirlingshire, United Kingdom -- since that's the first Google result I got for "Stevenson" as an address.
Perfect point!
Of course, somebody can't imagine the people exist who aren't in the same city as he is.
And it's often forgotten that Google is not the same for everybody, customizing their result based on their own heuristics of who is asking and from where he's asking.
But maybe the author of that post actually knows the person he communicates to is actually in the same city? I don't know.
The point was that it's not the kind of reply you should make in a public issue tracker. It comes off as: "We're only interested in talking to people who are in San Francisco and are cool enough to hang out with us."
Hmm...no, it's not like that. I have a lot of meetings with a lot of people, it's part of my day job. He's just one of our many users, with whom I like to engage directly.
I had a rant to one of your other comments that I deleted as probably being non-productive, but this is the second comment of a similar vein that I've seen from you, so I'll actually reply this time.
Aphyr tests distributed systems to destruction. He seems to be really good at it. You've had an account here for nearly three years, and you write distributed software; it's fairly shocking to me that you're not familiar with his work. You should browse https://aphyr.com/tags/jepsen for a bit, to get a feel for what he does.
As an anecdotal aside, I tend to use aphyr's interactions with a development team as a bit of an indicator about the team's quality. A team that acknowledges the problems that he's finding and hurries to address them is probably a good one. A team that outright denies that the problems exist is a pretty poor one. Your responses to #513 (especially the nerd raging bit) aren't encouraging.
By excluding all of the other users (most of whom aren't in SF, or even in USA) from the conversation? And you can't see why people believe you were trying to bury the item and/or get him to sign an NDA because he was "visiting your offices"?
You may want to step back, take a deep breath, and read this from the point of view a potential user of your software and see how it might be an attempt to Davidson the discussion, keeping the report of a bug quiet? It certainly doesn't give me the impression of transparency.
Only for somebody living there. As I point out in the parallel post (which is possibly too gray to be seen) Google is almost certainly really giving Stevenson in UK as the answer for people who search from there.
He works in SF, as do we. It makes sense, given the context.
I like face to face discussions. It would have been nice to talk to him about what he's building, and what he's trying to accomplish. Google hangouts would have worked, too.
I think what he's trying to accomplish is running the same kinds of tests on Chronos that he's famously run on almost every other important open source distributed system, and that his writeup gives you probably the most important info you need on how to make your stuff more amenable to that kind of adversarial automated testing.
Written communication can be a lot more work, but it is also a lot healthier for your opensource community. Its turns into something that is googleable for new people trying to understand the rationale behind decisions. It is more accessible for people in your community that are not native english speakers or who do not feel comfortable speaking the language while still being able to read and write it.
It seems the context is "if it isn't tested, it is broken". One might surmise that the kind of testing he did on your software is a kind that you did't do.
I would disagree with face-to-face being a useful way to confront a painful bug report.
> Transient resource or network failures can completely disable Chronos. Most systems tested with Jepsen return to some sort of normal operation within a few seconds to minutes after a failure is resolved. In no Jepsen test has Chronos ever recovered completely from a network failure. As an operator, this fragility does not inspire confidence.
Honestly, that is the money shot. Its not worth running an off the shelf cluster system that isn't self-healing from a network partition and/or a node loss->node restored cycle.
"This isn’t the end of the world–it does illustrate the fragility of a system with three distinct quorums, all of which must be available and connected to one another, but there will always be certain classes of network failure that can break a distributed scheduler."
As we build increasingly complex distributed from what are essentially modular distributed systems themselves, this is something to be aware of. The failure characteristics of all the underlying systems will bubble up in surprising ways.
Yes but the number of times I've had to manually heal a cluster in production in my professional life can be counted on one hand.
A network partition like Jepsen tests for is a once-every-other-month problem due to network maintenance, hardware failures, etc. So yeah, it is the end of the world for me. I like not having to wake up in the middle of the night every few months.
One thing to note here, is that upstream mesos is working on removing Zookeeper as a dependency. That would at a minimum remove one of the three pieces.
Note that I briefly evaluated Chronos, but tossed it aside as a toy when realizing it didn't support any form of constraints. Aurora is a very nice scheduler for HA / distributed cron (with constraints!) and long running services.
It would be fantastic if Kyle would test out Aurora, if not only to break it so that upstream fixes it. He is generally successful in breaking ALL THE THINGS.
In MESOS-1806, it mentions replacing ZK with ReplicatedLog or etcd. The replicated log is a mesos native construct, that has no external dependencies.
If you can replace ZK with a native Mesos construct, seems like it allows you to remove ZK entirely. I meant to say "optionally allow removing ZK as a dependency" in the original post. You're totally correct in that regard.
Oh I love zookeeper and have no issues whatsoever with it. However, it is almost universally the first thing I hear people moan about when someone suggest they use Mesos. People don't like Zookeeper.
The latest version of Chronos added support for constraints, at least for simple EQUALS constraints.
The annoying problem that I also encountered with Chronos was the lack of error reporting from the REST API. Invalid jobs would just return "400 Bad Request" with no error message. The error sometimes wouldn't even be reported in the logs.
It all depends on the claim vendors make. I think that is the great stuff about Kyle's series -- he often looks at how the systems behave vs. how they are documented and promoted.
If vendors fix things after the test, there are usually 2 types of fixes -- docs or code. Docs mean tell people how system really behaves and how data might be corrupted or actually fix the problem if possible.
So if Chronos just says in bold red letters on the front -- "You'll lose data in a partition" or "Our system is neither C or A if you use these options" that's ok too. The users can then at least make an informed choice.
> Several engineers suggest that Aurora is more robust, though more difficult to set up, than Chronos. I haven’t evaluated Aurora yet, but it’s likely worth looking in to.
I've tried out Aurora (and helped improve the setup docs a small bit) and found it quite nice. If you're going to try it out use vagrant, and to deploy it in production I suggest working off the Vagrant setup: https://github.com/apache/aurora/blob/master/examples/vagran...
> Instrument your jobs to identify whether they ran or not
if i worked on a distributed system, i'd be thrilled and horrified at the same time if i saw a bug report from aphyr. the whole series is a gem, well worth an afternoon (or five).
Are there many of you actually using Chronos in production? I'm not sold that this project will still be with us in a couple of years..
I'm not saying the idea is bad, but comparing the developer community of Marathon with Chronos you see some obvious differences in commits, developers, issues (and responses), releases etc.
We've been using it in conjunction with Marathon and Mesos, and my impression is that it's a now half-dead project riddled with bugs (especially in the web UI) and I'm unsure whether we should invest more resources and infrastructure around this project.
Aurora does look interesting. But Apache doesn't have a much better reputation, and I'm not particularly keen on going away from Marathon. Any DevOps engineers with container/mesos infrastructures wanting to chime in?
Hi, Chronos Product Manager here ('air' on github).
Status: Having fun creating issues in the wake of the Jepsen post : )
I can say that Mesosphere is 100% committed to both Marathon and Chronos. You're right that Marathon gets more attention for the moment, but we have grand plans for both.
A strength of Mesos is that it's an open platform. If Chronos isn't right for you, there are solid alternatives. I'd appreciate if you raise issues against Chronos though so we can improve it : )
Excuse the plug, but if you'd like to help us improve these widely-used frameworks we'd love to hear from you: https://mesosphere.com/careers/
Looks like a very interesting project which I'll keep a keen eye on. We're probably not ready to switch our entire stack from Mesos to Lattice though, especially considering you're missing some important bits such as external service discovery etc. (by the looks of it)
We're currently doing Marathon on Mesos with gliderlabs/registrator which syncs Docker and Consul - and then we're using consul-template + HAProxy for exposing applications externally. Works quite nice.
The differences (and advantages) between Lattice and Mesos could've been made clearer too I guess, although your docs and FAQs are quite nice.
If you want the full stack, you go from Lattice to Cloud Foundry. It gives you service injection, logging, routing and a bunch of other stuff I forget right now.
And it doesn't require you to build your own snowflake PaaS, which you'll be married to forever at your own expense.
I've worked on CF so I'm biased in its favour. But my actual paying job is helping clients to deliver user value. Tinkering around with various tools is fun, but it's not delivering user value.
Typing
cf push app-name
Is.
And it works. It just plain old works. That makes my life 100x easier.
Awesome post -- it'd be cool to see Kyle test more Mesos frameworks. Singularity (http://github.com/HubSpot/Singularity) is another good option for running scheduled jobs in Mesos (along with web services, background workers, and one-off tasks). It fails fast like Chronos, but we made it clear to use monit, supervisor, or systemd in the docs. ;) We run our entire product on it (~1,600 deployable items) with no issues. Our PROD cluster has launched 8,000,000 tasks so far in 2015!
Ha, that's a good and funny point that hadn't occurred to me. More info in answer to the parent's question: the song is "Call Me Maybe" by Carly Rae Jepsen.
"Ordinarily in Jepsen we’d refer to these as “primary” and “secondary” or “leader” and “follower” to avoid connotations of, well, slavery, but the master nodes themselves form a cluster with leaders and followers, and terms like “executor” have other meanings in Mesos, so I’m going to use the Mesos terms here."
hahaha, well naming things is one of the hardest problems in computer science after all.
It's a total bikeshed issue, in any event. No one seems to mind the issue of Unix routinely forcing you to commit filicide and general killing of children, for instance.
Who gives a fuck about slavery in 18th century US, when there are 21 to 36 million slaves in the world today. That means that there are more people enslaved to...fucking...day, than at any other point in history.
If you want to get the word changed do it out of respect for those living in slavery today, and not because some of the most privileged people on earth (first world software developers regardless their minority) could get their feelings hurt.
I got the impression that the people most vocal about these terms don't really care about the issues. They just want the thrill of changing the world, without actually doing or changing anything.
If we ban all the bad words from our vocabulary, it only serves to pretend the bad things they name don't exist.
Also, slavery isn't a very good metaphor in many situations.
If you have a DB "slave", the "master" really doesn't control it. The master doesn't enforce it's functionality. The master won't typically supervise it, the master doesn't own it. The master didn't buy it, and there's really no economics in the situation at all.
When you view systems in terms of autonomy and commitments (ie, promise theory), the metaphor of slavery becomes grossly inaccurate when trying to describe what a system is really doing.
You're only proving my point, really. The only things you can think of are Confederate flags and the trans-Atlantic slave trade, when slavery as an institution is as old as human civilization itself (and continues on well to this day).
Incidentally, filicide is also ancient and ongoing.
I can think of a lot of things other than Confederate flags, I was making a symbolic argument against your misuse of the word bikeshed.
Slavery is a bad practice and I would definitely not work with someone who felt otherwise, but other than that, your argument is just worthless escapism. Have fun talking to HR in the future.
My point was that trying to paint slavery as being particularly objectionable in the face of just as seemingly reprehensible (but not at all in context) metaphors of child murder, is disingenuous and pointlessly selective. Child murder and slavery are horrible and I can't believe I actually have to specify this. They are both ongoing, so your argument from ancestry doesn't work. There's millions of slaves as we speak, many more ancestors of slaves and many who have had their children murdered. That doesn't somehow make those metaphors irredeemable when put in purely technical context, e.g. slavery as a relationship of total control and ownership by one party over another (actually just as applicable to S/M -- a consensual sexual practice, as it is to real-life slavery).
The general take-away from most of them is -- do not believe the vendors of distributed systems. Even systems that were meant to solve consensus (etcd, consul) failed:
https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul
Elasticsearch didn't do well, MongoDB, Kafka, etc.
The only ones that did ok that I remember were: Zookeeper, Riak (with siblings), and maybe Cassandra.