Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Call Me Maybe: Chronos (aphyr.com)
243 points by sbilstein on Aug 18, 2015 | hide | past | favorite | 87 comments


Kyle series "Call Me Maybe" is great.

The general take-away from most of them is -- do not believe the vendors of distributed systems. Even systems that were meant to solve consensus (etcd, consul) failed:

https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

Elasticsearch didn't do well, MongoDB, Kafka, etc.

The only ones that did ok that I remember were: Zookeeper, Riak (with siblings), and maybe Cassandra.


As well it's interesting to see the different responses of teams when problems are raised.

Some are very thankful for the analysis, others don't seem to appreciate how important correctness is in this type of distributed systems. I find it's a good indicator of the maturity of the teams working on them, and how much I'm willing to trust each system.


Exactly. Notably etcd and consul teams have quickly responded with fixes.


Wasn't their response to turn up a couple of timeouts?


No, the etcd team responded by fixing the relevant bug and introducing a quorum read option to provide a linearized read.


> I find it's a good indicator of the maturity of the teams working on them, and how much I'm willing to trust each system.

I hate to nitpick but I'm generally curious. How do you know it's a good indicator? It can definitely be _an_ indicator but how do you know it was good? Did you have a follow up experience that validated what the indicator indicated?


How a project performs today tells you the zeroth derivative of its location. Looking at the commit log tells you about the first derivative. How people react to Call Me Maybe when its about their product gives you a lot of information about the second derivative.


I'm surprised to find this being a pretty good way to look at it.


How do you know it's a good indicator?

One way is to note whether or not the response to the report dismisses it or hand-waves it away. Another response is to neglect the report for 18 months with no reply.

The tests that aphyr conducts are brutal, which you can see by casual inspection of his Call Me Maybe site or his talks. If the response of a software team tries to divert his attention or your attention from the issue, it does not leave a good impression.


To be fair, he said it was a good indicator of the maturity of the team and how much he was willing to trust the systems. He didn't say it was a good indicator of the reliability of those systems.


IIRC, Riak only did well with the CRDT data types. That result and the rather well done presentation by Peter Bourgon about CRDT and the time series database they built with it at SoundCloud made me look into more, and it seems a quite compelling concept. And obviously proven robust.


Not true, Riak was completely, 100% correct with siblings enabled, which is Totally Expected behavior for a standard Dynamo-style system.

The criticism is that last-write-wins is the default, and that's a fair criticism--that's not an honest default for a robust eventually consistent system like Riak. My guess is they default to LWW to ease onboarding of new users. Everyone is worried about rivaling MongoDB's out-of-the-box experience so you don't scare people away...

TBH, though, if you're using a distributed, eventually consistent storage system without understanding the consequences, you're probably using the wrong tool.


He's also pretty favorable about Postgresql:

https://aphyr.com/posts/282-call-me-maybe-postgres


Postgres was a bit of a funny one. I don't know what to think of it. The partition case studies was with a single server and clients. The failure was found related to the server accepting the transaction but the ack never reaching the client (due to partition) so client thinks transaction hasn't completed.


Well, PostgreSQL is not a distributed database. Sure, the study may have involved one of the HA solutions available for PostgreSQL, but it wouldn't have been a PostgreSQL study then.

As some comment below, the reason for this observed behavior is due to the fact that the server, together with the client, form a kind of distributed system. It is not classified as a failure to perform a write and fail to ACK that to the client. The reverse (not performing an ACKed write) is a failure. However, this suggests that every non ACKed write operation should be retried with care, because it may have gone through. It's better if you can use idempotency semantics, like 9.5's upcoming http://www.postgresql.org/docs/9.5/static/sql-insert.html#SQ... (ON CONFLICT, aka UPSERT).


> It is not classified as a failure to perform a write and fail to ACK that to the client.

It's worthwhile to note that this is a fairly general problem in many, many kinds of systems.

If that's actually a problem, you can use 2PC. If you didn't get back an ack for the PREPARE TRANSACTION it failed (and you better abort it explicitly until you get an error that it doesnt exist!). Once the PREPARE succeeded you send COMMIT PREPARED until you suceed or get an error that the repared transaction doesn't exist anymore (in which case you suceeded but didn't get the memo about it).


Yes, that surprised me too; I expected a cluster of several Pg servers instead.


Anyone who has run a number of postgres replicas can tell you that their ability to recover from a network partition is directly proportionate to their wal_keep_segments value and there isn't otherwise much to it.

If the network recovers while the last WAL log your replica read still exists on the server, your replica probably get to catch up, as long as your replica can read the logs as fast as they're being rotated out.

Postgres doesn't really try much else in the realm of distributed systems, so there's not much else you can test.

Skytools would be interesting, though, I've never felt particularly comfortable with replication schemes that trust an external process to ship rows and other changes, compared to the binary replication that will absolutely fail if you don't have every bit.


>Anyone who has run a number of postgres replicas can tell you that their ability to recover from a network partition is directly proportionate to their wal_keep_segments value and there isn't otherwise much to it.

Rather than wal_keep_segments, you may (should) use replication slots (http://www.postgresql.org/docs/9.4/static/warm-standby.html#...), by setting the primary_slot_name parameter in recovery.conf. With replication slots, WAL files will be kept for every replica, regardless of how far the replica lags behind. So you can very directly control the ability to recover from network partitions (although in exchange you will need to monitor the disk usage for pg_xlog at the master and, if necessary, forcibly drop the slot).


Neat, I haven't had the pleasure of running postgres 9.4 in production, and I don't run postgres at all at my current gig, but that's pretty dope.

In practice, the last time I ran pg, I was fond of wal-e which, esp if you're on EC2, is handy because you can store basically all of the WAL segments ever, with snapshots, in S3, and you can bring new replicas online without a read load on any of the existing nodes. It will also bring a replica online that's been shut off for days or weeks for service.

This is really just an S3 version implementation of the wal archiver, which was originally designed afaict for storing infinite history on NFS. Come to think of it, the new place has NFS. rubs chin

Anyway, thanks for the pointer! Def something I would look into in the future. Maybe it would be fun to write a jepsen test for recovering pg replicas.


Well technically clients are also part of distributed systems, but you are right Postgres is different than all other systems he analyzed.


HBase also did well. They need to blog about it. There was a related jira ticket I read through: https://issues.apache.org/jira/browse/HBASE-9719

Also, an old blog entry from yammer engg. blog: http://old.eng.yammer.com/call-me-maybe-hbase/


I spoke to him privately and in terms of how well message brokers handle fault tolerance, his response was that Kafka appears to be the best at the moment.


I remember reading that FoundationDB did very well, OH WELL...


They ran it themselves if I recall correctly.


Apache Solr did well.


Awesome read. The issue he made on github[1] is also worth it, brndnmtthws could have definitely handled that better.

1. https://github.com/mesos/chronos/issues/513


No kidding. Someone raises reasonable points, and the engineer responds with:

"Like I suggested before, you should stop by our offices @ 88 Stevenson for lunch some day, and you can chat with our best and brightest about this (and other things) until the cows come home. Then you can save yourself a lot of back and forth, and nerdraging about computers."

Aphyr spending his time finding bugs and unspecified behavior in their software is dismissed as "nerdraging". And instead of discussing it openly, he should spend a day and go to "88 Stevenson", wherever that is...

I'm going to assume he means Stevenson Avenue, Polmont, Falkirk, Stirlingshire, United Kingdom -- since that's the first Google result I got for "Stevenson" as an address.


> I'm going to assume he means Stevenson Avenue, Polmont, Falkirk, Stirlingshire, United Kingdom -- since that's the first Google result I got for "Stevenson" as an address.

Perfect point!

Of course, somebody can't imagine the people exist who aren't in the same city as he is.

https://mesosphere.com/contact/

"88 Stevenson St, San Francisco, CA 94105"

And it's often forgotten that Google is not the same for everybody, customizing their result based on their own heuristics of who is asking and from where he's asking.

But maybe the author of that post actually knows the person he communicates to is actually in the same city? I don't know.


It means San Francisco, they're both there.


Sorry, I forgot the sarcasm tags.

The point was that it's not the kind of reply you should make in a public issue tracker. It comes off as: "We're only interested in talking to people who are in San Francisco and are cool enough to hang out with us."


> "We're only interested in talking to people who are in San Francisco and are cool enough to hang out with us."

I read it less as that and more "you are embarrassing us by discussing our product's flaws in public. please bring this behind closed doors"


Which was precisely what was meant by it in my reading. Unsurprising really given the way many people act right now in SV.


>Unsurprising really given the way many people act right now in SV.

Except he's in SF.


Hmm...no, it's not like that. I have a lot of meetings with a lot of people, it's part of my day job. He's just one of our many users, with whom I like to engage directly.


I had a rant to one of your other comments that I deleted as probably being non-productive, but this is the second comment of a similar vein that I've seen from you, so I'll actually reply this time.

Aphyr tests distributed systems to destruction. He seems to be really good at it. You've had an account here for nearly three years, and you write distributed software; it's fairly shocking to me that you're not familiar with his work. You should browse https://aphyr.com/tags/jepsen for a bit, to get a feel for what he does.

As an anecdotal aside, I tend to use aphyr's interactions with a development team as a bit of an indicator about the team's quality. A team that acknowledges the problems that he's finding and hurries to address them is probably a good one. A team that outright denies that the problems exist is a pretty poor one. Your responses to #513 (especially the nerd raging bit) aren't encouraging.


By excluding all of the other users (most of whom aren't in SF, or even in USA) from the conversation? And you can't see why people believe you were trying to bury the item and/or get him to sign an NDA because he was "visiting your offices"?


You may want to step back, take a deep breath, and read this from the point of view a potential user of your software and see how it might be an attempt to Davidson the discussion, keeping the report of a bug quiet? It certainly doesn't give me the impression of transparency.


Really don't think you handled this one well m8.


Just to play devil's advocate-

There's an 88 Stevenson in SoMa in San Francisco, so that's pretty canonical.

Also, you have to give Air credit for being civil, facilitating discussion, and taking responsibility for the issues.


> that's pretty canonical

Only for somebody living there. As I point out in the parallel post (which is possibly too gray to be seen) Google is almost certainly really giving Stevenson in UK as the answer for people who search from there.


You mean someone like Aphyr, the person he was talking to? :)


He works in SF, as do we. It makes sense, given the context.

I like face to face discussions. It would have been nice to talk to him about what he's building, and what he's trying to accomplish. Google hangouts would have worked, too.


I think what he's trying to accomplish is running the same kinds of tests on Chronos that he's famously run on almost every other important open source distributed system, and that his writeup gives you probably the most important info you need on how to make your stuff more amenable to that kind of adversarial automated testing.


> It would have been nice to talk to him about what he's building, and what he's trying to accomplish.

I'm pretty sure he is merely (!) building tests that demonstrate defects in distributed systems software.

I wonder if you think that makes the issues he files against your software somehow less valid.


Written communication can be a lot more work, but it is also a lot healthier for your opensource community. Its turns into something that is googleable for new people trying to understand the rationale behind decisions. It is more accessible for people in your community that are not native english speakers or who do not feel comfortable speaking the language while still being able to read and write it.


It makes sense, given the context

It seems the context is "if it isn't tested, it is broken". One might surmise that the kind of testing he did on your software is a kind that you did't do.

I would disagree with face-to-face being a useful way to confront a painful bug report.


> Transient resource or network failures can completely disable Chronos. Most systems tested with Jepsen return to some sort of normal operation within a few seconds to minutes after a failure is resolved. In no Jepsen test has Chronos ever recovered completely from a network failure. As an operator, this fragility does not inspire confidence.

Honestly, that is the money shot. Its not worth running an off the shelf cluster system that isn't self-healing from a network partition and/or a node loss->node restored cycle.


Another big takeaway for me:

"This isn’t the end of the world–it does illustrate the fragility of a system with three distinct quorums, all of which must be available and connected to one another, but there will always be certain classes of network failure that can break a distributed scheduler."

As we build increasingly complex distributed from what are essentially modular distributed systems themselves, this is something to be aware of. The failure characteristics of all the underlying systems will bubble up in surprising ways.


Yes but the number of times I've had to manually heal a cluster in production in my professional life can be counted on one hand.

A network partition like Jepsen tests for is a once-every-other-month problem due to network maintenance, hardware failures, etc. So yeah, it is the end of the world for me. I like not having to wake up in the middle of the night every few months.


One thing to note here, is that upstream mesos is working on removing Zookeeper as a dependency. That would at a minimum remove one of the three pieces.

Note that I briefly evaluated Chronos, but tossed it aside as a toy when realizing it didn't support any form of constraints. Aurora is a very nice scheduler for HA / distributed cron (with constraints!) and long running services.

It would be fantastic if Kyle would test out Aurora, if not only to break it so that upstream fixes it. He is generally successful in breaking ALL THE THINGS.


Mesos is looking at allowing things other than Zookeeper to be the dependency, I'm not aware of them removing the dependency altogether.

https://issues.apache.org/jira/browse/MESOS-1806


In MESOS-1806, it mentions replacing ZK with ReplicatedLog or etcd. The replicated log is a mesos native construct, that has no external dependencies.

If you can replace ZK with a native Mesos construct, seems like it allows you to remove ZK entirely. I meant to say "optionally allow removing ZK as a dependency" in the original post. You're totally correct in that regard.


I do wonder why the GP would want to, given how ridiculously reliable zookeeper is.


Oh I love zookeeper and have no issues whatsoever with it. However, it is almost universally the first thing I hear people moan about when someone suggest they use Mesos. People don't like Zookeeper.


The latest version of Chronos added support for constraints, at least for simple EQUALS constraints.

The annoying problem that I also encountered with Chronos was the lack of error reporting from the REST API. Invalid jobs would just return "400 Bad Request" with no error message. The error sometimes wouldn't even be reported in the logs.


Yeah I was playing with it > 6 months ago, and Aurora was solid then as well.


It all depends on the claim vendors make. I think that is the great stuff about Kyle's series -- he often looks at how the systems behave vs. how they are documented and promoted.

If vendors fix things after the test, there are usually 2 types of fixes -- docs or code. Docs mean tell people how system really behaves and how data might be corrupted or actually fix the problem if possible.

So if Chronos just says in bold red letters on the front -- "You'll lose data in a partition" or "Our system is neither C or A if you use these options" that's ok too. The users can then at least make an informed choice.


> Several engineers suggest that Aurora is more robust, though more difficult to set up, than Chronos. I haven’t evaluated Aurora yet, but it’s likely worth looking in to.

I've tried out Aurora (and helped improve the setup docs a small bit) and found it quite nice. If you're going to try it out use vagrant, and to deploy it in production I suggest working off the Vagrant setup: https://github.com/apache/aurora/blob/master/examples/vagran...

> Instrument your jobs to identify whether they ran or not

Prometheus.io let's you alert on when the job last succeeded, which is generally what you want. http://prometheus.io/docs/instrumenting/pushing/ has a full example for Java.


if i worked on a distributed system, i'd be thrilled and horrified at the same time if i saw a bug report from aphyr. the whole series is a gem, well worth an afternoon (or five).


As always people should check out Aphyr's twitter page for more detail and local color.

"really wish I had more time to dig into chronos because it has so many fascinating failure modes but I've got 2 other systems before talks!"


Are there many of you actually using Chronos in production? I'm not sold that this project will still be with us in a couple of years..

I'm not saying the idea is bad, but comparing the developer community of Marathon with Chronos you see some obvious differences in commits, developers, issues (and responses), releases etc.

We've been using it in conjunction with Marathon and Mesos, and my impression is that it's a now half-dead project riddled with bugs (especially in the web UI) and I'm unsure whether we should invest more resources and infrastructure around this project.

Aurora does look interesting. But Apache doesn't have a much better reputation, and I'm not particularly keen on going away from Marathon. Any DevOps engineers with container/mesos infrastructures wanting to chime in?


Hi, Chronos Product Manager here ('air' on github). Status: Having fun creating issues in the wake of the Jepsen post : )

I can say that Mesosphere is 100% committed to both Marathon and Chronos. You're right that Marathon gets more attention for the moment, but we have grand plans for both.

A strength of Mesos is that it's an open platform. If Chronos isn't right for you, there are solid alternatives. I'd appreciate if you raise issues against Chronos though so we can improve it : )

Excuse the plug, but if you'd like to help us improve these widely-used frameworks we'd love to hear from you: https://mesosphere.com/careers/


Have a look at Lattice (http://lattice.cf), which is extracted from Cloud Foundry.

Disclaimer: I work for Pivotal (in Labs), which donates the largest chunk of engineering effort to Cloud Foundry, including the Lattice team.


Looks like a very interesting project which I'll keep a keen eye on. We're probably not ready to switch our entire stack from Mesos to Lattice though, especially considering you're missing some important bits such as external service discovery etc. (by the looks of it) We're currently doing Marathon on Mesos with gliderlabs/registrator which syncs Docker and Consul - and then we're using consul-template + HAProxy for exposing applications externally. Works quite nice.

The differences (and advantages) between Lattice and Mesos could've been made clearer too I guess, although your docs and FAQs are quite nice.


If you want the full stack, you go from Lattice to Cloud Foundry. It gives you service injection, logging, routing and a bunch of other stuff I forget right now.

And it doesn't require you to build your own snowflake PaaS, which you'll be married to forever at your own expense.

I've worked on CF so I'm biased in its favour. But my actual paying job is helping clients to deliver user value. Tinkering around with various tools is fun, but it's not delivering user value.

Typing

    cf push app-name
Is.

And it works. It just plain old works. That makes my life 100x easier.


Would love to get your feedback on the Singularity framework: https://github.com/HubSpot/Singularity

Currently in use at HubSpot, OpenTable, Groupon (last I heard) and EverTrue.


Awesome post -- it'd be cool to see Kyle test more Mesos frameworks. Singularity (http://github.com/HubSpot/Singularity) is another good option for running scheduled jobs in Mesos (along with web services, background workers, and one-off tasks). It fails fast like Chronos, but we made it clear to use monit, supervisor, or systemd in the docs. ;) We run our entire product on it (~1,600 deployable items) with no issues. Our PROD cluster has launched 8,000,000 tasks so far in 2015!


Can someone explain the call me maybe reference. Was it a movie or a song?


It's a song, and probably the tool and aphyr's blog posts will have more staying power than the song and the singer they are named after.


Ha, that's a good and funny point that hadn't occurred to me. More info in answer to the parent's question: the song is "Call Me Maybe" by Carly Rae Jepsen.


You're probably right but I think her new stuff ain't bad at all!


It's explained more thoroughly (and with extended jokes) in the first post from the series: https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-a...


> for instance, it omits that Chronos nodes are fragile by design and must be supervised by a daemon to restart them.

Wat? Why is that necessary to mention? Every process is fragile and needs to be supervised!


Awesome analyze! :-)


"Ordinarily in Jepsen we’d refer to these as “primary” and “secondary” or “leader” and “follower” to avoid connotations of, well, slavery, but the master nodes themselves form a cluster with leaders and followers, and terms like “executor” have other meanings in Mesos, so I’m going to use the Mesos terms here."

hahaha, well naming things is one of the hardest problems in computer science after all.


It's worth noting that the Mesos project will actually be renaming "slave" to "agent." cf. https://issues.apache.org/jira/browse/MESOS-1478


It's a total bikeshed issue, in any event. No one seems to mind the issue of Unix routinely forcing you to commit filicide and general killing of children, for instance.


There's a difference between that and trying to work on a diverse team with people whose ancestors were slaves and avoiding common phrases like:

  "We need more slaves!"
It's a bit of nomenclature that would never be used if people weren't mimicking choices from decades ago.

What name to use other than slave might be a bikeshed issue, but the bikeshed is currently painted with a confederate flag. ;)


Who gives a fuck about slavery in 18th century US, when there are 21 to 36 million slaves in the world today. That means that there are more people enslaved to...fucking...day, than at any other point in history.

If you want to get the word changed do it out of respect for those living in slavery today, and not because some of the most privileged people on earth (first world software developers regardless their minority) could get their feelings hurt.

I got the impression that the people most vocal about these terms don't really care about the issues. They just want the thrill of changing the world, without actually doing or changing anything.

If we ban all the bad words from our vocabulary, it only serves to pretend the bad things they name don't exist.


Also, slavery isn't a very good metaphor in many situations.

If you have a DB "slave", the "master" really doesn't control it. The master doesn't enforce it's functionality. The master won't typically supervise it, the master doesn't own it. The master didn't buy it, and there's really no economics in the situation at all.

When you view systems in terms of autonomy and commitments (ie, promise theory), the metaphor of slavery becomes grossly inaccurate when trying to describe what a system is really doing.

I'd call it a replica instead.


A replica suggests that all nodes are somewhat equal copies.

On a master/slave bus, nothing works if the master is shut down. The master literally supervises everything by arbitrating communication.

In a replica system like elastic search, failure of one node will usually just cause some other node to take over its task.


You're only proving my point, really. The only things you can think of are Confederate flags and the trans-Atlantic slave trade, when slavery as an institution is as old as human civilization itself (and continues on well to this day).

Incidentally, filicide is also ancient and ongoing.


I can think of a lot of things other than Confederate flags, I was making a symbolic argument against your misuse of the word bikeshed.

Slavery is a bad practice and I would definitely not work with someone who felt otherwise, but other than that, your argument is just worthless escapism. Have fun talking to HR in the future.


My point was that trying to paint slavery as being particularly objectionable in the face of just as seemingly reprehensible (but not at all in context) metaphors of child murder, is disingenuous and pointlessly selective. Child murder and slavery are horrible and I can't believe I actually have to specify this. They are both ongoing, so your argument from ancestry doesn't work. There's millions of slaves as we speak, many more ancestors of slaves and many who have had their children murdered. That doesn't somehow make those metaphors irredeemable when put in purely technical context, e.g. slavery as a relationship of total control and ownership by one party over another (actually just as applicable to S/M -- a consensual sexual practice, as it is to real-life slavery).

I will have fun talking to HR, thank you.


Or Akka's lovely akka.actor.dungeon.ChildrenContainer...

http://doc.akka.io/japi/akka/2.3.9/akka/actor/dungeon/Childr...


With a method called "shallDie", no less!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: