Hacker News new | past | comments | ask | show | jobs | submit login
Leader Election with S3 Conditional Writes (morling.dev)
122 points by gunnarmorling 21 days ago | hide | past | favorite | 70 comments



Leader election and distributed locking reduce to the same problem… which is proven to be impossible. It means in some edge case it will fail on you, is your system handling those cases?

I didn’t read past this:

> Systems like Apache ZooKeeper or Postgres (via Advisory Locks) provide the required building blocks for this

Zookeeper is the original sin. Convincing a whole generation of programmers that distributed lock are a feasible solution.

This is my biggest pet peeve in distributed systems.

——

And if you don’t believe me, maybe you’ll trust Kyle K. of Jepsen fame:

> However, perfect failure detectors are impossible in asynchronous networks.

Links to: https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-...

https://jepsen.io/analyses/datomic-pro-1.0.7075


> which is proven to be impossible.

'Technically' intractable problems are solvable just fine in a way that is almost as useful as solving them completely if you can achieve one of two things:

* Reliably identify when you've encountered an unsolvable case (usefulness of this approach depends on the exact problem you're solving).

or

* Reduce the probability of unsolvable cases/incorrect solutions to a level low enough to not actually happen in practice.

'Technically' GUIDs are impossible, reliable network communication (TCP) is impossible, O^2 time complexity functions will grow to unusably large running times - but in practice all of these things are used constantly to solve real problems.


"Distributed locks" are at best a contention-reduction mechanism. They cannot be used to implement mutual exclusion that is _guaranteed_ to work.

I've seen way too many systems where people assume TCP == perfectly reliable and distributed locks == mutual exclusion. Which of course it's not the case.


> Convincing a whole generation of programmers that distributed lock are a feasible solution.

I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.

Questions that should be simple become complicated. What happens when the distributed locking system is down? What happens when we reboot all the nodes at once? What if they don't come down at exactly the same time and there's leader churn for like 2 minutes? Etc, etc.

Those questions should be fairly simple, but become something where a senior dev is having to trace codepaths and draw on a whiteboard to figure it out. It's not even enough to understand how a single node works in-depth, they have to figure out how this node works but also how this node's state might impact another node's.

All of this is much simpler in leaderless systems (where the leader system is replaced with idempotency or a scheduler or something else).

I very strongly prefer avoiding leader systems; it's a method of last resort when literally nothing else will work. I would much rather scale a SQL database to support the queries for idempotency than deal with a leader system.

I've never seen an idempotent system switch to a leader system, but I've sure seen the reverse a few times.


>> Convincing a whole generation of programmers that distributed lock are a feasible solution.

> I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.

I think this is a huge problem with the way we’re developing software now. Distributed systems are extremely difficult for a lot of reasons, yet it’s often or first choice when developing even small systems!

At $COMPANY we have hundreds of lambdas, DocumentDB (btw, that is hell in case you’re considering it) and other cloud storage and queuing components. On call and bugs basically are quests in finding some corner case race condition/timing problem, read after write assumption etc.

I’m ashamed to say, we have reads wrapped in retry loops everywhere.

The whole thing could have been a Rails app with a fraction of the team size and a massive increase in reliability and easier to reason about/better time delivering features.

You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.


> You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.

The older I get, the more I think this is a result of Conway's law and that a lot of this architectural cruft stems from designing systems around communication boundaries rather than things that make technical sense.

Monolithic apps like Rails only happen under a single team or teams that are so tightly coupled people wonder whether they should just merge.

Distributed apps are very loosely coupled, so it's what you would expect to get from two teams that are far apart on the org chart.

Anecdotally, it mirrors what I've seen in practice. Closely related teams trust each other and are willing to make a monolith under an assumption that their partner team won't make it a mess. Distantly related teams play games around ensuring that their portion is loosely coupled enough that it can have its own due dates, reliability, etc.

Queues are the king of distantly coupled systems. A team's part of a queue-based app can be declared "done" before the rest of it is even stood up. "We're dumping stuff into the queue, they just need to consume it" or the inverse "we're consuming, they just need to produce". Both sides of the queue are basically blind to each other. That's not to say that all queues are bad, but I have seen a fair few queues that existed basically just to create an ownership boundary.

I once saw an app that did bidirectional RPC over message queues because one team didn't believe the other could/would do retries, on an app that handled single digit QPS. It still boggles my mind that they thought it was easier to invent a paradigm to match responses to requests than it was to remind the other team to do retries, or write them a library with retries built in, or just participate in bleeping code reviews.


> once saw an app that did bidirectional RPC over message queues

Haha I've seen this anti-pattern too (although I think it's in the enterprise patterns book??). It would bring production to a grinding halt every night. Another engineer and I stayed up all night and replaced it with simple REST API.


I once saw a REST API built with bidirectional queues. There was a “REST” server that converted HTTP to some weird custom format and an “app” server with “business logic”, with tons of queues in between. It was massively over complicated and never made it to production. I won’t even describe what the database looked like.


I see the same. All this complexity to handle a few requests/second... but at least we can say it's "cloud native."


Same thing where I work now. Many experienced developers waste a huge chunk of their time trying to wrap their heads around their Django micro services communication patterns and edge cases. Much more complex than an equivalent Rails monolith, even though Ruby and Rails both have their issues and could be replaced by more modern tech in 2024.


This is rather misleading, the FLP theorem talks about fully asynchronous networks with unbounded delay. Partial synchrony is a perfectly reasonable assumption and allows atomic broadcast and locking to work perfectly well even if there is an unknown but finite bound on network delay.


Notice I did not mention FLP.

Atomic Broadcast (via Paxos or RAFT) does not depend on partial synchrony assumptions to maintain its safety properties.

Your internet or intranet networks are definitely asynchronous and assuming delays are bound is a recipe for building crappy systems that will inevitably fail on you in hard to debug ways.


Had you read on, you'd have seen that I am discussing this very point:

> leader election will only ever be eventually correct... So you’ll always need to be prepared to detect and fence off work done by a previous leader.


I'm sorry, I didn't mean to be bashful. I am not familiar with S3 and maybe what you describe is a perfectly safe solution for S3 and certain classes of usage.

I could not get past the point where you promulgate the idea that ZK can be used to implement locks.

Traditionally a 'lock' guarantees mutual exclusion between threads or processes.

"Distributed locks" are not locks at all. They look the same from API perspective, but they have much weaker properties. They cannot be used to guarantee mutual exclusion.

I think any mention of distributed locks / leader election should come with a giant warning: THESE LOCKS ARE NOT AS STRONG AS THE ONES YOU ARE USED TO. Skipping this warning is doing a disservice to your readers.


Reliable network communication is also proven to be impossible [1], yet it happens all the time. Yes, sometime it fails but it still “works”.

[1] https://en.wikipedia.org/wiki/Two_Generals%27_Problem


There's some serious flaws in your reasoning.

TCP guarantees order [as long as the connection is active] but it is far from being 'perfectly reliable'.

Example: sender sends, connection drops, sender has no idea whether the receiver received.

In other words, it works until it doesn't. The fact that sometimes it doesn't means it's not perfect.

TCP is a great tool but it doesn't violate the laws of physics. The 2 generals problems is and will always be impossible.


Yes, but the vast majority of network traffic these days is TCP and very, very rarely does that cause a problem because applications already need to have logic to handle failures which cannot be solved at the transport level. There is a meaningful difference between theoretically perfect and close enough to build even enormous systems with high availability.


Rounding up 'usually works' to 'is reliable' is a recipe for building crappy systems.

Rounding down 'usually works' to 'it's not perfect and we need to handle edge cases' is how you build dependable systems.

Your first comment seemed very much in the first camp to me.


Your second camp is the latter half of my first sentence. As a simple example, the transport layer cannot prevent a successfully-received message from being dropped by an overloaded or malfunctioning server, duplicate transmissions due to client errors, etc. so most applications have mechanisms to indicate status beyond simple receipt, timeouts to handle a wide range of errors only some of which involve the transport layer, and so forth. Once you have that, most applications can tolerate the slight increase in TCP failures which a different protocol would prevent.


> sometime it fails

That failures are a possibility does not concern me. That the failures are not fully characterized does.


I literally had this argument with David Schwartz of Ripple about our lack of a consensus protocol.


They both reduce to a paxos style atomic broadcast, which is in fact possible although the legend is that Leslie Lamport was trying to prove it impossible and accidentally found a way.


> They both reduce to a paxos style atomic broadcast

Atomic Broadcast guarantees order of delivery. It does not (cannot) guarantee timing of delivery. Which is what people want and expect when using distributed lock / leader election.


Ordering gives you a leader or lock holder, first claimant in the agreed ordering wins.

If you're saying "what if everything's down and we never get responses to our leadership bids", then yeah, the data center could burn down or we could lose electricity, too.


Ordering gives you ... ordering. And nothing more.

Process 1 receives: 3:00 [1] "P1 is leader" 3:01 [2] "P2 is leader"

Process 2 receives: 3:00 [1] "P1 is leader" 4:00 [2] "P2 is leader"

This is perfectly valid Atomic Broadcast. Order is maintained.

However from 3:01 to 4:00PM you have 2 leaders (or 2 processes holding the lock).

Don't use ABCast to do locking / leader election for your "user-space" application!


Good point, thanks


I remember at Oracle they built systems to shut down the previous presumed leader to definitively know it wasn't ghosting.


Yep, the "STONITH" technique [1]. But programmatically resetting one node over a network/RPC call might not work, if internode-network comms are down for that node, but it can still access shared storage via other networks... The Oracle's HA fencing doc mentions other methods too, like IPMI LAN fencing and SCSI persistent reservations [2].

[1] https://en.wikipedia.org/wiki/STONITH

[2] https://docs.oracle.com/en/operating-systems/oracle-linux/8/...


They had access to the ILOM and had some much more durable way to STONITH. Of course every link can "technically" fail but it brought it to some unreasonable amount of 9s that it felt unwarranted to consider.


Yep and ILOM access probably happens over the management network and can hardware-reset the machine, so the dataplane internode network issues and any OS level brownouts won't get in the way.


Indeed a slow node looks like a dead node for a while until it isn’t.

At some point distributed systems that work well vs others that do not is an “art of tuning timeouts and retries”.

Also nothing in production is perfect - so we should consider failures always when writing code in distributed systems and the impacts.

And we will still make mistakes…


> …which is proven to be impossible

For some definition of impossible, given that many systems utilise them effectively. Not all corner cases or theoretical failure modes are relevant to everyone.


Many systems are buggy, periodically running into corner cases that are borderline impossible to debug.


Sometimes it's enough to detect these cases and reset the system to its stable state.

For example bicycle wheels are bistable systems, but they usually stay in their useful state so it does not matter in practice.


Yes, and yet those systems still work and deliver real value all day every day. If every company Rollbar I've ever seen is the measure good software can have millions of faults and still work for users.


Can you point me in the right direction to understand Zookeepers fatal flaw?

I know you linked the first paper, but yeah candidly I don’t wish to read the full paper

Don’t mean this sarcastically whatsoever. Genuine interest.


The fatal flaw is calling them "locks".

When programmers think of locks, they think of something that can be used to guarantee mutual exclusion.

Distributed locks have edge cases where mutual exclusion is violated.

Implementation does not matter.

e.g. imagine someone shows you a design for a perpetual motion machine. You don't need to know the details to know it doesn't work! It would violate the laws of physics!

Similarly, anyone telling you they created an implementation of a distributed lock that is safe, is claiming their system breaks the laws of information theory.

"Distributed locks" are at best contention-reduction mechanisms. i.e. they can keep multiple processes from piling up and slowing each other down.

[Some] Paxos for example use leader election to streamline the protocol and achieve high throughput. But Paxos safety does NOT depend on it. If there are multiple leaders active (which will inevitably happen), the protocol still guarantees its safety properties.


For me, I almost stopped reading at the assertion that clock drift doesn't matter. They clearly didn't think through the constant fight that would occur over who the leader actually was and just hand-wave it away as 'not an issue.' They need to remove time from their equation completely if they want clock drift to not matter.


part of me thinks that clock drift would be reliably biased toward a particular node, not leapfrogging between two nodes.


It’s deeper than that. See the paper on time in distributed systems by Lamport: https://lamport.azurewebsites.net/pubs/time-clocks.pdf

There’s a relativity-like issue where it’s impossible to have a globally consistent view of time.

See IR2 for how to synchronize physical time in a distributed system.

“(a) If Pg sends a message m at physical time t, then m contains a timestamp Tm= C/(t). (b) Upon receiving a message m at time t', process P/ sets C/(t') equal to maximum (Cj(t' - 0), Tm + /Zm).”

I think the formula is saying that the clocks will only ever increase (i.e. drift upwards). If so, then you could imagine two processes leapfrogging if one of them sends a message that bumps the other’s clock, then that one sends a message back that bumps the first.

But I’m curious how it behaves if one of the clocks is running faster, e.g. a satellite has a physically different sense of time than an observer on the ground.

Also note the paper claims you can’t get rid of clocks if you want to totally order the events.


It's a truly fantastic paper, but like many things the idea that it's impossible to have a perfectly consistent global view of time doesn't mean it's not possible to have a near-perfect one. The blog mentioned AWS's Timesync which lets you be synchronized into the microseconds (https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-ti...), and Google's TrueTime is used to give ranges of times that you're guaranteed to be within (https://cloud.google.com/spanner/docs/true-time-external-con...).


It's near-perfect ... until it isn't and everything goes to shit in a cascading wtafc (what-the-actual-fuck-catastrophe).

1ms is an extremely long time in computer land where decisions are made in nanoseconds. There are 1000 nanoseconds per millisecond.


1000000 nanoseconds per millisecond.


Think you mean micro rather than milli


Yes thank you!


> Also note the paper claims you can’t get rid of clocks if you want to totally order the events.

The order of events isn't important here. We only care about the 'last event' which everyone can agree on; they just can't agree on when it happened.

In other words, they can all agree that there is a leader; we just don't know if that leader is still alive or not. The best thing to do is simply make the leader election deterministically random:

1. 'seed' the file with a 'no leader' file.

2. at randomly short intervals (max latency you want to deal with, so like once every few seconds or so), try to claim the file by using the hash of the current file as your etag. The winner is the leader.

3. Once there is a leader and you are the leader, every N seconds, update the file with a new number, using the hash of the current file as the etag.

4. If you are not the leader, every N*(random jitter)*C (where C>N*2, adjust for latency), attempt to take the file using the same method above. If you fail, retry again in N*(random jitter)*C.

5. If the leader is taken, you are not the leader until you've held leadership for at least N seconds.

This doesn't necessarily remove the clock. However, most -- if not all -- modern clocks will agree that the length of a second is within a few nanoseconds of any other clock, regardless of how accurate its total count of seconds is since the epoch. It also probably doesn't work, but it probably isn't that far away from one that does.


Why are people going for this guys throat— distributed locking might be impossible but what is being described, distributed leasing, is totally possible and useful. There's no sins being committed here.

I might choose DynamoDB over S3 to implement this but both are fine.


HN being HN, I guess. Nerds like to nitpick.

That said, if you have a choice between implementing idempotency and using a distributed lock, you should always opt for the former. It’s far less complex and less error prone.


Telling people "here's a recipe to do locks" should come with a giant flashing sign that say: "this is not an actual lock (as in in-process locks) -- locks in distributed systems are impossible, this cannot be used as a recipe for mutual exclusion"


I did the same thing 5+ years ago on GCS, using generation id.

And I thought S3 also supported that. I was always under impression that S3 at some point reached feature parity with GCS.

Are there any other fundamental features they are not equal on?


Is chubby a filesystem?


Google's locking service


it's an ongoing joke inside google: is chubby a filesystem? It's a locking service which exposes an API that resembles files and paths, and even has a File implementation, but the service owners really never wanted people to treat it like a filesystem.


Very cool case of using s3 conditional writes for distributed leasing


This is a very expensive way to do leader election, at least from an infrastructure perspective.

This is because you're essentially pushing the problem down to S3, which does its own leader election in a way that is waaaay overbuilt for what we're trying to accomplish here.

But... that doesn't mean it isn't cool. :)


One advantage of this approach is that if you're already using S3 (as in the SlateDB case mentioned in the article), it's essentially “free”. And it means that a package like SlateDB just needs to be pointed at an S3 bucket, instead of making you point it to an S3 bucket _and_ a DynamoDB instance.


Provisioning a DynamoDB table is scarcely more effort than provisioning a bucket. And you get way nicer constructs (still not as nice as RDBMS) for locking like ConditionalUpdate and TransactWrite.


Totally. I'm not advocating that people don't ever use it for this. I'm just saying that from a pure resource perspective, it might be one of the least resource-efficient mechanisms for doing this.

As others have pointed out, it's probably not a noticeable cost and, in fact, the fixed costs associated with setting something up yourself would far outweigh what you're paying to use S3 for this purpose.

Part of me just dies inside when I think of all the stuff needlessly happening behind the scenes, given it's not actually being used for storage. I mean, it's called Simple Storage Service.


I'll add that if a lot of people actually did start using it for this purpose, they could probably just productionize the thing they actually use for this, which is essentially their own version of etcd, but with Paxos instead of Raft.


Since the user is only paying an infinitesimal fraction of the infrastructure cost, does it really matter? From the user’s perspective it’s extremely inexpensive.


If everyone started using S3 for this purpose, they would shut it down pretty quickly.

Small objects (especially small hot objects) are actually problematic in S3. They cache them in the keymap instead of retrieving from block storage. But many small objects can quickly blow out a keymap shard. The keymap is designed for high-performance lookups. Because of this, it's also more expensive to scale out this class of 'storage' at the keymap layer than to scale out block storage. And you still have to do quorum writes to block storage and then cache eviction at the keymap.

If you're doing this for slow-moving leader election in a small cluster, fine. But if, for example, you started using this for leader-election among end-user-defined workloads that could scale up/down, you might find yourself on the other side of a call from AWS.


> If everyone started using S3 for this purpose, they would shut it down pretty quickly

I work at AWS but not on the S3 service team. (Opinions are entirely my own.)

I have little doubt that the team already considered this possibility before releasing the feature. The choice to make new functionality available in a service is considered a “one-way door” that receives a tremendous amount of scrutiny. Once a feature is made available, they will do everything they can to support it as long as possible, even at the cost of convenience to themselves. And even de-emphasized services (keep the lights on) are rarely deprecated completely - existing accounts and customers already using them can frequently continue to do so.


They introduce economic incentives. :)

By way of example, a little over a decade ago a famous online streaming company used to upload all of their video masters for transcoding. This process involved a huge upload and a huge workload, followed by a significant drop in activity.

The problem was that AWS had to provision for peak usage instead of average usage. This resulted in a situation where the peak-to-average ratio was very high for just one or two customers.

To address this issue, the solution was to incentivize these customers to spread out their workload more evenly over time, at least until they were no longer the largest driver of peak/avg.

This is also why things like Reserved Instances and Spot Instances exist.


I’m sure we agree that other means such as changing pricing and reaching out to exceptional customers are not the same thing as killing a feature.


I never said they would kill the feature. I said they would shut down the behavior. They can do that through economic incentive.

(Source: I was on the S3 team. Opinions my own, etc.)


You said they would “shut it down,” which can reasonably be interpreted as killing the feature. If you meant something more specific, you should have said that. Clarity is the duty of the speaker.


You must be fun at stand-ups.


>> “If everyone started using S3 for this purpose, they would shut it down pretty quickly.”

> “I never said they would kill the feature. I said they would shut down the behavior.”

Seems "it" is standing in for a lot of nuance here.

With these three sentences side by side, still took a while to see how you might be referring to the "everyone using" behavior instead of the "S3 purpose" feature! Usually given ambiguity the nearest plausible reference wins.

Since in tech "shut it down" is more often systems than humans and that was the nearest reference "it" could refer to, took some doing to see how your assertion "I said the behavior" could be accurate!


I'll concede that "it" was ambiguous. Though if you take the most literal version of what I said, it would be they'd shut down S3, which clearly isn't true.

We can continue to debate whether or not I (who worked on that team and generally understands what goes into building scaled cloud services, having built several myself) understand how a cloud provider responds to customers using your service in a way you clearly didn't intend, or we can move on with our day.


I wasn't debating what you know from the vantage of someone who's founded and built global cloud SaaS services at scale for a while. I was noting how it was ambiguous but going further to say "ah, I see how you can assert you said it," giving you credit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: