This is rubbish, we've run with guaranteed webhook ordering for years, so the idea that you can't do is laughable.
Timestamps don't solve the issue, and neither do "thin payloads" since the receiver has no idea how long to wait before assuming that the order is certain, and if you have a problem on the sender side it could cause logic errors for all of your clients.
Most of these problems are solved if the receiver doesn't process the webhook immediately, but instead queues it internally. You don't have issues with the queue being stalled due to one bad webhook, because there is no event-specific processing happening on the receiver (other than perhaps ignoring some events). The queue can still be stalled if there is a wider problem, but as soon as the problem is resolved, the system can catch up on those queued webhooks, and synchronization integrity is maintained.
Having said all that, if I were to design a new system I would go with a pull-based system instead. In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query. A "webhook" would contain an empty payload, and would simply indicate that the queue had become non-empty - this could be omitted entirely if realtime updates are not required, instead having the client poll.
The advantages of this approach are that it's easy for consumers to "replay" a set of events if they accidentally lose them, and it's also a lot more efficient, since many events can be sent per request (we gain some of this benefit at the moment by supporting "batch" webhooks containing multiple events, but it requires opt-in from the client.) Additionally, it allows webhooks to be versioned more easily, since you can have versioned endpoints for fetching events, and it also allows you to have an arbitrary number of consumers of the same set of events with no additional complexity.
You obviously CAN guarantee ordering, it's just that you can't guarantee it as the sender, you need cooperation from the receiver. Additionally, putting them in a receive queue on the receiver doesn't solve the issue unless the receiver takes extra care to also read from the queue in strict (non-overlapping) order which is rarely the case, and even then has significant throughput implications. So it really is all on the receiver. This piece was written from the context of the sender.
Timestamps definitely don't solve the issue, I explicitly said to use a centralized sequence number if you must (not a great idea in most cases). Thin payloads: the idea behind that is essentially to use the webhooks as a "please update" kind of notification and then you get the most recent data from the server. Essentially what you called a "pull system", it's a combination of both a push (webhook) to know when to pull, and the pull to get the data. This also doesn't work as nicely in many scenarios (because oftentimes, receivers want the data immediately without having to fetch), but it's good in others.
Please take a look at the content of the article (rather than just the title), I've addressed most of it there too.
I agree with the parent on the weirdness of how the problem is stated in the first place.
On your customer and card example, the issue is not message delivery order but processing order, or more precisely prerequisite satisfaction.
My first thought looking at it was to just store the data of any of the hooks coming in, check the prerequisites each time, and only process the whole when everything needed has arrived.
Trying to dictate order from the sender without any cooperation from the receiver seems like a fool's errand, as in any real world scenario where it really matters, the receiver will also want a way to check it actually received everything in order.
I responded in much the same way below - your job as a sender is not to guarantee that the receiver will do its job in processing but to provide a reliable set of webhook messages so that if the receiver does fail, at least they can discover they've missed or skipped messages or are processing them out of order. As a sender, you certainly can provide guaranteed ordering or a way to identify the order of those messages. What you can't guarantee is that the receiver will process them in any given order if they choose to ignore the ordering you provide.
I understand it may not have been very clear. Though the point is that no one cares about delivery order, what they really care about is processing order. So it doesn't matter if you ensure delivery order if they process it out of order.
As for relying on the customers to get ordering correctly: it's actually more involved and easier to get wrong than people realize, and it's better to avoid it altogether in how you design your API if possible.
Thanks for this, your post resonated with me. It’s good to know that most of what you did is what I ended up doing for a customer implementation (we’re using Odoo and Queue Job to bring in sales from Shopify) and Shopify doesn’t always always guarantee the ordering of their order webhooks payloads.
> In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query.
What happens if two transactions commit out of order? tx1 with a lower timestamp commits after tx2 with a higher timestamp has committed - and your client just saw tx2's timestamp.
Or if you have ≥$maxCount number of events changed the same exact timestamp?
The timestamp in this case would be when the message was added to the queue, not the timestamp of the transaction which triggered the event.
If two transactions are non-causal, it doesn't matter which order the events arrive in the queue, but once the message is in the queue, the order is fixed.
> Or if you have ≥$maxCount number of events changed the same exact timestamp?
Use a sufficiently precise timestamp that this doesn't happen, or add a counter in the low bits. The only reason to use a timestamp rather than a simple incrementing counter is to make it more convenient for recipients to re-request historical events (eg. I want to replay all events since yesterday) and to make debugging easier, since with a counter it's a bit meaningless.
The timestamp is not meaningful for the actual event, its only purpose is to specify where this event sits in the total order.
I think those are some excellent points, and after plenty of experience with webhook-based integrations, I agree that they are definitely a pain.
While I largely agree with you, I'm hesitant to say that it is always preferable to use an /events endpoint. There are two reasons:
1. This requires the client to essentially implement an event-sourced architecture. There are many advantages to such architectures, but they are more complicated and can be tricky to implement.
2. It's important to consider the direction of coupling in systems, and how that affects you're ability to evolve the architecture of the whole system.
3. Polling is generally going to involve a higher amount of network traffic, and will have to be weighed against the latency requirements for processing an event.
This is a big pain point with Stripe's webhooks, and I think there's ample room for improvement.
Senders could guarantee ordering by only sending webhook n+1 after the HTTP request for webhook n completes, rather than sending them concurrently or in arbitrary order. For efficiency, perhaps only guarantee ordering for hooks related to each resource rather than all of a customer's hooks.
Or, include a monotonic counter in the webhook so the recipient can tell when it would apply an old state on top of a new one.
What the recipient does when they receive the webhook is up to them (delays, parallelism, etc.), but at least they'd know the correct event order.
The author raises a good point about what to do in the face of errors, but I'd vastly prefer to handle special behavior upon recipient error (stall, dead letter queue) to the current Stripe reality of "things come in out of order, and we don't give you the info needed to reassemble the order on your end".
The problem with "n+1 after HTTP request for webhooks n completes" is that your throughput is very adversely impacted by this. Let's assume that a webhook takes 1s to process (usually much slower when you include network latency, and endpoint processing time), you're effectively limited to 1 request per second.
Counter makes it slightly better because then you can reconstruct the order without the above artificial limit, though it's also not great (though indeed much better!).
The solution to receiving webhooks in unknown order is to ignore the payload and refetch the resource. Yet naively implemented, this still leaves race conditions on the recipient end: if two webhooks can come in at once, you have to make sure you process them serially, since your refetch or database write could complete in arbitrary order.
That's non-trivial engineering to foist upon every recipient of your webhooks.
I mentioned it in the post, but yeah, also not trivial.
Polling /events solves some problems but introduces others. A mix of push (webhooks) and pull (/events) can also work, which is what I was referring to with the "thin clients", though it's not a great experience for many use-cases and it requires state (many webhooks recipients are stateless - e.g Zapier or Slack).
> At first glance it seems like a simple, and easy to implement idea — just send the webhooks in order.
Not webhook specific but a couple hours today figuring out that some our service calls to internal services look like they open & are sent & processing, but the target server doesnt even see the request for a full 8s sometimes. The call itself was not thrle problem, the service just hadnt started until a long time after data was all sent.
Would have thought this is self evident. Intermediates exist and they can do essentially anything. Without absolute control over every aspect of the systems involved you have no guarantees about ordering.
There's no way to guarantee that the receiver will process the webhooks in order since you have no control over the receiver, but you CAN guarantee you can send them in a given order or at least identify the order of those webhooks and provide ways to identify or discover that order if the receiver so chooses to pay attention to ordering or use order discovery.
> but you CAN guarantee you can send them in a given order
That's not sufficient. Intermediate proxies can reorder your requests however they wish for whatever reason they want, and then change behavior with no notice at any time. In the real world of HTTP you'll get duplicates, false positives and every other conceivable failure mode.
> or at least identify the order of those webhooks and provide ways to identify or discover that order
Sure, you might invent some protocol that incorporates a sequence number or uses some chaining mechanism.
Thing is this; if you find yourself engaging in such gymnastics and you're any good as an engineer it needs to occur to you that you're using the wrong medium, hopefully long before you obligate yourself to the task. "Webhooks" are a pretty fragile thing to use when your requirements involve stuff like "order." If it did fail to occur to you then you're unlikely to get whatever sequencing mechanism you invent working properly either, because that's actually a hard problem that doesn't submit to the sort of muddlers unaware that "You Can't Guarantee Webhook Ordering."
Those who fear or do not know Kafka are doomed to work around its absence. Like I just cannot understand this mindset of "just educate people" when the system doesn't meet the requirements of its users. If your users want event ordering just give them event ordering.
You can even keep your existing webhook code by providing a synchronous bridge to Kafka so "just send them in order but wait for the 200 before sending the next one." Boom, now you are guaranteed the events are recorded and processed in order.
Timestamps don't solve the issue, and neither do "thin payloads" since the receiver has no idea how long to wait before assuming that the order is certain, and if you have a problem on the sender side it could cause logic errors for all of your clients.
Most of these problems are solved if the receiver doesn't process the webhook immediately, but instead queues it internally. You don't have issues with the queue being stalled due to one bad webhook, because there is no event-specific processing happening on the receiver (other than perhaps ignoring some events). The queue can still be stalled if there is a wider problem, but as soon as the problem is resolved, the system can catch up on those queued webhooks, and synchronization integrity is maintained.
Having said all that, if I were to design a new system I would go with a pull-based system instead. In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query. A "webhook" would contain an empty payload, and would simply indicate that the queue had become non-empty - this could be omitted entirely if realtime updates are not required, instead having the client poll.
The advantages of this approach are that it's easy for consumers to "replay" a set of events if they accidentally lose them, and it's also a lot more efficient, since many events can be sent per request (we gain some of this benefit at the moment by supporting "batch" webhooks containing multiple events, but it requires opt-in from the client.) Additionally, it allows webhooks to be versioned more easily, since you can have versioned endpoints for fetching events, and it also allows you to have an arbitrary number of consumers of the same set of events with no additional complexity.