
Reliable Webhooks Using Serverless Architecture - prostoalex
https://medium.com/square-corner-blog/reliable-webhooks-using-serverless-architecture-e009a2096732
======
koblas
Would have hoped for a more in-depth article. It looks like lots of startups
have adopted the attach marketing to your engineering department as a way to
drum up interest for recruiting. This is a great strategy but you should still
provide more depth.

------
Diggsey
The article doesn't mention anything about in-order delivery, which adds an
extra bit of challenge because most message queues don't have any support for
strict ordering. I implemented a simple solution for the company I work for,
which uses a Rust microservice backed by a Postgres database. It could easily
be made "serverless" by uploading the same service with minor changes to AWS
Lambda.

We use separate queues for each customer so we can send webhooks in parallel
whilst guaranteeing in-order delivery for each customer (we can actually
support multiple queues per customer for super high volume if required).
Messages are removed from the queue only once successfully delivered, so we
never "lose" messages, no matter how long a customer's system fails to receive
a message. Failing messages stall the queue and are retried with exponential
back-off to ensure the order is preserved.

We also have a web UI for developers built into our product, so that you can
see recent webhook delivery failures (we store error messages returned by our
customers' systems), disable webhooks (clearing the queue) or retry a failed
webhook immediately (very useful for debugging!).

The whole service is a few thousand lines of code, works faster than we could
ever need using only basic SQL queries and polling, and delivery is within a
few seconds compared to the 30 seconds listed in the article - it's not clear
where that delay comes from, I can't imagine it's intrinsic to SQS?

~~~
tirumaraiselvan
Even with separate queues per customer, how are you ensuring ordering? Are you
using something like Kafka which ensures ordering with-in partitions?

~~~
Diggsey
No, Kafka is one of the few messages queues which supports ordering, so for a
more advanced solution it could be the right choice, but we don't use anything
more than postgres.

We have workers "take locks" on the queues they are processing. I put that in
quotes because they are purely software locks - workers store the time when
they started working on the queue, and other workers avoid queues which have a
time less than X seconds ago. When a worker finishes with a queue it resets
the timestamp to NULL.

It's still possible for a worker to take too long to send a webhook and for
another worker to pick it up (although very unlikely because we can set X
quite large with no side effects): in that case the worst that can happen is
that the webhook is sent twice, so we can still guarantee that the _first
successful_ send of each webhook occurs in the right order.

Our webhooks all have UUIDs, so it's pretty easy for consumers to process them
idempotently.

------
revskill
How do you handle "miss events" in a webhooks architecture ?

------
tirumaraiselvan
I am the author of Hasura Event Triggers [1], we guarantee reliable webhooks
by persisting each event in Postgres which is then processed (parallely) by
multiple workers.

[1] [https://hasura.io/event-triggers](https://hasura.io/event-triggers)

