
Resiliency with Queues: Building a System That Never Skips a Beat in a Billion - brazeepd
https://www.braze.com/perspectives/article/building-braze-job-queues-resiliency
======
lkrubner
I know that most of us don't want to read someone's thesis because typically a
thesis is boring, but I want to strongly encourage you to read Joe Armstrong's
thesis, about the origins of Erlang. I did not plan to read it, I just glanced
at it, and I got pulled in, and I ended up reading the whole thing because it
was so interesting:

Making reliable distributed system in the presence of software errors

[http://erlang.org/download/armstrong_thesis_2003.pdf](http://erlang.org/download/armstrong_thesis_2003.pdf)

If the subject of "resiliency with queues" is interesting to you, then what
Joe Armstrong wrote will definitely be interesting to you.

------
randaouser
from my experience, erlang actor model had some of the strongest queue
mechanics and the most resilient systems I have deployed. As some background,
I worked as a contractor for a telco and built a 6 9's system for their
service monitoring using Erlang and its supervisor model. In 3 years the
system has always been able to recover from errors with near 0 downtime (the 6
9s comes from production metrics).

~~~
etaioinshrdlu
Was there anything in particular in Erlang that you wish other frameworks or
languages had, in terms of building reliable systems?

~~~
coldtea
The "supervisor model".

------
chillaxtian
> Going back to our message-sending example, how might we use these concepts
> to ensure consistency? In this case, we might break the job into two pieces,
> with the first one sending the message and enqueuing the second one, and the
> second one writing to the database. In that scenario, we can retry either
> job as many times as we want—if the message-sending provider is down, or the
> internal accounting database is down, we’ll appropriately retry until we
> succeed!

This still isn't any better than the initial example. It could still crash
between sending the message and enqueueing the second job. So it may still
send out the same message twice.

~~~
tybit
The writing wasn’t particular clear, this wasn’t a proposed solution but a use
case where at least once delivery + idempotency ensures the message is never
duplicated in either downstream system.

~~~
peatfreak
I agree. I find the writing in this article to be unclear and it's making me
very confused.

------
pwaivers
>"Failing to send one of those messages has consequences, whether that’s a
missed receipt or—even worse—a missed notification letting a user know that
their food is ready."

This may be tongue-and-cheek, but it is not true. That is a great example of a
scenario where it is _okay_ to lose messages. The consequences are extremely
low if one a billion people is not notified that their food is ready.

~~~
nyfresh
>"This is a great example where it is okay to lose a message"

It's not, if a client is in trail with your service and they miss a message
you risk losing that client. It's only "Okay" to not deliver on non-client
facing services. Anything else is an unmeasurable risk

~~~
FakeComments
It’s not an unmeasurable risk to not tell someone their food is ready.

And that kind of absolutism in technology is the source of a common failure to
meaningfully deal with failure modes of your technology.

Losing one in a billion messages telling someone their food is ready can be
offset by $100 in marketing budget to buy that person a very nice meal in
compensation. We know how to deal with hospitality failures like that, it’s
not actually complicated.

Spending the effort to reduce the failure below that is not worth the cost,
which is certainly more than $100. There’s almost certainly better usages for
those developer resources.

------
CorvusCrypto
I have a few questions after reading. Mostly I'd like to know how they built
their dynamic queueing system.

How do they signal the workers to refresh their dynamic queue lists?

Are they using a homebuilt queueing system or piggybacking on top of something
like celery?

What is the underlying message bus and how is it deployed?

Would love to know more. Also kudos to them for stressing idempotency in a
message system. It's usually much easier to ensure idempotency than to ensure
exactly once delivery.

~~~
amz3
> How do they signal the workers to refresh their dynamic queue lists?

> Are they using a homebuilt queueing system or piggybacking on top of
> something like celery?

They prolly do. Other comments mention erlang. I don't know if braze use
Erlang. After reading the dissertation about it
[http://erlang.org/download/armstrong_thesis_2003.pdf](http://erlang.org/download/armstrong_thesis_2003.pdf)
and my xp with Celery, it is clear that they have their own queueing system,
maybe based on erlang. My understanding is that in erlang systems workers have
much more knowledge than Celery workers about what happens in the whole
system.

Celery workers are very simple: they poll for tasks in a single statically
defined queue. There is no sens of priority, it's a FIFO queue. You emulate
priorities with several queues each of which have a number of worker
proportional to their priority. That's not that simple because it depends on
the kind of tasks they must execute (e.g. how many times it takes to execute,
kind of resources required by the task (RAM, CPU, IO, GPU)).

At the end of the day, I am not happy with Celery + RabbitMQ. I am also
looking for dynamic queueing systems [0].

[0]
[https://github.com/celery/celery/issues/4901](https://github.com/celery/celery/issues/4901)

------
xchaotic
how is writing reliably to a queue any different from writing to a database?
I'd say same principles should apply - journals and two stage commits as a way
to verify writes?

~~~
scarface74
In theory none. But in practice, a lot more can go wrong when writing to a
database.

~~~
tatersolid
Actually, with 20+ years working in financial services, I’d say the
_transactional RDBMS table as queue_ pattern is by far the simplest, most
reliable, and almost always the best choice queueing system for business
applications.

Every other queue system I’ve encountered has terrible failure modes, buggy
clients, weak semantics, and huge serialization overhead.

~~~
scarface74
Using a database as a queue is a well known anti pattern.

[http://mikehadlow.blogspot.com/2012/04/database-as-queue-
ant...](http://mikehadlow.blogspot.com/2012/04/database-as-queue-anti-
pattern.html)

Transactions make the problem worse with locking.

~~~
tatersolid
That’s one opinion, and ignores the reality that almost all message passing
systems need to store state in a database.

Read all the comments on that post.

Transactions make the problem _easier_ , not “worse”.

~~~
scarface74
If you have twenty producers and twenty consumers and transactions you are
going to get all types of dead lock scenerios. I have one queue process that
at its maximum scaling has 16 processes consuming a queue with 10 messages
each running 10 threads - 160 messages at once. They are only doing inserts
and updates - at most a row lock.

But if they had to have queueing logic in the database, you would have to have
locks on the table holding the messages and the statuses of the messages. Not
to mention even more of a database load to delete the messages. Of course on
top of that you are constantly polling the database.

Now compare that to a purpose built queueing system where you just read from
the queue, it automatically marks the messages as unavailable- “in flight” -
and that will automatically requeue the message after a certain amount of time
of it isn’t deleted.

As opposed to a database system where either messages get stuck in
“processing” status of a job fails or you have another process reading the
queue and “fixing” the issue after a certain amount of time has passed.

And finally, what happens if you need a fan out type of queue? Where you have
one producer and multiple types of consumers?

~~~
tatersolid
What, exactly, do you think a purpose-built queuing system does under the
hood?

If it’s sane, it’s doing _exactly_ the same transactional row locking an RDBMS
is highly optimized to do.

I have apps with RDBMS queues doing thousands of work items per second with 16
workers. Very few applications need more than this.

Introducing a whole new subsystem and API for queueing is a bad engineering
decision in most cases.

Fan-out is generally an anti-pattern, but if you need it, then explore other
options

Note that I am talking about Kafka-style linear queues in an RDBMS, not pub-
sub.

Triggers or app code do very fast transactional inserts into queue as part of
the initial write; consumers fast-poll with “update top 1 ... where status =
<unprocessed>” and back off polling exponentially when the queue is empty.
This is just a few lines of code trivial to do correctly and cannot deadlock
in READ COMMITTED or SERIALIZABLE modes.

~~~
scarface74
How is “fan out” an anti pattern? You have an event or message that needs to
be processed by multiple systems. This is completely normal.

So introducing a purpose built queueing system - that people have been doing
for decades instead of using a database is technically bad?

A queueing system at most has to respond to a request, set a flag for
“processing” and it’s done and since by definition, a queue only has to read
the top most item, it is more efficient.

Besides, how do you handle a message that is sitting in a “processing” state
because the process that originally read it crashed and didn’t change the
status?

 _Triggers or app code do very fast transactional inserts into queue as part
of the initial write; consumers fast-poll with “update top 1 ... where status
= <unprocessed>” and back off polling exponentially when the queue is empty.
This is just a few lines of code trivial to do correctly and cannot deadlock
in READ COMMITTED or SERIALIZABLE modes._

Or instead of reinventing the wheel with your own bespoke database as a queue
that has all of the maintenance overhead you could just use a queueing system
that is already optimized for that use case.

 _Note that I am talking about Kafka-style linear queues in an RDBMS, not pub-
sub._

In general parlance, most people would call Kafka style processing stream
processing.

Even then, why write and maintain a pseudo streaming process that you have to
maintain instead of just using Kafka where you can let it handle all of the
fault tolerance, partitioning etc?

It’s about like a past company that I worked for where the “architect” had his
own homegrown ORM, encryption scheme, and configuration system instead of just
using an off the shelf solution because he thought his system was its own
special snowflake.

~~~
tatersolid
Fan-out is an anti-pattern because it introduces the complexity of concurrency
to an asynchronous process where timing isn’t critical. And If you think a
“stream” and a “queue” are somehow different I don’t know what to say.

95% of applications only need a queue for “do this thing asynchronously so the
user doesn’t have to wait.” This is where a DB table queue is the best
solution. Introducing RabbitMQ or any other service in such a common case is a
terrible idea.

Every real-world message passing implementation I’ve encountered ends up with
the complexity of a bespoke state database for each consumer to handle re-
ordering and crashes, as well as poorly written code and more state in a
database to handle message replay. Not all messages can be idempotent, and
consumers never end up stateless in the real world.

All of these solutions lost events in production under various conditions.

My point is: think about if you really need the complexity of managing another
service in production, when all you really need is “do this thing as soon as
you can”.

~~~
scarface74
_Fan-out is an anti-pattern because it introduces the complexity of
concurrency to an asynchronous process where timing isn’t critical. And If you
think a “stream” and a “queue” are somehow different I don’t know what to
say._

So now the common definitions of things is wrong. So how do you propose that
multiple systems that all care about a single event get notified? You realize
that processing queues and using a fan-out pattern has been done for decades?

 _95% of applications only need a queue for “do this thing asynchronously so
the user doesn’t have to wait.” This is where a DB table queue is the best
solution. Introducing RabbitMQ or any other service in such a common case is a
terrible idea._

Because based on your anecdotal experience you can say with confidence that
“95%” of people are doing it wrong....

 _Every real-world message passing implementation I’ve encountered ends up
with the complexity of a bespoke state database for each consumer to handle
re-ordering and crashes, as well as poorly written code and more state in a
database to handle message replay. Not all messages can be idempotent, and
consumers never end up stateless in the real world._

Well, maybe “in your real world”, but people have been managing queues with
idempotency, statelessness, and out of order execution for decades.

As far as handling crashes, there is nothing to do. Once the message is in
process for a certain amount of time (“in flight”) and the consumer hasn’t
acknowledged successful processing, the queueing system automatically puts the
message back in the queue. After a certain number of retries it goes into a
dead letter queue.

 _All of these solutions lost events in production under various conditions._

Don’t blame a poor implementation on the technology. I preach to people all of
the time unless you are working at Google or even Twitter scale, you’re not a
special snowflake that needs to reinvent the wheel and try to re-solve solved
problems.

 _My point is: think about if you really need the complexity of managing
another service in production, when all you really need is “do this thing as
soon as you can”._

“Managing” RabbitMQ is not rocket science. But these days, I don’t deal with
managing infrastructure. That’s what cloud providers are for.

~~~
amz3
> “Managing” RabbitMQ is not rocket science.

We have many issues with RabbitMQ where most of our workload is background
tasks with dozens of queues.

> That’s what cloud providers are for.

What cloud provider queueing system do you recommend?

~~~
scarface74
We use AWS, I’m not saying it’s the “best” it’s just what I’m familiar with.
Also, you don’t have to move your infrastructure to AWS at all to use any of
these services. They all use publicly accessible https APIs managed by
Identity and Access Management and access keys.

Also since all queueing systems basically serve the same purpose, it’s easy to
layer the AWS SDK calls under your own facade classes to reduce the dependency
on AWS’s services.

All that being said:

Simple one consumer/one or multiple producers system:SQS

Multiple consumers/one or multiple producers: SNS/SQS

Kafka equivalent: AWS Kinesis or AWS MSK (Manager Kafka). I haven’t used Kafka
but if you don’t want to use an AWS specific service and want easy
portability, it couldn’t hurt to do a proof of concept.

With AWS SQS/SNS there are no servers to manage. You just create your queues
from the web console (not recommended), use the CLI, CloudFormation, or
Terraform.

~~~
amz3
Tx for the (fast) reply.

The problem I have with SQS and SNS (and Celery) is you can not just throw
tasks into it and eventually the system based on some hints scale up / down
the workers. Of course you can rely on Lambdas but then you are locked with
amazon (not need to mention that you can not control how much the lambdas will
cost you).

Also, I disagree with you point that question tech/tool status-quo is NIH,
hence is bad. I for instance, would like to be able to avoid vendor lock-in.
Also, reinventing the wheel allows to stay in control. Using RabbitMQ and to
some extent Celery or Kafka locks you up without much control since it's
foreign code base with alien language.

~~~
Izkata
Celery does have autoscaling, though perhaps more limited than what you're
thinking of:
[https://celery.readthedocs.io/en/latest/userguide/workers.ht...](https://celery.readthedocs.io/en/latest/userguide/workers.html#autoscaling)

------
notyourday
So they built a database?

