
On SQS - mpweiher
https://www.tbray.org/ongoing/When/201x/2019/05/26/SQS
======
dantillberg
I've worked with SQS at volumes in thousands of messages per second with
varied (non-tiny) payload sizes.

SQS is a very simple service, which makes it fairly reliable, though part of
the reason for the reliability is that the API's guarantees are weak. And it
can be economical, but I've had to build a lot of non-trivial logic in order
to interact with SQS robustly, performantly, and efficiently, especially
around using the {Send,Receive,Delete}MessageBatch operations to reduce costs.

With the caveat that I think my use case has been quite different from what's
discussed in this article, here are some of the problems I've encountered:

\- Message sizes are limited, but in a convoluted way: SendMessageBatch has a
256KiB limit on the _request_ size. Message values have a limited character
set allowed, so you need to base64-encode any binary data. This also means
that there's not exactly a max message size; you can batch up to 10 messages
per SendMessageBatch but not in excess of 256KiB for the whole request.

\- If you want to send more than 256KiBx3/4-(some padding) or around 180KiB of
data for any single message, you need to put that data somewhere else and pass
a pointer to it in the actual SQS message.

\- SQS does routinely have temporary (edit: _partial_ ) failures that
generally last for a few hours at a time. ReceiveMessageBatch may return no
messages (or less than the max of 10) even if the queue has millions of
messages waiting to be delivered; SQS knows it has them somewhere but it can't
find them when you ask. And DeleteMessageBatch may fail for _some_ of the
messages passed while succeeding for others; it will sometimes fail repeatedly
to delete _those_ messages for an extended period.

\- The SDKs provided by AWS (for either Java or Go) don't help you handle any
of these things well; they just provide a window into the SQS API, and leave
it to the user to figure all the details out.

~~~
timbray
Um, I’ve been at AWS since late 2014, and AFAIK the only extended SQS hiccup
correlated with the DynamoDB issue in 2016. SQS isn’t perfect but I’m pretty
sure “does routinely have temporary failures that generally last for a few
ours at a time” is just wrong.

~~~
noelherrick
I believe GP was talking about particular messages failing, not a total system
outage. In my use of AWS, the status page almost never reports an outage even
though that AWS service is down for me-as in the most I've ever seen is some
hand wavey message that there's elevated error rates. So you could be right,
SQS hasn't failed entirely, but that probably means there's a good number of
failed requests that are below the margin where AWS would consider it down.

~~~
dantillberg
Yes, this is correct, thank you. I updated my comment to indicate that I meant
partial failure, though the failure conditions persist from 20 minutes to a
few hours. Those partial failures have happened once every two months or so in
my experience.

Technically, it's not even really a failure of SQS because the guarantees SQS
makes are so weak that those partial failures are really "operating normally."

------
staticassertion
I think this is a decent response - they really nail what @rbranson misses,
that the failures he mentions are actually features we're after.

An example,

> Convert something to an async operation and your system will always return a
> success response. But there's no guarantee that the request will actually
> ever be processed successfully.

Great! I don't want service A to be coupled to service B's ability to work. I
want A to send off a message and leave it to B to succeed or fail. This
separation of state (service A and B can't even talk to each other directly)
is part of what makes queues so powerful - it's also the foundation of the
actor model, which is known for its powerful resiliency and scalability
properties.

The author's suggestion of using synchronous communication with backpressure
and sync failures is my last ditch approach. I have to set up circuit breakers
just to make something like this anything less than a total disaster with full
system failure due to a single service outage.

Like the author, the "good use cases for queues" is very nearly 100% for me. I
believe you should reach for queues first, and it's worth remodeling a system
to be queue based if you can help it.

Sometimes modeling as synchronous control is easiest, but I'm happy that I can
avoid that in almost every case.

~~~
dmix
> Convert something to an async operation and your system will always return a
> success response.

It's funny reading this after using Erlang/Elixir over the last few years. The
default is always async with the assumption it will fail - as async processes
failing is a core part of the OTP application architecture.

It's not something to be feared but a key part of how your application data-
flow works.

~~~
haolez
I've been planning on giving Erlang/Elixir a try, but we are very reliant on
serverless and managed cloud services (like SQS) and I get the impression that
managing a cluster of worker nodes for Erlang/Elixir would be too much work
for us, since we would have to manage the servers, security patches, plan its
scaling, etc.

Maybe I'm wrong and it's not so much work in the end. Hoping for some
feedback.

~~~
awinder
I’d love to know how this isn’t true as well, but I was in an environment
where cross-az network costs were something we were continuously mitigating
against. Using stuff like sqs let us build cross-az availability with
0-metered network costs, serverless can come into play because it’s network
connections usually come through 0-cost aws services as well. It seems to me
like from a cost basis, getting into something with clustered erlang would
kill you in many of these cloud environments (or at least you would be on the
hook for engineering workarounds to keep traffic within an az w/ failover to
other azs)

~~~
dmix
That’s a good question. I’m sure there are people who could answer that.
WhatsApp famously scaled an erlang app to a billion users around the world.
Quite a few people have done it on a large scale. RabbitMQ is also built in
Erlang and used in large deployments.

The zero cost stuff is always going to be a big draw with cloud deployments
and the various demands from the company. Although a lot of this stuff like
messaging and clustering/failover is within the application/Beam VM itself
rather than something scaled or managed externally to the software. But that
level of server and infrastructure stuff is out of my league of understanding.

------
andrewstuart
I use Postgres SKIP LOCKED as a queue.

I used to use SQS but Postgres gives me everything I want. I can also do
priority queueing and sorting.

I gave up on SQS when it couldn't be accessed from a VPC. AWS might have fixed
that now.

All the other queueing mechanisms I investigated were dramatically more
complex and heavyweight than Postgres SKIP LOCKED.

~~~
alexandercrohde
I _LOVE_ this idea. I usually hear other Sr. engineers denigrate it as
"hacky," but I think they aren't really looking at the big picture.

1\. By combining services, 1 less service to manage in your stack (e.g. do
your demo/local/qa envs all connect to Sqs?)

2\. Postgres preserves your data if it goes down

3\. You already have the tools on each machine and everybody knows the
querying language to examine the stack

4\. All your existing DB tools (e.g. backup solutions) automatically now cover
your queue too, for free.

5\. Performance is a non-issue for any company doing < 10m queue items a day.

~~~
andrewstuart
I don't think it's hacky - it's using documented Postgres functionality in the
way it's intended. Engineers tend to react that way to anything unfamiliar,
until they decide it's a good idea then they evangelise.

What does "hacky" even mean? If it means using side effects for a primary
purpose then no, SKIP LOCKED is not a side effect.

I researched alot of alternative queues to SQS and tried several of them but
all of them were complex, heavyweight, with questionable library support and
more trouble than they were worth.

The good thing about using a database as a queue is that you get to easily
customise queue behaviour to implement things like ordering and priority and
whatever other stuff you want and its all as easy as SQL.

As you say, using Postgres as a queue cut out alot of complexity associated
with using a standalone queueing system.

I think MySQL/Oracle/SQL server might also support SKIP LOCKED.

~~~
kwindla
Thanks for posting that code!

Definitely similar experience here. We handle ~10 million messages a day in a
pubsub system quite similar in spirit to the above, running on AWS Aurora
MySQL.

Our system isn't a queue. We track a little bit of short-lived state for
groups of clients, and do low-latency, in-order message delivery between
clients. But a lot of the architecture concerns are the same as with your
queue implementation.

We switched over to our own pubsub code, implemented the simplest way we could
think of, on top of vanilla SQL, after running for several months on a well-
regarded SaaS NoSQL provider. After it became clear that both reliability and
scaling were issues, we built several prototypes on top of other
infrastructure offerings that looked promising.

We didn't _want_ to run any infrastructure ourselves, and didn't want to write
this "low-level" message delivery code. But, in the end, we felt that we could
achieve better system observability, benchmarking, and modeling, with much
less work, using SQL to solve our problems.

For us, the arguments are pretty much Dan McKinley's from the Choose Boring
Technology paper.[0]

It's definitely been the right decision. We've had very few issues with this
part of our codebase. Far, far fewer than we had before, when we were trying
to trace down failures in code we didn't write ourselves on hardware that we
had no visibility into at all. This has turned out to be a counter-data point
to my learned aversion to writing any code if somebody else has already
written and debugged code that I can use.

One caveat is that I've built three or four pubsub-ish systems over the course
of my career, and built lots and lots of stuff on top of SQL databases. If I
had 20 years of experience using specific NoSQL systems to solve similar
problems, those would probably qualify as "boring" technology, to me, and SQL
would probably seem exotic and full of weird corner cases. :-)

[0] - [https://mcfunley.com/choose-boring-
technology](https://mcfunley.com/choose-boring-technology)

------
mabbo
A long time ago, as new-ish developer, I was building a system that needed to
take inputs, then run "pass/fail/wait and try again later" until timeout or
completion. This wasn't mission-critical stuff, mind you, so a lost message
would annoy someone but not cause any actual harm.

As I was figuring out how to setup a datastore, query it for running workflows
and all that jazz, I happened upon an interesting SQS feature: Post with
Delay.

And so, the system has no database. Instead, when new work arrives it posts
the details of the work to be done to SQS. All hosts in the fleet are polling
SQS for messages. When they receive one, they do the checks and if the process
isn't complete they repost the message again with a 5-minute delay. In 5
minutes, a host in the fleet will receive the message and try again. The
process continues as long as it needs to.

Looking back, part of me now is horrified at this design. But: that system now
has thousands of users and continues to scale really well. Data loss is very
rare. Costs are low. No datastore to manage. SQS is just really darned neat
because it can do things like that.

~~~
Rapzid
The biggest gotcha in a design like this IMHO is that you can't post and
delete atomically. You may post the new work into the queue and then a failure
to delete could occur and the work will stack.

Depending on the workload this could be not a big deal or very expensive.
Treating a queue as a database, particularly queues that can't participate in
XA transactions, can get you in trouble quick.

~~~
nine_k
With a realistic, that is, not 100% reliable, queue you can have either "at
most once" or "at least once" delivery anyway. "Exactly once" can't be
guaranteed.

So a duplicate message should be processed as normal anyway, e.g. by
deduplication within a reasonable window, and/or by having idempotent
operations.

~~~
Rapzid
Yes, it depends on the workload. Idempotency is typically always a good idea,
but sometimes the operation itself is very expensive in terms of time,
resources, and/or money. I have also seen people try to update the message
when writing it back(with checkpoint information and etc) for long running
processes. A slew of issues, including at least once delivery, can cause
workflow bifurcation. Deduplication via FIFO _can_ help mitigate this, but it
has a time window that needs to be accounted for. Once you start managing your
own deduplication I'd say it has moved past trying to go databaseless.

------
yilugurlu
Has anyone ever measured the latency of the sending message to SQS? I was
using with ELB in t2.medium instances, and my API (handle => send message to
queue => return {status: true}) response times were around 150 - 300 ms and
replaced SQS with RabbitMQ, and it went down to around 75-100 ms.

Does anyone think that sending message to SQS is slow?

Edit: With this update, I was able to process almost 3 x requests with the
same resources, and it lowered my bills quite a lot.

For example my SQS bill for last month

Amazon Simple Queue Service EUC1-Requests-Tier1 $0.40 per 1,000,000 Amazon SQS
Requests per month thereafter 290,659,096 Requests $116.26

it went to 0, and ec2 cost went down as well because ELB spun up fewer
instances that I could handle more quest with the same resources.

This was my experience with SQS. I just wanted to share it.

~~~
Feeble
Were you running RabbitMQ clustered with persistent queues?

I don't think SQS is primarily for low-latency messaging, but rather a
provided high available MQ with very little hassle.

~~~
yilugurlu
I wasn't, single instance in the same subnet with persistent work queues.

------
sessy
WE are heavy users of AWS. SQS is the only service where we have had zero
downtime. The only downside we have about SQS is you can pull out only 10
messages at a time (without batching). You can have parallel readers but they
result in some duplicates. There is SQS FIFO but it is throttled.

~~~
actuator
It went down on us once for extended period in 2015. It was chaotic as you
don't expect it to fail. If memory serves me right, even S3 suffered that day.

~~~
Twirrim
A whole slew of AWS services went down that day. Tim's not wrong when he
indicates that almost every service in AWS has a dependency on it (Amazon has
services split up in to tiers based on how much they can rely on other
services for critical components, SQS is pretty high up in the tiering.)

I was on-call that day for an AWS service. There wasn't much I could do but
sit muted on the conference call and watch some TV, waiting for the outage to
be over.

------
hexene
One downside of SQS is that it doesn't support fan-out, for eg.
S3->SQS->multiple consumers. The recommendation instead seems to be to first
push to SNS, and then hookup SQS/other consumers to it. Kinesis/Kafka would
appear to be better suited for this (since they support fan-out like SNS and
are pull-based like SQS), but aren't as well supported as SNS/SQS (you can't
push S3 events directly to Kinesis for eg.) Can someone from AWS comment on
why that is? Also, related: when can we expect GA for Kafka (MSK)?

~~~
staticassertion
I do S3 -> SNS -> SQS. I don't see why I would use Kinesis instead. The SNS
bit is totally invisible to the consumers (you can even tell SNS not to wrap
the inner message with the SNS boilerplate), downstream consumers just know
they have to listen to a queue.

I don't see a downside to this approach. Perhaps some increased latency?

~~~
hexene
If you wanted multiple pull-based consumers for the stream, wouldn't you need
a separate SQS queue per consumer, with each queue hooked up to SNS? Perhaps
I'm mistaken, but that seems brittle to me. With Kinesis/Kafka, you only need
to register a new appName/consumer group on the single queue for fan-out.
Plus, both are FIFO by default, at least within a partition.

~~~
staticassertion
That's exactly how you do it. To me, it's the opposite of brittle - every
consumer owns a queue, and is isolated from all other consumers. Clients are
totally unaware of other systems, and there's no shared resource under
contention.

~~~
stellar678
I feel like the create/delete queue semantics hint that a queue should be a
long-lived thing that consumers are configured to connect to. When I saw
suggestions to have one queue per consumer and have that consumer
create/delete the queue during its execution lifecycle, the idea of one-queue-
per-consumer started making more sense to me.

~~~
staticassertion
I think the word "Consumer" here is "Consumer Group".

For example, an AWS Lambda triggered from SQS will lead to thousands of
executions, each lambda pulling a new message from SQS.

But another consumer group, maybe a group of load balanced EC2 instances, will
have a separate queue.

In general, I don't know of cases where you want a single message duplicated
across a variable number of consumer groups - services are not ephemeral
things, even if their underlying processes are. You don't build a service,
deploy it, and then tear it down the next day and throw away the code.

------
etaioinshrdlu
I really wish SQS had reliably lower latency, like Redis, and also supported
priority levels. (Also like redis, now, with sorted sets and the
[https://redis.io/commands/bzpopmax](https://redis.io/commands/bzpopmax)
command.)

Has anyone measured the performance of Redis on large sorted sets, say
millions of items? Hoping that it's still in single digit milliseconds at that
size... And can sustain say 1000QPS...

~~~
plasma
We use Redis as a job queue and its great; the only limitation is being
sometimes concerned about job queue size due to memory limits of the Redis
server itself.

~~~
espadrine
Also, when you don't want to lose a message, the Redis persistence story
requires careful thought. It requires setting up RDB + AOF +
appendfsync=always + backups.

~~~
manigandham
You don't need to fsync on every write, that's what the replica is for. At
lower/mid-tier hardware, network is faster than storage and your message is on
multiple machines before it's even written to disk so fsync of 1 second is
usually fine.

------
redact207
Author of [https://node-ts.github.io/bus/](https://node-ts.github.io/bus/)
here. SQS is definitely one of my most favourite message queues. The ability
to have a HA managed solution without having to worry about persistence,
scaling or connections is huge.

Most of the complaints are native to message based systems in general. At
least once message receives, out of order receives, pretty standard faire that
can be handled by applying well established patterns.

My only request would be to please increase the limits of message visibility
timeouts! Often I want to delay send a message for receipt in 30 days. SQS
forces me to cook some weird delete and resend recipe, or make this a
responsibility of a data store. It's be really nice to do away with batch/Cron
jobs and deal more with delayed queue events.

~~~
plasma
RE: visibility timeout beyond 30 days, you may be more after a “saga” that has
state and is long running (hours/days/months/years).

You can imagine building a saga system on top of a queue system.

~~~
redact207
You're absolutely right, in fact I have a whole package that is just that
[https://node-ts.github.io/bus/packages/bus-workflow/](https://node-
ts.github.io/bus/packages/bus-workflow/).

The problem is this. Let's say that I want to trigger a step in a "free trial"
saga that sends an email to the customer 10 days after they sign up nudging
them to get a paid account. If I can delay send this message for 10 days then
it's easy.

However because SQS has a much shorter visibility timeout, I have to find a
much more roundabout way of triggering that action.

~~~
plasma
Yeah, that makes total sense. For some of our saga's (we don't use SQS -- we
use a custom redis queue), we have the saga potentially wake up and
immediately sleep again ("Nothing to do right now, defer again in a few
days").

But yes, a quirk.

------
Jemaclus
We love SQS, but one of the problems we're running into lately is the 256kb
per message limitation. We do tens of millions of messages per day, with a
small percentage of those reaching the 256kb limit. We're approaching the
point where most of our messages will hit that limit.

What are our options for keeping SQS but somehow sending large payloads? Only
thing I can think of is throwing them into another datastore and using the SQS
just as a pointer to, say, a key in a Redis instance.

(Kafka is probably off the table for this, but I could be convinced. I'd like
to hear other solutions first, though.

~~~
somedev55
We point to an s3 object for any large payloads

~~~
Jemaclus
Wouldn't sending and later retrieving millions of S3 objects be expensive?

------
polskibus
Does anyone know a good, low overhead out-of-process message queue, that's
lightweight enough that it can be useful for communicating between processes
on the same machine, but if necessary it can scale beyond it? In case of a
single-machine product that comprises of several services, a message queue can
sometimes be useful for pull model, but adding RabbitMQ to the stack makes
installation and ops much more complex than customers deem acceptable.

I know some people use Akka with Persistence module, but I would welcome other
alternatives.

~~~
gizzlon
Guess it depends on the definition of "queue". Potentials:

    
    
      - https://nsq.io/
      - https://nats.io/

~~~
polskibus
Seems like NATS streaming would fit my case - have you heard of any real world
deployments that use it ? Are there any larger issues that don't make it a
good choice ?

~~~
manigandham
NATS Streaming is not as well tested and has some design issues that make
scaling hard. NATS itself has a new version 2 that has a protocol update and
NATS Streaming should follow with a new design as well, but I would recommend
other options if you want persistence.

~~~
polskibus
What are the design flaws that you have in mind? Is it ok for a couple of
nodes or even then it would have trouble to keep up with a medium load? Or
maybe the design flaws are to do with providing durability and other
guarantees?

What other options would you recommend, that can provide at least once
delivery and are lighweight enough not to require zookeeper etc?

~~~
manigandham
Nats streaming isn't just a persistence layer to NATS. It's an entirely
different system that basically acts as a client to NATS and then records
messages it sees. Basically think of how you would design a persistent queue
on top of the ephemeral NATS pub/sub and that's what NATS streaming is.

Here's a good post (and series) about distributed logs and NATS design issues:
[https://bravenewgeek.com/building-a-distributed-log-from-
scr...](https://bravenewgeek.com/building-a-distributed-log-from-scratch-
part-5-sketching-a-new-system/)

------
reallydude
Nowhere is cost mentioned. Using S3 as an ad-hoc queue is a cheaper solution,
which should throw some red flags. You can easily do what SQS does for so much
cheaper (this includes horizontal scaling and failure planning), that I'm
consistently surprised anyone uses it. Either you are running at a volume
where you need high throughput and it's pricey, or you're at such a low
throughput you could use any MQ (even redis).

> Oh but what about ORDERED queues? The only way to get ordered application of
> writes is to perform them one after the other.

This is another WTF. Talking about ordered queues is like talking about
databases, because it's data that's structured. If you can feed data from
concurrent sources of unordered data to a system where access can be ordered,
you have access to a sorted data. You deal with out-of-order data either in
the insertions or a window in the processing or in the consumers. "Write in
order" is not a requirement, but an option. Talking about technical subjects
on twitter always results in some mind-numbingly idiotic statements for the
sake of 144 characters.

~~~
archgoon
> Using S3 as an ad-hoc queue is a cheaper solution, which should throw some
> red flags.

Interesting. Can you expand on this? How do you ensure that only one worker
takes a message from s3? Or do you only use this setup when you have only one
worker?

~~~
reallydude
You encode messages with timestamp and origin (eg 1558945545-1), you write
directly to S3 into a (create if not exists) folder for a specific windowing
(let's say minute). Every agent writing, you end up with a new folder in the
next minute. You have a window with an ordered set of messages by window by
sort algorithm...optimally determined by the naming encoding.

~~~
HatchedLake721
You reminded me of a post on Dropbox announcement in 2007, that you can do it
“yourself quite trivially by getting an FTP account, mounting it locally with
curlftpfs, and then using SVN or CVS on the mounted filesystem”.

Just because you can, doesn’t mean you should.

~~~
reallydude
Cost is the motivating factor here.

~~~
earenndil
Dropbox is more expensive than an FTP server, so the two scenarios are
comparable.

------
curryst
From the operations side, I feel like SQS is a very sharp pair of scissors. In
the hands of a good tailor, they'll make amazing things. On the other hand,
most people use them to cut things they probably shouldn't, or just flat out
Sprint around with them on a wet pool deck with their shoes untied.

A non-comprehensive list of ways I've seen my developers shoot themselves in
the foot:

* Giant try-catch block around the message handling code to requeue messages that threw an exception. They neglected to add any accounting, so some messages would just never process. No one noticed until they saw the queue size never dropped below a certain amount during debugging.

* Queue behavior is highly dependant on configuration. Bad queue configurations result in dropped messages. Queueing systems provide few features to detect and alert on these failures (it's rally not their job), but building a system to track the integrity of the business process across queues is deemed to onerous.

* The built-in observability is generally not enough to be complete. I haven't seen a lot of great instrumentation libraries for SQS like there are for HTTP, meaning that observability is pushed on to the developer. They typically ignore that requirement because PMs rarely care until they realize we're unable to respond to incidents effectively.

* Most people vastly overestimate their scale. The number of applications I've seen built on SQS "because scale" that end up taking less than 100 QPS globally is significant. Anecdotally, I would say the majority of queue-based apps I have seen could have solved their scaling issues within HTTP.

* Many people want to treat queued messages like time-delayed HTTP requests. They are not, the semantics and design are totally different. I have seen people marshal requests to Protobuf, use it as the body of a message, and had another service read and process the request, and write another message to a queue that the first app reads back. It's basically gRPC over queues. Except that it solves none of the problems gRPC does, and creates a lot of problems. Just an example, how do you canary when you can't guarantee that the version of the app that sends the request will get the response to that request?

I think SQS is an amazing tool in the hands of people that know when to use
it, and how to use it. But my experience has been that most people don't, and
the ecosystem to make it available to people who aren't experts just doesn't
exist yet.

~~~
cle
I agree with most of this. If you have non-trivial message handling logic in a
production system, you probably shouldn't use SQS directly to drive your work.
Your SQS handling logic should be simple and reliable. In most cases, if the
handling logic is complex, long-running, or needs operational visibility
(logging, monitoring, etc.), I'd write the message handler itself to just kick
off a workflow via Step Functions or some other workflow system. You'll pay
for that in initial development costs, because it _is_ more complicated (you
need to write lambda handlers, wire them up with CloudFormation, etc.), but
the tradeoff is that it gives you a central place to look at each unit of
work, instead of having your artifacts scattered around various logs (if at
all).

The takeaway for me is: distributed systems are hard. If you have distributed
workers, you have entered into a vastly more complex realm. SQS gives you
_some_ tools to work successfully in that environment, but it doesn't (and
can't) get rid of that complexity. Most of the problems I've seen relate to
engineers not understanding the fundamental complexity of coordinating
distributed work. Your choice of tech stack for your queues isn't going to
make a big difference if you don't understand what you're fundamentally
dealing with.

------
varelaz
One important drawback of SQS is that it's eventually consistent, you can read
the same message twice from different workers. Nevertheless we keep using it
with additional checks when it's critical, it's still the cheapest solution by
maintenance.

~~~
lclarkmichalek
Making the processing of a message idempotent is the ideal way to handle that
limitation

~~~
cies
That's not always possible. Thus a big no-no for SQS if that's what you need.
Concluding: SQS cannot replace RabbitMQ in all usecases.

~~~
tybit
If you’re message consumer isn’t idempotent then no MQ can help you. Exactly
once delivery is impossible other than with at least once delivery and an
idempotent consumer.

[https://bravenewgeek.com/tag/amazon-
sqs/](https://bravenewgeek.com/tag/amazon-sqs/)

~~~
nkozyra
> Exactly once delivery is impossible other than with at least once delivery

Can you explain this? Don't many applications deliver once and only once via
locking? It's obviously easier as an application developer to say "I will only
get this once" and accept losing messages than dealing with idempotence
particularly in distributed services.

~~~
larzang
Locking is no longer 100% reliable as soon as you have horizontal distribution
of the same data over multiple nodes (for redundancy, so you can guarantee
delivery) instead of sourcing from e.g. a monolithic rdbms. Eventual
consistency is the model for a whole lot of distributed systems, e.g. S3 or
Mongo. The CAP theorem applies to more than just databases, so MQs tend to use
eventual consistency as well, which looks like at-least-once and not
guaranteed exactly-once delivery.

While dealing with at-most-once delivery is easier in isolation as an
application developer on the consumer side, dealing with lost messages is in
practice MUCH MUCH harder on the producer side than idempotent handling where
required on the consumer side. You end up building elaborate mechanisms for
locking and retries and receipt validation which can all fail.

Just think about emails, there are a lot of situations where you can't be 100%
certain whether the other side received the message or not. For something low-
priority it may be better to ignore partial failures, but if it's important it
may be better to send a second message to guarantee delivery. If you're
modeled on never sending the same message twice AND the messages matter,
you're in trouble.

------
pojzon
After reading some comments in this thread im concerned about how many
misconceptions people have about AWS services. Most of the stuff from
"correcting comments" is plain available for anyone.

------
plasma
Anyone run a multi-tenant SaaS and handle fairness with jobs “fairly”?

Occasionally we use to have all workers tied up on a single customers long
running tasks, we mitigated by using a throttler we wrote that can defer a job
if too many resources are in use by the customer, but it’s not ideal.

I’d love a priority based, customer throttled (eg max concurrent tasks) queue.

We can prioritize by low/medium/high using separate queues, and could make a
set of queues per customer; but that is starting to explode how many queues we
have and feels unmanageable.

~~~
rtpg
Using a database lets you make much better decisions on this space

Tbh purpose-built queues are taken way too eagerly by programmers who later
end up needing the flexibility offered by a more general data store.

~~~
cookiecaper
Yes. I've seen it in all kinds of teams. Anything that allows a developer to
retrieve some data after a local server restart essentially gets treated by
devs as a system of record, regardless of the intricacies or guarantees
involved.

My personal experience is that abuse of queuing/messaging systems along this
axis is rampant. Engineering leaders must keep a close eye on how these types
of mechanisms are utilized to ensure things don't go off the rails.

I've seen far too many serious data loss events that boil down to "we lost our
AMQP queue". It's critical that developers understand the limitations of the
systems that run their code rather than just jumping aboard that "SQL is for
old people" hype train.

------
cyberferret
We've used SQS with great results (and reliability) for many years now, but I
am interested to hear the author talking about 'replaying queues' to replicate
faults. I never realised you could do this with SQS. Or can you? I thought
once a queue item was processed and deleted, that was it, it was gone forever,
but perhaps you can see historical queue data somewhere? (without having to
store it yourself)

~~~
hexene
Possibly this:
[https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQS...](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-
dead-letter-queues.html)

~~~
cyberferret
We actually already use the DeadLetterQueues in our service at the moment, but
these are when _failed_ SQS deliveries happen, then they get replicated to the
DLQ.

I am more interested in diagnosing _successful_ SQS deliveries after the fact,
to see what the payloads were in case there was a downstream problem.

It seems that SQS deliveries that don't get a 200 response from our service go
to the DLQ, but those that get a successful 200 disappear into the ether.

------
kayoone
> Those messages will age out and vanish after a little while (14 days is
> currently the max); but before they go, they’re stored carefully and are
> very unlikely to go missing

can somebody expand on this? I know about the 14 days limitation but this
makes it sound like you can store messages for a long time and still recover
them somehow?

~~~
actuator
I think what he meant was, Once you successfully put a message to SQS, it is
kept till the 'message retention period' unless you explicitly call delete on
it. Right now you can configure the period to be upto 14 days. AFAIK, there is
no way to recover messages older than their retention period.

~~~
kayoone
You are right, thank you! I misinterpreted that.

------
Kiro
> So what to do? In almost all cases it's better to propagate failure and
> backpressure. This gives a chance for the upstream to decide if it wants to
> retry or fail itself. It keeps things simple and easy-to-understand. It
> doesn't build up huge backlogs that become disasters.

What does this actually mean in practical terms?

~~~
owenmarshall
[https://ferd.ca/queues-don-t-fix-overload.html](https://ferd.ca/queues-don-t-
fix-overload.html) explains it better than pretty much anything I’ve read on
the topic.

------
KirinDave
I see lots of engineers who argue that queues are bad and cause more problems
than they solve. "Queues don't have backpressure" is a common complaint. It
wouldn't be a difficult thing to add to a queue interface, but people don't do
it despite the relative ease of the task. It is pretty easy to get a
cardinality estimate on the queue size of any given queue even as it's rapidly
increasing and decreasing in value.

The reason why is that it's probably a bad idea in the first place to mix your
control and data interfaces into an inextricable knot. Since you're not going
to be able to institute rate limiting at a sub-millisecond latency in a
distributed system anyways, why isn't your control plane separate and
instrumented?

Once you have a separate control plane, you can introduce back-pressure in
many different ways and do so with a better understanding of the dissemination
of those throttling values throughout your system.

So what I see are engineers who are actually ignoring the the architectural
decision with at least as many implications as "queue vs diffuse interface,"
namely "control and data or just one monolithic system."

------
edoo
I recreated an SQS style service on AWS for sole purpose of avoiding being
locked into Amazon. It used their autoscale system and was about the dumbest
simplest api you can imagine, just depositing messages into a geographically
replicated db table for later processing. We also used it as sort of a backup
system, where the messages were never truly deleted and everything could be
reprocessed as needed since the front end db that held processed records was
an absolute amateur disaster just waiting for some malicious sql injection
wipes. It was solid as a rock and extremely low maintenance. I think things
have become relatively standardized since then (aka lots of duplicate SQS
compatible api/services), so unless there was a serious requirement to stay
off SQS I'm not sure I'd do it again.

------
gregw2
With 4-5 concurrent processes reading from my (FIFO?) SQS queue, I have a hard
time seeing the contents of the queue from the console. SQS really doesn't
handle the concurrent readers well in my experience.

------
oceanbreeze83
is SNS + SQS a reasonable solution for realtime irc style topic chatrooms?

~~~
cyberferret
There is also a size limit on the messages sent via SQS - 256Kb IIRC. Might be
OK for small text based messaging, but if you want to talk attachments and
other payloads then you have to hook it up to other services like S3 etc. and
then it becomes complicated.

Also, as another respondent replied - there is no real deliverability
guarantee, although there are certain ways to handle that within SQS.

------
danielmg
I clicked the link. Laptop set to loud by mistake. The cat asleep on the
keyboard nearly died.

------
ejcx
Way to go Rick! You know you've made it when people write blog posts about
your hot takes on Twitter

