Hacker News new | past | comments | ask | show | jobs | submit login
On SQS (tbray.org)
343 points by mpweiher on May 27, 2019 | hide | past | favorite | 225 comments

I've worked with SQS at volumes in thousands of messages per second with varied (non-tiny) payload sizes.

SQS is a very simple service, which makes it fairly reliable, though part of the reason for the reliability is that the API's guarantees are weak. And it can be economical, but I've had to build a lot of non-trivial logic in order to interact with SQS robustly, performantly, and efficiently, especially around using the {Send,Receive,Delete}MessageBatch operations to reduce costs.

With the caveat that I think my use case has been quite different from what's discussed in this article, here are some of the problems I've encountered:

- Message sizes are limited, but in a convoluted way: SendMessageBatch has a 256KiB limit on the request size. Message values have a limited character set allowed, so you need to base64-encode any binary data. This also means that there's not exactly a max message size; you can batch up to 10 messages per SendMessageBatch but not in excess of 256KiB for the whole request.

- If you want to send more than 256KiBx3/4-(some padding) or around 180KiB of data for any single message, you need to put that data somewhere else and pass a pointer to it in the actual SQS message.

- SQS does routinely have temporary (edit: partial) failures that generally last for a few hours at a time. ReceiveMessageBatch may return no messages (or less than the max of 10) even if the queue has millions of messages waiting to be delivered; SQS knows it has them somewhere but it can't find them when you ask. And DeleteMessageBatch may fail for some of the messages passed while succeeding for others; it will sometimes fail repeatedly to delete those messages for an extended period.

- The SDKs provided by AWS (for either Java or Go) don't help you handle any of these things well; they just provide a window into the SQS API, and leave it to the user to figure all the details out.

Um, I’ve been at AWS since late 2014, and AFAIK the only extended SQS hiccup correlated with the DynamoDB issue in 2016. SQS isn’t perfect but I’m pretty sure “does routinely have temporary failures that generally last for a few ours at a time” is just wrong.

I believe GP was talking about particular messages failing, not a total system outage. In my use of AWS, the status page almost never reports an outage even though that AWS service is down for me-as in the most I've ever seen is some hand wavey message that there's elevated error rates. So you could be right, SQS hasn't failed entirely, but that probably means there's a good number of failed requests that are below the margin where AWS would consider it down.

Yes, this is correct, thank you. I updated my comment to indicate that I meant partial failure, though the failure conditions persist from 20 minutes to a few hours. Those partial failures have happened once every two months or so in my experience.

Technically, it's not even really a failure of SQS because the guarantees SQS makes are so weak that those partial failures are really "operating normally."

I've seen this behavior occasionally, too.

Producer system puts 500 messages on the queue. Consumer system can't see anything for 90 minutes. Then mysteriously the messages show up.

The status page stays green, not even a note about elevated error rates.

for your last point, that the api is simple, explains all the errors, but forces you to handle them; that’s the aws philosophy. they give you full insight into how each of their products fail and it’s up to the integrator to figure out how to use them properly in a partial outage. it’s difficult to do this because it implies you’re code has to be multi region ready. contrast this with google that mostly designs the api as one that will try to mask as much of this as possible. they are trying to offer global solutions that solve it for you

>you need to put that data somewhere else and pass a pointer to it in the actual SQS message

I was under the impression that's the industry standard - you drop the payload in some redis-like storage and pass keys in messages.

On AWS you would typically use S3 but yes this is definitely the accepted standard: nobody puts big blobs in message queues because they solve a different problem and are (typically) not designed to handle that.

You will definitely not use S3 in a pipeline if latency is an issue.

"In cases where latency is of primary concern, don't use SQS"

I think this is a decent response - they really nail what @rbranson misses, that the failures he mentions are actually features we're after.

An example,

> Convert something to an async operation and your system will always return a success response. But there's no guarantee that the request will actually ever be processed successfully.

Great! I don't want service A to be coupled to service B's ability to work. I want A to send off a message and leave it to B to succeed or fail. This separation of state (service A and B can't even talk to each other directly) is part of what makes queues so powerful - it's also the foundation of the actor model, which is known for its powerful resiliency and scalability properties.

The author's suggestion of using synchronous communication with backpressure and sync failures is my last ditch approach. I have to set up circuit breakers just to make something like this anything less than a total disaster with full system failure due to a single service outage.

Like the author, the "good use cases for queues" is very nearly 100% for me. I believe you should reach for queues first, and it's worth remodeling a system to be queue based if you can help it.

Sometimes modeling as synchronous control is easiest, but I'm happy that I can avoid that in almost every case.

> Convert something to an async operation and your system will always return a success response.

It's funny reading this after using Erlang/Elixir over the last few years. The default is always async with the assumption it will fail - as async processes failing is a core part of the OTP application architecture.

It's not something to be feared but a key part of how your application data-flow works.

I've been planning on giving Erlang/Elixir a try, but we are very reliant on serverless and managed cloud services (like SQS) and I get the impression that managing a cluster of worker nodes for Erlang/Elixir would be too much work for us, since we would have to manage the servers, security patches, plan its scaling, etc.

Maybe I'm wrong and it's not so much work in the end. Hoping for some feedback.

The learning curve I've experienced with Elixir, after working previously with managed services, in handling the above-mentioned tasks while managing state in the BEAM cluster. Patches are scaling are straightforward if you can restart instances and assume they can pick up what was interrupted before, but hot-reloading or managing state between nodes in a rolling update with give you overhead as you get set up.

What it works really great for if you don't want to do the up front investment in managing a stateful cluster, is doing multi-step or fan-out processing. BEAM/OTP really shines when it's helpful to have individual processing steps coordinated but isolated, but where if a job needs to cancel and rerun (interrupted by a node restart or OOM), it's not an issue.

This is great resource https://www.erlang-in-anger.com/

I’d love to know how this isn’t true as well, but I was in an environment where cross-az network costs were something we were continuously mitigating against. Using stuff like sqs let us build cross-az availability with 0-metered network costs, serverless can come into play because it’s network connections usually come through 0-cost aws services as well. It seems to me like from a cost basis, getting into something with clustered erlang would kill you in many of these cloud environments (or at least you would be on the hook for engineering workarounds to keep traffic within an az w/ failover to other azs)

That’s a good question. I’m sure there are people who could answer that. WhatsApp famously scaled an erlang app to a billion users around the world. Quite a few people have done it on a large scale. RabbitMQ is also built in Erlang and used in large deployments.

The zero cost stuff is always going to be a big draw with cloud deployments and the various demands from the company. Although a lot of this stuff like messaging and clustering/failover is within the application/Beam VM itself rather than something scaled or managed externally to the software. But that level of server and infrastructure stuff is out of my league of understanding.

My opinion is that the actor model, today, is best expressed through queue based microservices, especially on something like AWS lambda. If you're already follow microservice best practices, using async communication, there's a good chance that you're already benefiting from an actor oriented system.

> The author's suggestion of using synchronous communication with backpressure and sync failures is my last ditch approach

Also back pressure isn’t difficult to implement. Simply read the estimated size of the queue every N minutes and pass sending until it goes down to a more manageable level. Obvious downside is that it’s client side.

Or subscribe to an SQS queue behind an SNS topic that receives events from CloudWatch when it detects your queue is full (or empty).

I use Postgres SKIP LOCKED as a queue.

I used to use SQS but Postgres gives me everything I want. I can also do priority queueing and sorting.

I gave up on SQS when it couldn't be accessed from a VPC. AWS might have fixed that now.

All the other queueing mechanisms I investigated were dramatically more complex and heavyweight than Postgres SKIP LOCKED.

Here is a complete implementation:

    import psycopg2
    import psycopg2.extras
    import random
    db_params = {
        'database': 'jobs',
        'user': 'jobsuser',
        'password': 'superSecret',
        'host': '',
        'port': '5432',
    conn = psycopg2.connect(**db_params)
    cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
    def do_some_work(job_data):
        if random.choice([True, False]):
            print('do_some_work FAILED')
            raise Exception
            print('do_some_work SUCCESS')
    def process_job():
        sql = """DELETE FROM message_queue 
    WHERE id = (
      SELECT id
      FROM message_queue
      WHERE status = 'new'
      ORDER BY created ASC 
      LIMIT 1
        queue_item = cur.fetchone()
        print('message_queue says to process job id: ', queue_item['target_id'])
        sql = """SELECT * FROM jobs WHERE id =%s AND status='new_waiting' AND attempts <= 3 FOR UPDATE;"""
        cur.execute(sql, (queue_item['target_id'],))
        job_data = cur.fetchone()
        if job_data:
                sql = """UPDATE jobs SET status = 'complete' WHERE id =%s;"""
                cur.execute(sql, (queue_item['target_id'],))
            except Exception as e:
                sql = """UPDATE jobs SET status = 'failed', attempts = attempts + 1 WHERE id =%s;"""
                # if we want the job to run again, insert a new item to the message queue with this job id
                cur.execute(sql, (queue_item['target_id'],))
            print('no job found, did not get job id: ', queue_item['target_id'])

Interesting... How do you schedule this? If the queue is empty, do you back off and retry later, or spin the query until it returns a queue item, or some other way? It's a nice approach.

Thank you for posting this. Will definitely be using it going forward. Beautifully simple.

I LOVE this idea. I usually hear other Sr. engineers denigrate it as "hacky," but I think they aren't really looking at the big picture.

1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)

2. Postgres preserves your data if it goes down

3. You already have the tools on each machine and everybody knows the querying language to examine the stack

4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.

5. Performance is a non-issue for any company doing < 10m queue items a day.

I don't think it's hacky - it's using documented Postgres functionality in the way it's intended. Engineers tend to react that way to anything unfamiliar, until they decide it's a good idea then they evangelise.

What does "hacky" even mean? If it means using side effects for a primary purpose then no, SKIP LOCKED is not a side effect.

I researched alot of alternative queues to SQS and tried several of them but all of them were complex, heavyweight, with questionable library support and more trouble than they were worth.

The good thing about using a database as a queue is that you get to easily customise queue behaviour to implement things like ordering and priority and whatever other stuff you want and its all as easy as SQL.

As you say, using Postgres as a queue cut out alot of complexity associated with using a standalone queueing system.

I think MySQL/Oracle/SQL server might also support SKIP LOCKED.

Thanks for posting that code!

Definitely similar experience here. We handle ~10 million messages a day in a pubsub system quite similar in spirit to the above, running on AWS Aurora MySQL.

Our system isn't a queue. We track a little bit of short-lived state for groups of clients, and do low-latency, in-order message delivery between clients. But a lot of the architecture concerns are the same as with your queue implementation.

We switched over to our own pubsub code, implemented the simplest way we could think of, on top of vanilla SQL, after running for several months on a well-regarded SaaS NoSQL provider. After it became clear that both reliability and scaling were issues, we built several prototypes on top of other infrastructure offerings that looked promising.

We didn't want to run any infrastructure ourselves, and didn't want to write this "low-level" message delivery code. But, in the end, we felt that we could achieve better system observability, benchmarking, and modeling, with much less work, using SQL to solve our problems.

For us, the arguments are pretty much Dan McKinley's from the Choose Boring Technology paper.[0]

It's definitely been the right decision. We've had very few issues with this part of our codebase. Far, far fewer than we had before, when we were trying to trace down failures in code we didn't write ourselves on hardware that we had no visibility into at all. This has turned out to be a counter-data point to my learned aversion to writing any code if somebody else has already written and debugged code that I can use.

One caveat is that I've built three or four pubsub-ish systems over the course of my career, and built lots and lots of stuff on top of SQL databases. If I had 20 years of experience using specific NoSQL systems to solve similar problems, those would probably qualify as "boring" technology, to me, and SQL would probably seem exotic and full of weird corner cases. :-)

[0] - https://mcfunley.com/choose-boring-technology

I'm not opposed to a separate table or DB used exclusively for messaging separate from business level data, but I think I've never heard of anyone doing that in practice. It's almost always a bunch of hacky joins someone wants to do with the messages and the messaging gets directly coupled with the schema (foreign key constraints on transaction or event IDs and business entities is the usual case). Letting one or two things be used for purposes it wasn't necessarily designed for is alright but coupling the same place that has your OLTP together as your messaging system in itself is usually a culture where the same DB will be used in another month as a time series DB, then as an OLAP system, and then there's no time to properly decouple anything anymore / actually understand your needs and nobody wants to try to touch the very, very brittle DB that runs the entire business on its back.

It is a very, very slippery slope and it's seductive to put everything into Postgres at low scale but I've seen more places sunk by technical debt (because they can't move fast enough to catch up to parity with market demands, can't scale operationally with the budget, or address competitors' features) than ones that don't get enough traction where technical debt becomes a problem. But perhaps my experience is perversely biased and I'm just driven to over-engineer stuff because I've been routinely handed the reins of quite literally business non-viable software systems that were over-sold repeatedly while Silicon Valley sounds like it has the inverse problem of over-engineering for non-viable businesses.

I think you're saying it's not a good approach because developers will do it wrong?

That's hard to argue with.

The example code shows how it should be done - a simple table dedicated to messages which supplies the id of the work/job to be carried out.

Not much can be done architecturally to address developers messing that up.

While the pattern is simple programmatically, it is insufficient operationally and sets a precedent of coupling your data plane and your messaging plane.

Looking solely from a code perspective, how would this be done prior to Postgres 9.5 (before SKIP LOCKED)? Of the four examples I saw (either MySQL or Postgres), all of them choked constantly and had a pathological case of a job failing to execute expediently during peak usage which made on-call a nightmare. So what do we do about the places that can't upgrade their Postgres databases because it's too complicated to decouple messaging and business data now? Based upon my observations the answer is "it's never done" or the company goes under from lack of ability to enact changes caused by the paralysis.

The reality is that simple jobs are almost never enough outside toy examples. Soon, someone wants to be notified about job status changes - then comes an event table with a foreign key constraint on the job table's primary key. Also add in job digraphs with sub-tasks and workflows (AKA weak transactions) - these are non-trivial to do well. Depending upon your SQL variant and version combined with access patterns, row level locking and indexing can be an O(1) lookup or an O(n^2) nightmare that causes tons of contention depending (once again) upon the manner in which jobs are modified.

Instead of thinking about "what are my transaction boundaries and what do my asynchronous jobs look like?" which are much more invariant and important to the business trying to map their existing solution onto as many problems as possible. Then it should be more clear whether you would be fine with a RDBMS table, Airflow, SQS, Redis, RabbitMQ, JMS, etc. Operations-wise more components is certainly a headache, but I've had more headaches in production due to inappropriate technology for the problem domain than "this is just too many parts!"

> Looking solely from a code perspective, how would this be done prior to Postgres 9.5 (before SKIP LOCKED)?

It's possible to implement SKIP LOCKED in userland in PostgreSQL (using NOWAIT, which has been in PG), although it's obviously a bit slower.

As always, it depends on your use cases. The data management and API footprint story is way simpler with SQS than with a relational database, and you don't have to worry about all the scaling complexity that comes with relational databases. Plus you get things like AZ resiliency for free. Staging SQS queues for demo/local/qa/dev envs is trivial if you are using CloudFormation.

Personally, I would never use a relational database as a queue, but I'm already very familiar with SQS's semantics.

Especially for temporary batch processing jobs, where I really don't want to be modifying my production tables, SQS queues rock.

There is another advantage.

If you are using the queue as a log of events (i.e. user actions), you get an atomic guarantee that the db data is updated and the event describing this update has been recorded.

Also being able to run a query over the contents of the entire queue and the messages bodies (for example a JSON blob).

> I LOVE this idea. I usually hear other Sr. engineers denigrate it as "hacky," but I think they aren't really looking at the big picture.

Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly. It forces a lot of complexity into areas where you'd rather not deal with it, later manifesting as technical debt as folks who need to solve the problem have to go hunting through the codebase to find where all the control and throttling decisions are being made and what they touch.

> 1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)

"Oh this SQS is such a pain to manage", is not something anyone who has used SQS is going to say. It's much less work than a database. It has other benefits too, like being easier for devs by being able to trivially replicate staging and production environments is a major selling point of cloud services in the first place. And unlike Postgres, you can do it from anywhere.

The decision to use the same postgres you use for data as a queue also concentrates failure in your app and increases the blast radius of performance issues concentrated around one set of hardware interfaces. A bad day for your disk can now manifest in many more ways.

> 2. Postgres preserves your data if it goes down

Not if you're using a SKIP LOCKED queue (in the sense that it has exactly the same failure semantics as SQS, and you're more likely to lose your postgres instance than SQS with a whole big SRE team is to go down). Can you name a data loss scenario for SQS in the past 8 years? I can think of one that didn't hit every customer, and I'm not at AWS. Personally, I haven't lost data to it despite a fair amount of use since the time us-east-1 flooded. Certainly "reliability" isn't SQS's primary problem.

> 3. You already have the tools on each machine and everybody knows the querying language to examine the stack

SQS has a web console. You don't need to know a querying language. If you're querying off a queue, you have ceased to have an unqualified queue.

> 4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.

Given most folks who are using AWS are strongly encouraged and empoewred to use RDS, this seems like a wash. But also: SHOULD you back up your queues? This seems to me like an important architectural question. If your system goes down hard, should/can it resume where it left off when restarting?

It'd argue the answer is almost always "no." But YMMV.

> 5. Performance is a non-issue for any company doing < 10m queue items a day.

You say, and then your marketing team says, "We're gonna be on Ellen next week, get ready!"

>>>> 2. Postgres preserves your data if it goes down

>>>Not if you're using a SKIP LOCKED queue,

Can you provide a reference for this? I'm not aware that using SKIP LOCKED data is treated any differently to other Postgres data in terms of durability.

It's exactly the same case as SQS in that if your disk interface dies in mid write of a message you lose it the same as if your interface to SQS or SQS itself dies mid-write, you lose it.

The rest of the sentence is a challenge, "show me a data loss bug in SQS that extends beyond the failure cases you've already implicitly agreed to using postgres itself."

SKIP LOCKED is a read-side feature. PG is a problem for u/andrewstuart because they have to manage it themselves (unless they're getting a PG service from a cloudy provider, say) and that's harder than it is for Amazon to manage SQS.

What possibilities are there for a write to fail that the database server cannot react to? I am under the impression that a write error like this would be reported as a failure to run the statement and the application is responsible for handling that.

The box turning off, the disk interface failing, and/or the link to the database failing in mid instruction.

Same as SQS.

All of those would result in a write failure from the application's perspective, which is fine, and must be accounted for regardless (e.g. retry, two phase commit, log an error, whatever).

But you have to explicitly delete the message from SQS, right? You'd only delete after confirming you processed the message, right? So if you die mid-instruction in processing a message, the message just re-appears in the SQS queue after the visibility timeout.

That's a feature, and one your have to replicate in your postgres queue.

Also the FS might report the data as written but its actually in a write cache and will be lost if the plug is pulled.

Not an issue if you follow th e recommendations in the PostgeSQL documentation. The way to comfigure write caches is described there.

Using SKIP LOCKED - do you commit the change to the dequeued item (ack it) at the point where you exit the DB call. If so what happens if the instance that dequeued the messages crashes?

Not GP but I think this wouldn't be a problem. The consumer dequeues with SELECT FOR UPDATE within a transaction. If it crashes the database would rollback the transaction, and then another consumer would be able to select the work unit.

As for acking, I see two common methods: using an additional boolean column, something like is_processed. Consumers skip truthy ones. Or, after the work is done, simply delete the entry or move elsewhere (e.g. For archival / auditing).

My question assumed a scenario where a consumer dequeues a batch, commits the deqieued change, and then crashes while processing the batch.

Offcourse one could delay the commit until all processing is completed but then reasoning about the queue throughput becomes tricky.

That's the challenge of distributed systems :) it really boils down to how you want failures to be handled.

If you ack before processing, and then you crash, those messages are lost (assuming you can't recover from the crash and you are not using something like a two-phase commit).

If you ack after processing, you may fail after the messages have been processed but before you've been able to ack them. This leads to duplicates, in which case you better hope your work units are idempotent. If they are not, you can always keep a separate table of message IDs that have been processed, and check against it.

Either way, it's hard, complex and there are thousands of intermediate failure cases you have to think about. And for each possible solution (2pc, separate table of message IDs for idempotency, etc) you bring more complexity and problems to the table.

Well, sqs has machinery that deals with this (in flight messages, visibility timeouts) "out of the box". Similar functionality needs to be handcrafted when using dB as a queue.

To be clear, it is not that the SKIP LOCKED solution is invalid, it is just that there are scenarios where it is not sufficient.

Refer to the example I posted as a reply, above.

You'd have the same problem with SQS, wouldn't you. The act of dequeueing does not guarantee that the process that received a message will not fail to perform it.

If you want a reliable system along those lines than you need to use SKIP LOCKED to SELECT one row to lock, then process it, and then DELETE the row. If your process dies then the lock will be release. You still have a new flavor of the same problem: you might process a message twice because the process might die in between completing processing and deleting the row. You could add complexity: first use SKIP LOCKED to SELECT one row to UPDATE to mark in-progress and LOCK the row, then later if the process dies another can go check if the job was performed (then clean the garbage) or not (pick and perform the job) -- a two-phase commit, essentially.

Factor out PG, and you'll see that the problem similar no matter the implementation.

> you might process a message twice because the process might die in between completing processing and deleting the row

The very handy thing about the setup described, is that your data tables are part of the same MVCC world-state as your message queue. So you do all the work for the job, in the context of the same MVCC transaction that is holding the job locked; and anything that causes the job to fail, will fail the entire transaction, and thus rollback any changes that the job's operation made to the data.

With SQS, the act of dequeueing makes the mesage invisible to other consumers for a predefined time period. The consumer can ack the mesage once the procesing is completed resulting in the message being deleted. If the consumers fails to do so - the mesage will eventually become elligible to be processed by another consumer,

I described, in the message you're responding to, how to do the same thing with SKIP LOCKED.

Essentially postgres SKIP LOCKED worker queues DELETES an item from a worker queue table, does the relevant work, and if the work completes ok, commits the deletion.

The SKIP LOCKED bit means that once the queue item has been grabbed FOR UPDATE, it cannot then be grabbed FOR UPDATE by any other queries, so it has an exclusive lock on the worker queue item.

It's pretty robust and works fine for servicing multiple workers.

Delaying the commit is the standard approach with this. SKIP LOCKED was created specifically to avoid the throughput issues of locked rows (and has similar implementations in other RDBMS).

If you don't want to keep the transaction open than you can just go back to updating a column containing the message status, which avoids keeping a transaction open but might need a background process to check for stalled out consumers.

Do you have an example of such a queue somewhere?

I think I have a rough idea about how it works because I implemented something similar about four years ago in PostgreSQL but kept getting locking issues I couldn't get out of:

- https://stackoverflow.com/questions/33467813/concurrent-craw...

- https://stackoverflow.com/questions/29807033/postgresql-conc...

Also, what kind of queue size / concurrency on the polling side are you able to sustain for your current hardware?

There's a few libraries implementing this:

https://github.com/mbuhot/ecto_job for Elixir

https://github.com/timgit/pg-boss for Node.js

Refer example code above. I don't think Postgres included SKIP LOCKED 4 years ago.

There are VPC API endpoints for this. As far as I recall it was all the time there from VPC begging, but I might be wrong.

Only since December 2018 and you need to pay extra...


A long time ago, as new-ish developer, I was building a system that needed to take inputs, then run "pass/fail/wait and try again later" until timeout or completion. This wasn't mission-critical stuff, mind you, so a lost message would annoy someone but not cause any actual harm.

As I was figuring out how to setup a datastore, query it for running workflows and all that jazz, I happened upon an interesting SQS feature: Post with Delay.

And so, the system has no database. Instead, when new work arrives it posts the details of the work to be done to SQS. All hosts in the fleet are polling SQS for messages. When they receive one, they do the checks and if the process isn't complete they repost the message again with a 5-minute delay. In 5 minutes, a host in the fleet will receive the message and try again. The process continues as long as it needs to.

Looking back, part of me now is horrified at this design. But: that system now has thousands of users and continues to scale really well. Data loss is very rare. Costs are low. No datastore to manage. SQS is just really darned neat because it can do things like that.

The biggest gotcha in a design like this IMHO is that you can't post and delete atomically. You may post the new work into the queue and then a failure to delete could occur and the work will stack.

Depending on the workload this could be not a big deal or very expensive. Treating a queue as a database, particularly queues that can't participate in XA transactions, can get you in trouble quick.

With a realistic, that is, not 100% reliable, queue you can have either "at most once" or "at least once" delivery anyway. "Exactly once" can't be guaranteed.

So a duplicate message should be processed as normal anyway, e.g. by deduplication within a reasonable window, and/or by having idempotent operations.

Yes, it depends on the workload. Idempotency is typically always a good idea, but sometimes the operation itself is very expensive in terms of time, resources, and/or money. I have also seen people try to update the message when writing it back(with checkpoint information and etc) for long running processes. A slew of issues, including at least once delivery, can cause workflow bifurcation. Deduplication via FIFO _can_ help mitigate this, but it has a time window that needs to be accounted for. Once you start managing your own deduplication I'd say it has moved past trying to go databaseless.

But you could adjust the message visibility timeout of the message you received so that it appears back later in the queue itself.

For this specific use case, as described, this sounds like a possible very good approach.

One thing to be aware of for this approach though is that you can only have 120k in-flight messages per queue.

Why are you horrified at a design that works well, scales well, is resilient enough for its use case, and is low cost? The whole point of an engineering design process is to find designs that meet these types of requirements. Honestly, this sounds like the perfect solution for what you're trying to accomplish.

Because looking at it now, something feels deeply wrong about it, haha. Honestly, if I'd used a database it probably would have opened up a few more options for future work.

I can't do any analytics about how long things typically take, who my biggest users are, etc. I mean, I could, but I'd have to add a datastore for that.

Adding new details to the parameters of the system requires very careful work to make all changes backwards and forwards compatible so that mid-deployment we don't have messages being pushed that old hosts can't process or new hosts seeing old messages they don't understand. That's good practice generally, but it's super mission critical to get right this way.

Also, a dropped message is invisible. SQS has redrive, sure, and that helps but if there were a bug, an edge case, where the system stopped processing something and quietly failed, that processing would just stop and we'd never know. If the entries were in a datastore, we'd see "Hey, this one didn't finish and I havne't worked on it lately, what gives?".

Has anyone ever measured the latency of the sending message to SQS? I was using with ELB in t2.medium instances, and my API (handle => send message to queue => return {status: true}) response times were around 150 - 300 ms and replaced SQS with RabbitMQ, and it went down to around 75-100 ms.

Does anyone think that sending message to SQS is slow?

Edit: With this update, I was able to process almost 3 x requests with the same resources, and it lowered my bills quite a lot.

For example my SQS bill for last month

Amazon Simple Queue Service EUC1-Requests-Tier1 $0.40 per 1,000,000 Amazon SQS Requests per month thereafter 290,659,096 Requests $116.26

it went to 0, and ec2 cost went down as well because ELB spun up fewer instances that I could handle more quest with the same resources.

This was my experience with SQS. I just wanted to share it.

In cases where latency is of primary concern, don't use SQS. Consider also not using Rabbit as its tail latency isn't hugely more reliable when loaded.

Were you running RabbitMQ clustered with persistent queues?

I don't think SQS is primarily for low-latency messaging, but rather a provided high available MQ with very little hassle.

I wasn't, single instance in the same subnet with persistent work queues.

I’m not quite sure why latency affected total throughout.

Although 100-300ms seems pretty good for total round-trip latency to most message queues. Another thing to make sure of is that whatever HTTP client you’re using to interact with AWS is using pipelining. It’s off by default for the JS libraries for example.

Umm.. is this in the same DC?

From what I remember from numbers from New Relic.

- The average SQS Put latency is: 10-20 ms.

- RabbitMQ Publish Message in the same DC should not be more than the round trip time of a DC assuming a persistent connection. So, about 0.5 ms

Benchmarks online show this to be true, depends on your use case if it's acceptable I guess.

Sounds more like AZ issue than SQS/rabbit?

WE are heavy users of AWS. SQS is the only service where we have had zero downtime. The only downside we have about SQS is you can pull out only 10 messages at a time (without batching). You can have parallel readers but they result in some duplicates. There is SQS FIFO but it is throttled.

Plus FIFO isn’t integrated with SNS nor is it integrated with Lambda leaving it in an island of its own within AWS.

It went down on us once for extended period in 2015. It was chaotic as you don't expect it to fail. If memory serves me right, even S3 suffered that day.

A whole slew of AWS services went down that day. Tim's not wrong when he indicates that almost every service in AWS has a dependency on it (Amazon has services split up in to tiers based on how much they can rely on other services for critical components, SQS is pretty high up in the tiering.)

I was on-call that day for an AWS service. There wasn't much I could do but sit muted on the conference call and watch some TV, waiting for the outage to be over.

One downside of SQS is that it doesn't support fan-out, for eg. S3->SQS->multiple consumers. The recommendation instead seems to be to first push to SNS, and then hookup SQS/other consumers to it. Kinesis/Kafka would appear to be better suited for this (since they support fan-out like SNS and are pull-based like SQS), but aren't as well supported as SNS/SQS (you can't push S3 events directly to Kinesis for eg.) Can someone from AWS comment on why that is? Also, related: when can we expect GA for Kafka (MSK)?

Kinesis is not necessarily well-suited fan-out. It is very well suited for fan-in (single consumer, multiple producers).

Each shard allows at most 5 GetRecords operations per second. If you want to fan out to many consumers, you will reach those limits quickly and have to implement a significant latency/throughput tradeoff to make it work.

For API limits, see: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...

I do S3 -> SNS -> SQS. I don't see why I would use Kinesis instead. The SNS bit is totally invisible to the consumers (you can even tell SNS not to wrap the inner message with the SNS boilerplate), downstream consumers just know they have to listen to a queue.

I don't see a downside to this approach. Perhaps some increased latency?

If you wanted multiple pull-based consumers for the stream, wouldn't you need a separate SQS queue per consumer, with each queue hooked up to SNS? Perhaps I'm mistaken, but that seems brittle to me. With Kinesis/Kafka, you only need to register a new appName/consumer group on the single queue for fan-out. Plus, both are FIFO by default, at least within a partition.

That's exactly how you do it. To me, it's the opposite of brittle - every consumer owns a queue, and is isolated from all other consumers. Clients are totally unaware of other systems, and there's no shared resource under contention.

I feel like the create/delete queue semantics hint that a queue should be a long-lived thing that consumers are configured to connect to. When I saw suggestions to have one queue per consumer and have that consumer create/delete the queue during its execution lifecycle, the idea of one-queue-per-consumer started making more sense to me.

I think the word "Consumer" here is "Consumer Group".

For example, an AWS Lambda triggered from SQS will lead to thousands of executions, each lambda pulling a new message from SQS.

But another consumer group, maybe a group of load balanced EC2 instances, will have a separate queue.

In general, I don't know of cases where you want a single message duplicated across a variable number of consumer groups - services are not ephemeral things, even if their underlying processes are. You don't build a service, deploy it, and then tear it down the next day and throw away the code.

Hmm. It seems a bit awkward if you have a variable number of consumers?

I haven't run into that myself, when would you want a variable number of consumers? Usually the way I have it is that a service, which is itself a cluster of processes, owns one queue. For example, an AWS Lambda triggered by that queue.

Then any new lambdas or other services that want to subscribe to messages will have another queue, and another, etc.

I haven't had a case where I had service groups coming up and down, I'm struggling to think of a use case.

Yea, I find this setup really convoluted and unnecessarily complex. Now I have to learn the particulars of two aws services to do a job which ought to be handled by one.

Google Cloud really outshines AWS here with its serverless PubSub - its trivial to fan out, its low latency, and has similar delivery semantics (I think), and IMHO better, easier api's. Its a really impressive service, IMHO.

I have been working with Google pubsub and was excited about their Push service that can post messages to subscribed endpoints/webhooks.

But their only method of throttling is to scale up and down base on failures. And it has been very unpredictable for me.

Even though my webhook started failing and timing out on requests, pubsub just kept hammering my servers until it brought it completely to it's knees. Logs on Google's end showed 1,500 failed attempts per second and 0.2 successes per second. It hammered at this rate for half an hour.

Seems like their Push option really needs some work.

Generally they do GA in the next ReInvent from when as service is announced, so probably by end of year. But I won't be sure on MSK. It is extremely limited right now, last time I checked their API had no way for even changing the number of nodes.

I really wish SQS had reliably lower latency, like Redis, and also supported priority levels. (Also like redis, now, with sorted sets and the https://redis.io/commands/bzpopmax command.)

Has anyone measured the performance of Redis on large sorted sets, say millions of items? Hoping that it's still in single digit milliseconds at that size... And can sustain say 1000QPS...

I have worked with sorted sets with millions of items. The latency really depends on what you execute. Commands that involve only few elements are fine. Like the ZPOPMAX you mentioned should be way under a ms if you pop say 10 items, and you should be able to get way more than 1k QPS. The thing with sorted sets like most data structures in Redis is to just read the time complexity well in their documentation. For an operation like ZPOPMAX it is linearly proportional to the number of items you pop, so if you pop too many it will take time.

But actually ZPOPMAX is proportional to log(number of elements in set). So when the set grows to millions you may have a performance problem. I have no idea how fast growing this log function is so I really have no idea of the performance on sets of millions...

Well, that represents the asymptotic time complexity. So, roughly with an increasing N it increases by log of it. Here is an example of the log(base 2) function growth.

Log(4) = 2, Log(16) = 4, Log(256) = 8, Log(65536) = 16, Log(4,294,967,296) = 32

Assuming other things remaining unaffected, theoretically that means in a set of 4 million the operation should be about 4 times slower than in a set of 250 elements.

We use Redis as a job queue and its great; the only limitation is being sometimes concerned about job queue size due to memory limits of the Redis server itself.

Also, when you don't want to lose a message, the Redis persistence story requires careful thought. It requires setting up RDB + AOF + appendfsync=always + backups.

You don't need to fsync on every write, that's what the replica is for. At lower/mid-tier hardware, network is faster than storage and your message is on multiple machines before it's even written to disk so fsync of 1 second is usually fine.

This would just kill the performance on virtualized hardware like EBS. You would lose a lot of benefits of Redis at that point.

Always hear using redis etc for job queues, what do these "job queues" entail? I've been an amateur and have used postgres tables as queues .. am I not being efficient?

Well, they are just a mechanism to push jobs between different components where you treat Redis as a message queue. In Redis you can implement one using Lists, Sorted Sets, Stream or Pub/Sub; though they are commonly implemented through the first two. Though if you end up using first two, to guarantee at least once semantics you need to take care of acknowledgement in a not so pleasant way.

You can replace Redis with pretty much any message queue though like RabbitMQ which has a better consumption story. The main advantage of using any one of these would be the throughout you can achieve and it decouples your database at that point which a lot of people prefer.

Tables as queues is perfectly legit, one less component to worry about.

Does anyone have any rough performance numbers on each approach? I'll only use DB tables when I don't want to deploy Redis (which I usually do, as a cache or whatever), and I'll use RabbitMQ for more "serious" projects. On small projects, I'll even forgo Redis and use the DB as a cache too, which works pretty well.

What do y'all do?

My job queue requirements have been very modest (less than ten jobs per hour) and no more than 2 or 3 workers. I also generally have redis deployed but it never occurred to me to use it given it's "ephemeral" nature - always remained paranoid that services on the cloud should always be assumed to go down at any time

I generally agree, but I have Redis as part of my main stack now, and it's been pretty solid. I'm fairly distrustful of "cloud" services, though, so I run my stuff on Hetzner and manage them myself. For my revenues, where a day's downtime might cost me $5/mo in lost revenue, I think it makes more sense.

One more advantage it gives is you can have pretty easy job guarantees through the database transactions.

Good luck with durability (of which Redis has no decent guarantees) and availability (of which depend entirely on how good you are at configuring and maintaining Redis servers and worse, the way you access them as a client).

I'm gonna give SQS a shot because authentication in redis seems to be quite a challenge, or was so the last time we looked about six months back.

Author of https://node-ts.github.io/bus/ here. SQS is definitely one of my most favourite message queues. The ability to have a HA managed solution without having to worry about persistence, scaling or connections is huge.

Most of the complaints are native to message based systems in general. At least once message receives, out of order receives, pretty standard faire that can be handled by applying well established patterns.

My only request would be to please increase the limits of message visibility timeouts! Often I want to delay send a message for receipt in 30 days. SQS forces me to cook some weird delete and resend recipe, or make this a responsibility of a data store. It's be really nice to do away with batch/Cron jobs and deal more with delayed queue events.

RE: visibility timeout beyond 30 days, you may be more after a “saga” that has state and is long running (hours/days/months/years).

You can imagine building a saga system on top of a queue system.

You're absolutely right, in fact I have a whole package that is just that https://node-ts.github.io/bus/packages/bus-workflow/.

The problem is this. Let's say that I want to trigger a step in a "free trial" saga that sends an email to the customer 10 days after they sign up nudging them to get a paid account. If I can delay send this message for 10 days then it's easy.

However because SQS has a much shorter visibility timeout, I have to find a much more roundabout way of triggering that action.

Yeah, that makes total sense. For some of our saga's (we don't use SQS -- we use a custom redis queue), we have the saga potentially wake up and immediately sleep again ("Nothing to do right now, defer again in a few days").

But yes, a quirk.

How about Step Functions? Jobs can run for up to 12 months with wait steps. And can now send action tokens to services like SQS for completion later.

isn't it dangerous to relay on persistence of MQ for such a long time? shouldn't your system have a way to re-create the queue at any time? if so, you should be able to call such command to fill in the queue within the maximum timeout time allowed by MQ provider.

> isn't it dangerous to relay on persistence of MQ for such a long time

Most of them are designed and intended as a persistent data store. Data loss for a queue system is typically something you do not want

We love SQS, but one of the problems we're running into lately is the 256kb per message limitation. We do tens of millions of messages per day, with a small percentage of those reaching the 256kb limit. We're approaching the point where most of our messages will hit that limit.

What are our options for keeping SQS but somehow sending large payloads? Only thing I can think of is throwing them into another datastore and using the SQS just as a pointer to, say, a key in a Redis instance.

(Kafka is probably off the table for this, but I could be convinced. I'd like to hear other solutions first, though.

Hard to suggest smth without details, but imo 256kb per message is a lot and enough. One thing I would consider is storing such "messages" payload on s3, and sending notification to sqs allowing you to fetch given file and process it (this however might be expensive if you happen to have such high traffic)

We have a library that puts the payload in s3 bucket under random key, the bucket has expiration policy of few days. Then we generate http link to the object and send an sqs message with this url in metadata. The reader library gets data from s3, it doesn't even have to remove it. It will disappear automatically later.

We do it "by ourselves", not using the provided lib, because that way it works both for SQS and SNS. The provided lib only supports SQS.

Also our messages aren't typically very big, so we do this only if the payload size demands it.

Wouldn't sending and later retrieving millions of S3 objects be expensive?

Messaging in our system is not high volume so not in this case. Also as I said - these big messages are pretty rare.

Gotcha. That makes sense.

The Java SDK for SQS automatically drops >256k messages into an S3 bucket and stores a pointer to the message in SQS itself. You set the bucket and the client transparently handles retrieving a message from SQS/S3 when necessary.

Based on the structure of the message (UUIDv4) you could probably roll your own implementation in any language.

That's a good idea, but wouldn't sending and later retrieving millions of S3 objects be expensive?

Yes, and slow, if you're actually doing it for millions. The comment was saying the library only does it for messages that are over the size limit though.


I've successfully been using Kafka in production with a tiny percentage of packets ~1Mb. We utilize gzip compression with chunking on some of those messages (system is legacy, no fancy compressed message format in use). IIRC, only had to modify settings at the broker level and it works perfectly fine.

We've discussed switching to Kafka. There are some pros/cons to doing that. With respect to my problem above, our messages _could _conceivably approach 1MB (or even surpass it), so we're really just delaying the inevitable. That said, we're a long, long way from hitting that limit, so it's definitely something we're looking at.

We just recently started gzipping our payloads, which buys us even more time.

Why are your messages large? Messages should just be signalling events, like follows:




I guess you are sending the whole user object over the event bus? Isnt that an anti-pattern?

We point to an s3 object for any large payloads

Wouldn't sending and later retrieving millions of S3 objects be expensive?

and there's a library around that can do it for you, at least in java

Does anyone know a good, low overhead out-of-process message queue, that's lightweight enough that it can be useful for communicating between processes on the same machine, but if necessary it can scale beyond it? In case of a single-machine product that comprises of several services, a message queue can sometimes be useful for pull model, but adding RabbitMQ to the stack makes installation and ops much more complex than customers deem acceptable.

I know some people use Akka with Persistence module, but I would welcome other alternatives.

Depending on the project a database (Postgres) can make a nice database, queue and cache. You get transactions among all three which which makes life much simpler in the beginning and it will happily march along processing thousands or tens of thousands of TPS. You can scale by adding read replicas and eventually moving to separate databases for each purpose.

If you're on the JVM then ActiveMQ (and other Java queues) will usually run embedded and have options to use a database for persistence.

Guess it depends on the definition of "queue". Potentials:

  - https://nsq.io/
  - https://nats.io/

Seems like NATS streaming would fit my case - have you heard of any real world deployments that use it ? Are there any larger issues that don't make it a good choice ?

NATS Streaming is not as well tested and has some design issues that make scaling hard. NATS itself has a new version 2 that has a protocol update and NATS Streaming should follow with a new design as well, but I would recommend other options if you want persistence.

What are the design flaws that you have in mind? Is it ok for a couple of nodes or even then it would have trouble to keep up with a medium load? Or maybe the design flaws are to do with providing durability and other guarantees?

What other options would you recommend, that can provide at least once delivery and are lighweight enough not to require zookeeper etc?

Nats streaming isn't just a persistence layer to NATS. It's an entirely different system that basically acts as a client to NATS and then records messages it sees. Basically think of how you would design a persistent queue on top of the ephemeral NATS pub/sub and that's what NATS streaming is.

Here's a good post (and series) about distributed logs and NATS design issues: https://bravenewgeek.com/building-a-distributed-log-from-scr...

Can you share some details or point to articles describing the design issues with NATS streaming?

See the other comment but this is a good post/series: https://bravenewgeek.com/building-a-distributed-log-from-scr...

I am curious what definition of "queue" fits NSQ. It's a distributed messaging platform.

The normal one. It puts stuff in a queue.

"nsqd is the daemon that receives, queues, and delivers messages to clients."

Do they offer at-least-once or better delivery guarantees?

"Low overhead"? Redis.

Reliable and durable? Emphatically not Redis, and at that point you probably want to just start looking at SQS.

(I've had good luck shipping products as docker-compose files, these days. Even to fairly backwards clients.)

Can you explain why Redis is not durable? Looking into it for a project but this comment worries me.

Intentionally so. It's not a deficiency or a footgun, it's a design decision to be aware of. Redis is an in-memory database first.

You can configure Redis for durability. The docs[1] page for persistence has a good examination of the pros and cons.

[1]: https://redis.io/topics/persistence

antirez put a lot of work in around Redis 3.0 (iirc) to make the persistence reliable and strong. As long as your server is configured and used correctly (obviously a large caveat, but you can only hold someone's hand so much), I don't think there is any reason to doubt Redis's persistence anymore.

It's important to make this distinction because there are commonly-used systems that offer a best-effort style persistence that usually works fine, but explicitly warn developers not to trust it, and developers rarely understand that.

We badly need to get better at distinguishing between true, production-level data integrity and "probably fine".

In my experience Redis is plenty durable (especially in Elasticache with replicas + multi-region + backups). Redis _is_ an in-memory data store, so if the server crashes, you lose the data, but if you have replicas it'll fail over, if you have backups you can restore, and if you have multi-region you can failover to the other region. IMO, the idea that Redis is not durable enough is outdated.

It's something to be aware of and to have backup plans for, but we've been using Redis as our primary datastores for over a year with only one or two instance failures which were quickly resolved within minutes by failing over to replicas, with no data loss.

Redis won't lose data from just a server crash. It has persistence with streaming AOF mode and snapshot RDB mode, and they can be combined. fsync is also configurable. It can be set to write every operation to disk but most set it to every 1 second which is safe enough with replicas.

I guess I didn't make it clear. Solution has to work in-house / in private cloud.

Redis fits this niche for me most of the time, but I always start in-process for simplicity. If durability is all you need, SQLite is a great reliable option.

Nowhere is cost mentioned. Using S3 as an ad-hoc queue is a cheaper solution, which should throw some red flags. You can easily do what SQS does for so much cheaper (this includes horizontal scaling and failure planning), that I'm consistently surprised anyone uses it. Either you are running at a volume where you need high throughput and it's pricey, or you're at such a low throughput you could use any MQ (even redis).

> Oh but what about ORDERED queues? The only way to get ordered application of writes is to perform them one after the other.

This is another WTF. Talking about ordered queues is like talking about databases, because it's data that's structured. If you can feed data from concurrent sources of unordered data to a system where access can be ordered, you have access to a sorted data. You deal with out-of-order data either in the insertions or a window in the processing or in the consumers. "Write in order" is not a requirement, but an option. Talking about technical subjects on twitter always results in some mind-numbingly idiotic statements for the sake of 144 characters.

In the first subheading, towards the bottom, the author mentions it’s $0.40 per million api calls.

S3's eventual consistency aside, how is it cheaper?

It would seem to me that naively, S3 charges $5 per million POST requests, so it's 10x worse than SQS's $0.40 per million.

Writing directly from EC2 is free (within same region).

Only the data transfer is free. The API requests are not free. S3 does come out to be more expensive...

Only if you are pushing 1 message per write. 1 write per agent per minute, ends up at just over 2$ per month fox 10 agents, regardless of the number of messages - it's cheaper for any setup I've encountered. AWS services are always middling solutions, which you can often optimize for better cost efficiency.

Interesting point. Should have mentioned this difference though, it's a quite different architecture with different tradeoffs.

This is a message queue. But wouldn’t be fit for purpose for the way most people need to use message queues.

What constitutes "the queue" is a matter of data structuring, not a singular brand (eg RabbitMQ, Redis, SQS, etc each have their own internals).

Any middle tiering of the data before it reaches the consumer, is still "the queue". You don't need to know the internals of SQS, anymore than a consumer need know the black box elements of how you collate the messages within your ad-hoc queue.

> Using S3 as an ad-hoc queue is a cheaper solution, which should throw some red flags.

Interesting. Can you expand on this? How do you ensure that only one worker takes a message from s3? Or do you only use this setup when you have only one worker?

You encode messages with timestamp and origin (eg 1558945545-1), you write directly to S3 into a (create if not exists) folder for a specific windowing (let's say minute). Every agent writing, you end up with a new folder in the next minute. You have a window with an ordered set of messages by window by sort algorithm...optimally determined by the naming encoding.

You reminded me of a post on Dropbox announcement in 2007, that you can do it “yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem”.

Just because you can, doesn’t mean you should.

Cost is the motivating factor here.

Dropbox is more expensive than an FTP server, so the two scenarios are comparable.

Which gets you one of the basic features of SQS, but not the entire rest of the implementation. It's also significantly more work than just setting up an SQS queue.

I guess if you're at the point where your engineering time to implement this + all of the features on top of it that you might need from SQS and future maintenance of this custom solution is cheaper than the cost of using SQS, and you have no other outstanding work that your engineering team should be doing instead, this is a viable cost optimization strategy.

But that's a whole lot of ifs, and with customers I've mostly worked with, they're far better served just using SQS.

What happens when you need to retry the messages from a few minutes in the past because there was a transient failure in a downstream dependency?

S3 went down twice in 5 years. Since we're transferring files, you just push everything in the next window. The retry is trivial from the agent and accounted for in the consumer.

I wasn’t talking about the reliability of S3, but of your own systems.

Say the outage results in a few million messages that need to be retried. Some subset of those few million will never succeed (aka they are “poisoned pills”). At the same time, new messages are arriving.

In your system, how do you maintain QoS for incoming messages as well as allow for the resolution of the few million retries while also preventing the poisoned pills from blocking the queue? How do you implement exponential backoff, which is the standard approach for this?

SQS gives you some simple yet powerful primitives such as the visibility timeout setting to address this scenario in a straightforward manner.

Sounds really painful since I want things like retry logic, retry queues, and dead letter queues.

FWIW there is a queue based on maildir which has implementations in Perl, Python, C and Java and probably more.

The Perl implementation was the original AFAIK.


Interesting; looks like DirectoryQueue uses directories, rather than file locks (man 2 flock), to lock the queue messages. This might actually work, since mkdir returns an error if you attempt to create a directory that already exists. The implementation seems to be handling most of the obvious failure cases, or at least tries to.


So how does one lock a message in s3? Does s3 have a "createIfDoesNotExistOrError"? I'm still having difficulty understanding how the proposed system avoids race conditions.

maildir is specified at https://cr.yp.to/proto/maildir.html and billions of message uses it each year. So that's pretty safe.

I can't vouch for the queueing code but I believe it's quite robust too.

Back in 2000, I worked with a guy that built an entire message queue product using the SMTP protocol with an implementation that was in turn built on top of lex and yacc.

From the operations side, I feel like SQS is a very sharp pair of scissors. In the hands of a good tailor, they'll make amazing things. On the other hand, most people use them to cut things they probably shouldn't, or just flat out Sprint around with them on a wet pool deck with their shoes untied.

A non-comprehensive list of ways I've seen my developers shoot themselves in the foot:

* Giant try-catch block around the message handling code to requeue messages that threw an exception. They neglected to add any accounting, so some messages would just never process. No one noticed until they saw the queue size never dropped below a certain amount during debugging.

* Queue behavior is highly dependant on configuration. Bad queue configurations result in dropped messages. Queueing systems provide few features to detect and alert on these failures (it's rally not their job), but building a system to track the integrity of the business process across queues is deemed to onerous.

* The built-in observability is generally not enough to be complete. I haven't seen a lot of great instrumentation libraries for SQS like there are for HTTP, meaning that observability is pushed on to the developer. They typically ignore that requirement because PMs rarely care until they realize we're unable to respond to incidents effectively.

* Most people vastly overestimate their scale. The number of applications I've seen built on SQS "because scale" that end up taking less than 100 QPS globally is significant. Anecdotally, I would say the majority of queue-based apps I have seen could have solved their scaling issues within HTTP.

* Many people want to treat queued messages like time-delayed HTTP requests. They are not, the semantics and design are totally different. I have seen people marshal requests to Protobuf, use it as the body of a message, and had another service read and process the request, and write another message to a queue that the first app reads back. It's basically gRPC over queues. Except that it solves none of the problems gRPC does, and creates a lot of problems. Just an example, how do you canary when you can't guarantee that the version of the app that sends the request will get the response to that request?

I think SQS is an amazing tool in the hands of people that know when to use it, and how to use it. But my experience has been that most people don't, and the ecosystem to make it available to people who aren't experts just doesn't exist yet.

I agree with most of this. If you have non-trivial message handling logic in a production system, you probably shouldn't use SQS directly to drive your work. Your SQS handling logic should be simple and reliable. In most cases, if the handling logic is complex, long-running, or needs operational visibility (logging, monitoring, etc.), I'd write the message handler itself to just kick off a workflow via Step Functions or some other workflow system. You'll pay for that in initial development costs, because it is more complicated (you need to write lambda handlers, wire them up with CloudFormation, etc.), but the tradeoff is that it gives you a central place to look at each unit of work, instead of having your artifacts scattered around various logs (if at all).

The takeaway for me is: distributed systems are hard. If you have distributed workers, you have entered into a vastly more complex realm. SQS gives you some tools to work successfully in that environment, but it doesn't (and can't) get rid of that complexity. Most of the problems I've seen relate to engineers not understanding the fundamental complexity of coordinating distributed work. Your choice of tech stack for your queues isn't going to make a big difference if you don't understand what you're fundamentally dealing with.

One important drawback of SQS is that it's eventually consistent, you can read the same message twice from different workers. Nevertheless we keep using it with additional checks when it's critical, it's still the cheapest solution by maintenance.

Making the processing of a message idempotent is the ideal way to handle that limitation

That's not always possible. Thus a big no-no for SQS if that's what you need. Concluding: SQS cannot replace RabbitMQ in all usecases.

If you’re message consumer isn’t idempotent then no MQ can help you. Exactly once delivery is impossible other than with at least once delivery and an idempotent consumer.


> Exactly once delivery is impossible other than with at least once delivery

Can you explain this? Don't many applications deliver once and only once via locking? It's obviously easier as an application developer to say "I will only get this once" and accept losing messages than dealing with idempotence particularly in distributed services.

Locking is no longer 100% reliable as soon as you have horizontal distribution of the same data over multiple nodes (for redundancy, so you can guarantee delivery) instead of sourcing from e.g. a monolithic rdbms. Eventual consistency is the model for a whole lot of distributed systems, e.g. S3 or Mongo. The CAP theorem applies to more than just databases, so MQs tend to use eventual consistency as well, which looks like at-least-once and not guaranteed exactly-once delivery.

While dealing with at-most-once delivery is easier in isolation as an application developer on the consumer side, dealing with lost messages is in practice MUCH MUCH harder on the producer side than idempotent handling where required on the consumer side. You end up building elaborate mechanisms for locking and retries and receipt validation which can all fail.

Just think about emails, there are a lot of situations where you can't be 100% certain whether the other side received the message or not. For something low-priority it may be better to ignore partial failures, but if it's important it may be better to send a second message to guarantee delivery. If you're modeled on never sending the same message twice AND the messages matter, you're in trouble.

What if your server dies right as it's about to release that lock and mark the job done? How will you ever know if it was completed or not?

This is what I mean.

Exactly once delivery =/= exactly once processing.

> you can read the same message twice from different workers

If you have a short visibility timeout that it appears back on the queue before you delete it... Sure.

No. By default SQS uses `at-least-once delivery` system. It's distributed and when you connect to API one SQS node that you working with could have different state comparing to another, so you may fetch the same message twice by 2 separate workers even with big visibility timeout.

Amazon's hand an answer to that with FIFO queues for the better part of years, but even without a queue based system you probably still need some notion of idempotence in the actions your system takes.

Systems "without queues" have just eschewed one queue for N queues, where N is the number of clients buffering actions for retries at any given moment.

FIFO queues should fix that issue...

FIFO (First-In-First-Out) queues are designed to enhance messaging between applications when the order of operations and events is critical, or where duplicates can't be tolerated

And it comes with the trade off of 3000 (batches) messages per second (and I bet some reduced potential availability). Sure, it’s a lot of messages for most people but non-FIFO SQS is one of the truly limitless services offered by AWS. No limits to raise, no soft maximums AWS staff ask you not to exceed. Just limitless scaling.

Implementing a queue that does exactly once delivery is a very hard problem and comes with its own trade-offs. Mostly around scalability and reliability.

After reading some comments in this thread im concerned about how many misconceptions people have about AWS services. Most of the stuff from "correcting comments" is plain available for anyone.

Anyone run a multi-tenant SaaS and handle fairness with jobs “fairly”?

Occasionally we use to have all workers tied up on a single customers long running tasks, we mitigated by using a throttler we wrote that can defer a job if too many resources are in use by the customer, but it’s not ideal.

I’d love a priority based, customer throttled (eg max concurrent tasks) queue.

We can prioritize by low/medium/high using separate queues, and could make a set of queues per customer; but that is starting to explode how many queues we have and feels unmanageable.

Using a database lets you make much better decisions on this space

Tbh purpose-built queues are taken way too eagerly by programmers who later end up needing the flexibility offered by a more general data store.

Yes. I've seen it in all kinds of teams. Anything that allows a developer to retrieve some data after a local server restart essentially gets treated by devs as a system of record, regardless of the intricacies or guarantees involved.

My personal experience is that abuse of queuing/messaging systems along this axis is rampant. Engineering leaders must keep a close eye on how these types of mechanisms are utilized to ensure things don't go off the rails.

I've seen far too many serious data loss events that boil down to "we lost our AMQP queue". It's critical that developers understand the limitations of the systems that run their code rather than just jumping aboard that "SQL is for old people" hype train.

We implemented this with additional DB checks. For example: we put only one job to queue at a time per customer, the rest are in database until one that in progress is not processed.

Actually most of the prioritization could be implemented through additional DB. With SQS in most cases you need persistent reflection of job to keep its status, process times, results. You can put to queue only few items that are highest priority to guarantee that workers are busy next 10-30 minutes.

Thanks for the suggestion, ill reflect on that; we were also considering an "overflow" queue that would receive jobs if there was a recent high insert rate of a customer's jobs to the main queue, but that didn't solve the "cost" problem of a job potentially being large.

Was hoping to avoid having "side car" infrastructure for this, but I don't think I can escape it :)

We've used SQS with great results (and reliability) for many years now, but I am interested to hear the author talking about 'replaying queues' to replicate faults. I never realised you could do this with SQS. Or can you? I thought once a queue item was processed and deleted, that was it, it was gone forever, but perhaps you can see historical queue data somewhere? (without having to store it yourself)

We actually already use the DeadLetterQueues in our service at the moment, but these are when failed SQS deliveries happen, then they get replicated to the DLQ.

I am more interested in diagnosing successful SQS deliveries after the fact, to see what the payloads were in case there was a downstream problem.

It seems that SQS deliveries that don't get a 200 response from our service go to the DLQ, but those that get a successful 200 disappear into the ether.

I think he’s saying that you can publish to a queue without any subscribers, and only play the messages for some recovery operation. It is limited as there’s a maximum life time of 14 days for SQS messages among other limits.

maybe once you've processed a message, you resend it to a different queue, just in case?

That's actually a neat idea. Since we don't have to pay for number of items in a queue, I am thinking we could create a dummy queue and just handball all incoming messages to it for safekeeping (at least for 14 days)?!?

Or send everything to SNS and have two SQS queues subscribed to the topic.

> Those messages will age out and vanish after a little while (14 days is currently the max); but before they go, they’re stored carefully and are very unlikely to go missing

can somebody expand on this? I know about the 14 days limitation but this makes it sound like you can store messages for a long time and still recover them somehow?

I think what he meant was, Once you successfully put a message to SQS, it is kept till the 'message retention period' unless you explicitly call delete on it. Right now you can configure the period to be upto 14 days. AFAIK, there is no way to recover messages older than their retention period.

You are right, thank you! I misinterpreted that.

They're saying you've very unlikely to experience data loss after a message has been saved by SQS. Presumably they have multiple layers of redundancy at varying levels in order to provide this guarantee.

> So what to do? In almost all cases it's better to propagate failure and backpressure. This gives a chance for the upstream to decide if it wants to retry or fail itself. It keeps things simple and easy-to-understand. It doesn't build up huge backlogs that become disasters.

What does this actually mean in practical terms?

https://ferd.ca/queues-don-t-fix-overload.html explains it better than pretty much anything I’ve read on the topic.

I see lots of engineers who argue that queues are bad and cause more problems than they solve. "Queues don't have backpressure" is a common complaint. It wouldn't be a difficult thing to add to a queue interface, but people don't do it despite the relative ease of the task. It is pretty easy to get a cardinality estimate on the queue size of any given queue even as it's rapidly increasing and decreasing in value.

The reason why is that it's probably a bad idea in the first place to mix your control and data interfaces into an inextricable knot. Since you're not going to be able to institute rate limiting at a sub-millisecond latency in a distributed system anyways, why isn't your control plane separate and instrumented?

Once you have a separate control plane, you can introduce back-pressure in many different ways and do so with a better understanding of the dissemination of those throttling values throughout your system.

So what I see are engineers who are actually ignoring the the architectural decision with at least as many implications as "queue vs diffuse interface," namely "control and data or just one monolithic system."

I recreated an SQS style service on AWS for sole purpose of avoiding being locked into Amazon. It used their autoscale system and was about the dumbest simplest api you can imagine, just depositing messages into a geographically replicated db table for later processing. We also used it as sort of a backup system, where the messages were never truly deleted and everything could be reprocessed as needed since the front end db that held processed records was an absolute amateur disaster just waiting for some malicious sql injection wipes. It was solid as a rock and extremely low maintenance. I think things have become relatively standardized since then (aka lots of duplicate SQS compatible api/services), so unless there was a serious requirement to stay off SQS I'm not sure I'd do it again.

With 4-5 concurrent processes reading from my (FIFO?) SQS queue, I have a hard time seeing the contents of the queue from the console. SQS really doesn't handle the concurrent readers well in my experience.

is SNS + SQS a reasonable solution for realtime irc style topic chatrooms?

There is also a size limit on the messages sent via SQS - 256Kb IIRC. Might be OK for small text based messaging, but if you want to talk attachments and other payloads then you have to hook it up to other services like S3 etc. and then it becomes complicated.

Also, as another respondent replied - there is no real deliverability guarantee, although there are certain ways to handle that within SQS.

AWS AppSync (https://aws.amazon.com/appsync/) is a better fit for the chat room use case because of server pushed events over a persistent connection (WebSockets). Launch the Chat sample in the console to try it out.

It depends, but you can safely assume neither service is intended for real-time use cases. They'll be fast, but there won't be any speed / delivery guarantees.

Message fanout isn’t really solved there. Also probably quite expensive for the task

I clicked the link. Laptop set to loud by mistake. The cat asleep on the keyboard nearly died.

Way to go Rick! You know you've made it when people write blog posts about your hot takes on Twitter

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact