> I LOVE this idea. I usually hear other Sr. engineers denigrate it as "hacky," but I think they aren't really looking at the big picture.
Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly. It forces a lot of complexity into areas where you'd rather not deal with it, later manifesting as technical debt as folks who need to solve the problem have to go hunting through the codebase to find where all the control and throttling decisions are being made and what they touch.
> 1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)
"Oh this SQS is such a pain to manage", is not something anyone who has used SQS is going to say. It's much less work than a database. It has other benefits too, like being easier for devs by being able to trivially replicate staging and production environments is a major selling point of cloud services in the first place. And unlike Postgres, you can do it from anywhere.
The decision to use the same postgres you use for data as a queue also concentrates failure in your app and increases the blast radius of performance issues concentrated around one set of hardware interfaces. A bad day for your disk can now manifest in many more ways.
> 2. Postgres preserves your data if it goes down
Not if you're using a SKIP LOCKED queue (in the sense that it has exactly the same failure semantics as SQS, and you're more likely to lose your postgres instance than SQS with a whole big SRE team is to go down). Can you name a data loss scenario for SQS in the past 8 years? I can think of one that didn't hit every customer, and I'm not at AWS. Personally, I haven't lost data to it despite a fair amount of use since the time us-east-1 flooded. Certainly "reliability" isn't SQS's primary problem.
> 3. You already have the tools on each machine and everybody knows the querying language to examine the stack
SQS has a web console. You don't need to know a querying language. If you're querying off a queue, you have ceased to have an unqualified queue.
> 4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.
Given most folks who are using AWS are strongly encouraged and empoewred to use RDS, this seems like a wash. But also: SHOULD you back up your queues? This seems to me like an important architectural question. If your system goes down hard, should/can it resume where it left off when restarting?
It'd argue the answer is almost always "no." But YMMV.
> 5. Performance is a non-issue for any company doing < 10m queue items a day.
You say, and then your marketing team says, "We're gonna be on Ellen next week, get ready!"
>>>> 2. Postgres preserves your data if it goes down
>>>Not if you're using a SKIP LOCKED queue,
Can you provide a reference for this? I'm not aware that using SKIP LOCKED data is treated any differently to other Postgres data in terms of durability.
It's exactly the same case as SQS in that if your disk interface dies in mid write of a message you lose it the same as if your interface to SQS or SQS itself dies mid-write, you lose it.
The rest of the sentence is a challenge, "show me a data loss bug in SQS that extends beyond the failure cases you've already implicitly agreed to using postgres itself."
SKIP LOCKED is a read-side feature. PG is a problem for u/andrewstuart because they have to manage it themselves (unless they're getting a PG service from a cloudy provider, say) and that's harder than it is for Amazon to manage SQS.
What possibilities are there for a write to fail that the database server cannot react to? I am under the impression that a write error like this would be reported as a failure to run the statement and the application is responsible for handling that.
All of those would result in a write failure from the application's perspective, which is fine, and must be accounted for regardless (e.g. retry, two phase commit, log an error, whatever).
But you have to explicitly delete the message from SQS, right? You'd only delete after confirming you processed the message, right? So if you die mid-instruction in processing a message, the message just re-appears in the SQS queue after the visibility timeout.
Because it is hacky from the perspective of a distributed system architecture. It's coupling 2 components that probably ought not be coupled because it's perceived as "convenient" to do so. The idea that your system's control and data planes are tightly coupled is a dangerous one if your system grows quickly. It forces a lot of complexity into areas where you'd rather not deal with it, later manifesting as technical debt as folks who need to solve the problem have to go hunting through the codebase to find where all the control and throttling decisions are being made and what they touch.
> 1. By combining services, 1 less service to manage in your stack (e.g. do your demo/local/qa envs all connect to Sqs?)
"Oh this SQS is such a pain to manage", is not something anyone who has used SQS is going to say. It's much less work than a database. It has other benefits too, like being easier for devs by being able to trivially replicate staging and production environments is a major selling point of cloud services in the first place. And unlike Postgres, you can do it from anywhere.
The decision to use the same postgres you use for data as a queue also concentrates failure in your app and increases the blast radius of performance issues concentrated around one set of hardware interfaces. A bad day for your disk can now manifest in many more ways.
> 2. Postgres preserves your data if it goes down
Not if you're using a SKIP LOCKED queue (in the sense that it has exactly the same failure semantics as SQS, and you're more likely to lose your postgres instance than SQS with a whole big SRE team is to go down). Can you name a data loss scenario for SQS in the past 8 years? I can think of one that didn't hit every customer, and I'm not at AWS. Personally, I haven't lost data to it despite a fair amount of use since the time us-east-1 flooded. Certainly "reliability" isn't SQS's primary problem.
> 3. You already have the tools on each machine and everybody knows the querying language to examine the stack
SQS has a web console. You don't need to know a querying language. If you're querying off a queue, you have ceased to have an unqualified queue.
> 4. All your existing DB tools (e.g. backup solutions) automatically now cover your queue too, for free.
Given most folks who are using AWS are strongly encouraged and empoewred to use RDS, this seems like a wash. But also: SHOULD you back up your queues? This seems to me like an important architectural question. If your system goes down hard, should/can it resume where it left off when restarting?
It'd argue the answer is almost always "no." But YMMV.
> 5. Performance is a non-issue for any company doing < 10m queue items a day.
You say, and then your marketing team says, "We're gonna be on Ellen next week, get ready!"