The only people worried about vendor lock-in are people who have practically nothing of value to be locked-in. "Lambda is horrible for lock-in, once you're hitting billions of invocations a month you're going to wish you could cost optimize by shopping around with other providers! I've got my personal website hosted there, it gets a dozen hits a month, sure glad I'm inside the free tier."
I've heard CTOs/CIOs of large companies express concern over it, but at the end of the day they'll sign that $100M contract with Amazon or Azure. It doesn't actually matter. Its a concern, but its not going to stop the sale or development.
Today, if you're not locked in, you're leaving business value on the table. I hope that changes in the future, and maybe Kube will be the standard platform we've needed to push it forward (for example, I wish I could tell kube "give me a queue with FIFO and exactly once delivery", it knows what cloud provider you're on, if you're on AWS it provisions an SQS queue, if you're on GCloud it errors because they don't have one of those yet, and in either case I communicate with it from the app using a standard Kube API, not the aws-sdk).
But for now, lean in. Don't fight the lock-in; every minute you spend fighting it is a minute that you should be spending fighting your competition.
Vendor lock-in isn't something anyone, especially at C-level should be ignoring. Yes, there are things to be gained and money to be made,but one should always have some sort of alternative,just in case. I happen to be in a business,which 100% relies on one vendor,which keeps hiking up prices every year way above inflation.
The flip side of vendor lock-in — at the megacorp level, anyway — is the “one throat to choke” principle; sure, it might be hard to migrate off AWS/Azure/IBM, but (a) your technology investments have a timespan measured in decade-long iterations; and (b) what you really want is to ensure reliable operations for your enterprise. Sure, you may be stuck with cloud vendor x for the next ten years, but at least you know who to scream at if the infra goes south — preferably someone who can mobilize enough engineers to get you back up and running right now.
Having said that, if you’re a small company where cloud budgets are a significant portion of your overall spend, or a tech company that has a defensive need to be able to migrate providers, YM will definitely V.
Is "migrating from Oracle" the Enterprise equivalent of nuclear fusion? We're always 20 years away from production nuclear fusion, Enterprises are always a few quarters away from migrating out of Oracle %list_your_appliance_here%?
At the end of the day they need to increase revenue by adding customers or by increasing revenue per customer. Once the market is saturated they either buy competitors or they raise prices or try and sell you additional services...
Well, not quite. What SQS actually offers is: exactly once delivery, except when the message is delivered multiple times.
Quoting from the SQS docs [1]:
> The visibility timeout begins when Amazon SQS returns a message. During this time, the consumer processes and deletes the message. However, if the consumer fails before deleting the message and your system doesn't call the DeleteMessage action for that message before the visibility timeout expires, the message becomes visible to other consumers and the message is received again. If a message must be received only once, your consumer should delete it within the duration of the visibility timeout.
If a consumer receives a message and then loses its network connection to SQS, the visibility timeout will expire before it can delete the message and it will be delivered a second time to another consumer.
I would very much suggest reading through the linked article by GP and maybe more about the Byzantine Generals Problem which establishes why exactly once delivery isn't just hard, but impossible. And Amazon's marketing department can't change that.
Or you could post the quote from the link I posted.
You can now use Amazon Simple Queue Service (SQS) for applications that require messages to be processed in a strict sequence and exactly once using First-in, First-out (FIFO) queues. FIFO queues are designed to ensure that the order in which messages are sent and received is strictly preserved and that each message is processed exactly once
> However, if the consumer fails before deleting the message and your system doesn't call the DeleteMessage action for that message before the visibility timeout expires, the message becomes visible to other consumers and the message is received again.
that you fail to understand.
The phrase "received again" means that the message is delivered more than once. More than once is more than exactly once.
The quote you are referencing is marketing nonsense.
If you want the message to be only received once regardless of whether it was successful processed, you set the MaxReceive count to 1 and set up a dead letter queue as part of the redrive policy and you use fifo queues instead of the standard queue.
It isn’t “marketing nonsense”. It’s more just quoting from one site on the Internet, instead of understanding the system as a whole.
It does? How? Can you think of any conditions that might make that not true?
Like you said several times, knowing what’s going on takes more than just reading several AWS documentation pages.
And to boot, that stretches the definition of delivery. You’d hardly consider a letter delivered if the postman left it outside the postbox and it blew away.
2. Message failed to process successfully -> process caught error -> invisibility timeout manually set to 0 by process -> message ends up in DLQ exactly once delivery
3. Message processed successfully -> client failed to delete message —> message times out automatically -> message delivered to DLQ exactly once delivery
And to boot, that stretches the definition of delivery. You’d hardly consider a letter delivered if the postman left it outside the postbox and it blew away.
The message in the failure case didn’t “blow away”. AWS guaranteed exactly once delivery and it put the messages that failed in a dead letter queue. It’s up to the business to decide remediation steps in failure cases. You can even set up an alert when the DLQ is not empty.
4. getMessage called; message sent; consumer experiences critical hardware failure before it gets through message; visibility timeout expires; message goes to DLQ. In this case, the message is never delivered.
If the API was called by the consumer and the connection was alive long enough for the request to be sent without the connection closing, the message was sent to the client. If the client failed halfway through receiving the message the server would have received a connection closed event. That’s basic TCP/IP.
The point that a fair few people here have been trying to make is that things are not as simple as you are quoting from the documentation. There are a variety of networking, hardware and software issues that can conspire to mean that it’s impossible to guarantee exactly once delivery in general. Every networking protocol is built precisely around this integral fact. Just because it’s at a higher level doesn’t mean the underlying issues are not still there.
Heck, we cannot even guarantee it in real life.
Granted, AWS likely has a pretty specialized networking stack and I doubt network partitions are the norm, but that doesn’t really matter.
For some degree of once, and some degree of exactly, it’s possible. Which is what is happening here.
But you have failed to come up with a scenario outside of the three that I named and the fourth one which you posted is a type of one those - the message was received and not processed - ie the consumer crashed after from SQS’s perspective, the message was received. AWS never guaranteed that your process would work correctly.
It guarantees that if you as a producer either sent a message with a uniqueid or unique content within five minutes, it wouldn’t enqueue the message more than once.
It also guarantees that if your process acknowledged receipt of a message, and you set up a DLQ. It wouldn’t send the same message more then once.
No AWS can’t promise that your consumer is up and running and polling for a message. The best they can do is allow you to setup an alert to tell you that a message has been waiting for you to process it for more than “approximately” x seconds.
> But you have failed to come up with a scenario outside of the three that I named and the fourth one which you posted is a type of one those - the message was received and not processed - ie the consumer crashed after from SQS’s perspective, the message was received. AWS never guaranteed that your process would work correctly.
What it sounds like you are saying is that as soon as SQS sends the message to its consumer, then SQS can consider that a successful delivery since it isn't responsible for anything that happens next. The problem is, while such a metric would support the AWS marketing, it's not really useful in the real world nor is it the tradition definition for what "exactly once delivery" means. In order to be "exactly once delivery", it needs to guarantee that a message actually makes it to a consumer, given an unreliable communication channel. The consumer is an important part of the system.
I think my example of a server failing may have confused things. So, let's limit this to just network failures. Also, let's look at TCP from the packet level. Once you get down to this level, TCP really starts to look a lot like UDP. Reading https://en.wikipedia.org/wiki/Two_Generals%27_Problem is handy - it's a thought experiment, but, what it proves is that it's not possible to perfectly coordinate state across an unreliable network. Why that is important is because it means that once you send a packet of information, its not possible to always know if it was received by the other end - 1) it might have been lost in transit; 2) it might have made it, but the receiver sent an acknowledgment message that got lost; 3) it made it, but the acknowledgment message has been delayed and hasn't been received yet. Whats interesting about the Two Generals Problem is that it shows that the best we can do, is to reduce the chance of lost messages interfering in communication, but we can't eliminate it.
So, we have a few scenarios we can work through, packet-by-packet:
1. If everything goes right:
a) consumer sends a getMessage packet
b) SQS receives a getMessage packet
c) SQS sends a Message packet
d) consumer receives a Message packet
e) consumer processes the message
f) consumer send a deleteMessage packet
g) SQS receives a deleteMessage packet and deletes the message
However, the network is unreliable. So, at any point, SQS and the consumer can stop being able to communicate for arbitrarily long chunks of time. So, we can end up with something like:
2. With a DLQ setup, if the network drops at the right time, we lose a message:
a) consumer sends a getMessage packet
b) SQS receives a getMessage packet
c) SQS sends a Message packet
d) network drops for the duration of the visibility period
e1) SQS send messages to DLQ
e2) consumer retries getMessage call - gets a different message
3. Without a DLQ setup, if the network drops at the right time, we double-deliver a message:
a) consumer sends a getMessage packet
b) SQS receives a getMessage packet
c) SQS sends a Message packet
d) consumer receives a Message packet
e) consumer processes the message
f) network drops for the duration of the visibility period
g) SQS sees that the message hasn't been deleted within the visibility period; message becomes visible again
h) another consumer sends a getMessage pack
i) SQS sends a Message packet with the same message as before
You might say something along the lines of: Let's add some extra acknowledgment packets in. But, this is where the Two Generals problem rears its ugly head - it doesn't matter how many acknowledgments we add in, we're still faced with fundamental limitations of agreeing on state over an unreliable channel. And if you extend that, it implies that its not possible for both ends of a TCP connection to agree on if a connection ended cleanly or ended with a timeout - and the distinction is important if, as you've suggested, you want to use TCP connection failures as part of determining if a message was delivered.
I've never used SQS. It's probably fine. Maybe its great, I dunno. Honestly, the features it have sound pretty useful for guaranteeing "usually once delivery". But, they can't guarantee "exactly once delivery" - doing that isn't just hard, it's impossible, it's been proved impossible, and marketing can't change that.
Ah yes, that age old showdown: the fundamentals of distributed systems vs a shiny marketing page.
It doesn’t make what they say any more true. SQS is at least once delivery, and the fact you think otherwise is down to great marketing and maybe a bit of Dunning–Kruger effect.
SQS is only meant for multiple publishers and a single subscriber or set of subscribers that do the same thing. A message in an SQS queue is either delivered or in the queue. SQS operates on a strictly polling model.
Meaning you’re not going to put a message directly in an SQS queue and have it processed by multiple types of subscribers.
Are you expecting some generic system to make sure your subscribers are up and running and polling?
Better question is, do you know what SQS is and have you ever done anything with it?
Exactly once semantics: even if a producer retries sending a message, it leads to the message being delivered exactly once to the end consumer.
This is done with AWS FIFO queues by:
Unlike standard queues, FIFO queues don't introduce duplicate messages. FIFO queues help you avoid sending duplicates to a queue. If you retry the SendMessage action within the 5-minute deduplication interval, Amazon SQS doesn't introduce any duplicates into the queue.
So exactly once is handle by the producer, the queueing service, and the subscriber.
> If you retry the SendMessage action within the 5-minute deduplication interval, Amazon SQS doesn't introduce any duplicates into the queue.
But what if you try to send and at just about the same time the network fails for 6 minutes? The producer then has the option to try to send again, which may produce a duplicate, or, to assume the first send worked, which may mean the message is lost.
In a system that supported true "exactly once" guarantees, the producer wouldn't have to pick between a duplicate message or a lost message, regardless of the length of network outage.
I think that's the crux of your misunderstanding here. There are networking issues where one side thinks a request has been sent or received successfully while the other side does not.
In this specific case, if a networking issue prevents the final ACK from being received by the server then the client assumes the connection is closed and the message has been delivered, while the broker will wait for the connection to time out, which I assume is a failure condition.
There are other cases where the client can receive a complete message but the server is unsure that it has, and will time out the connection while the client continues processing, assuming everything is fine.
You cannot build something on top of a networking stack that does not guarantee exactly-once semantics at any level (TCP, IP, HTTP, physical) and expect to get exactly once. You can mitigate it, sure, so maybe Amazon is super-dooper sure that these networking conditions won't ever happen.
Apologies, I misspoke. SQS does have a FIFO+Exactly Once Processing mode, so it will never introduce duplicate messages into the queue. It of course doesn't have exactly once delivery without additional state tracking on the consumer outside of the queue.
I actually am a fan of the AWS CDK for this reason, and its concept of "constructs".
My big application has a dozen packages, one of which defines generic cloud constructs that host and deploy the other packages as a distributed service.
If we wanted to switch clouds, we'd just have to re-implement the constructs we use for a new cloud; generic things like "ingress data pipeline", which contain dozens of AWS resources each, but have a semantic grouping and defined permissions boundaries.
I've heard CTOs/CIOs of large companies express concern over it, but at the end of the day they'll sign that $100M contract with Amazon or Azure. It doesn't actually matter. Its a concern, but its not going to stop the sale or development.
Today, if you're not locked in, you're leaving business value on the table. I hope that changes in the future, and maybe Kube will be the standard platform we've needed to push it forward (for example, I wish I could tell kube "give me a queue with FIFO and exactly once delivery", it knows what cloud provider you're on, if you're on AWS it provisions an SQS queue, if you're on GCloud it errors because they don't have one of those yet, and in either case I communicate with it from the app using a standard Kube API, not the aws-sdk).
But for now, lean in. Don't fight the lock-in; every minute you spend fighting it is a minute that you should be spending fighting your competition.