Hacker News new | comments | show | ask | jobs | submit login

Having spent 7 years of my life working with Pat Helland in implementing Exactly Once In Order messaging with SQL Server Service Broker[0] I can assure you that practical EOIO messaging is possible, exists, and works as advertised. Delivering data EOIO is not rocket science, TCP has been doing it for decades. Extending the TCP paradigms (basically retries and acks) to messaging is not hard if you buy into transacted persisted storage (= a database) for keeping undelivered messages (transmission queue) and storing received messages before application consumption (destination queue). Just ack after you commit locally.

We've been doing this in 2005 at +10k msgs/sec (1k payload), durable, transacted, fully encrypted, with no two phase commit, supporting long disconnects (I know for documented cases conversations that resumed and continued after +40 days of partner network disconnect).

Running into resource limits (basically out of disk space) is something the database community knows how to monitor, detect and prevent for decades now.

I really don't get why so many articles, blogs and comments claim this is not working or impossible or even hard. My team shipped this +12 years ago, is used by major deployments, technology is proven and little changed in the original protocol.

[0] https://docs.microsoft.com/en-us/sql/database-engine/configu...




TCP does deliver data more than once though. Sure within a TCP session you are guaranteed not to get the same byte twice out of your socket, bytes are delivered exactly once, in order. Now if your application that uses TCP pulls data out of the socket and then dies the data will need to be delivered again and the TCP protocol is unable to help us there, it's application level logic at that point.

So everyone is talking about the same thing here, data can get delivered twice, and the application must handle it in an idempotent way. For TCP this happens to be looking at some counter and throwing away anything that's already been seen. With a single client/server connection maintaining sequence numbers like TCP does solve the problem but it's harder to use this same technique in multi-client, multi-server, "stateless" requests that are common in the web world.

EDIT: To clarify, what I mean by TCP "does deliver data more than once", is that one side of the connection can send the same data twice. It's true that it's then discarded by the other end but this is what people are talking about when they talk about the theoretical impossibility of never sending anything twice. The rest is basically the idem-potency thing of ensuring that data received twice somehow doesn't cause anything abnormal.


Not when your 'socket' is a persisted, durable, transacted medium (ie. a database). Sure, applications can 'pull data out of the socket and then die', but this is a common scenarios on databases which is handled with transaction and post-crash recovery. The application comes back after the crash and find the same state as before the crash (the 'socket' still has the data ready to pull off), it pull again, process, and then commit. This is not duplicate delivery, since we're talking about an aborted and rolled back attempt, followed later by a successful processing. Again, databases and database apps have been dealing with this kind of problems for decades and know how handle them.

I've been living in this problem space for many years now and seen the wheel reinvented many times. Whenever the plumbing does not guarantees EOIO but the business demands it, it gets pushed into the app layer where TCP (retries and acks) is reimplemented, to various success levels.


> The application comes back after the crash and find the same state as before the crash (the 'socket' still has the data ready to pull off), it pull again, process, and then commit.

Isn't this the same app layer stuff that has to get reimplemented? I can see how this is often pushed back to a human (oops, my money transfer didn't go through, I'll have to redo it) but it's still something that has to be dealt with somewhere.


> it's still something that has to be dealt with somewhere

Database programmers have the means to deal with it off-the-shelf: BEGIN TRANSACTION ... COMMIT. When your queues are in the database, this becomes trivial. Even without the system I'm talking about (Service Broker) that has the queues stored in the database, most regular messaging systems do support enrolling into a distributed transaction and achieve an atomic dequeue/process sequence, is just that many apps/deployments don't bother to do it because the ops overhead (XA coordinator), reduced throughput and/or simply not understanding the consequences.

Point is that durable, persisted, transacted 'sockets' are behaving very differently from a TCP socket. Is a whole lot harder to simply lose a message in the app layer when interacting with a database.


But the application still has to deal with processing the same message twice if it dies before ACK.


He did say XA.

The transaction boundary defined in your consumer covers the interactions of that consumer with other XA aware nodes who all participate in a "distributed transaction". So you can process this message N times without committing, and thus possibly N times telling other systems to do M' side-effect of that message, but until you commit the world has not changed.

https://en.wikipedia.org/wiki/X/Open_XA


The ACK is also a transaction. Something like this:

1. Receive from server, commit state = NEW, UNACK 2. Send ACK to server, get confirmation from server, commit state = NEW, ACK 3. Start processing, commit state = PROC 4. Got results, commit state = FIN, UNACK 5. Send FIN to server, commit state = FIN, ACK

Each commit is a database transaction where you write the results of that step along with the new state. If anything fails along the way the work-in-progress is discarded along with the state change. The server has an equivalent so if it gets a duplicate ACK for the same (or earlier) state it can ignore it.

In this example, if the client crashes between 1-2, in #2 never gets confirmation, or crashes trying to commit the "NEW, ACK" state then it will retry. The server has already committed the fact that it sent the value to the client and is awaiting an ACK. If it saw the ACK and gets a duplicate it ignores it. If it never saw the first ACK then it will see the second(+) attempt and commit that it saw the ACK before sending confirmation to the client.


Right. The idea is that by having your database, message queue, and application all share the same transactional context, "reprocessing" the message twice doesn't matter, because the effects are only ever committed exactly once.

It's true that this doesn't work if your processing touches external systems or otherwise escapes the transaction context, but in those cases you do still get at-least-once delivery (or at-most-once, if you choose to commit the receipt before processing the message).

It really is a powerful technology and when leveraged can absolutely reduce the level of effort and cognitive burden to building correct asynchronous systems.


> Whenever the plumbing does not guarantees EOIO but the business demands it, it gets pushed into the app layer where TCP (retries and acks) is reimplemented, to various success levels.

Interesting! You're saying the exact opposite of what Tyler Treat is saying:

> Even with smart middleware, problems still leak out and you have to handle them at the edge—you’re now being taxed twice. This is essentially the end-to-end argument. Push responsibility to the edges, smart endpoints, dumb pipes, etc. It’s the idea that if you need business-level guarantees, build them into the business layer because the infrastructure doesn’t care about them.

(source: http://bravenewgeek.com/smart-endpoints-dumb-pipes/)

What do you think of his argument?


Right. If your definition of "delivered" is "showed a message to the user", then exactly once is a fairly trivial goal to attain. If it's defined "data over the wire", then it's nearly impossible to avoid some form of ACK/RETRY mechanism on real networks, which means that messages will need to sometimes have to be 'delivered' many times to ensure a single display to the user.


Are you talking about distributed environment, where network partitions can occur? If yes, then there's Two Generals Problem and "FLP result", that just prove it impossible. So I guess you're talking about non-distributed environment.

In other words, to reliably agree on a system state (whether message id was delivered) you need the system to be Consistent. And per CAP theorem, it cannot be Available in presence of Partitions.

So other people you're referring to probably talk about distributed systems.


Yes, I'm talking about distributed systems and I am aware of the CAP theorem. Hence my choice of the word 'practical'.

As I said, users had cases when the plumbing (messaging system) recovered and delivered messages after +40 days of network partitioning. Correctly written apps completed the business process associated with those messages as normal, no special case. Humans can identify and fix outages and databases can easily outlast network outages (everything is durable, transacted, with HA/DR). And many business processes make perfect sense to resume/continue after the outage, even if it lasted for days.


I'm not really versed in this topic, but it seems like using a database for a socket makes the system entirely centralized around that database. Is there something I'm missing?


ServiceBroker, at least, had the capability of (transactionally) sending messages between databases. So, if you drank the kool-aid (I did; it wasn't so bad), there needn't be "the centralized database". You can separate your databases and decompose your services, and indeed it's easier to do so correctly and with confidence because the technology eliminates a lot of hairy edge cases.


Pat had some opinions about CAP and SOA and distributed systems, see [0]. I also remember a talk given by Pat and Eric Brewer together, that went deeper into the whole CAP ideas vis-a-vis the model Pat had been advocating (see Fiefdoms and Emissaries [1]), but I can't remember when it was or find a link for it.

[0] https://blogs.msdn.microsoft.com/pathelland/2007/05/20/soa-a...

[1] http://download.microsoft.com/documents/uk/msdn/architecture...


Exactly once delivery is theoretically impossible.

Approaching exactly once delivery asymptotically is possible. Your parent poster's point is that this is one where you can get so close to exactly once in order that in practice you never violate for years and years.


My point is that I've seen people making decisions to go with 'best effort delivery' and live with the (costly) consequences because they read here and there that EOIO is impossible, so why bother trying.


Because idempotentcy can be cheap at the application layer, so why try to solve something we know can never truly be solved?


> Extending the TCP paradigms (basically retries and acks) to messaging is not hard

> Just ack after you commit locally.

wait ... what is this "retry" and "ack"? is this so that the sender knows not to send the message again? that sounds suspiciously familiar ...

I mean, come on, you're just describing a framework you wrote that handles "at least once delivery" plus "idempotent operations" for the programmer. That's fine. It's marketing to say it's "exactly once". You're just arguing over what to call it.


"At least once", "at most once" and "exactly once" are fairly well established terms in this area.

Idempotency is a restriction/requirement that you may put in place in a distributed system to make the above types of guarantees. Deduping of messages like the article mentions means that it is using "at least once" message delivery, with some notion of unique message IDs and/or messages to then dedup the message.

TCP is a great analogy, where the Windows of the protocol are effectively maintained by Kafka. Anyway, "exactly once" is very hard and requires putting limits into the system such that deduping or something similar is possible. But I'd agree that in all cases anything that claims "exactly once" behavior is in fact implementing that on top of protocols with "at least once" primitives.


The point of throwing a message broker at a problem is not to get away from those things; the point is to package non-idempotent operations up into idempotent containers "on the wire", so that the at-least-once semantics of the message being brokered translate to exactly-once semantics for the message-body delivered to the consumer.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: