There's no such thing as finite-time arbitration for multiprocessor requests to memory, but it works anyway. This is similar. You can't guarantee time-bounded exactly once delivery, but after enough interchanges the probability of being wrong approaches zero.
This shows up explicitly in blockchain systems, where confidence improves as the number of confirmation cycles increases.
This is the first time I'm hearing about the topic of arbitrating multiprocessor access to memory. Isn't there a trivial counterexample where the bandwidth is divided by the number of processors and each processor gets a timeslot?
Processors, despite being a distributed system (well really every ASIC is), don't typically suffer from your classical distributed systems style issues because they have no failures, or rather they only have certain catastrophic failure modes that are low probability and well controlled for. Your typical distributed system to contrast has constant failures.
It's more to do with the underlying electronics: digital is an abstraction, everything is analog underneath, and logic gates and registers are actually just very non-linear analog circuits. What this means is that it's always in principle possible for an asynchronous edge (i.e. one that is not aligned to the other edges in the system) to be arbitrarily non-determinate: never actually being decided as a 1 or a 0 as it propagates through the digital circuit. This isn't an entirely theoretical issue, either: it can and will cause very strange behaviour in a digital system because a marginal or metastable value can be interpreted differently by different parts of the circuit that should be seeing the same value. It's possible in practice to push the chance of this happening extremely low, but it comes at the cost of latency: you need to spend more time to allow the circuit to actually settle into one state or the other.
Within a fully synchronous system, it's possible to avoid this, though. But most multi-processor systems are not synchronous.
> But most multi-processor systems are not synchronous.
Is this actually true? A modern AMD or Intel CPU doesn't have one quartz crystal per core, so therefore all per-core clocks have to be derived from one clock source, meaning the relationship between all the clock frequencies and phases is deterministic, even if not specced in the datasheet.
Multi-core isn't the same as multi-processor, and in practice while they are derived from the same reference, dynamic per-core clock scaling makes it pretty asynchronous, to say nothing of the I/O paths between the many independently clocked parts of a modern computer.
Are there many (or even any) multi-drop buses in modern computers? I get the impression that these days almost everything is point-to-point links. But I know less about what’s happening on silicon. I have seen talk of crossbar interconnects on microcontrollers, which (aiui) are more active, unlike passive buses?
I2C buses still tend to turn up in modern computers for running various low level components. Stuff like allowing the CPU to talk to power management systems, controlling fan speeds, RGB lights etc
I don’t think there are any high bandwidth multi-drop busses anymore. Electrically multi-drop busses don’t make it easy to ensure the needed signal integrity for high speed busses.
Memory controllers with more than one port have an internal arbiter. The ports are connected to various CPUs, GPUs, and I/O devices. Something has to sequence requests to the memory interface. Memory controllers used to be separate chips, but since about 2008 they're usually on the CPU die.
Memory arbiters are where atomic operations on memory are actually implemented.
Modern processors have "packet" (nowhere near as complex as ethernet/IP/TCP) based Network On Chip interconnects based with packet routing.
Processors suffer from the classical distributed system style issues just the same, the difference is that as you said they tend to be more reliable and the configuration doesn't change at runtime, but another key aspect is that a lot of latency tradeoffs aren't as lethal due to the short distances within a chip.
Accessing data that is kept in the cache of another processor only costs you 100ns at most and a lot of that overhead comes from time spent on the cache coherency protocol. Modern directory based cache coherency protocols are complicated beasts. Each processor core needs to be aware which processors have a copy of the cache line it is about to access, so that it can request exclusive write access to that cache line and mark every single one of those copies dirty upon modification.
The way the distributed system problems manifest themselves on the higher levels is that cache coherency by itself doesn't get you correct software automatically. You still have to use synchronization primitives or atomics on the software level, because your registers are essentially stale read replicas. Compare and swap on the processor level is no different than compare and swap on mongodb.
Easy exactly-once delivery: Send a transaction ID with the transaction using an at-least-once system. The receiver ignores duplicate ID transactions. The sender can guarantee never sending the same ID twice.
That's what the article calls exactly once processing.
An analogy would be pizza delivery. You wouldn't call someone who delivered 100 identical pizzas to you even though you only ate one of them "exactly once delivery".
Agreed, if I had to be the one who dealt with the extra pizzas. But if there was someone standing outside in my yard in order to discard the 99 duplicate pizzas before they got to my door (a travesty in and of itself, but that's another discussion), I would think of the system as a whole as "exactly once delivery".
And, analogously, that does seem to be what Sequin's system provides.
If someone named Bob was standing outside your yard discarding the 99 pizzas, then Bob would not be able to guarantee that you actually get exactly one pizza. It's possible that the one pizza Bob tries to give you gets rained on before Bob can actually hand it to you.
Only you yourself can actually implement a means of discarding extraneous pizzas, any middlemen introduced into the chain just adds another layer of indirection that is susceptible to failure.
No, Bob standing outside of your yard is principally closer to you - as in, holding your hand - and can't fail to release pizza only after you got hold of it.
Of course the rain may flip bits in any system, but that would amount to destruction of you. You won't worry about pizza then, but your clone will spring up and get the pizza from Bob. And Bob himself is similarly protected from single-point-of-failure problems.
Of course the whole Bob subsystem may be destroyed, but that's already beyond the threshold of the problem we choose. Otherwise we'd have the external system to check if we'd actually got the pizza, and resend it again.
EDIT: that's probably the definition of "only you" - the Bob being that close. Agree then.
Wrap that up and abstract it as a transport system. The users don't need to know the inner workings, they just need a transport API that guarantees exactly-once.
The point is you can't wrap that up, because the faults can happen at the edges, ie where the systems interact.
The users still need to interact with this wrapped system, and it's those interactions that can fail, thus preventing the overall exactly-once delivery.
To flip it around, if this wasn't an issue then what you're suggesting already exists in the form of ACID databases.
> That's why we don't say Sequin offers exactly-once delivery anywhere. We say it offers at-least-once delivery and exactly-once processing.
The author should note that the homepage at https://sequinstream.com/ does in fact claim that Sequin supports "Exactly-once delivery" under the "Key features" heading.
Out of order does not cause a reset. Some networks with unstable topologies like wireless meshes frequently cause reordering and it doesn’t destroy TCP streams. It usually just results in inefficient retransmits when the receiver registers it as a drop.
Part of the reason I find distributed and concurrent computing so fascinating is that we lose so many assumptions about how computers are supposed to work; ordering guarantees, data actually arriving to destinations, fairness, etc.
While it is of course possible for data to get lost or mangled in a single-threaded, non-distributed application, it's rare enough to where you often don't need to plan much around it. Once you add concurrency, and especially if you make it distributed, you suddenly have to plan around the eventuality of the messages not arriving, or arriving in the wrong order, or arriving multiple times.
Personally, I've gotten into the habit of PlusCal and TLA+ for everything, and modeling in the assumption that stuff might never arrive or arrive multiple times, which can work as a very good sanity check (though it's of course not a silver bullet).
The "exactly once delivery is impossible" narrative got so strong that people feel that they have to redefine the word 'delivery' to make it fit the premise.
If a message can be guaranteed to be received one or more times but results in a single entry being inserted into a database (e.g. in an idempotent way using UUIDs to enforce uniqueness), then clearly, a consumer reading the records from that database would only receive the message exactly once. If you consider the intended recipient to be the process, not the host, then the definition of 'delivery' allows you to claim that exactly-once delivery is possible.
Of course this is a pragmatic, not mathematical argument. In theory you can't mathematically guarantee even at least once delivery because you can't guarantee that there's not going to be a Carrington event which will bring down the internet and drive humanity to extinction; thus preventing the consumer server from being rebooted.
In a sense it's just kicking the can down the road: how you ensure the consumer reads the message only once? It may fail at some point after it has read the message but updated some state to reflect that.
So it's not just a pointless argument about the semantics of the term "delivery": the fact that no communication channel can have exactly once delivery means that these systems are much more difficult to implement. For example, you can't just chain together 2 "exactly once" delivery systems into some longer one with a stateless middle node: instead you need some concept of idempotency that spans both systems or the middle node itself needs to de-duplicate (statefully) the first link and forward messages on to the second link.
Yes, idempotency is critical to achieve this but my point is that a message which can be recorded in an idempotent can be delivered exactly once. Yes, it is tricky but it is possible.
Same with two-phase commit, it can be implemented perfectly given certain reasonable constraints and assumptions. The consumer process could crash, then on restart, it would use timestamps to check which record it processed last and resume processing starting with the next unprocessed one. If the consumer uses input messages to produce output messages somewhere else, the consumer could also itself perform a two-phase commit on its own outputs to account for the possibility of itself crashing part-way through processing an input message. It can coordinate both inputs and outputs perfectly.
If we define 'processed' as fully committed, then messages ca be processed exactly once. If it can be processed once, it can be delivered exactly once if you accept the simplest definition of that word.
What is often suggested when people say "exactly once delivery is impossible" is that duplicates cannot be avoided and this is usually a cop out.
> If we define 'processed' as fully committed, then messages ca be processed exactly once. If it can be processed once, it can be delivered exactly once if you accept the simplest definition of that word.
I don't see how. Let's take a simple example of a the receiving end of a messaging system, where the messaging system can do whatever it wants to enforce idempotency or whatever semantics it wants, and calls, in-process, some `handler()` method of the application.
The useful processing happens somewhere in `handler()`. No matter what point you identify as the point where processing happens inside handler() it won't have delivered/processed-once semantics: it will potentially be called multiple times. The fact that the messaging system will internally have a commit step after handler() completes which is idempotent is irrelevant: as a user of the system you care how many times handler() is called.
If the sender is writing a message with a UUID and/or a unique incrementing sequence order (to maintain the order) to some kind of remote database using the UUID as the ID; they would be written to the DB in-order and unique. Then the receiver would receive the messages by reading them from the database sorted by their sequence number. The receiver would mark off messages in the DB as processed whenever it receives them. If it fails, it resumes from the last unprocessed one based on sequence number.
If we have a chain of senders and receivers and absolutely wanted to avoid double processing instead of also maintaining idempotence down the line as described above. It is also technically possible (but I guess not worth the complexity); we could make it so that the receiver would add a flag 'about_to_process' on each record just before it starts processing the record and then it would change it to 'committed' once it finishes. If the receiver crashes before completing processing a message, it would see the 'about_to_process' flag and in that case it could ask the next process/receiver in the sequence if they already received a packet with that UUID and only send the output for that message if they have not.
What about other messages though? The question isn't usually limited to one message in isolation, it's related to the state of the system as a whole. So sure, trivially, setting one bit where that bit is never cleared can only be done once, so any process setting that bit is also trivially (and not very usefully?) "exactly once delivery".
On the contrary, that is an extremely useful process, and is the basis for vast swathes of multiprocessing algorithms (such as "has this thing finished initializing", or "has it been freed"). However, it is "at most once", not "exactly once". The bit may never be set.
This feels a bit like those relativity puzzles, for example the barn-pole paradox[1].
That is, delivered exactly-once seems to be a relative statement where one party can state they did deliver exactly-once while another party can disagree, and they're both correct relative to their local frames of reference.
So, like how GR has no global conservation of energy, just local, could one say that distributed systems have no global conservation of messages, only local?
It's admirable when vendors don't try to abuse theory to make their product look better.
There was some blog post last week that supposedly proved the impossibility of exactly once delivery wrong, but that was quite expectedly another attempt to rename the standard terms.
if you employ persistence to keep sequence numbers, and/or retransmissions and redundancy, you can drive the likelihood of distributed consensus over time arbitrarily close to probabilities where it doesn't matter anyways (like the earth suddenly being coincident with the sun).
there is a group of people that say that there is no such thing as exactly-once-delivery and they are correct. There is also a group of people who cannot hear nor read nor process the word "delivery" in that term and instead insist that exactly-once-processing is easily achievable via [deduplication, idempotent actions, counters, ...]. They are correct as long as they acknowledge "processing" vs "delivery" -- and there are several who refuse to acknowledge that and don't believe the distinction matters and they are wrong.
Not that I disagree with getting terms exactly right, but if the "processing" that happens exactly once in your messaging system is firing a "delivery" event to a higher level of the stack (over a local channel), that's observationally indistinguishable from exactly-once delivery to that higher level, isn't it? What am I missing?
Over a local channel? Like same machine or even same process? I mean, yeah, you might have a bug that messes up the local "processing", but that's not a deep truth about distributed systems anymore, it's just a normal bug that you can and should fix.
In the systems I'm used to, that "passing up" is a function call within the same codebase. There isn't a second delivery system for a message to be lost/duplicated, though sometimes they obscure it by making it look more like such a system.
Power gets cut. Now what? A process can fail between two function calls. When your power is restored, what happens with that mid-processing request? Lost or retried?
The problems always stem from the side effects. You can achieve exactly once like properties in a contained system.
But if the processing contains for example sending an email then you need to apply said actions for all downstream actions, which is where exactly once as seen by the outside viewer gets hard.
Out of curiosity, how would you describe TCP in these terms? Does the TCP stack's handling of sequence numbers constitute processing (on the client and server both I assume)? Which part(s) of a TCP connection could be described as delivery?
From TCP's perspective it has delivered the data when read() returns successfully. This is also the point at which TCP frees the packet buffers. From the app's perspective, TCP is an at-most-once system because data can be lost in corner cases caused by failures.
(Of course plenty of people on the Internet use different definitions to arrive at the opposite conclusion but they are all wrong and I am right.)
I agree this is what you'd consider delivery. Also agree everyone else is wrong an I am right ;)
The similar question in TCP is what happens when the sender writes to their socket and then loses the connection. At this point the sender doesn't know whether the receiver has the data or not. Both are possible. For the sender to recover it needs to re-connect and re-send the data. Thus the data is potentially delivered more than once and the receiver can use various strategies to deal with that.
But sure, within the boundary of a single TCP connection data (by definition) is never delivered twice.
is there a real difference between delivery and processing?
In a streaming app framework (like Flink or Beam, or Kafka Streams), I can get an experience of exactly once "processing" by reverting to checkpoints on failure and re-processing.
But what's the difference between doing this within the internal stores of a streaming app or coordinating a similar checkpointing/recovery/"exactly-once" mechanism with an external "delivery destination"?
A typical delivery target is a data store or cache. Writing to such a delivery can be construed of as a “side effect”. But if you accept eventual consistency, achieving exactly once semantics is possible.
Eg a streaming framework will let you tally an exactly once counter. You can flush the value of that tally to a data store. External observers will see an eventually consistent exactly-once-delivery result.
> You can flush the value of that tally to a data store
What if you can't flush? You can't guarantee a flush: the sender or receiver can get powered off or otherwise get stuck for an indeterminate amount of time. That flush is the delivery. The data, is it stored in memory? Then it was lost when the process was killed.
In streaming apps, the data is not only stored in memory, and failures are inevitable.
A common streaming app will consume from a distributed log (Kafka, Kinesis) and only commit offsets if/when the data from the last offset to the next committed offset is fully processed in exactly-once semantics.
When delivering to the destination, the same story can hold. If the delivery destination returns a failure, then it may or may not have committed the write, but the streaming app will ensure either that the correct (exactly-once) value will be eventually flushed, or will retry computing the tally.
This will give you eventually consistent exactly-once semantics unless you allow the destination system (or any system) to remain in outage forever. But then of course if that's true, then you won't get any semantics, exactly-once, at-least-once, or otherwise.
and the only reason why that all works (kafka, kineis, etc) is _because_ they take care of things like "what if the power goes out before we can flush"? And the reason why they have to think of those things is because exactly-once-delivery doesn't exist. If it did, they wouldn't have to do anything. There would be no offsets needed. Offsets are needed because exactly-once-delivery doesn't exist and to achieve exactly-once-processing the extra computation is required.
Thus the very start of this thread: exactly-once-delivery and exactly-once-processing are different and it matters. If it didn't matter, there would be no offsets that kafka is tracking because everything would just work.
What you can't do, and what naive consumers of distributed systems often expect, is drive the probability arbitrarily low solely from the producer side. The consumer has to be involved, even if they consider themselves to be a single-node process and don't want to bother with transactionality or idempotency keys.
No, as a vendor you're either deliberately lying, or completely incompentent. Let's insert the words they don't want to.
"We do exactly once good enough for your application!"
How could you possibly know that?
"We do exactly once good enough for all practical purposes!"
So are my purposes practical? Am I a true scotsman?
"We do exactly once good enough for applications we're good enough for!"
Finally, a truthful claim. I'm not holding my breath.
Edit: More directly address parent:
It's not possible to even guarentee at-least-once in a finite amount of time, let alone exactly-once. For example, a high-frequency trading system wants to deliver exactly one copy of that 10-million-share order. And do it in under a microsecond. If you've got a 1.5 microsecond round trip, it's not possible to even get confirmation an order was received in that time limit, much less attempt another delivery. Your options are a) send once and hope. b) send lots of copies, with some kind of unique identifier so the receiver can do process-at-most-once, and hope you don't wind up sending to two different receiving servers.
> With that definition, SQS, Kafka, and Sequin are all systems that guarantee exactly-once processing. The term processing captures both the delivery of the message and the successful acknowledgment of the message.
This terminology confuses my face. Maybe we need a better name for "exactly-once processing"? When I say "processing" I'm thinking of the actions that my program is going to take on said message - the very processing that they're talking about can fail. Can we not just say 'the message is held until acknowledgement'?
The simple way to visualize this is that all the processing can happen within a serializable transaction. If anything fails, the transaction rolls back. Marking the message as delivered is part of the same transaction.
Which means that the 'processing' in my application has to have the semantics of a serializable transaction. And it might not - as in the very example given in the post, where a processing application sends an email.
Hence my complaint that making statements like "our system offers exactly once processing" can be completely correct - from the point of view of the messaging system. But it can be misleading to the casual reader in understanding how it fits into the broader solution, because the messaging system cannot guarantee that my application processes it exactly once. I'm saying the terminology doesn't feel right.
It's worth remembering that the same applies for person-to-person interactions too -- there's nothing magical about software that makes it somehow more susceptible to this kind of problem, it's just often easier to resolve with people. Especially if you're close enough for so-called "real time" communication.
Exactly-once delivery means solving the Byzantine Generals problem, and since that's not possible, neither is exactly-once delivery.
Which isn't to say you can't get a good enough approximation, but you'll still only have an approximation.
- Within a transactional system you can have exactly once processing (as they call it here).
- Crossing the boundaries between two transactions requires idempotency IDs to deduplicate redeliveries and restore exactly-once inside the new transaction.
Email is an oft-cited example but is a transactional system that supports idempotency IDs (the Message-ID). Deliver the same email twice with the same message IDs and email systems are supposed to deduplicate. Gmail at least would do that, iirc.
A lot of developers assume exactly once semantics, often implicitly, and have no metrics or consistency checks in place to detect issues. Or the issues are silently hidden - I.e the same database row updated with the same value.
I remember working on a system in AWS that hooked a SNS topic into a Data Firehose, which then delivered the data to S3. I was expecting rare duplicates and the system accounted for that.
After a week I audited the data and found that the duplicate rate varied between 5% and 8% per hour - much much higher than anticipated.
And this is purely within a managed integration between two managed components.
Messages don't have to be delivered more than once to achieve exactly-once processing; the receiver can offer a query interface for delivery status. If every attempt at delivery checks for a unique message ID and only sends the message if it has not been received then you get exactly-once delivery of the message, if not the metadata.
Optimistically assuming the message has not been delivered yet will sometimes reduce average latency (most databases), but for very expensive messages it is better to check first (rsync)
Right, and as you solve for this on a different layer, you end up having to choose between at-most-once and at-least-once at some point anyway. Proof is in the pudding if GP won't take my word for it ;)
I think it's a useful excercise to attempt to disprove notions such as this. We shouldn't take all these "truths" for granted. But it does take some diving into it properly to appreciate the nuances and eventually get the underlying (as opposed to reasoning about it on a higher level, as we all are here).
I've worked on a couple systems that achieve exactly-once processing and one took a at-most-once writing approach that backfilled missing events after the fact (6 nines of completeness was sufficient for ~hours), and another that took the approach of file-chunk ownership with retries by another worker if the destination files (distributed file system in another region) were too short once ownership expired, before chunks were marked fully complete and the next chunk was added to the work queue (transactionally).
There is multiple delivery if there are multiple senders, which can be avoided on the sending side by e.g. individual send-workers owning messages with a TTL and cancelling if they can't increase their ownership TTL if they haven't finished by the expiration
Tangent, but: this reminds me of many philosophical debates, e.g. "do we have free will?".
What I've found is that usually what's happening is that the "debaters" assume different definitions of the thing they're debating about, which means they're really debating about two entirely different things, with the same name, and that's the real reason why they've come to different conclusions.
(We detached this comment from https://news.ycombinator.com/item?id=41701678. There's nothing wrong with a good tangent now and then, but it's probably better as a top-level subthread, plus lower down on the page.)
One recommendation I've read is that in such cases you should treat the single label you and your opponent have attached to two different ideas as forbidden. Describe it in more detail to reveal the difference (don't just swap words in the starting combination with their synonyms).
That's fine for discussions between experts but when a hundred noobs enter the discussion it's just unpaid tutoring. These threads are bringing out a lot of "I've discovered something all the distributed systems experts missed" energy.
For me "do we have free will" debate is useless because if the answer is "no" then that implies the result of the debate is predetermined and all involved lack any agency to change the outcome.
Therefore any adult arguing the "no free will" side is either
1. Disingenuous
2. Purposefully playing Candyland as an adult.
Either of which is so damaging to the Ethos that I lack any desire to continue the debate.
I feel similarly, but for a different reason: even if I don't have free will, I'm not going to change anything about how I live my life. My life is still important to me, either way. Free will vs. determinism might be a fascinating academic question to some, but I just don't find it all that interesting.
Same when it comes to arguments about whether or not our reality is a simulation, because, again, I'm not going to live any differently either way. My life is still important to me, even if it's just the result of an algorithm run on some computer built by an advanced civilization.
> that implies the result of the debate is predetermined and all involved lack any agency to change the outcome.
In a superdeterministic world, the presence of an illusion of free will is baked into everything humans do anyways. If the world is superdeterministic, but you can't actually predict what's about to happen with any additional certainty, then it doesn't really change anything anyway. So, there's no reason to change the way you argue based on the answer to the question in my opinion.
Of course, now I'm arguing about whether arguing about free will makes any sense, which is perhaps even more silly, but alas.
Not sure why you’re downvoted. I think this is completely true. To add more examples, I’ve seen a lot of debates about async IO and “non-blocking” where people talk over each other because they clearly mean different things. I tend to avoid ambiguous terms, but it’s an uphill battle.
> we don't say Sequin offers exactly-once delivery anywhere. We say it offers at-least-once delivery and exactly-once processing.
Ugh, this is just a matter of semantics. Boring. If the receiver only acts on messages once, and will always have the opportunity to act on every single message, then for all practical purposes, it's exactly-once delivery.
If there is no such thing as exactly-once delivery because the recipient might not feel like deduplicating, then there can be no such thing as at-least-once delivery either, because the recipient might not feel like being endlessly available for delivery either. Or the mailman might not feel like (re-)attempting deliveries endlessly either. Or the sender might not feel like re-sending things endlessly either.
Is this insightful? Surprising? What's the point of the debate?
I mean… activemq delivers messages once and reliably… we push 12k msgs/s with a gig of ram and 2 cores about 25% cpu. Msg size is about 500 bytes - 500kb.
It’d be faster but we don’t care. It also participates in an XA transaction…
In 10 years of running it practically non stop up to volumes of 100kmsg/s with hundreds of network and disk issues, I’ve never ever seen even once a dupe message. Sure as the article is it could happen but in practice it doesn’t.
You wouldn’t expect to see issues in a simple system in a steady-state. The problems arise when the system responds to other transient or unexpected issues.
Try `kill -9`ing your consumers or unplugging a network cable and see what happens.
Also, AMQP is most certainly not exactly once by any measure at all. Do you have any consistency checks or anything to validate duplicates?
I mean, you can do exactly-once delivery, it's just a more complex state machine with a specific design requirement. Here it is using mail:
# Receive new mail message, incoming id '1234'.
# No other writer knows about this message.
$ echo "hello world" > incoming/1.1234.txt
# If files stay in incoming/ too long, it means they probably failed writing,
# and are thus stale and can be removed.
# When the message is done being written, the writer moves it to a new state.
$ mv incoming/1.1234.txt new/1.1234.txt
# The writer can then report to the sender that the message was accepted.
# In mail, this is seen as optional, as the whole point is to deliver the message,
# not to make sure you inform the sender that you will deliver the message.
# Here's the exactly-once bit:
# If you need assurance of exactly-once delivery, split out the delivery portion
# from the confirmation portion.
#
# Have one actor talk to the sender, and another actor queue the message in another
# temporary queue, like 'accepted/'. Once the message is queued there, you tell the
# client it is accepted, wait for successful termination of the connection, and then
# move the message to the 'new/' queue.
#
# This way, if there is an error between accepting the message and confirming with
# the sender, you can detect that and leave the message in 'accepted/' where you can
# later decide what to do with it. (But it won't get picked up for processing)
# The message is now ready to be picked up for processing.
# To process the message, a processor picks up the new message and moves it to a new state.
#
# The processor adds an identifier '5678', so the processor can identify which file it's working on.
# It can have extra logic to re-attempt processing on this ID if it fails.
$ mv new/1.1234.txt process/1.1234.5678.txt
# If the processor dies or takes too long, you can identify that (stale file, processor tracking its work, etc)
# and this can be moved back to 'new/' for another attempt. Moving it back to 'new/' is still atomic so
# there is no danger of another processor continuing to work on it.
# Once the processor is done processing, it will move to the complete state.
$ mv process/1.1234.5678.txt done/1.txt
# If the processing file no longer exists, it means it was either already done,
# or previously died and was moved back to 'new/'.
This process all depends on a database [filesystem] using synchronous atomic operations [mv]. If your distributed system can't handle that, yeah, you're gonna have a hard time.
You have made a false assumption: directory operations across directories are not atomic. The typical way around that is to hardlink the file into the target directory and then sync the target directory then delete the old name, but now you no longer know if the file is new and has been processed or old when the systems is interrupted with the intermediate state on disk with the file in both directories. Which is exactly the problem messaging systems have. There is always a point in the state machines across multiple systems where a crash at the right point in time can lead to replay of a message. I worked on persistent messaging for a few years, and it is impossible to resolve this in real world systems. The best you can do is to narrow the window by reducing latency of the network and for commits to persistent storage.
It is of course "from the perspective of the host issuing the call", but that can be resolved by a network server by blocking operations on a given file when such an operation has begun. And of course the filesystem has to actually do the right thing. I'm no filesystem expert, but I would assume one way to do it is to write the block(s) that have the updated inode maps in one operation. A journal should help prevent corruption from making a mess of this. And of course disk / filesystem / OS tuning to further ensure data durability.
It appears to be atomic to the application as viewed from userland (and it is should no crash occur), but it is not guaranteed to be an atomic operation on disk should the system crash at the right point in time. Whether it is atomic on disk is filesystem dependent under Linux. Moreover, there is no way for many filesystems without a journal, like vfat and the historic ext2, to provide that atomicity.
Furthermore, there is no API to force 2 different directories to write to disk simultaneously in the same transaction. The best option is to fsync() both directories to disk. However, until both directories are confirmed to have been synced to disk, you run the risk of the file being in both directories during a well timed crash. Just to make this 100% clear: there are 0 guarantees that rename() has hit the disk once rename() has returned. rename() is just like every other filesystem operation that gets written back to disk at some later point in time after the syscall has returned (ignoring things like sync mount options).
FYI, I have worked on filesystems in Linux, and one of the applications I worked on had tests where we intentionally rebooted the system in the middle of a write heavy workload. Your view of the filesystem world is insufficiently nuanced.
I'm not a filesystem expert either, but I've done my time building tools that support those who are. Some of the issues with distributed files systems are:
* Deliver-at-least-once: Ever had something hang for a very long time while trying to access a NFS mount that wasn't available? Say while booting. That's because NFS has a mode (hard mount) that tries really, really hard to guarentee delivery. In the bad old days you could literally wait forever. Nowadays things usually give up after a while, sacrificing delivery for some kind of functionality. You can never guarentee delivery within a finite time period. "But!" you say, "you can add as many redundant network cards, paths, and servers as you need to get as many 9's as you want in your delivery guarentee." That helps. But a) you still can't guarentee delivery (what if you've swapped network cards, and the new one isn't configuring properly?), and b) you now have the problem of
* Network partitioning. It is entirely possible, and happens in real life, that part of the network can't talk to the rest. So now you've got, say, 4 servers and 20 clients in one partition, and 3 servers and 100 clients in the other. What do you do? It's provably impossible [1] to guarentee all three features "consistincy" (no read gets an incorrect answer), "availability" (non-failing nodes continue to function), and "partition tolerance" (messages between nodes may be delayed for an arbitrary amount of time).
* Plus other stuff such as load balancing, consistent security contexts, backups, restores, adding and removing hardware on the fly, rolling updates, yada yada yada.
In many cases you can engineer a solution that's good enough. Not always; sometimes you just have to fork out the $$$ for monster monolithic servers. (Which, technically, are internally distributed redundant systems, but the SLA 9's can go way up if its all glued together in one box.)
So that everyone's on the same page, perhaps you could clarify what "delivery" means to you? For example:
* The human to whom the message is addressed reads it on their screen.
* The email is inserted into that person's inbox. It's possible something will happen to the message after than point, before the addressee actually reads it, but that isn't what you're talking about.
* The email has been accepted by a server operating as an endpoint for the addressee (their company's Exchange server), _and_ acknowledgement has been received by the sending server.
* The above, but no guarentees about the sending server getting the acknowledgement.
etc.
[Edit: also, what "guarentee" means to you. 100%, 99.99999% is good enough, and so on.]
The delivery contract in this case is between a message-sending client (SMTP client) and a message-receiving server (SMTP server). "Guarantee" meaning within the SLA for data durability of the storage provider for the message-receiving server.
And that, or course, is perfectly doable, and is done all the time. Not perfectly, but good enough. However, that's not generalizable to any other particular case. You always need to consider time, processing, storage, network constraints, what does "good enough" mean, and gobs of other stuff. That's the whole point in saying "you can't guarentee deliver-once"--you always have to say "it depends, and depends on what you can settle for."
This shows up explicitly in blockchain systems, where confidence improves as the number of confirmation cycles increases.
reply