
Delivering Billions of Messages Exactly Once - fouadmatin
https://segment.com/blog/exactly-once-delivery/
======
newobj
I don't want to ever see the phrase "Exactly Once" without several asterisks
behind it. It might be exactly once from an "overall" point of view, but the
client effectively needs infinitely durable infinite memory to perform the
"distributed transaction" of acting on the message and responding to the
server.

Imagine:

\- Server delivers message M

\- Client process event E entailed by message M

\- Client tries to ack (A) message on server, but "packet loss"

\- To make matters worse, let's say the client also immediately dies after
this

How do you handle this situation? The client must
transactionally/simultaneously commit both E and A/intent-to-A. Since the
server never received an acknowledgment of M, it will either redeliver the
message, in which case some record of E must be kept to deduplicate on, or it
will wait for client to resend A, or some mixture of both. Note: if you say
"just make E idempotent", then you don't need exactly-once delivery in the
first place...

I suppose you could go back to some kind of lock-step processing of messages
to avoid needing to record all (E,A) that are in flight, but that would
obviously kill throughput of the message queue.

Exactly Once can only ever be At Least Once with some out-of-the-box
idempotency that may not be as cheap as the natural idempotency of your
system.

EDIT: Recommended reading: "Life Beyond Distributed Transactions", Pat Helland
-
[http://queue.acm.org/detail.cfm?id=3025012](http://queue.acm.org/detail.cfm?id=3025012)

~~~
rusanu
Having spent 7 years of my life working _with_ Pat Helland in implementing
Exactly Once In Order messaging with SQL Server Service Broker[0] I can assure
you that practical EOIO messaging is possible, exists, and works as
advertised. Delivering data EOIO is not rocket science, TCP has been doing it
for decades. Extending the TCP paradigms (basically retries and acks) to
messaging is not hard if you buy into transacted persisted storage (= a
database) for keeping undelivered messages (transmission queue) and storing
received messages before application consumption (destination queue). Just ack
_after_ you commit locally.

We've been doing this in 2005 at +10k msgs/sec (1k payload), durable,
transacted, fully encrypted, with no two phase commit, supporting long
disconnects (I know for documented cases conversations that resumed and
continued after +40 days of partner network disconnect).

Running into resource limits (basically out of disk space) is something the
database community knows how to monitor, detect and prevent for decades now.

I really don't get why so many articles, blogs and comments claim this is not
working or impossible or even _hard_. My team shipped this +12 years ago, is
used by major deployments, technology is proven and little changed in the
original protocol.

[0] [https://docs.microsoft.com/en-us/sql/database-
engine/configu...](https://docs.microsoft.com/en-us/sql/database-
engine/configure-windows/sql-server-service-broker)

~~~
deepsun
Are you talking about distributed environment, where network partitions can
occur? If yes, then there's Two Generals Problem and "FLP result", that just
prove it impossible. So I guess you're talking about non-distributed
environment.

In other words, to reliably agree on a system state (whether message id was
delivered) you need the system to be Consistent. And per CAP theorem, it
cannot be Available in presence of Partitions.

So other people you're referring to probably talk about distributed systems.

~~~
rusanu
Yes, I'm talking about distributed systems and I am aware of the CAP theorem.
Hence my choice of the word 'practical'.

As I said, users had cases when the plumbing (messaging system) recovered and
delivered messages after +40 days of network partitioning. Correctly written
apps completed the business process associated with those messages as normal,
no special case. Humans can identify and fix outages and databases can easily
outlast network outages (everything is durable, transacted, with HA/DR). And
many business processes make perfect sense to resume/continue after the
outage, even if it lasted for days.

~~~
alexbeloi
I'm not really versed in this topic, but it seems like using a database for a
socket makes the system entirely centralized around that database. Is there
something I'm missing?

~~~
ztorkelson
ServiceBroker, at least, had the capability of (transactionally) sending
messages between databases. So, if you drank the kool-aid (I did; it wasn't so
bad), there needn't be "the centralized database". You can separate your
databases and decompose your services, and indeed it's easier to do so
correctly and with confidence because the technology eliminates a lot of hairy
edge cases.

------
mamon
"Exactly once" model of message is theoretically impossible to do in
distributed environment with nonzero possibility of failure. If you haven't
received acknowledgement from the other side of communication in the specified
amount of time you can only do one of two things:

1) do nothing, risking message loss

2) retransmit, risking duplication

But of course that's only from messaging system point of view. Deduplication
at receiver end can help reduce problem, but itself can fail (there is no
foolproof way of implementing that pseudocode's "has_seen(message.id)" method)

~~~
kmicklas
> there is no foolproof way of implementing that pseudocode's
> "has_seen(message.id)" method

Wait why? Just because you'd have to store the list of seen messages
theoretically indefinitely?

~~~
sethev
There's also a race condition in there when you receive the duplicate before
publish_and_commit is done doing its thing - assuming they're not actually
serializing all messages through a single thread like the pseudocode implies.

What they've done is shift the point of failure from something less reliable
(client's network) to something more reliable (their rocksdb approach) -
reducing duplicates but not guaranteeing exactly once processing.

~~~
dastbe
its not so much that they are serializing all messages through a single
thread, but that they are consistently routing messages (duplicates and all)
into separate shards that are processed by a single thread.

------
alexandercrohde
Here's a radical solution. Instead of becoming a scala pro akka stream 200k
engineer with a cluster of kafka nodes that costs your company over $100,000
of engineering time, technical debt, opportunity cost, and server costs, just
put it all in bigtable, with deduping by id....

Enough of resume-driven-engineering, why does every need to reinvent the
wheel?

~~~
azernik
Yup. Databases, whether relational or not, have been designed to solve all
these problems in a much more "bulletproof" way than your piddly [1] several-
dozen-engineer team could ever manage, no matter how genius they are.

[1] No disrespect meant - just a description of size. Source: running a piddly
2-person engineering team.

------
bmsatierf
In terms of connectivity, we deal with a similar problem here at CloudWalk to
process payment transactions from POS terminals, where most of them rely on
GPRS connections.

Our network issues are nearly 6 times higher (~3.5%) due to GPRS, and we
solved the duplication problem with an approach involving both client and
server side.

Clients would always ensure that all the information sent by the server was
successfully received. If something goes wrong, instead of retrying (sending
the payment again), the client sends just the transaction UUID to the server,
and the server might either respond with: A. the corresponding response for
the transaction or B. not found.

In the scenario A, the POS terminal managed to properly send all the
information to the server but failed to receive the response.

In the scenario B, the POS terminal didn't even manage to properly send the
information to the server, so the POS can safely retry.

~~~
v1shnu
Why not just send the data too along with the UUID? It'll save another
roundtrip in case of scenario B right? Or do you have data to prove that
scenario B is a lot less likely to occur, making it sensible to save bandwidth
by not re-transmitting the data?

------
falcolas
So, a combination of a best effort "at least once" messaging with
deduplication near the receiving edge. Fairly standard, honestly.

There is still a potential for problems in the message delivery to the
endpoints (malformed messages, Kafka errors, messages not being consumed fast
enough and lost), or duplication at that level (restart a listener on the
Kafka stream with the wrong message ID) as well.

This is based on my own pains with Kinesis and Lambda (which, I know, isn't
Kafka).

In my experience, better to just allow raw "at least once" messaging and
perform idempotant actions based off the messages. It's not always possible
(and harder when it is possible), but its tradeoffs mean you're less likely to
lose messages.

~~~
caust1c
This is generally better, but we're delivering these messages to integrations
which don't necessarily take idempotent actions.

------
travisjeffery
Kafka 0.11 (recently released) has exactly once semantics and transactional
messages built-in.

\- Talk from Kafka Summit: [https://www.confluent.io/kafka-summit-
nyc17/resource/#exactl...](https://www.confluent.io/kafka-summit-
nyc17/resource/#exactly-once-semantics_slide)

\- Proposal:
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+E...](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging)

~~~
mshenfield
My understanding (noob here) is that it allows producers to retry without fear
of duplication. You still have to consider the system feeding the producer
though. In Segment's example, clients might deliver their messages more than
once to the API. Kafka's mechanism wouldn't detect duplicate messages sent to
the producer, just that any given message a producer wants to append to Kafka
won't be duplicated.

------
ju-st
52 requests, 5.4 MB and 8.63 seconds to load a simple blog post. With a bonus
XHR request every 5 seconds.

~~~
heipei
That's not everything: This website contacted 56 IPs in 6 countries across 47
domains to perform 100 HTTP transactions. In total, 3 MB of data was
transfered, which is 5 MB uncompressed. It took 4.103 seconds to load this
page. 37 cookies were set, and 8 messages to the console were logged.

[https://urlscan.io/result/b2e27a08-1298-491a-863f-8cadc45e73...](https://urlscan.io/result/b2e27a08-1298-491a-863f-8cadc45e73eb)

------
StreamBright
"The single requirement of all data pipelines is that they cannot lose data."

Unless the business value of data is derived after applying some summary
statistics, than even sampling the data works, and you can lose events in an
event stream, while not changing the insight gained. Originally Kafka was
designed to be a high throughput data bus for analytical pipeline where losing
messages was ok. More recently they are experimenting with exactly once
delivery.

~~~
vannevar
Yeah, this was a major overstatement. There are lots of data pipelines where
it's ok to lose some data. Consider a sensor that sends measurements hundreds
of time a second to an app that operates on a 1-second timeframe. And UDP is
used all the time on the internet, yet carries no delivery guarantee.

------
skMed
Having built something similar with RabbitMQ in a high-volume industry, there
are a lot of benefits people in this thread seem to be glossing over and are
instead debating semantics. Yes, this is not "exactly once" \-- there really
is no such thing in a distributed system. The best you can hope for is that
your edge consumers are idempotent.

There is a lot of value derived from de-duping near ingress of a heavy stream
such as this. You're saving downstream consumers time (money) and potential
headaches. You may be in an industry where duplicates _can_ be handled by a
legacy system, but it takes 5-10 minutes of manual checks and corrections by
support staff. That was my exact use case and I can't count the number of
times we were thankful our de-duping handled "most" cases.

------
sethammons
"Exactly Once _"

_ Over a window of time that changes depending on the amount of ingested
events.

Basically, they read from a kafka stream and have a deduplication layer in
rocks db that produces to another kafka stream. They process about 2.2 billion
events through it per day.

While this will reduce duplicates and get closer to Exactly Once (helping
reduce the two generals problem on incoming requests and potentially work
inside their data center), they still have to face the same problem again when
they push data out to their partners. Some packet loss, and they will be
sending out duplicate to the partner.

Not to downplay what they have done as we are doing a similar thing near our
exit nodes to do our best to prevent duplicate events making it out of our
system.

------
incan1275
To be fair, they are upfront in the beginning about not being able to adhere
to an exactly-once model.

"In the past three months we’ve built an entirely new de-duplication system to
get as close as possible to exactly-once delivery"

What's annoying is that they do not get precise and formal about what they
want out of their new model. Also, their numbers only speak to performance,
not correctness.

On the plus side, I think it's awesome to see bloom filters successfully used
in production. That sort of thing is easy to implement, but not easy to get
right for every use case.

------
openasocket
So there's a lot of talk on here about the Two Generals Problem, so I thought
I'd chime in with some misconceptions about how the Two Generals Problem
relates to Exactly Once Messaging (EOM). WARNING: I'm going mostly on memory
with this, I could be completely wrong.

EOM is NOT strictly speaking equivalent to the Two Generals Problem, or
Distributed Consensus, in an unreliable network. In distributed consensus, at
some given point in time, A has to know X, A has to know B knows X, A has to
know B knows A knows X, ... It has to do with the fact that the message broker
is in some sense the arbitrator of truth, so the consumer(s) don't need full
consensus. In an unreliable network, you can have EOM.
[http://ilpubs.stanford.edu:8090/483/1/2000-7.pdf](http://ilpubs.stanford.edu:8090/483/1/2000-7.pdf)
gives some examples of how that works.

HOWEVER, you can't have EOM when the consumers can fail. If a consumer fails
there's no way, in general, to tell if the last message it was working on was
completed.

There are a couple of edge cases where you can still have EOM. For instance, a
system where you have a message broker A, and a bunch of consumers that read
messages x from that queue, compute f(x), and insert f(x) onto message broker
B, where f(x) may be computed multiple times for the same x (i.e. if f is a
pure function or you don't care about the side effects). This system can
implement EOM in the presence of an unreliable network and consumer failures
(I think it can handle one or both of the message brokers failing too, not
100% sure) in the sense that x will never be in broker A at the same time as
f(x) is in broker B, f(x) will never be in broker B more than once for the
same x, and any y in B had some x that was in A such that y = f(x).

------
siliconc0w
Was thinking a 'reverse bloom filter' could be cool to possibly avoid the
RocksDB for situations like this- turns out it already exists:
[https://github.com/jmhodges/opposite_of_a_bloom_filter](https://github.com/jmhodges/opposite_of_a_bloom_filter)

I love it when that happens.

~~~
openasocket
It should be noted it's impossible to have a 'reverse bloom filter' with the
same properties as a regular bloom filter. That is, a constant predefined size
gives you a particular false negative rate. The thing you linked to is really
just a cache. It has to store an entire entry (not just set some bits based on
the hash) and doesn't have a predictable false negative rate based on its
size. For more info:
[https://cstheory.stackexchange.com/questions/6596/a-probabil...](https://cstheory.stackexchange.com/questions/6596/a-probabilistic-
set-with-no-false-positives)

------
pfarnsworth
Sounds very cool. A couple of questions I had:

1) What happens if they lose their rocksdb with all of the messageIds?

2) Is their kafka atleast-once delivery? How do they guarantee that kafka
doesn't reject their write? Also, assuming they have set up their kafka for at
least once delivery, doesn't that make the output topic susceptible to
duplicates due to retries, etc?

3) >Instead of searching a central database for whether we’ve seen a key
amongst hundreds of billions of messages, we’re able to narrow our search
space by orders of magnitude simply by routing to the right partition.

Is "orders of magnitude" really correct? Aren't you really just narrowing the
search space by the number of partitions in kafka? I suppose if you have a
hundred partitions, that would be 2 orders of magnitude, but it makes it sound
like it's much more than that.

~~~
ngrilly
> What happens if they lose their rocksdb with all of the messageIds?

I'm wondering the same.

------
squiguy7
I wonder how they partition by "messageID" they use to ensure that the de-
duplication happens on the same worker. I would imagine that this affects
their ability to add more brokers in the future.

Perhaps they expect a 1:1 mapping of RocksDB, partition, and de-duplication
worker.

~~~
bytecodes
Kafka does this as part of its design. A topic has a declared number of
partitions (which can't really be changed on the fly, you choose a theoretical
high number and hope it's enough), and an agreed upon hash algorithm chooses
between those partitions (probably in Java, so hashCode is readily available
for primitives as well as objects). Each partition is really like its own
topic, so you lose in-order messaging for anything not included in your
partition key.

------
philovivero
tl;dr: Clickbait headline. Exactly-once delivery not even close to
implemented. Typical de-duping, as you've seen and read about hundreds of
times already, is what they did.

------
ratherbefuddled
"Almost Exactly Once" doesn't have quite the same ring to it, but it is
actually accurate. We've already discovered better trade-offs haven't we?

------
iampims
If the OP doesn't mind expanding a little on this bit, I'd be grateful.

> If the dedupe worker crashes for any reason or encounters an error from
> Kafka, when it re-starts it will first consult the “source of truth” for
> whether an event was published: the output topic.

Does this mean that "on worker crash" the worker replays the entire output
topic and compare it to the rocksdb dataset?

Also, how do you handle scaling up or down the number of workers/partitions?

~~~
ryanworl
I'm not the OP, but changing the number of Kafka partitions isn't a super
graceful operation. You would be wise to add as many as you could reasonably
need assuming one consumer thread per partition. But not too many because each
one is at least two files on disk!

~~~
tedmiston
I haven't used Kafka yet, but a Kafka partition is roughly the same as a
Kinesis shard, right?

~~~
ryanworl
Yes, but you have to set them up upfront and there is no API to split and
merge partitions. If you add more partitions, it doesn't automatically re-
shard for you either. I don't think it should automatically re-shard as it
would cause a ton of disk and network IO, but just something to be aware of.

------
qsymmachus
It's funny, at my company we implemented deduplication almost exactly the same
way for our push notification sender.

The scale is smaller (about 10k rpm), but the basic idea is the same (store a
message ID in a key-value store after each successful send).

I like the idea of invalidating records by overall size, we hadn't thought of
that. We just use a fixed 24-hour TTL.

------
wonderwonder
Would something like AWS SQS not scale for something like this? We currently
push about 25k daily transactions over SQS, obviously no where near the scale
of this, just wondering about what limitations we will bump into potentially.

~~~
timdorr
The limitations are most likely on price. For the 200B messages they've
already processed in the last 3 months, that would be $100,000 total on just
the SQS FIFO queue, or $33,333 per month. And that's not counting data
transfer.

~~~
hilbertseries
As long as everything is in ec2 data transfer will be free. You're cost
calculations are also off base. You'll need to send, receive and delete every
message that you process via SQS. These can all be done in batches of 10. So
it's 200B * 3/10 * .50 / million, which comes out to 60k over 3 months. Still
not cheap, Kinesis is probably the better option in this case if you want an
AWS managed service.

------
linkmotif
It's worth noting that the next major Kafka release (0.11, out soon) will
include exactly once semantics! With basically no configuration and no code
changes for the user. Perhaps even more noteworthy is this feature is built on
top of a new transactions feature [0]. With this release, you'll be able to
atomically write to multiple topics.

[0]
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+E...](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging)

------
ggcampinho
Isn't the new feature of Kafka about this?

[https://issues.apache.org/jira/browse/KAFKA-4815](https://issues.apache.org/jira/browse/KAFKA-4815)

~~~
caust1c
It would help, but messages are sent to the API first. We aren't sending
messages directly to Kafka from the Internet, of course.

So we can get duplicate API submissions regardless of whether or not we
enabled transactional productions into kafka from a producer.

------
robotresearcher
> [I]t’s pretty much impossible to have messages only ever be delivered once.

IIRC, it's provably impossible in a distributed system where processes might
fail, i.e. all real systems.

------
jkestelyn
Relevant to this topic: Description of exactly-once implementation in Google
Cloud Dataflow + what "exactly once" means in context of streaming:

[https://cloud.google.com/blog/big-data/2017/05/after-
lambda-...](https://cloud.google.com/blog/big-data/2017/05/after-lambda-
exactly-once-processing-in-google-cloud-dataflow-part-1)

(Google Cloud emp speaking)

------
kortox
With deduplication state on the worker nodes, how does scaling up, or
provisioning new machines, or moving a partition between machines work?

------
vgt
Qubit's strategy to do this via streaming, leveraging Google Cloud Dataflow:

[https://cloud.google.com/blog/big-data/2017/06/how-qubit-
ded...](https://cloud.google.com/blog/big-data/2017/06/how-qubit-deduplicates-
streaming-data-at-scale-with-google-cloud-platform)

------
majidazimi
What is so exciting about this? There is still possibility of duplicates. You
still have to put the engineering effort to deal with duplicates end-to-end.
If the code is there to deal with duplicates end-to-end, then does it really
matter to have 5 duplicates or 35? Or may be they just did it to add some
useful cool-tech in to CV?

~~~
ngrilly
> There is still possibility of duplicates.

Where?

------
redmalang
Another possible approach: [https://cloud.google.com/blog/big-
data/2017/06/how-qubit-ded...](https://cloud.google.com/blog/big-
data/2017/06/how-qubit-deduplicates-streaming-data-at-scale-with-google-cloud-
platform)

------
gsmethells
Why do I get the feeling this is repeating TCP features at the Message level?
There must a protocol that can hide this exactly once need away. TCP doesn't
create downloads, generally, that are bad and fail their checksum test, hence
packets that make up the file are not duplicated.

~~~
jkarneges
Yes there is some duplication of TCP capability here.

The problem with relying on TCP for reliability is that its state is in
memory, associated with a particular peer IP address, and acknowledgements
passed back to the sender only indicate that the receiver has the data in
local memory, not that the data has been processed.

A file download over TCP can fail, for example due to a network problem.
Ensuring reliable delivery requires additional measures outside of TCP, such
as retrying the download using a new connection.

In practice, this means that TCP is primarily useful for providing flow
control and offering a streaming interface (no worry about packet sizes). Less
so as a complete solution for transmission reliability.

------
luord
This is interesting work. But I think I'll continue relying on at least once
and idempotency. Exactly once is impossible anyway.

> In Python (aka pseudo-pseudocode)

This annoyed probably more than it should have.

------
spullara
This isn't the solution I would architect. It is much easier to de-duplicate
when processing your analytics workload later and you don't need to do so much
work.

~~~
tejasmanohar
That's not true for all the hundreds of real-time integrations that Segment
sends data to. Many are write-only.

~~~
spullara
Fair. For that case this solution makes more sense than the pure analytics
case.

------
PinguTS
That reminds me of the safety-related protocols we use since years in embedded
electronics like rail-road signaling, medical, and other areas.

------
stratosgear
Site seems to be down. Any ideas how big these HN hugs of death usually are?
How big of a traffic spike brings these servers down?

~~~
Exuma
It's not down

~~~
stratosgear
Hmm weird. I get: This site can’t be reached

segment.com refused to connect.

Try:

Checking the connection

ERR_CONNECTION_REFUSED

Maybe something local to me only?

~~~
disconnected
Check your ad blocker/hosts file.

In here, uMatrix just blocks the site with the message:

> uMatrix has prevented the following page from loading:

> [https://segment.com/blog/exactly-once-
> delivery/](https://segment.com/blog/exactly-once-delivery/)

I checked, and one of my uMatrix hosts files includes 'www.segment.com'.

~~~
stratosgear
Yep. Disabling AdAway seems to do the trick. Thanks for the heads up

~~~
goodplay
You can also add an exception for a blocked site by marking it green and
clicking on the opened padlock in the umatrix panel.

No need to nuke a city to get a fly.

------
mooneater
Awesome story. What I would like to hear more about, is the people side. The
teams and personalities involved with coming up with this new system and the
transition.

------
throwaway67
... or they could have used BigQuery with a primary key on message ID.

~~~
redmalang
BQ doesn't have primary keys. Perhaps you are thinking of the id that can be
supplied with the streaming insert? This has very loose guarantees on what is
de-duplicated (~5m iirc)

~~~
vgt
yea I think within the context of BigQuery the most sensible thing would be to
do an aggregate per the column that would be considered a primary key. For
example [0]. That said, Streaming API de-dupe window is very nice in practice.

I mentioned elsewhere on Google Cloud the most elegant way of doing this is
with Google Cloud Dataflow [1]

(work at G)

[0][https://stackoverflow.com/questions/38446499/bigquery-
dedupl...](https://stackoverflow.com/questions/38446499/bigquery-
deduplication-on-two-columns-as-unique-key)

[1][https://cloud.google.com/blog/big-data/2017/06/how-qubit-
ded...](https://cloud.google.com/blog/big-data/2017/06/how-qubit-deduplicates-
streaming-data-at-scale-with-google-cloud-platform)

