
Data Deduplication at Scale - i0exception
https://engineering.mixpanel.com/2019/07/18/petabyte-scale-data-deduplication?referrer=hn
======
jandrewrogers
Keeping duplication metadata around doesn't scale well, though it may be
sufficient for their typical case. A strategy I've seen many times that scales
better is to (1) only check for duplicates when some part of the system has a
reason to believe duplicates could occur, with the assumption that duplicates
are relatively infrequent, and (2) design the storage engine such that you can
directly search for the record in previously ingested data inexpensively
without any extra deduplication metadata. Note that this is not as easy to
achieve if your data infrastructure is a loosely coupled collection of
arbitrary storage engines, databases, and processing pipelines -- which may be
a practical limitation for the case in the article.

If the storage engine is well-designed for the data model, a duplicate check
against existing data should only touch a handful of pages in the worst case,
it is an inexpensive query that rarely or never touches the network (detail
depending). For ingestion of records where there is no risk of duplication,
presumably the bulk of the time, this is a zero overhead model as there is no
duplication state to be maintained or checked. For most scenarios that create
potential duplication, this model is also quite cache friendly as a practical
matter.

The pathological case for this design is when you need to check every single
record for duplication (e.g. cleaning up a giant offline mess of agglomerated
data that may contain an arbitrary number of duplicates), but those scenarios
usually don't involve real-time stream ingestion.

~~~
notknifescience
We essentially do do this--duplicates are only looked for in the (project,
user, time) shard. It’s certainly more than a handful of pages, but nothing
extremely cost prohibitive at all. The indexer dedupes a whole shard at a
time, but only if at least one duplicate exists--in the normal case, we just
do nothing. Unfortunately, we also can’t control the number of duplicates that
we get (since we ingest data from our customers, we can't control how good it
is), and a lot of the time, the number of duplicates are fairly frequent due
to suboptimal implementations of Mixpanel, making the whole shard deduping a
fair tradeoff. We definitely never touch the network :slightly_smiling_face:

------
rbranson
A counterpoint is our ingest-time deduplication system at Segment:
[https://segment.com/blog/exactly-once-
delivery/](https://segment.com/blog/exactly-once-delivery/)

It's done at ingest time because Segment has a completely different use case.
Message data fans out over multiple downstream systems, some of which
distribute this data to systems outside of our control. However, if I were in
Mixpanel's shoes, I'd probably do it how they're describing it here.

~~~
jeffail
"What’s more, we want to ensure the information about which events we’ve seen
is written durably so we can recover from a crash, and that we never produce
duplicate messages in our output."

Your processor is described as writing from Kafka to Kafka and using a
persisted RocksDB instance to check message identifiers. How then do you
ensure messages aren't dropped if your processor crashes or gets killed after
checking against RocksDB but before the message is flushed to the Kafka
broker?

Also is your producer writing to Kafka not at-least-once? If so then even if
it removes all duplicates in its processing stage the feed written to your
output topic could still contain duplicates.

By contrast deduplicating on consumption avoids that problem entirely by
attempting to build an idempotent consumer, which results in an exactly-once.
Although in this case they have identified edge cases of duplicates they're
comfortable with.

~~~
collinvandyck76
The processor will read the last message from the output topic on startup.
That catches the use case where the processor crashes in between writing to
kafka and recording the write in rocksDB.

~~~
jeffail
Yeah but the first time it was read the key was stored in RocksDB, so the
second time it gets consumed after the crash:

"If the message already exists in RocksDB, the worker simply will not publish
it to the output topic and update the offset of the input partition,
acknowledging that it has processed the message."

It gets dropped as if it were a duplicate, oops!

Edit: I reread the relevant section...

"If a message was found in the output topic, but not RocksDB (or vice-versa)
the dedupe worker will make the necessary repairs to keep the database and
RocksDB in-sync. In essence, we’re using the output topic as both our write-
ahead-log, and our end source of truth, with RocksDB checkpointing and
verifying it."

Not sure entirely what they mean by "(or vice-versa)", if the message exists
in RocksDB but not in the output topic how you distinguish between a real
duplicate and a crash artifact?

If the last message of a crash happens to be a real duplicate and this
recovery mechanism reintroduces it into the pipeline then you have a
duplicate.

Either way, it's not an exactly-once feed. At best (assuming message loss
isn't possible) it's an at-least-once feed that usually appears to be exactly-
once.

~~~
rbranson
The order-of-operations is that it checks RocksDB first, writes to Kafka, and
then writes to RocksDB.

If the write to Kafka fails, it re-positions itself in the input topic stream
based on the offset annotation in the output topic's last message. The write
never went to RocksDB, so it won't be considered a duplicate.

Recovering from a failed RocksDB write is more complicated. The output topic's
last message will have an offset that will effectively be beyond the
accumulated state in RocksDB. Transactionally the last input topic offset for
each committed message is written to RocksDB alongside it. The recovery
process uses this offset as a starting point when consuming the input topic.
During this process, messages aren't published into the output topic until the
offset read from the output topic is reached.

~~~
jeffail
That makes more sense, thanks for clarifying.

Still, assuming there are no other edge cases there, it doesn't address the
other problem where a hypothetical consumer of the output topic is reading an
at-least-once feed of your exactly-once topic. In order for that not to be the
case then the consumer must also be idempotent, in which case what value was
gained from the original deduplication?

------
ryanworl
"For query-time, we found that reading an extra bit for every event adds
around 10ns to the reading of data. This is close to a 2% increase in the
query time because of the additional column."

This seems somewhat more expensive than I would've expected. Given your
estimates of duplicate probability, the bitset should compress to essentially
nothing, so IO is probably not the issue unless you're not compressing it.

Are you doing a virtual function call or something here?

------
sethammons
Our deduplication system needs to happen at our outgoing edge and is latency
sensitive (have we sent this message to its recipient already?). It needs to
do this a couple billion times a day and be highly available. Interesting
problem space. We are redesigning our current solution to be more robust soon.

