
Déjà vu: Fast efficient probabilistic deduplication - f483
https://github.com/F483/dejavu
======
blattimwind
Bloom filters don't seem useful as the primary means of deduplication _for an
actual data storage system_ to me.

\- False positives means marking data as duplicate when it's not.

\- Bloom filters are not associative. So unless the application is very
special and can find/ _identify_ /retrieve the data later in some other
(likely inefficient?) way, a separate index is required anyway. But if you
have a separate index of the data you already have, you can just use that to
deduplicate. This index is the main memory cost of deduplicating storage.

Bloom filters are of course interesting in certain scenarios e.g. to
statistically reduce network traffic.

~~~
f483
> Bloom filters are of course interesting in certain scenarios e.g. to
> statistically reduce network traffic.

Yes I originally wrote it for exactly this such a scenario.

This is not for you if you cannot live with any chance of a false positive,
even an extremely small one. Note that even in most cases a false positive is
acceptable if the chance of a false positive is less then the chance of a
hardware error.

Not that just because it is probabilistic does not mean its not useful in many
cases. The default setup with 1mil entrie memory and 1/1mil chance false
positive chance will run at ~8M mem usage. This is a quite an acceptable trade
off in many cases.

------
maxdemarzi
May want to take a look at using a Cuckoo Filter instead of a Bloom Filter.
See [https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-
for-...](https://maxdemarzi.com/2017/07/13/using-a-cuckoo-filter-for-unique-
relationships/)

~~~
f483
Thanks for the feedback, will look into it.

------
adrianratnapala
Perhaps like the other ZFS fanboys on this thread, I was hoping that
"efficient" would by some magic mean "sublinear in memory", but it doesn't
look like that. :(

~~~
f483
Sorry I cant do magic just yet :(

------
jwilk
Poor name choice.

[https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_(software)](https://en.wikipedia.org/wiki/D%C3%A9j%C3%A0_Vu_\(software\))

[https://github.com/worldveil/dejavu](https://github.com/worldveil/dejavu)

[https://github.com/appbaseio/dejavu](https://github.com/appbaseio/dejavu)

[https://github.com/IndigoUnited/js-
dejavu](https://github.com/IndigoUnited/js-dejavu)

~~~
f483
Yeah, should have put more thought into it then I did. But I guess its to late
now.

------
bradknowles
Could you use this to deduplicate files in a filesystem?

Could you use this as the deduplication method in ZFS?

~~~
f483
> Could you use this to deduplicate files in a filesystem?

Depends on what you mean and your constraints. Deduplicate file entries if you
can live with a rare false positives, sure. A setup of 1million entrie limit
with 1/1billion false positive chance will get ~80M mem usage.

If you want to check for duplicate files, probably not.

> Could you use this as the deduplication method in ZFS?

I would say no.

