
Data Deduplication with Linux - chintanp
http://www.linuxjournal.com/content/data-deduplication-linux
======
wazoox
For those interested, I've done lots of lessfs testing published on my
professional blog a while ago :

* first post: <http://blogs.intellique.com/tech/2010/12/22#dedupe>

* detailed setup and benchmark results: <http://blogs.intellique.com/tech/2011/01/03#dedupe-config>

After more than 9 months running lessfs, I recommend it.

------
chintanp
A required reading from my course on Advanced Storage Systems at CMU,
<http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf>

Really good paper which describes in detail how the deduplication works.

------
ak217
So, from what I understand, this is great but more of a proof of concept since
fuse performance kills it. As far as putting it in production, there are a few
unresolved questions which I haven't seen picked apart:

\- Can dedup be integrated into the VFS layer, like unionfs is shooting for,
or does it have be integrated with the underlying filesystem.

\- Is online dedup possible, and does the answer change when running SSD.

\- What's the best granularity (block-level? inode-level? block extent-level?)
and how badly can it randomize the i/o. I imagine one would have to do a lot
of real-world benchmarking to find this out.

\- Are there possible privacy issues (i.e. finding through i/o patterns
whether someone else has a given block or file stored) and how to deal with
them

------
res0nat0r
Bup is also a pretty cool git based ddup backup utility:

<https://github.com/apenwarr/bup#readme>

------
viraptor
I was wondering - with the current amount of abstraction and similar
(sometimes redundant) metadata on almost everything - what percent of
duplicate blocks could be found on a standard desktop system?

I don't think it would be useful, I'm just interested in the level of
"standard" data duplication.

~~~
viraptor
Actually the btrfs email thread contained the answer ([http://www.mail-
archive.com/linux-btrfs@vger.kernel.org/msg0...](http://www.mail-
archive.com/linux-btrfs@vger.kernel.org/msg07726.html)):

"I was just toying around with a simple userspace app to see exactly how much
I would save if I did dedup on my normal system, and with 107 gigabytes in
use, I'd save 300 megabytes."

It's a relatively small amount. Then again - you're storing 300MB of exactly
the same blocks of data... Unless they're manual backup files, this looks like
a big waste to me.

~~~
radiowave
Yup, that's about the same proportion I found when I recently tried copying my
data across to a ZFS system with the dedup switched on.

I then decided to disable the dedup, because it comes at a cost - the checksum
data (which would mostly be living on the SSD read cache I had attached) was
occupying about 3 times the monetary worth of SSD storage space than the
monetary worth of conventional disk space that the duplicate data was
occupying.

I noticed that the opendedup site (linked from the article) claims a much
lower volume of checksum data, relative to number of files; perhaps an order
of magnitude less than I observed with ZFS, but they seem achieve that by
using a fixed 128KB block size, which brings along its own waste. (ZFS uses
variable block size.) I haven't actually done the numbers here but I wouldn't
be at all surprised to find that for my data, the 128KB block size would be
costing as much disk space as what dedup was saving me. (YMMV, of course.)

~~~
dedward
Just curious - were you using the verify option (not related to your point I
realize)

I'm puzzled why people in general aren't more worried about data corruption
due to hash collision.....

~~~
radiowave
As it happens, in between first reading about ZFS dedup, and finally trying it
out, I seem to have forgotten that the verify option existed. I just did "set
dedup=on" - beyond that everything was whatever defaults you get on
OpenIndiana build 148.

Were I to ponder that matter to any great depth, I suspect I'd find it rather
difficult to get a handle on how concerned I ought to be about hash collision.
Perhaps that's part of the answer to your puzzlement.

------
makmanalp
btrfs also has a deduplication feature in the works: [http://www.mail-
archive.com/linux-btrfs@vger.kernel.org/msg0...](http://www.mail-
archive.com/linux-btrfs@vger.kernel.org/msg07720.html)

------
tobias3
I tested it and I don't recommend it. (It was like a year ago though) It was
really slow and some blog posts about the reliability of the data storage
backend were a little bit scary.

I would recommend using zfs-fuse. You don't have the FUSE->File on a
filesystem->Hard disk indirection (thus more speed). And additionaly you get
all the cool ZFS features! If you need even more speed there is a ZFS kernel
module for linux and a dedup patch for btrfs. I don't think those are
production ready though.

~~~
Dylan16807
I tried ZFS dedup but there was something like a 20x slowdown to write files
compared to ZFS without dedup, and this was on under ten gigabytes of files. I
don't know if I somehow had the cache settings wrong or what the problem was,
but I didn't manage to fix it, even trying both FUSE and kernel versions. (On
ubuntu 11.4)

~~~
tobias3
Yeah random acess on hard disks is awfully slow. And if you have dedup you can
cause lots of random access... If you have a little bit of data the hashtable
used for dedup can also be to big to fit into memory. Then ZFS puts it onto
the disk and it is even slower. Luckily there is a feature to use SSDs as a
cache device in this case.

~~~
Dylan16807
The tricky part seems to be 'too big to fit into memory'. From what I
understood and calculated the dedup tables on my system should have been well
under 100MB, and the amount of memory designated for metadata was over 350MB,
yet the performance was terrible.

~~~
devicenull
Based on my testing (not published anywhere, sorry) ZFS dedup works best when
you enable compression. With compression, it's only slightly slower then
without dedup.

~~~
Dylan16807
I did have compression on. Good to know that in some cases dedup will perform
quite well. Was that with an SSD?

My best guess is that I either ruined the configuration in some way or dedup
and only dedup reacted horribly to being in a virtual machine.

~~~
dedward
ZFS is designed to have lots of horsepower and memory thrown at it.......big
servers, available CPU power, lots of ECC ram. If there's going to be an SSD
allocated as a cache disk, it's probably expected to be huge and enterprisey
too.....

ZFS is awesome, but some features will be disappointing unless you are dealing
with adequate resources.

------
alecco
I don't understand the complication of using a database. The sensible approach
would be something like BMDiff with [page] indexing on top for random access.

~~~
billswift
I remember a spate of academic articles a few (3-7?) years ago talking about
how all filesystems were going to be replaced by single huge databases to hold
all our "files", maybe this is partially a continuation of that research.

~~~
alecco
I remember over a decade ago Microsoft was working on a FAT based filesystem
backed by one of their database engines.

------
wcoenen
lessfs appears to do block level deduplication (like ZFS). This means that if
I copy a huge file but add a few bytes at the start, I won't get any benefit
from deduplication because the data doesn't align anymore with the original
block boundaries.

I wonder if there is a way to improve on that?

