
Software Deduplication: Quick comparison of save ratings - linuxready
https://trae.sk/view/26/
======
vardump
> Method: I used 22.3GiB worth of Windows XP installation ISOs, 52 ISOs in
> total. No file was exactly the same, but some contained much duplicate data,
> like the Swedish XP Home Edition vs the Swedish N-version of XP Home
> Edition. I deduplicated these files and noted how much space I saved
> compared to the 22.3GiB.

So let me get it straight: he stores a bunch of CD ISOs, presumably with block
size of 2048 bytes to different dedup file systems without caring about dedup
block size?

ZFS has 128 kB recordsize by default, so little wonder it does so badly _in
this particular test_ without any tuning!

Windows has 4 kB blocks, so that's why it does so well. Doh.

He could have configured other systems to use a different block size. 2 kB
block would obviously be optimal, one should get the highest deduplication
savings with that size.

From ZFS documentation: [http://open-
zfs.org/wiki/Performance_tuning#Dataset_recordsi...](http://open-
zfs.org/wiki/Performance_tuning#Dataset_recordsize)

"ZFS datasets use an internal recordsize of 128KB by default. The dataset
recordsize is the basic unit of data used for internal copy-on-write on files.
Partial record writes require that data be read from either ARC (cheap) or
disk (expensive). recordsize can be set to any power of 2 from 512 bytes to
128 kilobytes. Software that writes in fixed record sizes (e.g. databases)
will benefit from the use of a matching recordsize."

So what happens if he sets ZFS recordsize to 2 kB (assuming it can be done?)?
Ok, dedup table will probably be huge, but... savings ratio is what we need to
know.

> ZFS is another filesystem capable of deduplication, but this one does it in-
> line and no additional software is required.

Yup, ZFS is probably the best choice for online deduplication.

~~~
linuxready
In this kind of scenario, I expect the block size to have only a marginal
impact. Indeed if all the CD ISOs are very similar, I would expect that the
size of a duplicated chunk to be on average quite big. The difference between
using 128k and 64k for BTRFS is for instance not very big.

But except for the block size, I don't see other explanation for the
differences.

Dedup is dedup, so I failed to understand why results between different
implementation should lead to such differences at the end (except very
incorrect implementation !).

~~~
vardump
The files on CDs are aligned to 2 kB boundaries. Dedup is looking for n kB
continuous block. If the block size of the material you want to dedup does not
match to the block size of dedup system, you'll get suboptimal results. The
bigger the difference, the worse the results.

Say you have this data:

    
    
      ABCABCBACCBBABCC
    

Dedup system that has block size of 1 can see you really have just three
unique blocks, A, B and C.

Same data, but dedup with block size of 2:

    
    
      AB CA BC BA CC BB AB CC
    

Dedup block size of 2 thinks you have 6 unique blocks: AB, CA, BC, BA, CC and
BB.

Etc.

~~~
linuxready
I'm sorry I am not sure I get it. Let's say you have a 1000 kB file which is
duplicated and which is located on continuous blocks (so if the CDs used 2 kB
boundaries, we'll have 500 continuous blocks). If ZFS use 128 kB block size,
it will detect 7 blocks (896 kB) that it can deduplicate. So we only lose
about 10%.

Perhaps there is a high degree of fragmentation then and files are not on
continous blocks ?

(this example would be the same if instead of 2 exactly duplicated files, we
have a big common chunk between 2 files)

~~~
vardump
Wrong. If the alignment is wrong, you'll likely lose 100%. 2 kB can be aligned
in 64 different positions within 128 kB.

That 1000 kB of 2 kB continuous blocks must start exactly at same mod 128 kB
alignment. There are 64 different possible alignments.

~~~
linuxready
Oh that's it then ! Thanks for the clarification.

