
A look at VDO, the new Linux compression layer - lima
https://www.redhat.com/de/blog/look-vdo-new-linux-compression-layer
======
d33
At first I was confused because I thought that someone miraculously
implemented compression with no context switching via VDSO. Funny to see how
wrong I was ;)

Could someone explain how VDO works to me? Based on the example, it looks like
another DM backend, like LUKS and such. It exposes a virtual device backed by
a real one, adding a layer of compression, am I reading it right? I can also
see that we're specifying a "logical size" that is exposed to the user. How
much space is really used and how is it allocated? Can I grow the logical size
later?

Apart from this - what is the status of the patch, is it redhat-only or is it
on its way to the upstream?

~~~
sweettea
> It exposes a virtual device backed by a real one, adding a layer of
> compression, am I reading it right?

And deduplication, yes.

>Can I grow the logical size later?

Yes; the man page for the program discusses the growLogical subcommand
briefly: [https://github.com/dm-vdo/vdo/blob/6.1.1.24/vdo-
manager/man/...](https://github.com/dm-vdo/vdo/blob/6.1.1.24/vdo-
manager/man/vdo.8#L281)

>I can also see that we're specifying a "logical size" that is exposed to the
user. How much space is really used and how is it allocated?

There's a vdostats command mentioned in the article that exposes how much
actual space is available and used.

>Apart from this - what is the status of the patch, is it redhat-only or is it
on its way to the upstream?

It has not yet been submitted as a kernel patch.

~~~
lotyrin
If I specify a logical size, and that's larger than the physical backing size,
then what happens if someone attempts to write unique high-entropy data to
fill their logical space?

~~~
sweettea
VDO will return ENOSPC when it gets a write and has nowhere to store it. The
filesystem or other writer is responsible for passing ENOSPC up to the user,
just like if dm-thin were to run out of space.

~~~
d33
OK, that sounds like a convincing reason not to use this feature - it really
looks like it covers wrong abstraction. Consider reasons why you could get
ENOSPC before:

1\. Ran out of disk space,

2\. Ran out of inodes,

3\. Media reported ENOSPC and it gets propagated,

4\. (something else I don't know about?)

So now, #3 gets more common. Is it actually something that FS devs think
about? "What if I get this weird error message?" Or is it something that could
trigger a some destructive edge case in the filesystem and lead to serious
data loss? You mention dm-thin doing the same, so I guess that at least the
popular filesystems should handle it well.

Anyway, when you think about it, how do you debug #3? You checked file size,
you checked number of inodes, other than that the FS gives you no feedback.
There's no central API to signalling this kind of problems and suggested
solutions and I would say that this makes systems much less flexible. You
could of course run vdostats (and whatever dm-thin uses to report its
resources), but just imagine the amount of delicate code you would need to
automatically solve this kind of issues. It's insane, it really looks like
crappy engineering to me.

~~~
sweettea
#3 is certainly not a historically common error from a disk, but thin
provisioning has been around for a while -- dm-thin was introduced in Linux
3.2 in 2012, and filesystems in my opinion worked extensively to handle ENOSPC
correctly since then. Consider that storage returning ENOSPC at a particular
write and all thereafter is roughly equivalent to storage stopping writing
suddenly, and storage stopping writing is equivalent to a system crash and
reboot. Filesystems do work hard to recover without loss of unflushed data
from a crash, assuming the storage was properly processing flushes, and this
case should be very similar.

Filesystems (and VDO) both log extensively to the kernel log in an out of
space situation, so inspection via dmesg or journalctl hopefully leads swiftly
to identification of the problem. The 'dmeventd' daemon provides automatic
monitoring of various devices (thin, snapshot, and raid) and emits warnings
when the thin-provisioned devices it is aware of are low on space; there's a
bug filed against VDO to emit similar warnings [1]. Careful monitoring is
definitely important with thin provisioned devices, though.

[1]
[https://bugzilla.redhat.com/show_bug.cgi?id=1519307](https://bugzilla.redhat.com/show_bug.cgi?id=1519307)

------
faragon
For data compression it uses the LZ4 format [1] (real-time LZ77-like data
compressor -without entropy encoding, just string references and literals-,
with small blocks so its LUT-based O(1) search is always in the data cache).

[1] [https://rhelblog.redhat.com/2018/02/05/understanding-the-
con...](https://rhelblog.redhat.com/2018/02/05/understanding-the-concepts-
behind-virtual-data-optimizer-vdo-in-rhel-7-5-beta/)

------
lima
This is great news!

At my employer, we're currently using btrfs in production for deduplication.
It's relatively stable nowadays, but fragmentation and maintenance are massive
issues for our use case.

~~~
lunchables
On Redhat? Are you aware that Redhat has deprecated btrfs? I believe btrfs
still used heavily in SUSE.

~~~
ioayman
Exactly!

_" The Btrfs file system has been in Technology Preview state since the
initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving
Btrfs to a fully supported feature and it will be removed in a future major
release of Red Hat Enterprise Linux. The Btrfs file system did receive
numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will
remain available in the Red Hat Enterprise Linux 7 series. However, this is
the last planned update to this feature."_

Source: [https://access.redhat.com/documentation/en-
us/red_hat_enterp...](https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/7/html/7.4_release_notes/chap-
red_hat_enterprise_linux-7.4_release_notes-deprecated_functionality)

------
jimmies
If you plan to use it for your personal traditional storage HDD, you have to
keep in mind that it will render most (if not all) data recovery tools naught.
Tools such as PhotoRec rely on the file headers to determine the file type to
recover the data.

I have a long-term storage HDD that I thought I would never need data recovery
on it until I fucked it up good by a wrong partitioning command. Normally
those types of mistakes can be recovered by TestDisk but not that time. I
realized my only hope was to use PhotoRec to recover what I can from the
garbage that I overwrote on the disk. Thanks to PhotoRec I could recover most
of them with messed up names. I was so thankful I didn't use any fancy
compression algorithm.

Tools such as LVM, SSD backing storage, btrfs, compression and stuff are all
nice when you understand their limitations. For now, I don't use all of my
storage so I just create ext3, ext4, and exFAT partitions to store my data for
the maximal chance of recovery, whether it is due to hardware or software.

~~~
vardump
Using something like VDO for personal storage would be pretty pointless.

Unless, of course, you have a ton of similar virtual machines or other data
with high amount of 4 kB block size aligned redundancy.

~~~
globalc
If your data directories have much redundancy (so deduplication can help), or
are nicely compressable, then VDO can also make sense for home setups. Like
companies, you would consider things like "does it make sense for my kind of
data", "do I have enough CPU cycles/ram to spend on getting storage down to
some degree" and "am I ok with the additional layer, so more complicated
setup".

------
zokier
Considering that it is based on blazingly fast lz4, using twice as much time
as non-VDO in this simple test seems ..bad? I suppose the dedup stage is
really expensive here. On a setup like this with simple compression, I would
have expected the results to be inverted.

~~~
rurban
I would have thought they would consider zstd also, because that would be even
better and easier tunable. But I guess there are some license problems still.

~~~
0xcde4c3db
Reading some blog posts and such about ZFS compression, I got the impression
that simpler algorithms such as LZ4 and LZO are typically preferred for
transparent storage compression. Presumably a balance must be struck between
the benefits of writing less data and the cost of taking CPU time away from
other code.

------
vardump
I wonder if you can run VDO on top of ZFS.

Or is there a better way to handle snapshots and data integrity (block level
checksums)?

Without block level checksums, a single corrupted data block could corrupt
half of your virtual machine images...

~~~
dmm
LVM provides CoW snapshots. But why would you use this instead of ZFS's native
compression or dedup ?

~~~
vardump
How about block level checksums? Can LVM do that as well? Or is there some
other way?

ZFS dedup takes 320 bytes of RAM per unique block "record". So _1TB of RAM_ is
enough to dedup only a bit over 6 TB worth of unique blocks, when using a
block size that works well with virtual machines — 4 kB.

One can of course use larger ZFS record size than 4kB. But virtual machine
dedup savings drop very sharply as a result _if_ record size does not match
virtual machine filesystem block size and alignment. This happens, because
there are exponentially more different combinations how 4 kB blocks can be
arranged inside a bigger record size.

It's painful to format all virtual machine images to use say 64/128 kB
filesystem block size to be able to efficiently use larger ZFS record size
dedup.

I understand VDO dedup requires significantly less memory and uses only 4 kB
blocks for dedup. This is ideal for VM storage application.

~~~
AstralStorm
How about taking block level checksum a few levels higher and finally build a
filesystem that has a Merkle tree checksum?

~~~
anonuser123456
You mean like ZFS?

------
shmerl
So it's a layer under the filesystem and should work with any filesystem in
theory? So would XFS + VDO be better than let's say using compression in
BTRFS?

~~~
globalc
Yes, VDO is transparent for the layers above, like filesystems. XFS+VDO might
be preferable to BTRFS, if it has the features you need. \- LVM could be used
for snapshots, but the snapshots from BTRFS might be preferable in corner
cases where many snapshots are done (snapshots in XFS are being worked on
AFAIK, these could then be comparable) \- XFS+VDO combination might be more
reliable, it's a completely supported combination on RHEL, whereas BTRFS is
just techpreview \- only missing feature coming to mind is checksums over data
blocks, XFS is only providing checksums over meta data

~~~
LeoPanthera
How do filesystems cope with the apparent capacity of the disk changing? If
you put a lot of data on that compresses really well, suddenly your disk will
appear to have become larger.

~~~
globalc
The capacity is not changing 'rapidly', instead the whole VDO device is thin
provisioned. Let's assume a 1TB harddisk, then when creating a VDO device
ontop, you can have that directly appear as 3TB, so 'thin provisioned'.

The filesystem ontop always sees 3TB, unless you explicitly modify the VDO
device. Of course, you have to monitor the VDO status tightly: if you happen
to store data on the filesystem which is absolutely unique, has no zeros and
is uncompressable, then dedup/compression can not do anything. The VDO device
can only consume up to ~1TB of such data. Your monitoring should detect the
low space on the VDO backend before, and you should then either stop writing
or extend the backend device.

~~~
LeoPanthera
What happens if you go over the limit? Does VDO reject writes? Or is your
filesystem lost.

~~~
globalc
With the current code, your data becomes unmountable in that case. There is a
workaround for getting the data accessable: leave a bit of the underlying
blockdevice unused, so when creating VDO ontop give it not the full
blockdevice. Then, in case the filesystem ontop completely fills up, the last
part of the blockdevice can be used to grow the VDO device, and the filesystem
can be mounted.

Should be tried out before relying on it. The current behaviour seems to be
considered a bug,
[https://bugzilla.redhat.com/show_bug.cgi?id=1519377](https://bugzilla.redhat.com/show_bug.cgi?id=1519377)
has details.

------
jlarocco
The opening question made me laugh because I actually _do_ feel I have too
much storage.

I haven't bought a new hard drive in years, and I'm only using maybe 1.5 Tb of
~3 Tb, and most of that space is used up by RAW images from my DSLR. At the
rate I'm going, it'll be 2 or 3 years before I need a new drive.

~~~
sweettea
Personally I do agree about hard drives. They're so enormous these days that
I'll never use even a moderately sized one. I'm spoiled, though, and love my
SSDs; they're fast, smaller, and more expensive, and just a couple of VMs
quickly eat up a bunch of expensive space. With VDO, I personally run 20 VMs
instead of 8.

------
mindslight
Remember when the N in O(N) was the size of the problem, rather than the
quantity of O(lg N') indirection layers? Pepperidge Ph4rm remembers.

