
ZFS Is the Best Filesystem For Now - ingve
http://blog.fosketts.net/2017/07/10/zfs-best-filesystem-now/
======
floatboth
> ZFS never really adapted to today’s world of widely-available flash storage:
> Although flash can be used to support the ZIL and L2ARC caches, these are of
> dubious value in a system with sufficient RAM, and ZFS has no true hybrid
> storage capability.

How is L2ARC not "true hybrid"?

> And no one is talking about NVMe even though it’s everywhere in performance
> PC’s.

Why should a filesystem care about NVMe? It's a different layer. ZFS generally
doesn't care if it's IDE, SATA, NVMe or a microSD card.

> can be a pain to use (except in FreeBSD, Solaris, and purpose-built
> appliances)

I think it's just a package install away on many Linux distros? Also
installable on macOS — I had a ZFS USB disk I shared between Mac and FreeBSD.

Also it's interesting that these two sentences appear in the same article:

> best level of data protection in a small office/home office (SOHO)
> environment.

> It’s laughable that the ZFS documentation obsesses over a few GB of SLC
> flash when multi-TB 3D NAND drives are on the market

Who has enough money to get a mutli-TB SSD for SOHO?!

~~~
vbezhenar
> I think it's just a package install away on many Linux distros? Also
> installable on macOS — I had a ZFS USB disk I shared between Mac and
> FreeBSD.

It's easy to install if you want to use it as an additional filesystem. But if
you want to install e.g. RHEL on root ZFS, it's quite an adventure. Even
Ubuntu with first-class support for ZFS does not support ZFS on root from the
box. Actually I don't understand it. Of all the features, snapshots looks like
killer feature for Linux distributions. Make snapshot before upgrade, allow
easy rollback if upgrade gone wrong. Something like Windows restore points,
but much more reliable.

~~~
lscotte
ZFS root absolutely is supported on Ubuntu and Debian. Every Debian system
I've built in the past year or so have been 100% ZFS (around 6 physical
systems, but I've also built AWS AMIs this way). You have to install via
debootstrap, but it's definitely a working and supported configuration.
Installer support would sure be welcome though, it's not trivial!

~~~
kuschku
And that’s exactly the issue. You can’t just use ubiquity to set it up, or
even the terminal, you have to go a long, complicated, badly documented path.

~~~
weitzj
If you use proxmox (Debian), you do not need a terminal. And you get all the
nice proxmox features. But I agree, standalone setup is no fun.

------
mixmastamyk
I've been disappointed in linux filesystems and Intel hardware lately. Little
integrity checking in ext4 and btrfs is still having growing pains. Recent
search for a svelte laptop with ECC memory yielded nothing. Sheesh, wasn't
this stuff invented like 30+ years ago?

I understand Intel is segmenting reliability into higher-priced business gear,
but as a developer that depends on this stuff for their livelihood the current
status quo is not acceptable.

Linux should have better options since profit margins are not an impediment.

~~~
dijit
Dell Precision 5520 with a Xeon supports ECC memory (up to 64G DDR4) and it's
the same size as an XPS.

~~~
thijsvandien
The CPU supports it but I still don't see the option to actually configure it
with ECC memory (which has been keeping me from buying one for quite a while
now).

~~~
dijit
I have one with ECC memory that's seen my archlinux.

I had to buy the memory myself but I was looking for "better" memory anyway.

~~~
Roritharr
So you bought it with 8GB non-ECC, plugged in 2* 32GB ECC Dimms and it works?

I can't find those DIMMs, 16GB DIMMs are the biggest I can find.

~~~
dijit
Yep, I think I'm using memory that is not commercially available although I
didn't know it before.

I can't find them online.

For reference they're Samsung branded, I can take a photo if you like; you can
see the "width" of the channel in linux which tells you if you're using ECC or
not.

~~~
Roritharr
Wow, where did you get them? Is it an engineering sample?

------
peapicker
ZFS, at least on Solaris, has issue with many multiple readers of the same
file, blocking after ~31 simultaneous readers (even when there are NO
writers). Ran into this with a third party library which reads a large TTF to
produce business PDF documents. The hundreds of reporting processes all slowed
to a crawl when accessing the 20Mb Chinese TTF for reporting because ZFS was
blocking.

I can't change the code since it is third party. The only way I saw to easily
fix it was on system startup to copy the fonts under a new subdir in /tmp (so
in tmpfs, ie RAM, no ZFS at all there ) and then softlink the dir the product
was expecting to the new dir off of /tmp, eliminating the ZFS high-volume
multiple-reader bottleneck.

Never had this problem with the latest EXT filesystems on my volume groups on
my Linux VMs with the same 3rd party library and same volume of throughput.

~~~
nisa
If you recreate the bug on Linux and have something like a simple reproduder I
guess that the ZoL devs are more than happy to fix or at least understand it:

[https://github.com/zfsonlinux/zfs](https://github.com/zfsonlinux/zfs)

From reading the pull requests and issues on that repo I've got the impression
that the next release 0.7.0 will be quite step forward and there seems to be
quite sophisticasted work to tackle performance issues.

------
conductor
DragonFlyBSD's HAMMER [0] is another viable alternative.

Unfortunately the next generation HAMMER2 [1] filesystem's development is
moving forward very slowly [2].

Nevertheless, kudos to Matt for his great work.

[0]
[https://www.dragonflybsd.org/hammer/](https://www.dragonflybsd.org/hammer/)

[1]
[https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEA...](https://gitweb.dragonflybsd.org/dragonfly.git/blob_plain/HEAD:/sys/vfs/hammer2/DESIGN)

[2]
[https://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/...](https://gitweb.dragonflybsd.org/dragonfly.git/history/HEAD:/sys/vfs/hammer2)

~~~
Veratyr
I had a look at HAMMER as it seemed it might meet my requirements [0] but I
couldn't figure out whether it supports replication or erasure coding. Don't
suppose anyone here knows?

[0]:
[https://news.ycombinator.com/item?id=14756787](https://news.ycombinator.com/item?id=14756787)

~~~
conductor
HAMMER does support replication but it doesn't have erasure coding.

If you are using a recent Linux Kernel I can suggest you to use dm-integrity
[0] (optionally with dm-crypt) with your favorite filesystem. It's not erasure
coding but it can help detecting silent data corruption on the disk.

[0]
[https://old.lwn.net/Articles/721738/](https://old.lwn.net/Articles/721738/)

~~~
Veratyr
Hmm, it does seem like it should be possible to put mdadm on top of dm-
integrity devices, I might try that out.

------
Mic92
The article does not mention bcachefs as a future alternative:
[http://bcachefs.org/](http://bcachefs.org/)

~~~
reacharavindh
I’m certainly an excited audience for modern alternative file system options.
Quick look at the homepage of bcachefs reads -

“Not quite finished - it's safe to enable, but there's some work left related
to copy GC before we can enable free space accounting based on compressed
size: right now, enabling compression won't actually let you store any more
data in your filesystem than if the data was uncompressed.”

What is the point of having compression enabled if you can’t store more data
than you could if it was uncompressed? Shouldn’t they just say “compression
mechanism works but not useful yet. as of now it is just extra overhead”..

~~~
wtallis
Compression can give you a net performance improvement on slow disks if the
CPU time required to compress is less than the time saved by writing less data
to the disk.

------
Perseids
(Near) zero-cost snapshots and filesystem-based incremental backups are
amazing. Just today I was saved by my auto snapshots [1]. Apparently I didn't
`git add` a file to my feature branch and without the snapshot I wouldn't have
been able to recover it after some extensive resetting and cleaning before I
switched back to the feature branch. It's really comforting to have this easy
to access [2] safety net available at all times.

Now that Ubuntu has ZFS build-in by default, I'm seriously considering
switching back, and since I too have been burned by Btrfs, I guess I'll stay
with ZFS for quite some time. Still, the criticism of the blog post is fair,
e.g. I was only able to get the RAM usage in control after I set hard lower
and upper limits of the ARC as kernel boot parameters
(`zfs.zfs_arc_max=1073741824 zfs.zfs_arc_min=536870912`).

[1] [https://github.com/zfsonlinux/zfs-auto-
snapshot](https://github.com/zfsonlinux/zfs-auto-snapshot)

[2] The coolest feature is the virtual auto mount where you can access the
snapshots via the magical `.zfs` directory at the root of your filesystem.

~~~
rsync
"(Near) zero-cost snapshots and filesystem-based incremental backups are
amazing."

This.

We[1] offer ZFS filesystems in the cloud[2] and one of the nicest things to
explain to customers is that they don't have to think about "incrementals" or
"versions" or retention in any way. They can just do a "dumb rsync" to us
(mirror) and our ZFS snapshots, on their schedule, will do the rest.

In the event of a restore, the customer just browses right into "5 days
ago"[3] and sees their entire offsite filesystem as it existed 5 days ago.

[1] rsync.net

[2]
[http://www.rsync.net/products/platform.html](http://www.rsync.net/products/platform.html)

[3] rsync.net accounts have a .zfs directory

------
alyandon

      "Once you build a ZFS volume, it’s pretty much fixed for life."
    

The ease of growing/shrinking existing volumes and adding/removing storage is
why I made the decision to go with btrfs when I rebuilt my home file server.

~~~
phil21
This is only difficult with ZFS if you care about performance. If you are a
typical home file server user, you can add vdevs in a rather ad-hoc fashion as
they fill and it works pretty well.

There is a huge performance penalty of course, as all the majority of new data
will reside on the latest vdevs - but it generally works.

I do agree this is one of the largest drawbacks of ZFS, but very few
filesystems get it right.

~~~
wmf
_you can add vdevs in a rather ad-hoc fashion as they fill and it works pretty
well_

As long as they're expensive mirror vdevs, right? My impression is that home
users want the efficiency of RAID-6 and they want incremental expansion
(regardless of whether this combination is "good for them"). ZFS can't do
that.

~~~
floatboth
You should always use mirrors! [http://jrs-s.net/2015/02/06/zfs-you-should-
use-mirror-vdevs-...](http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-
vdevs-not-raidz/) "don’t be greedy. 50% storage efficiency is plenty". Mirrors
perform better, perform MUCH better when degraded, rebuild MUCH faster.

~~~
tscs37
Mirrors are also less safe. [0]

For a 6x8TB and assuming a (optimistic) 10^-16 URE, you get 3.5% failure rate
for a RAID5 array, 0.7% for a RAID10 array and a 1.06e-08% failure rate for
RAID 6.

Why be greedy for all that performance? Most home-grade NAS or even some
business-grade NAS isn't used for performance sensitive operations, more like
Word Documents and Family Pictures, stuff you _don 't_ want to loose.

I'd rather take safety over performance here.

[0]: [https://redd.it/6i4n4f](https://redd.it/6i4n4f)

~~~
hvidgaard
Raid is for convinience and performance. It is not and cannot replace backup.
In any case, if you want safety, RAID1 is the way forward, and not RAID6. An 8
drive RAID6 with 8TB WD Red NAS drives (URE <1 in 10^14) is virtually
guranteed to have at least one read error during a rebuild (if the URE rate is
true, which I believe it is not).

Regardless, the determining factor here is how much data do you need to read
in the case of a failure to rebuild the array. RAID1 wins every single time
because you cannot read less than the single drive you need to replace.

~~~
tscs37
The URE rate is most likely much lower.

However, the chance of failure for a RAID6 of 100x10TB disks is less than
0.482% after 1'000'000 rebuilds.

RAID1 is space-inefficient, a 100x10TB RAID array might never fail but it has
only 10TB of storage space.

A RAID10 Array has a 14% failure chance for just 4x2TB disks using 10^14
failure rates.

RAID1 and RAID10 are definitely not the way forward, it is less secure,
something that should be immediately apparent if you read the link in my
previous comment.

A 10 Disk RAID6 with 10TB disks is more reliably than a RAID10 by multiple
orders of _magnitude_ and more space efficient than a simple RAID1.

~~~
hvidgaard
Your math is completely off. a 100x10TB RAID6, with a failed disk need to read
990TB of data to rebuild in the case of a failed disk. With an URE of 1 in
10^14 you will see 79.2 URE events on average if the URE rate is correct
(again, I don't believe it is) during single a rebuild - this is the reason no
serious engineer recommends a RAID6 for large arrays.

In the case of a RAID1, noone uses 100 mirrored drives. You use RAID10, and in
the case of a failed disk, you must read 10TB to recover. With the same URE,
we'd see on average 8 URE for every 10 rebuilds, or around 2 orders of
magnitude less failure rate compared to the RAID6 example.

~~~
tscs37
Your logic is sadly incorrect.

During a RAID6 rebuild, a URE is non-critical as the Array can recover the
data with one lost disk an a URE on any other disk during the stripe rebuild.

The only critical error would be a URE on two disk on the same stripe, 80
URE's during a 990TB rebuild have an amazingly low chance of having two UREs
on the same stripe on two seperate disks.

In case of the RAID10, you get 8 URE's over 10 rebuilds, which aren't
recoverable unless you have 3 disks. So you'll corrupt data.

edit: URE of 10^14 is what most vendors specify for consumer harddrives, 10^16
is closer to what people encounter in the real world but 10^14 is considered
the worst case URE rate.

~~~
hvidgaard
Good point with the URE on a RAID6, but that still doesn't make it superiour.
The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

URE does not have to corrupt data, if you use a proper filesystem with
checksumming such as the ZFS.

When a disk fails, a RAID10 is simply in a far better position as it only have
to read a single disk, and it doesn't have any complicated striping to worry
about. Just clone a disk.

~~~
tscs37
>URE does not have to corrupt data, if you use a proper filesystem with
checksumming such as the ZFS.

No but afaik there is no way to recover data once ZFS has declared it
corrupted. (ie, no parity)

>The strain of rebuild have been known to kill many arrays, both RAID5 and 6.

I haven't actually encountered that yet. Despite that, a RAID 6 _can_ loose a
disk, so as long as you don't encounter further URE's after loosing another
disk, it's fine.

If you're worried about that, go for RAIDZ3 or equivalent. With something like
SnapRAID you can even have a RAIDZ6, loosing 6 disks without loosing data. The
chances of that happening are relatively low.

>When a disk fails, a RAID10 is simply in a far better position as it only
have to read a single disk

A RAID 10 is in no position to recover from URE's once a disk has failed
unless you reduce your space efficiency to 33%.

I personally favor not corrupting data over rebuild speeds.

Striping might be complicated but that doesn't make it worse.

It might be acceptable too loose a music file, but once the family image
collection gets corrupted or even lost on ZFS because a disk in a RAID 1
encountered a URE, it's personal.

I'd rather life with the thought that even if a disk has a URE, the others can
cover for it. Even during a rebuild.

------
Veratyr
This might be somewhat off topic but I'm desperate. I've been looking for a
way to store files:

\- Using parity rather than mirroring. I'm happy to deal with some loss of
IOPS in exchange for extra usable storage.

\- That deals with bitrot.

\- That I can migrate to without somehow moving all of my files somewhere
first (i.e. supports addition/removal of disks).

\- Is stable (doesn't frequently crash or lose data)

\- Is free or has transparent pricing (not "Contact Sales").

\- Ideally, supports arbitrary stripe width (i.e. 2 blocks data + 1 block
parity on a 6 disk array)

Unfortunately it doesn't appear that a solution for this exists:

\- ZFS doesn't support addition of disks unless you're happy to put a RAID0 on
top of your RAID5/6 and it doesn't support removal of disks at all when parity
is involved. It is possible to migrate by putting giant sparse files on the
existing storage, filling the filesystem, removing a sparse file, removing a
disk from the original FS and "replacing" the sparse file with the actual disk
but this is somewhat risky.

\- BTRFS has critical bugs and has been unstable even with my RAID1
filesystem.

\- Ceph _mostly_ works but I always seem to run into bugs that nobody else
sees.

\- I couldn't even figure out how to get GlusterFS to create a volume.

\- MDADM/hardware RAID don't deal with bitrot.

\- Minio has hard coded N/2 data N/2 parity erasure coding, which destroys
IOPS and drastically reduces capacity in exchange for an obscene level of
resiliency I don't need.

\- FlexRAID either isn't realtime or doesn't deal with bitrot depending which
version you choose.

\- Windows storage spaces are slow as a dog (4 disks = 25MB/s write).

\- QuoByte, the successor to XtreemFS has erasure coding but has "Contact Us"
pricing _and_ trial.

\- Openstack Swift is complex as hell.

\- BcacheFS seems extremely promising but it's still in development and EC
isn't available yet.

I'm currently down to fixing bugs in Ceph, modifying Minio, evaluating Tahoe-
LAFS and EMC ScaleIO or building my own solution.

~~~
555h
You can probably achieve what you're looking for by stacking a few
filesystems. For example, you could create a separate ZFS pool/vdev with a
single full-disk zvol on each disk. Then use mdadm to create a RAID array of
the zvols. Then put ext4 (or whatever) on the mdadm array.

I've done something similar for the purpose of getting FDE with ZFS in linux.
It can be a little finicky, but it's definitely workable.

One ZFS-specific caveat (which may conflict with your desire to get high
storage efficiency): you way need to prevent your ZFS pools from filling up
too much [1]. You can either enable discard/TRIM on the whole stack, so the
top level FS (e.g. ext4) can let ZFS know when a block is actually free. Or
alternative just to limit your zvols to 85% (for example) of their respective
pools. The latter is my preference, because there was originally a bug with
discard in zfs and it's not immediately clear if it's totally fixed (although
my fstrim tests seemed to work out fine).

[1]
[https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...](https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_happens_when_you_go_over_80_on/cxqu7uh/)

~~~
Veratyr
Hmm, that actually sounds workable. I could even format the mdadm device as
ZFS too if I really wanted. I am somewhat worried about performance, have you
had any issues with that?

~~~
555h
I haven't had any issues with performance, but then again my requirement was
just "reasonable performance".

I ran some quick benchmarks (data below). Obviously this is far from rigorous,
but maybe it'll be useful. In previous tests I found that volblocksize=128K
was optimal for my stack -- which is why the last benchmarks use that setting.

Every additional ZFS filesystem in the stack may reduce storage efficiency
(minimum free space requirements [1]; metadata & checksum overhead [2][3]) --
that's why I used ext4 as the top layer instead of another ZFS.

[1] (as mentioned before)
[https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_ha...](https://www.reddit.com/r/zfs/comments/3vtur4/what_exactly_happens_when_you_go_over_80_on/cxqu7uh/)

[2]
[https://news.ycombinator.com/item?id=14756360](https://news.ycombinator.com/item?id=14756360)

[3] [https://forums.freenas.org/index.php?threads/what-is-the-
exa...](https://forums.freenas.org/index.php?threads/what-is-the-exact-
checksum-size-overhead.28187/#post-183802)

    
    
      Test setup:
       debian stable
       kernel 4.9.0-3-amd64
       zfs 0.6.5.9-5
       ZFS "pool": mirror with 2x 7200rpm drives
    
      Benchmark command:
       for i in `seq 1 10`; do sync; dd if=/dev/zero of=DEST bs=1M count=1024 conv=fdatasync; done
    
      zfs mirror -> dataset
       Data (MB/s): 125,115,104,135,148,170,135,151,118,119
       Mean (MB/s): 132.0
       Std.dev.: 19.9
    
      zfs mirror -> zvol (volblocksize=8K [default])
       Data (MB/s): 150,115,127,125,122,118,105,118,124,128
       Mean (MB/s): 123.2
       Std.dev.: 11.6
    
      zfs mirror -> zvol (volblocksize=128K)
       Data (MB/s): 68.5,112,115,114,94.3,85.1,83.1,98.4,120,108
       Mean (MB/s): 99.8
       Std.dev.: 16.9
    
      zfs mirror -> zvol (volblocksize=128K) -> luks -> ext4  (my stack)
       Data (MB/s): 130,94.4,109,139,138,125,94.9,124,134,133
       Mean (MB/s): 122.1
       Std.dev.: 16.8
    

edit: formatting

------
cryptonector
Illumos has a way to expand pools, FYI. IDK if that's in OpenZFS yet.

It works thusly: ZFS creates a vdev inside the new larger vdev, then moves all
the data from the old vdev to the new vdev, then when all these moves are done
the nested vdevs are enlarged.

What should originally have happened is this: ZFS should have been closer to a
pure CAS FS. I.e., physical block addresses should never have been part of the
ZFS Merkle hash tree, thus allowing physical addresses to change without
having to rewrite every block from the root down.

Now, the question then becomes "how do you get the physical address of a block
given just its hash?". And the answer is simple: you store the physical
addresses near the logical (CAS) block pointers, and you scribble over those
if you move a block. To move a block you'd first write a new copy at the new
location, then overwrite the previous "cached" address. This would require
some machinery to recover from failures to overwrite cached addresses: a table
of in-progress moves, and even a forwarding entry format to write into the
moved block's old location. A forwarding entry format would have a checksum,
naturally, and would link back into the in-progress-move / move-history table.

During a move (e.g., after a crash during a move) one can recover in several
ways: you can go use the in-progress-moves table as journal to replay, or you
can simply deref block addresses as usual and on checksum mismatch check if
you read a forwarding entry or else check the in-progress-moves table.

For example, an indirect block should be not an array of zfs_blkptr_t but
_two_ arrays, one of logical block pointers (just a checksum and misc
metadata), and one of physical locations corresponding to blocks referenced by
the first array entries. When computing the checksum of an indirect block,
only the array of logical block pointers would be checksummed, thus the Merkle
hash tree would never bind physical addresses. The same would apply to znodes,
since they contain some block pointers, which would then have three parts:
non-blockpointer metadata, an array of logical block pointers, and an array of
physical block pointers.

The main issue with such a design now is that it's much too hard to retrofit
it into ZFS. It would have to be a new filesystem.

~~~
raattgift
> Illumos has a way to expand pools ... ZFS creates a vdev inside the new
> larger vdev

Huh?

> IDK if that's in OpenZFS yet.

The openzfs tree (on github) is virtually identical to illumos-gate (on
github).

> physical block addresses should never have been part of the ZFS Merkle hash
> tree, thus allowing physical addresses to change without having to rewrite
> every block from the root down.

mahrens deals with this (and block pointer rewriting) here:

[https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s](https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s)

Even with SSDs IOPS are precious. On rotating media, burning track-to-track
seeks in reading and updating a large hash table is a bad plan (cf. the
deduplication table).

------
cmurf
_Btrfs might just become “the ZFS of Linux” but development has faltered
lately, with a scary data loss bug derailing RAID 5 and 6 last year and not
much heard since._

It was not a per se data loss bug. It was Btrfs corrupting parity during scrub
when encountering already (non-Btrfs) corrupted data. So a data strip is
corrupt somehow, a scrub is started, Btrfs detects the corrupt data and fixes
it through reconstruction with good parity, but then sometimes computes a new
wrong parity strip and writes it to disk. It's a bad bug, but you're still
definitely better off than you were with corrupt data. Also, this bug is fixed
in kernel 4.12.

[https://lkml.org/lkml/2017/5/9/510](https://lkml.org/lkml/2017/5/9/510)

Update, minor quibbles:

 _lacking in Btrfs is support for flash_ Btrfs has such support and
optimizations for flash, the gotcha though if you keep up with Btrfs
development is there have been changes in FTL behavior and it's an open
question whether or not these optimizations are effective for today's flash
including NVMe. As for hybrid storage, that's the realm of bcache and dm-cache
(managed by LVM) which should work with Btrfs as any other Linux file system.

 _ReFS uses B+ trees (similar to Btrfs)_ XFS uses B+ trees, Btrfs uses
B-trees.

------
gulikoza
The thing I'm struggling with is 4K sector support. It's horribly inefficient
with ZFS. RAIDZ2 wastes a ton of space when pool is made with ashift=12. And
everybody knows 512e on AF disks is horribly slow...so ZFS is either very slow
or wastes 10% of total space...Or both (ZVOL :D)

According to some bug reports, nobody has touched this since 2011...

~~~
iooi
Can you elaborate on how ZFS wastes 10% of total space?

I recently set up a ZFS volume using 12x4TB drives using RAID-Z2, so I
expected 40TB of usable space, or ~36.3TiB. However, I only see 32TiB of
usable space on the volume. I always wondered why that was so, never figured
it out..

~~~
gulikoza
There's ton of sites when you google ashift=12, [http://louwrentius.com/zfs-
performance-and-capacity-impact-o...](http://louwrentius.com/zfs-performance-
and-capacity-impact-of-ashift9-on-4k-sector-drives.html) or
[https://github.com/zfsonlinux/zfs/issues/548](https://github.com/zfsonlinux/zfs/issues/548)
for instance.

Basically, ashift=12 increases ZFS block size to 4K. Metadata use full blocks
that would be 512b on ashift=9 but are now 4K (due to ashift=12). It wastes at
least 3.5Kb more than normal 512 byte blocks for each block that is not filled
entirely.

~~~
sfoskett
I didn't even consider this. Thanks for the explanation! Makes tons of sense!

Incidentally, many modern filesystems (including NTFS) store very small files
in the FAT rather than taking up a whole block for this very reason!

~~~
chungy
_cringes at "in the FAT" instead of "MFT"_....

ZFS has this same feature, however, as long as feature@embedded_data=enabled
;)

~~~
sfoskett
I guess I'm an old storage guy. I call it the FAT on everything! :-)

~~~
takeda
Yeah, different filesystem calls things differently FAT, MFT, inodes.

Best is to just call it metadata :)

------
fulafel
He his talking about "best level of data protection in a small office/home
office (SOHO) environment".

Trying to do this with FS features is misguided.

You need to have backups, and have regular practice in restoring from backups.

Some organizations need fancy filesystems in addition to backups, because they
want to have high availability that will bridge storage failures. But that has
a high cost in complexity, you should only consider it if you have IT/sysadmin
staff and the risk management says it's worth the investment in cognitive
opportunity cost, IT infrastructure complexity and time spent.

~~~
kev009
Negative, you wont know you _need_ to use the backups without these FS
features, and by the time you finally do you could have rotated through them.

~~~
fulafel
The article didn't mention backups at all. If a SOHO environment can afford
only either backups or a ZFS storage system, choosing backups leaves much less
residual risk on the table.

Yes, there is still a risk that corrupted data may end up in backups, but
that's true even with ZFS. Ideally you want end-to-end integrity checking and
verification, that means application layer and should also be done for
backups. But like with all risk management, there are diminishing returns...

~~~
kev009
That is the most contrived nonsense I've heard in a long time. You can't not
afford to use ZFS, it works fine on a single disk and at least you'd know your
data had mutated.

~~~
tscs37
Knowing your data has rotted doesn't bring it back.

~~~
kev009
But it does allow you to treat it with suspicion and human judgement. If it's
an album, you go re-rip or download it. If it's medical data, you don't use
it.

~~~
tscs37
And if it's that one holiday album of images, it's down the toilet forever.

------
throw2016
The filesystem as basic infrastructure has to be robust and fuss free. The
complex stuff is going to be built on top of that.

After years of btrfs I realized while the all the features around
snapshotting, send/receive etc are great the cost in performance and other
issues is too high.

And using plain old ext4 is more often than not the best compromise so you can
forgot just about the fs and focus on higher layers.

~~~
c3833174
The nice thing about btrfs vs ZFS is that you can just use it like a normal
filesystem ignoring all the advanced features and still get the benefit of
checksumming (plus duplicated metadata by default on spinning disks) and
compression

~~~
throw2016
The problem with btrfs and cow in general is poor performance for databases
and overall, in some cases significantly, slower performance than ext4. ZFS
has high memory requirements.

If your use case mainly revolves around the benefits of snapshots then it
definitely makes sense.

------
Cieplak
On my current laptop, I'm seeing a 20% reduction in disk usage relative to the
filesystem size because of ZFS's built-in compression.

~~~
floatboth
20%? That's weak :P I have 3.32x refcompressratio on my /home partition in my
dev VM (using gzip-7 here).

~~~
loeg
You might look into using xz(lzma) or zstd in place of gzip. Gzip offers
pretty poor compression per CPU time performance compared to these newer
options.

[https://clearlinux.org/blogs/linux-os-data-compression-
optio...](https://clearlinux.org/blogs/linux-os-data-compression-options-
comparing-behavior)

~~~
simcop2387
zfs doesn't support either of those yet unfortunately. I'd love for zstd to be
available given it's benefits in speed and compression ratio.

~~~
loeg
[https://reviews.freebsd.org/D11124](https://reviews.freebsd.org/D11124) :)

Or use lz4 while you wait.

------
thibran
Another future alternative TFS: [https://github.com/redox-
os/tfs](https://github.com/redox-os/tfs)

------
carlob
> Many remain skeptical of deduplication, which hogs expensive RAM in the
> best-case scenario. And I do mean expensive: Pretty much every ZFS FAQ
> flatly declares that ECC RAM is a must-have and 8 GB is the bare minimum. In
> my own experience with FreeNAS, 32 GB is a nice amount for an active small
> ZFS server, and this costs $200-$300 even at today’s prices.

I use nas4free with much less ram…

~~~
eriknstr
The massive amounts of RAM recommendation is if you need to do deduplication.
Are you doing that? If not then you don't need a lot of RAM.

Between the low cost of storage, and alternative solutions for deduplicating
data I personally don't use the built-in deduplication functionality of ZFS
for my zpools. Might come down to what sorts of data you are storing, though.

~~~
sfoskett
Yup. This. Now that I have 32 GB of RAM in my FreeNAS box I decided I really
didn't need dedupe after all. I just don't have that much duplicated data, and
I've got 60 TB of HDDs in the box. So I use the RAM for VMs instead!

~~~
agumonkey
Yeah dedup is mostly useful for massively multiusers setups, say mail, file
sharing, etc etc. For small setups you can do a `fdupes` pass once a day and
fix it yourself I suppose.

------
jerry40
Does anybody use ZFS as replacement for a database backup/restore on a test
environment? I'm not sure but it seems that it's possible to use ZFS snapshots
in order to quickly restore previous database state. Note: it's just a
question, I'm not advising to try that.

~~~
orf
File system snapshots of databases are not necessarily consistent, and can not
always be restored like that.

~~~
floatboth
Atomic snapshots like ZFS's are always consistent for Postgres. I guess other
databases with a similar write-ahead log can be snapshotted as well?

~~~
anarazel
> Atomic snapshots like ZFS's are always consistent for Postgres.

As long as you make sure to only use one filesystem, i.e. you don't place
pg_xlog or some tablespaces on a different filesystem. You can get very weird
corruption in such cases :)

~~~
olavgg
With ZFS you can also do atomic snapshots of multiple filesystems.
[https://serverfault.com/questions/608223/is-zfs-snapshot-
r-o...](https://serverfault.com/questions/608223/is-zfs-snapshot-r-of-several-
pools-atomic)

------
Koshkin
A logical issue that I have with the existence of such filesystems as ZFS and
BTRFS is that the problem of "bit rot" should be addressed at a lower
abstraction level - hardware or the driver - rather than at the level that
should be primarily responsible for user-visible organization of files,
directories, etc.

~~~
dragontamer
How?

Bitrot occurs because the lower-abstraction level hardware fails. When you put
a Hard-Drive into storage for say 5 years, the bits may change. Even if a
Hard-drive remains in constant use for 5 years... if said files or directories
aren't checked and double-checked constantly, the error-correction codes may
fail over time.

Its a fundamentally different problem from Hard Drives that are being used
constantly as say Swap.

Hard Drives typically include Hamming codes or ECC bits to address typical
corruption issues.

\-------------

The fundamental principle at hand here is as follows: to ensure integrity of
files, you need to _regularly_ check file data. Only the Filesystem would know
which files were recently checked.

~~~
Koshkin
Couldn't the drive's firmware or the driver do the same just as well (except
on physical records instead of files)?

~~~
dragontamer
First off, real hard drives have "SMART" data that detects (and automatically
corrects) simple errors. So remember, hard drives ALREADY have a large degree
of error correction built in. Its just not enough for serious data-storage
purposes.

The "Bit-Rot" scenario is particularly harmful to RAID5 (Minimum 3-hard
drives. Two contain data, one contains "parity" that can fix any errors on the
other hard drives. Then the parity is structured to be striped equally across
the three drives). Modern RAID drivers can do this rather easily.

The problem with "Bit Rot" is that a RAID5 drive will not rebuild itself until
it detects an error. If you're reading files along, and all of a sudden... the
hard drive detects an error. No problem (in the typical case), just rebuild
the data from the parity.

However, "Bit Rot" means that the parity bits (on the 3rd backup hard drive)
have ALSO rotted away.

\----------

The only way to fix this "bit-rot" error is to constantly read through your
data and CONSTANTLY check for bit-rot. No hard drive is going to silently spin
and hamper-performance of the system for self-verification purposes... but a
Filesystem / Operating system can schedule these "Scrubs" to occur during
periods of low-I/O.

Which is how ZFS, and Window's ReFS work. When your computer is idle, the OS
checks for bitrot. When the computer starts to work again, it pauses the "low
priority" bitrot checks and serves the data.

\-------

ZFS doesn't quite work like Window's ReFS. ZFS simply checks for bit rot
whenever a file is accessed. Every time. There are "ZFS Scrub" commands (which
you can put into a cron-job) to read every file (and therefore check for
bitrot).

------
Quequau
I have to wonder what's going to happen once those storage level random access
non-vol memory technologies finally make it out of R&D and into the market.

I mean, as it is now it seems like we have a hard enough time dealing with
comparatively simple hybrid memory systems.

------
snakeanus
I am really excited for bcachefs. It is also the only fs that has support for
chacha20-poly1305 encryption.

~~~
foepys
I wonder if a non-hardware accelerated encryption algorithm is good enough for
a FS that also has checksums. The CPU is already busy with checksimming, so
doesn't this considerably slow down writes?

~~~
conductor
Both Chacha20 and Poly1305 are optimized (by design) for running on general
purpose CPUs. AES-GCM using AES-NI instructions is still faster, but not that
much [0].

[0]
[https://community.qualys.com/thread/16005](https://community.qualys.com/thread/16005)

------
moonbug22
I'll stick with GPFS, thanks.

~~~
eriknstr
OP is not suggesting ZFS above _all else_.

> And every enterprise system has already moved way past what ZFS can do,
> including enterprise-class offerings based on ZFS from Sun, Nexenta, and
> iXsystems.

However, OP is naturally suggesting ZFS over NTFS, HFS+, ext3/4 and even ReFS
and APFS.

> Still, ZFS is way better than legacy storage SOHO filesystems. The lack of
> integrity checking, redundancy, and error recovery makes NTFS (Windows),
> HFS+ (macOS), and ext3/4 (Linux) wholly inappropriate for use as a long-term
> storage platform. And even ReFS and APFS, lacking data integrity checking,
> aren’t appropriate where data loss cannot be tolerated.

~~~
MikusR
ReFS has data integrity checking and repair.

~~~
sfoskett
...but ReFS File Integrity is disabled by default! Integrity streams are
enabled for metadata only. Here's how to turn it on:
[https://technet.microsoft.com/en-
us/library/jj218351%28v=wps...](https://technet.microsoft.com/en-
us/library/jj218351%28v=wps.630%29.aspx?f=255&MSPPError=-2147217396)

------
zzzcpan
Ceph, gluster, object storages, all would do a better job serving SOHO. ZFS is
a 90s way of thinking about storage, "a box" way. I don't think it deserves
any of that HN hype.

~~~
KaiserPro
_ahem_ gluster? ceph? really?

ceph is horrifically slow, and only has a rudimentary posix interface.

gluster is just well, terrible.

first things first a SOHO office cannot support a clustered filesystem, unless
one of the people happens to be a storage specialist.

yea I hear lots of noise about how self healing they both are. That is mostly
fancy talk for "I don't have backups"

Supposedly they both do HA. But yeah, its not something I'd want to support.

If you look at gitlab's setup, and their proposed setup
([https://about.gitlab.com/2016/12/11/proposed-server-
purchase...](https://about.gitlab.com/2016/12/11/proposed-server-purchase-for-
gitlab-com/))

32 file servers to serve ~480TBs of disk. Seriously? 4 file servers, GPFS and
4 MD3060e. Thats about a petabyte of usable storage with a streaming
throughput of about 40 gigabits a second.

"modern" clustered filesystems are mainly just toys. if you want speed, use
lustre, if you want sexy software defines awesomeness with next level raid,
use GPFS.

