
About ZFS Performance - okket
https://www.percona.com/blog/2018/05/15/about-zfs-performance/
======
linsomniac
ZFS is an industrial-scale technology. It's not a scooter you just hop on and
ride. It's like a 747, with a cockpit full of levers and buttons and dials. It
can do amazing things, but you have to know how to drive it.

I've run ZFS for a decade or more. With little or no tuning. If I dial my
expectations back, it works great, much better than ext4 for my use case. But
once I start trying to use deduplication, I need to spend thousands of dollars
on RAM, or the filesystem buckles under the weight of deduplication.

My use case is storing backups of other systems, with rolling history. I tried
the "hardlink trick" with ext4, the space consumption was out of control,
because small changes to large files (log files, ZODB) caused duplication of
the whole file. And managing the hard links took amazing amounts of time and
disk I/O.

ZFS solved that problem for me. Just wish I could do deduplication without
having to have 64GB of RAM. But, I take what I can get.

~~~
zymhan
I totally agree with your comment, except for the last bit about 64GB of RAM
being unreasonable high.

Is it unreasonable to spend thousands of dollars on memory for Enterprise-
grade, production-level servers? In my experience, you'd almost certainly
better be using over a hundred GB of RAM in a server if you want to maximize
the overall compute density.

To be clear, plenty of Desktops and all workstation-class laptops support
64GB+ of RAM.

~~~
linsomniac
I think that agrees with my "industrial grade" comment. :-) In production we
are currently deploying boxes with 256GB of RAM. But for our backup server,
it's hard to spend more on RAM than on the disks in it. The box is capable of
it, it's just the cost that makes it unappealing for this use.

For various reasons, the backup server doesn't get much priority.

------
fgonzag
The comparison isn't apples to apples though. You'd have to setup bcache or
dm-cache with the NVMe drive in front of XFS to compare ZFS with L2ARC on a
NVMe drive. This is stated in the article as an exotic technology, but bcache
is generally considered stable.

The point still stands, ZFS is fast enough 99.99% of the time (when tuned
correctly), and simplifies a lot of administrative tasks.

~~~
ofrzeta
> bcache is generally considered stable

Is it? How about the recent bug in Linux 4.14 that could lead to data loss
with bcache (fixed in 4.14.2)? On the other hand ZFS for Linux also had a
regression recently.

~~~
RX14
To be fair to bcache, that bug wasn't in the bcache code. It was in core
kernel bio code and could easily have effected other subsystems.

------
skywhopper
The point to take away from this is the entirely unsurprising conclusion that
locally-attached unthrottled ephemeral disk has higher throughput than
network-attached IOPS-throttled EBS. So yeah, sure, if you take the time to
pre-cache all your EBS-stored data onto the local ephemeral drive and only do
reads from the cache, then you will get more query throughput on the local
ephemeral drive than on the remote EBS drive. But I'm not sure what that is
supposed to tell us about ZFS or XFS.

~~~
boomboomsubban
The post is to explain how ZFS functions, and how you can tune it for your
best performance. The author notes that the benchmarks are not fair, they are
there to show how much improvement you could potentially gain from tuning ZFS.

The title is "About ZFS performance," not "An in depth comparison of ZFS and
XFS."

------
radiowave
For a transaction processing database workload, it's generally good practice
to run with "zfs set primarycache=metadata", which means that ZFS won't
attempt to cache anything except metadata. This might have reduced the 15%
cache overhead the author observed.

I'll be interested to read the future work on using larger page sizes.
Conventional wisdom would hold that it was a bad idea (at least, for a write-
heavy workload) because of the write-amplification that it produces, but it
sets me wondering to what extent ZIL-offloading could mitigate that. Then
again, there's really no point in putting the ZIL on ephemeral storage.

------
gigatexal
ZFS is such a delight to work with. I learned something new in this article.
Thanks for linking it.

~~~
rsync
It may interest you to know that in addition to running on ZFS, the rsync.net
platform also has the option to 'zfs send' and 'receive' (over SSH):

[https://arstechnica.com/information-
technology/2015/12/rsync...](https://arstechnica.com/information-
technology/2015/12/rsync-net-zfs-replication-to-the-cloud-is-finally-here-and-
its-fast/)

~~~
jlgaddis
I normally couldn't care that you guys try to slip a thinly-veiled
advertisement into any HN discussion that concerns ZFS but, geez, c'mon guys,
this is really pushing it.

At the very least, throw in that mention of the HN discount.

~~~
gigatexal
Agreed.

------
viraptor
> ZFS is much less affected by the file level fragmentation, especially for
> point access type.

I'm disappointed they didn't show that comparison. How much does the original
result change in case of fragmented db files?

~~~
ivan78
Actually, it is quite opposite. ZFS is a copy-on-write filesystem - if you do
write in the middle of your file the datablock gets moved to a new place on
your disk. For typical database load your db files get more and more
fragmented with time.

------
z3t4
"RAIDZ" seems to be most popular, but if you want better performance go with
mirrors. Then let yours VM's have their own FS on top of zvol. For example
NTFS on top of a zvol you can get GB rw/s even with spinning HDD's.

~~~
equalunique
If you have 24 disks for a pool, then mirroring across 4 6-disk RAIDZ2 zvols
works well too.

------
ivan78
Another interesting project of next-gen file system is Red Hat Stratis:
[https://stratis-storage.github.io/](https://stratis-storage.github.io/)

After dropping btrfs support some time ago they started developing their own
next-gen file system based on LVM and XFS. It is now available for technology
preview on Fedora 28.

------
dekhn
I read this article and took away the conclusion that to get acceptable
performance from ZFS compared to XFS, I have to do extensive tuning and throw
in a half terabyte of NVMe storage as a cache.

Not very impressive.

~~~
viraptor
I agree they didn't present it very well. But in reality if you do want to get
as much as possible out of your database, you'll need to look at tuning at
that level anyway - regardless of the database, or the filesystem you use.

I expect that XFS with bcache would need much less tuning up front though.
Adding bcache in front of the original configuration should give a similar
improvement to l2arc.

------
acd
You can also enable Lz4 compression on the dataset itself the ZFS data volume
for even faster performance.

# zfs set compression=lz4 mysqldatavol

~~~
nixgeek
At a cost of CPU to do the work.

~~~
BrainInAJar
your CPU spends most of it's time idle and your disks are 10's of thousands of
times slower. For systemic performance, if you can trade almost anything for
clock cycles, you should do that.

------
damm
The trend of crappy posts from Percona continues they must want attention
suddenly.

Percona was best at consulting; their DBA's were worth it. The software maybe
close to irrelevant now but not their DBA's.

------
jsgo
This is the worst GDPR or whatever implementation I’ve seen yet on mobile
where the button to accept (I assume it is to accept) renders under (depth)
the fixed social buttons at the bottom. And the modal is fixed, too, so no
matter of scrolling exposes it.

------
frozenport
Been doing hardware RAID for years. Went for the ZFS meme, got half the
performance and even that modicum required unnaturally deep queues. "Lost"
10k+ of my employer's money.

How many drives are these guys using, how does it scale compared to
theoretical performance.

