
ZFS from a MySQL perspective - rbanffy
https://www.percona.com/blog/2017/11/15/zfs-from-a-mysql-perspective/?utm_campaign=2017%20Q4%20Blogs&utm_content=63170561&utm_medium=social&utm_source=twitter
======
chungy
Clarification on the deduplication section: ZFS doesn't automatically hardlink
files (at least not in the sense you'll ever see a link count increase on your
files), instead it checksums blocks and if two blocks are identical, it'll
point both references to the same on-disk block.

It might sound like a small difference, but it does keep the files independent
on the VFS layer. You can modify a file without fear of unwantingly modifying
other files at the same time, which would happen if they were hard linked.

------
sethhochberg
We use MySQL on ZFS pretty extensively on my current team for both "hot"
application database servers and warm storage / reporting / archival systems.
Easy expansion, free snapshotting, great IO performance, and self-healing,
first-class monitoring from within the OS, etc made it a logical choice for us
when we were designing new database servers. We also use ZFS-on-linux
extensively on a large storage cluster for multimedia, so it fit into our
existing ecosystem and we wanted to standardize as much as possible.

Through our own research and experimentation we settled on a few critical
tweaks to make InnoDB and MySQL play nicely together... the highlights are:

\- disable the innodb doublewrite buffer, ZFS protects against partial page
writes

\- tune the ARC aggressively: we size it waaay down on our db servers and let
innodb's buffer pool handle in-memory caching, and disable the arc prefetch.
On our application databases, our core tables fit entirely into the innodb
buffer pool

\- particularly important on spinning-platter systems, give the zpool scrub
jobs higher IO priority... the default settings for zpool scrub delay
scan_min_time are extremely conservative and safe, but on high-IO slaves we
found that there was essentially never enough idle time for the scrub to
complete. we sacrifice some IO overhead to allow scrubs to finish regularly.

I'm really curious to see some more detailed writing on MySQL/ZFS from the
Percona team... their benchmarks are usually pretty well thought out, and they
certainly understand InnoDB better than we do. What we've got works well, but
I'd be surprised if a truly ideal setup down to the kind of detail that they
dig into.

------
krob
Biggest hurdle in using zfs is that it consumes a huge amount of ram and uses
considerable CPU cycles when compared with xfs or ext4. It's like running a
massive MySQL server on top of another MySQL server if ever that was possible.
Follow Reddit datahorders subreddit people speak all the time about how
powerful freenas machine must be just to give decent transfer speeds over
samba. Yes freenas will run without lot of CPU speed and ram but performance
will suffer greatly.

~~~
vetinari
ZFS uses lot of RAM only if you enable deduplication (and once you enable it,
you cannot disable it). Without it, the RAM and CPU time requirements are very
modest.

~~~
mrmondo
That’s not at all correct, without deduplication enabled I have observed
massive memory usage across BSD, Linux and ZFS on macOS.

~~~
AndrewDavis
That's the ARC. The ARC max size will default to (at least on FreeBSD) total
ram -1GB[1]. Because the ARC is read cache it will shrink if the system needs
more memory. Unused ram is wasted ram.

[1][https://www.freebsd.org/doc/handbook/zfs-
advanced.html](https://www.freebsd.org/doc/handbook/zfs-advanced.html)

------
jmnicolas
Frankly I don't think ZFS code complexity should be touted as an advantage.

~~~
pizza234
The complexity comparison itself is not correct - new generation "file
systems" are actually both volume managers and file systems (which also
implies other features that live at the intersection).

~~~
masklinn
The article properly points this out too.

------
bogomipz
I had a question about the mention of this under the RAID section:

>"ZFS is not affected by the RAID-5 write hole."

The "write hole" can exists in other RAID configurations such as RAID 10, RAID
1, RAID 6 etc. Is ZFS only insulated from these "write holes" in RAID 5 or all
ZFS supported RAID configurations?

~~~
Dylan16807
ZFS doesn't use actual RAID. It uses independent writes to each disk, so there
are no write holes possible.

~~~
bogomipz
But that's exactly what RAID is - independent writes to each disk.

ZFS's stripe-mirror vdev is mirrored pair with a stripe written across those
mirrors. How is that not actual RAID? Is that not RAID 10 with the ZFS
addition of checksumming?

~~~
Dylan16807
True RAID depends on not-at-all-independent writes to each disk. The whole
thing needs to be atomic to be safe, which is impossible without a battery
backup. With ZFS each write to each disk is independently atomic, which is
what avoids the write hole. You always know if the write succeeded on each
disk.

But I was to some extent conflating the different redundancy methods ZFS has.
Some of them use the same sectors on each disk (I think) but some definitely
don't. The writes are _completely_ independent of each other under some
settings.

------
e40
I thought databases needed O_DIRECT, which isn't implemented yet in ZoL.

~~~
jandrewrogers
O_DIRECT is not required, as evidenced by many open source databases not using
it, but it does enable substantial gains in performance that are not otherwise
possible. The caveat is that taking advantage of the O_DIRECT flag, or the
equivalent functionality on non-Linux operating systems, requires a couple
thousand lines of additional non-trivial code, a subset of which is not
particularly portable.

The return on investment is definitely worth it from a performance
perspective. However, it makes portability a hassle and implementation
requires a high degree of specialized skill, so open source generally avoids
it for the sake of easy maintainability.

~~~
esaym
>The return on investment is definitely worth it from a performance
perspective.

Do you have hard evidence for that? I know in "theory" it sounds good and
Oracle and DB2 docs continually act like it is a requirement.

My personal experience is only with DB2. Its storage was mounted on NFS (yea I
know..) and due to many other app issues we were having DB performance issues
(or that is what everyone though). Since it was mounted on NFS I could just
sniff the wire and see what was going on. To my horror it was reading and
writing in individual 4k blocks (which was the current page size). The overall
network throughput was only about 2-4mbs at tops.

I managed to convince the DBA to turn off O_DIRECT in the DB2 config. I didn't
get much of a chance to benchmark (since so much other stuff was going on at
the time and the DBA insisted that my idea was a complete was of time) but
what I did see was that without O_DIRECT, the OS has taken over and was
batching up the reads and writes into larger blocks on the wire (sometimes as
high as 64k) and overall throughput at max on the wire was now 20-30mbs. But I
only had a few minutes with this config before it was reverted and never
revisited. But nonetheless, using postgresql or mysql with the same schema and
data would yield reads and writes of much larger blocks and it would allow the
network throughput max peaks of up to 200mbs.

~~~
jandrewrogers
I've measured it in the past. The challenge is isolating the effect for the
purpose of measurement, since so many other ancillary algorithm and
architecture choices influence performance when using O_DIRECT e.g. the
quality of cache replacement implementation and the design of the operation
scheduler. O_DIRECT by itself is a pessimization.

Six or seven years ago I built a testbed to measure this directly, using a
full O_DIRECT implementation (io_submit, sophisticated user space cache, etc)
and a fully hinted mmap() implementation (so no double buffering). The rest of
the storage and database implementation was otherwise identical, a fairly
conventional B-tree style design. As I recall, the difference in throughput
was 2-3x depending on the workload, mostly due to the caching behavior being
dramatically more effective and predictable; the kernel cache makes poor
choices from the perspective of a database execution schedule, and the knobs
it provides for controlling behavior and inspecting cache state are much too
limited. You get some of this control back if you double buffer like
PostgreSQL does, but that introduces a different set of inefficiencies that
kind of leave you where you started.

The secret is that to really take advantage of what O_DIRECT offers, you have
to continuously intertwine and adaptively coordinate the query execution
schedule and I/O schedule at a very fine-grained level. Most database
implementations do not take this level of control of query execution, instead
letting the OS control which query execution context gets to run when. In
clever architectures, you can dynamically reschedule query execution to be
nearly optimal for the I/O schedule and vice versa, while balancing for other
tradeoffs (like query latency). Nonetheless, even if you don't do this, you
can see substantial performance lift from a competent O_DIRECT based
architecture. That said, you can't introduce O_DIRECT incrementally and expect
great results, it is a bit of an all or nothing proposition.

~~~
esaym
Sounds like you were really trying to optimize for the top 0.1% :) I think you
have hinted that you can't simply "turn on" O_DIRECT and have magic happen and
if O_DIRECT is not used with a proper tune, you will be in more trouble. The
irony with my story is that even with the horrible DB2 config our "guru" DBA's
left us with, all the issues were from the software team. There were many
stupid things going on in the ORM that were causing the issues (like grabbing
an entire table and then calling a method on each row that in turn got a
aggregate sum of a column on the same table... doing that for each row).

------
zaphar
I wonder what the performance impact of running MySQL on ZFS would be?

~~~
empthought
Correctly tuned, it could be faster on some workloads and hardware (block-
level compression really saves on disk I/O).

~~~
willvarfar
My understanding is that ZFS compresses to multiples of the sector size, so
it's the same ratio as innodb page compression gives but ought be much faster
and transparent.

My own experiments put tokudb far far ahead:

[https://williamedwardscoder.tumblr.com/post/160628887798/com...](https://williamedwardscoder.tumblr.com/post/160628887798/compressing-
mysql-databases)

Unfortunately tokudb has taken a massive performance hit recently - it was so
fast when tokutek integrated it into 5.6 but the percona and Mariadb packaging
in 5.7 is much slower than innodb - not sure what they haven't done but wish
they'd profile and fix it - you'd think that percona, having brought tokutek -
would have noticed and benchmarked and care :(

~~~
kbenson
> not sure what they haven't done but wish they'd profile and fix it

Sometimes a slowdown comes because somebody noticed a problem and fixed it,
such as the default compile options including a nasty bug that makes it
unsafe. I don't have any specific knowledge of this case, but it the kind of
thing I've seem happen multiple times in the past, so I wouldn't be surprised.

------
Timothycquinn
Does anyone know of any commercial Appliances that ship with ZoL as an active
part of the software stack?

~~~
MrMorden
Why would an appliance vendor prefer Linux over FreeBSD or Illumos?

~~~
MrMorden
But there are a few — [http://open-zfs.org/wiki/Companies](http://open-
zfs.org/wiki/Companies) lists some that mention Linux.

------
frik
I wonder if it would be possible to put InnoDB to a kernel mode filesystem
driver to create a new database filesystem. (BeFS/WinFS-style)

~~~
hodgesrm
Amazon Aurora does something similar in the sense that it decouples InnoDB
functions including buffer cache and storage access from the overlying MySQL
engine. The decoupled InnoDB lives on separate VM(s) and includes optimized
data replication to support DBMS high availability. For example, with Aurora
you can failover a large DBMS in seconds to a new site and have it come up
with a heated buffer cache since that replicates too.

This approach makes sense for Aurora because it leverages cloud storage to
provide data high availability across availability zones. Just putting InnoDB
in a file system seems less interesting unless you add features that are
strikingly different from what it does now.

See
[http://www.allthingsdistributed.com/files/p1041-verbitski.pd...](http://www.allthingsdistributed.com/files/p1041-verbitski.pdf)
for a full description.

~~~
late2part
Sounds overly complicated and obscure.

~~~
openasocket
The results are a massive gain in performance, availability, and durability,
that seems worth added complexity.

