
ZFS Root Filesystem on AWS - lscotte
https://www.scotte.org/2016/12/ZFS-root-filesystem-on-AWS
======
malisper
I'm currently managing a Postgres cluster with a petabyte of data running on
ZFS on Linux on AWS. Most of the issues we've come across are around us not
knowing ZFS.

The first main issue was the arc_shrink_shift default being poor for machines
with a large ARC. Our machines have Arc at several 100GB, so the default
arc_shrink_shift was flushing several GBs to disk at a time. This was causing
our machines to become unresponsive for several seconds at a time pretty
frequently.

The other main issue we encountered was when we tried to delete lots of data
at a time. We aren't sure why, but when we tried to delete a lot of data
(~200GB from each machine which each contain several TB of data), our
databases become unresponsive for an hour.

Other then these issues, ZFS has worked incredibly well. The builtin
compression has saved us lots of $$$. It's just the unknown unknowns that have
been getting us.

~~~
lomnakkus
Agreed, ZFS has its caveats, but feature-wise and stability-wise ZFS is to --
a large degree -- what BTRFS _should_ have been.

The licensing is _incredibly_ unfortunate, though. (I don't care about the
reasoning for the license, it's just _bad_ that it isn't GPL-compatible so
that it could be compatible with the most prolific kernel in the world.)

Anyway, back to BTRFS-vs-ZFS. It seems abundantly clear that a filesystem is
(no longer) a thing where you can just "throw an early idea out there" and
hope that others will pick up the slack and fix all the bugs. There's just too
much _design_ (not code) that goes into these things that it's not just about
code any more.

My (small) bet right now as to the "next gen" FS on Linux is on bcachefs[1,
2]. It sounds _much_ sounder from a design perspective than BFS, plus it's
built on the already-proven bcache, etc. etc. (Read the page for details.)

[1] [https://www.patreon.com/bcachefs](https://www.patreon.com/bcachefs) [1]
[https://bcache.evilpiepirate.org/Bcachefs/](https://bcache.evilpiepirate.org/Bcachefs/)

~~~
jen20
According to Canonical, it _is_ GPL compatible. Either way, that shouldn't get
in the way of the best file system in existence being used with the kernel of
last resort.

~~~
georgyo
The CDDL is incompatible with the GPL license. The GPL however is not
incompatible with the CDDL.

This means Linux copyright owners could sue ZoL binary distributers, but
Oracle could not.

However no one is shipping ZoL binaries, only the source code. The code itself
is 100% conflict free.

~~~
aidenn0
Canonical ships ZoL binaries as of April 2016. They claim doing so doesn't
violate the GPL since they are shipping it as a module rather than built into
the kernel.

~~~
mistat
So you're saying it's only going to be available in user space and never in
the kernel? Why don't oracle just relicense it?

~~~
_joel
No, they're supplied as kernel modules, packaged separately from the kernel.
Before Ubuntu 15.10 you could still install it as a DKMS module (such that it
compiled on the system it's being installed on). Now they just ship the pre-
built .ko's, saving the user compilation time. There are still userland tools
to interact with it zpool, zfs etc.

------
cmurf
This is quite a bit easier to do with Btrfs since there are installers that
support it; but also with two neat features lacking in ZFS.

1\. Reflink copies. Basically this is a file level snapshot. The metadata is
unique to the new file, but it shares (initially) extents with the original.

2\. Seed device. The volume is set read-only, mounted, add a 2nd device,
remount rw, delete the seed. This causes the seed to be replicated to the 2nd
device, but with a new volume UUID. Use case might be to do an installation
minimally configured so as to be generic, and then use it as a seed for
quickly creating multiple unique instances.

Another use case: don't delete the 1st volume. Each VM has two devices: the
read only seed (shared), a read write 2nd device (unique). The rw device
(sprout) internally references the read-only seed, so you need only mount by
the UUID of the sprout.

Seed-sprout is something like an overlay, or a volume-wide snapshot.

------
voncopec
I have an Ubuntu image for a thumb drive that I made that does both mbr/efi
boot with root on ZoL. I've used it to install ~10 computers now by booting,
partitioning, attaching the internal disk to rpool and then detaching the
thumbdrive and then installing the bootloader without ever rebooting to have a
fully working install after that. It is pretty slick.

------
briankwest
I thought you shouldn't be running ZFS inside a virtualized environment?

~~~
zorked
Why?

~~~
jlgaddis
Ideally, ZFS has exclusive control over the storage. When it's virtualized, it
doesn't and there may be various HBAs, raid controllers, etc., in between ZFS
and the actual disks. These can (do) "get in the way" and you (can) lose one
of the biggest features of ZFS (data integrity and error correction).

~~~
lscotte
The keyword is definitely "ideally" :-) With a cloud provider like AWS,
storage is always virtualized - so we've always got that working against us. I
see ZFS in AWS more about flexibility than data integrity, although having
said that ZFS should do just as well or better (certainly no worse) than EXT4,
XFS, or BTRFS for reliability. The ability to add storage dynamically without
having to move bits around is powerful.

~~~
drvdevd
And I've actually had (a couple times now) random _hardware_ failings, with
random notification emails from Amazon, on EBS storage on AWS, with
accompanying data loss.

Might as well treat your zpool like it's on real hardware and configure raidz
accordingly. The cloud does have real, problematic hardware behind it, and
it's important to remember that.

[edit] especially if you can configure your block devices such that you know
they're sitting on different _physical hardware_ at the cloud data center, you
_will_ gain that benefit of ZFS

------
dekhn
I've been waiting for Ubuntu to finish getting the native installed to do
simple ZFS-root installs for a while now. This document is basically the same
process: bootstrap linux on a supported FS, then use user space tools to make
a ZFS fs on a new block device, copy linux there, adjust the boot system's
pointers, and reboot.

~~~
josteink
I've used this guide and it works fine:

[https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-
on-...](https://github.com/zfsonlinux/zfs/wiki/Ubuntu-16.04-Root-on-ZFS)

It however advises to use blockid, not device-id, for mounts.

Any idea if this doesn't apply to AMIs?

~~~
dekhn
My point is I want the installer itself to lay down the original disk FS as
ZFS in the installer.

~~~
josteink
Considering how the current installer supports btrfs, it really doesn't seem
like it should take too much effort. Someone however, will have to put in that
effort.

And maybe people using ZFS want proper volume/subvolume management as good as
support for traditional partitions or LVM volumes? If so, it will probably
take a while longer to land.

~~~
dekhn
Right; Canonical is expected to do this work, as they already added ZFS as a
supported filesystem to Ubuntu.

~~~
josteink
In that case: Prepare to wait. Canonical has delayed lots of things lately in
their eagerness to launch Unity 8 ;)

~~~
dekhn
i've been waiting years now, not really a problem. When it's available I will
install it on a test machine and make a copy of my live server's data there
and then test it for a year.

------
kim0
Can someone explain why the author is aligning to 4096 sectors (2M) boundary.
While most tools (gdisk et al) default to 2048 sectors?

~~~
lscotte
I can (I'm the author). The partitions are aligned to 2048 sectors (4096 is
evenly divisible by 2048). But also note the first usable sector is 2048, not
0 - so the first partition, although we tell sgdisk to start at 0, actually is
from sector 2048-4095. I don't know the exact reason why the first usable
sector is 2048 - I believe it has to do with legacy support and MBR
compatibility - but I'm not sure of the details.

~~~
dordoka
I was intrigued by the first usable sector exact reason and found out this
link that might be an interesting read

[http://jdebp.eu./FGA/disc-partition-
alignment.html](http://jdebp.eu./FGA/disc-partition-alignment.html)

------
caseyf7
Is it ok to run ZFS without ECC RAM?

~~~
aidenn0
Yes. It always has. The "ZFS needs ECC RAM" meme comes from the fact that on
many systems the (non-ecc) RAM is the weakest point in the data integrity path
if you are running ZFS.

For an analogy, consider a world where car engines have a non-trivial chance
to instantly explode when involved in a crash. Then someone comes out with an
engine that doesn't explode. People say "you should wear seatbelts if you use
this non-exploding engine," but your car has no seatbelts; clearly you are
still safer with the non-exploding engine, but all of the sudden, seatbelts
are more likely to save your life than before.

~~~
anoother
In think the concern is more that if you encounter a bit-flip checksumming a
block on write, then ZFS will mark the data corrupt on read; and thus, make
good data unreadable.

A non-checksumming FS would not be vulnerable to this particular issue. On the
other hand, would undetected corruption through bit rot be a worse problem?
Almost certainly.

And considering how vanishingly unlikely such a scenario is, I do agree with
your sentiment.

That said, I'm not sure if my understanding of the issue is complete and would
welcome an explanation of the failure scenarios that [very] occasional RAM
bit-flips expose ZFS to.

~~~
paulmd
It all depends on the relative probability of those two scenarios occurring,
though. And the problem with ZFS is that each time you are scrubbing you are
essentially rolling the dice, so you are rolling them a lot more times.

Let's say that you have a 99.9% chance of the scrub running correctly on a big
pool with non-ECC memory (0.1% chance of a bit-flip during the scrub). Any
single scrub is extraordinarily likely to succeed, but if you run a scrub
every day then over the course of a year then your chance of your pool
surviving falls to 0.999^365 = 69.4%.

Pick your favorite numbers here, 0.1% failure chance per scrub is probably way
high. With five nines your yearly survival rate is 99.6%. But do remember that
soft errors are fairly rare in modern servers _mostly because they use ECC
RAM_ , you can't look at data with ECC and assume you'll get comparable
results by using non-ECC RAM.

In general, if you scrub infrequently you are probably going to be OK. (but
then why are you using ZFS instead of LVM?) If you live at high altitude,
however - let's say in Denver - you are also facing significantly increased
soft error rates. The extra atmosphere at sea level does make a strong
difference in shielding, something around 5x reduction in strike events.

On the plus side - the SRAM and some parts of the processor do use ECC
internally, which is good because fault rates increase with reduced feature
size and increased number of transistors. The CPU is potentially the most
sensitive part of the system per unit area, so it's very important to protect
against errors there.

And on the other hand - disk corruption or failure probably outweigh those
kinds of concerns in practice. But it's not like it's expensive to get a
system with ECC. An Avoton C2550 runs like $250. So why take the risk anyway?
Your data's worth an extra $100 in hardware.

Heck, you can run ECC RAM on the Athlon 5350 and the Asus AM1M-A motherboard.
Boom, ECC mobo/CPU combo for under $100. It's just a little thin on SATA
channels. It's a shame there's no "server" version of this board with dual
NICs, IPMI, and an extra SATA controller tossed on there.

------
daurnimator
> We also allow overlay mount on /var. This is an obscure but important bit -
> when the system initially boots, it will log to /var/log before the /var ZFS
> filesystem is mounted.

You shouldn't need to in a systemd based distro: journald logs to /run until
/var comes up. (and then flushes across) See
[https://www.freedesktop.org/software/systemd/man/journald.co...](https://www.freedesktop.org/software/systemd/man/journald.conf.html#Storage=)

~~~
lscotte
Shouldn't, perhaps, but we definitely have to (and are running with systemd)!
I have not made any effort to understand what is logging early - there's not
much that should happen prior to ZFS mounting everything.

~~~
daurnimator
Did you (accidentally) create the directory /var/log/journal in the rootfs?
Make sure your /var is actually empty in the rootfs image.

~~~
lscotte
Yep, it's empty when the target was created/installed. It's something
happening early at boot time, but I haven't diagnosed exactly what's going on.
And now that I think about it, I've seen the same thing on local ZFS Debian
installs as well - worth digging into when I have some time, likely a bug
somewhere.

------
machbio
for less production instructions on ubuntu 16.04 - here is a simple guide -
[http://www.howtogeek.com/272220/how-to-install-and-use-
zfs-o...](http://www.howtogeek.com/272220/how-to-install-and-use-zfs-on-
ubuntu-and-why-youd-want-to/)

------
ShakataGaNai
Anyone have thoughts on how to turn this into something more reproducible,
like Packer? Love the idea but would hate to rebuild AMI's by hand every X
time for Y regions.

~~~
lscotte
I definitely have thoughts about automating the process :-) - we certainly
plan on getting the whole process into an automated, repeatable build.

~~~
jen20
(Disclaimer - I work at HashiCorp)

I'm working on a Packer builder type that will support this use case as we
speak. The existing AWS `chroot` builder would likely be sufficient, but
requires running from within AWS.

~~~
ShakataGaNai
Seems like a perfectly logical pre-req. Easy enough to automate with say a
Lambda function to spin up a pre-configured host instance to build the AMI's.

So... very exciting, looking forward to hearing more about when it "hits the
streets"!

------
jsjohnst
Anybody tried ZFS on Centos or other RHEL like distros?

~~~
jlgaddis
cf.
[https://github.com/zfsonlinux/zfs/wiki/RHEL-%26-CentOS](https://github.com/zfsonlinux/zfs/wiki/RHEL-%26-CentOS)

~~~
jsjohnst
Thanks! I'd read that before, was more looking for anecdata from someone who'd
tried it, especially in production.

------
devoply
I have messed with ZFS on Linux on Ubuntu and I have to say that I would not
yet trust it in production. It's not as bullet proof as it needs to be and
still under heavy dev. Not even at version 1.0 yet.

~~~
dmm
Do you have any specific reasons not to trust ZoL?

ZFS-on-Linux devs say it's ready for production[1].

Lawrence Livermore laboratory stores petabytes of data using ZoL[2].

If we're sharing anecdotes, ZoL has served me fantastically for several years.

[1] [https://clusterhq.com/2014/09/11/state-zfs-on-
linux/](https://clusterhq.com/2014/09/11/state-zfs-on-linux/) [2]
[http://computation.llnl.gov/newsroom/livermores-zfs-linux-
po...](http://computation.llnl.gov/newsroom/livermores-zfs-linux-port-hit-it-
industry)

~~~
otterley
We have encountered a reproducible panic and deadlocks when a containerized
process gets terminated by the kernel for exceeding its memory limit:

[https://github.com/zfsonlinux/zfs/issues/5535](https://github.com/zfsonlinux/zfs/issues/5535)

We're strongly considering using something else until this gets addressed. The
problem is, we don't know what, because every other CoW implementation also
has issues.

* dm-thinp: Slow, wastes disk space

* OverlayFS: No SELinux support

* aufs: Not in mainline or CentOS kernel; rename(2) not implemented correctly; slow writes

~~~
brendangregg
The issue you link to was opened a day ago.

If that were me, I'd see how quickly it was fixed before strongly considering
something else.

~~~
otterley
Have you had any issues to report? If so, how quickly were they fixed? Knowing
what the typical time is to address these issues would help us make a more
educated decision.

~~~
lscotte
Yes, we've run into 2 or 3 ZFS bugs that I can think of that were resolved in
a timely fashion (released within a few weeks if I recall) by Canonical
working with Debian and zfsonlinux maintainers (and subsequently fixed in both
Ubuntu and Debian - and upstream zfsonlinux for ones that were not debian-
packaging related). Of course your mileage may vary, and it depends on the
severity of the issue. Being prepared to provide detailed reproduction and
debug information, and testing proposed fixes, will greatly help - but that
can be a serious time commitment on your side (for us, it's worth it). Hope
that helps!

