
ZFS High-Availability NAS - louwrentius
https://github.com/ewwhite/zfs-ha/wiki
======
insaneirish
It's very nice to see someone putting in the work to document this and show
others how they were able to get it to work.

But I unfortunately believe, quite sincerely, that redundancy at this level is
misguided at best and dangerous at worst.

Instead of building storage that can fail between nodes, consider building
applications that can survive node loss and are not relying on persisting
their state in a black box that is assumed to be 100% available. These
assumptions about reliablity of systems and their consistent behavior under
network partitions has bitten countless people before and will continue to
claim victims unless application paradigms are fundamentally rethought.

Many people already know this. Even people who build these systems know this,
but too many applications already exist that can't be change. Nonetheless,
it's important for people building such systems for the first time to be aware
that they can make a choice early in the design process and that the status
quo is not a good choice. In other words, "Every day, somebody's born who's
never seen The Flintstones."
([https://mobile.twitter.com/hotdogsladies/status/760580954532...](https://mobile.twitter.com/hotdogsladies/status/760580954532433920))

~~~
qaq
There are legitimate reasons to have a CP system and not an AP system, it's
obviously application specific. Also over-engineered systems come with their
own risks and can experience as much down time as a well setup CP system due
to human error stemming from the system's complexity (AWS is a good example).

~~~
zzzcpan
NASes are not CP systems, they are noCAP systems. They cannot guarantee
neither consistency nor availability in the event of network partition.

~~~
rdtsc
Very good point. Quite often when talking about distributed system, everyone
goes to the CAP theorem and automatically assumes if it is not AP, it must CP
then. While in reality (and maybe for good reasons) it might be neither.

~~~
qaq
You sure can but it seems there are not that many practical applications for
distributed system that are not tolerant to partitions.

------
olavgg
FreeBSD has a cool HA ZFS solution you can use.

HAST [https://wiki.freebsd.org/HAST](https://wiki.freebsd.org/HAST)

There are also other solutions which you can hack together like iSCSI or GEOM
Gate Network [https://www.freebsd.org/doc/handbook/geom-
ggate.html](https://www.freebsd.org/doc/handbook/geom-ggate.html)

I would however strongly recommend a proper distributed storage solution like
Ceph or GlusterFS.

~~~
X86BSD
People will do whatever they want to do. And my comment is not a knock on
HAST.

But myself, and I know I am not alone, firmly believe especially regarding HA
and ZFS that following anything other than the KISS principle is pain upon
pain upon pain.

Wether it's iSCSI, HAST, NFS, HAST or any combination thereof.

The simplest and least painful? Is to simply create two servers with dual
nic's and dual HBA's.

Assign half the disks in a mirror to HBA-1 the other half to HBA-2.

Use one nic for public networking, use the other for a direct P2P Gig-E link
between server A and server B.

Use zfs replication over the private network link using the granularity you
feel most comfortable with. 30 minutes? 15 minutes? 5? It's up to you.

This is the simplest, and IMO most reliable way to go about this. When you
start adding in iSCSI, NFS, HAST etc you are simply making things to complex,
introducing multiple points of failure, and the pain when something does
break? Ill pass. I'd rather have to manually intervene and fail over to server
B if server A fails and vice versa than deal with all the pain of additional
things that could break or introduce split brained issues etc.

But it's a free world, use whatever solution fits your comfort zone.

~~~
zzzcpan
The biggest issue with your approach is consistency. You just cannot guarantee
it unless you are using proper distributed algorithms on every level.

------
EvanAnderson
Ed White, the guy behind the Github repo, is a very experienced sysadmin (and
the guy who knocked me off the top scoring slot on Server Fault). His
experience supporting both COTS and custom applications in a variety of
environments is top-notch.

------
Veratyr
To add to the downsides, you can't expand RAIDZ vdevs.

If you start with a 6 disk RAIDZ2 and want to add a couple more drives, you
can't. The only way to add capacity to the pool is to add an entire new vdev.

Unfortunately I don't think there's an expandable file system that handles
parity/erasure coding as well as bitrot that can be used right now. BTRFS
might be usable after the RAID5/6 rewrite and Ceph might be more usable after
Bluestore comes along (at the moment erasure coding performance seems to be
pretty bad, even with an SSD cache) but in the present there's nothing.

I really hope
[https://www.patreon.com/bcachefs](https://www.patreon.com/bcachefs) manages
to fix this.

~~~
ymse
> Ceph might be more usable after Bluestore comes along (at the moment erasure
> coding performance seems to be pretty bad, even with an SSD cache)

I just saturated a 10Gb link in a k=4,m=2 EC configuration on Ceph (Haswell).
Too lazy to set up concurrent clients. What do you mean by pretty bad? This is
a Hammer cluster without SSD journals or cache.

Bluestore was recently released with Jewel, but shouldn't affect EC
performance drastically, if at all.

~~~
Veratyr
Maybe I'm doing something horribly wrong. My setup was:

\- 6x4TB HDD

\- 2x120GB SSD

\- OSD per disk

\- k=4,m=2 erasure coded HDD pool

\- 1 copy replicated (not) SSD pool

\- SSD pool writeback cache for HDD pool

I was consistently seeing the SSD writeback cache fill up then the disks
thrashed at 100% IO utilization, limiting incoming writes to ~40MB/s from
memory.

~~~
ymse
Oops, didn't see this reply until now. It's no longer recommended to use a SSD
cache pool, much for the reasons you describe.

You will be a lot better off using bcache or similar, on top of the OSD
device.

------
Annatar
Always on the prowl for cheap just a bunch of disks enclosures for my Solaris
/ SmartOS systems, from the links in the github readme.md, I learned a whole
bunch of stuff, for example, that Sun made the J4400 / J4100 just a bunch of
disks external serial attached small computer system interface arrays, and
these are dirt cheap on ebay:

[https://docs.oracle.com/cd/E19928-01/820-3223-14/J4200_J4400...](https://docs.oracle.com/cd/E19928-01/820-3223-14/J4200_J4400_Overview.html)

that in turn led me on the search for the actual host bus adapter supporting
this, and behold:

[https://docs.oracle.com/cd/E19928-01/820-3222-19/J4200_J4400...](https://docs.oracle.com/cd/E19928-01/820-3222-19/J4200_J4400_ReleaseNote_chapter.html#50401294_59854)

further "digging" on ebay uncovered that these low profile host bus adapters
can be had for as cheap as $18.95 USD plus shipping.

Since this stuff is professional grade equipment, the repossession companies
are having a really, really hard time reselling this equipment: enterprises
will not buy this because it is so specialized that they do not know about it
(and they would much rather buy ultra expensive EMC with a support contract),
and private persons will not buy this because most people do not even know
what it is, or that it exists, never mind that it requires one to have a rack
with professionally installed power and possibly cooling.

As a consequence, this equipment goes for peanuts on ebay, and is a way for
someone who knows what they are doing to implement highly reliable, enterprise
storage solutions for a mere, tiny fraction of the cost.

Not bad for five minutes worth of research, that was time well invested.

~~~
ewwhite
This is key. I mean, I focus on the HP side and used ProLiant servers and
storage enclosures in my examples... But the Sun/Oracle enclosures and
components are very inexpensive off-lease and come at a price-point where it's
totally possible to self-support.

------
louwrentius
I think this is quite amazing, building high-availability storage is a complex
endeavour because the entire storage path needs to be 100% redundant.

It's quite an achievement to get this functional using open-source.

------
jpgvm
One day we will hopefully have first class ZFS mirroring using shared nothing
hardware. Until then dual-pathed JBODs will have to do.

------
zokier
Am I reading the output right, that he is creating three-way stripe over
raidz1 volumes? Seems like bit odd configuration for HA solution especially. I
mean, doesn't that mean that the array can't handle two disk failures except
if it gets lucky?

~~~
Annatar
It means that the vdev on top of the three RAIDZ vdevs can lose up to three
devices simultaneously, one per RAIDZ each.

Any further failure would result in a complete data loss. Also, if any of the
RAIDZ stripes loses more than one vdev simultaneously, this will lead to a
complete data loss as well.

~~~
ewwhite
This is a simple example setup in a 12-drive enclosure. 9 disks, a global hot-
spare disk and two cache drives.

Previously, I would have used straight ZFS mirrors for performance and
expandability (raidz VDEVs can't be expanded), but at this scale, the 9-disk
3x3 raidz1 setup consistently outperformed a 10-disk RAID mirror setup in
random and sequential I/O.

See: [http://i.imgur.com/SeSlmH8.png](http://i.imgur.com/SeSlmH8.png)

~~~
eatstoomuchjam
A set of 5 stripes of mirrors has the same problem that the comment brought up
- losing the _wrong_ two disks loses all of your data.

In this case, going to raidz2 across all 12 disks would give an extra disk
worth of capacity/performance with the ability to lose any 2 disks without
data loss and going to raidz3 should give the same capacity/performance with
the ability to lose any 3 disks without data loss.

~~~
ewwhite
Hot spares and monitoring. RaidZ2 across that number of drives is a
performance nightmare and against ZFS best practices.

------
gsmethells
ZFS forces you to choose a drive size a priori and live with it for the life
of the system. Thanks but no thanks. I'll stick with OpenStack Swift where I
can leverage new larger disks and have my uncoupled highly available storage
to boot.

~~~
chad_c
This is not true. Read up on mirrored vdevs. [http://jrs-s.net/2015/02/06/zfs-
you-should-use-mirror-vdevs-...](http://jrs-s.net/2015/02/06/zfs-you-should-
use-mirror-vdevs-not-raidz/)

~~~
gsmethells
50% storage efficiency and 87.5% survival is not an option for us. We store
medical images by the petabyte and drives fail much too fast at scale.

~~~
notmyname
It's great to hear how you are using Swift. I work on that project; feel free
to reach out if you have any questions. Contact info is in my profile.

