
832 TB – ZFS on Linux - beagle3
http://www.jonkensy.com/832-tb-zfs-on-linux-project-cheap-and-deep-part-1/
======
rsync
"I ended up between the Supermicro SSG-6048R-E1CR60L or the SSG-6048R-E1CR90L
– the E1CR60L is a 60-bay 4U chassis while the E1CR90L is a 90-bay 4U chassis.
This nice part is that no matter which platform you choose Supermicro sells
this only as a pre-configured machine – this means that their engineers are
going to make sure that the hardware you choose to put in this is all from a
known compatibility list. Basically, you cannot buy this chassis empty and jam
your own parts in"

This is a _major_ departure from the Supermicro business model and practices
and basically broke all of our next generation expansion roadmaps.

This was not a technical decision - it is the same old economic decision that
every large VAR/integrator/supplier has succumbed to for the last 30 years.
They aren't the first ones to try this trick and they won't be the last.

We (rsync.net) are not playing ball, however. After 16 years of deploying
_solely_ on supermicro hardware (server chassis and JBODs) we bought our first
non-supermicro JBOD last month.

~~~
wmf
What specifically is the problem? Overpriced drives?

~~~
rsync
Yes. $30-$50 each. Multiply that by 60 or 90 and multiply that by (however
many you put in a rack).

That adds up to five figure premiums on drives _we 're going to burn in
anyway_.

We know how to read an HCL - it's not rocket science.

~~~
qaq
Ok 60X50 is 3K even with extra 3K Supermicro is still a lot cheeper

~~~
codyps
When drives are never replaced, yes. In the case where the chassis lasts
longer than the drives (which I'd imagine is often), extra cost in drives adds
up.

------
kev009
There are a couple needful tweaks to this BOM for anyone wanting to follow
this..

Only populate one CPU socket. Zone allocation between two NUMA nodes is kind
of hard, especially since Ubuntu 16.04 zfs is pre- OpenZFS ABD where memory
fragmentation is reality.

I would recommend better NICs like a Chelsio T5 or T6. Aside from better
drivers and a responsive vendor, you can experiment with some of the iscsi
offloads or zero copy TCP.

Supermicro seriously under-provisioned I/O on that chassis. I'd add
LSI/Avago/now Broadcom cards so you can get native ports to every drive. Even
if it's just a cold storage box, it will help with rebuild and scrub times and
peace of mind. The cost of this is not bad compared to the frustration of SAS
expander firmwares. 2x24 or 3x16 and 4 drives on the onboard if you can skip
the backplane expander. Supermicro will usually do things like this if you
insist, or an integrator like ixSystems can handle it.

More subjectively, I would also recommend FreeBSD. It seems their main
justification for Ubuntu was paid support, which can be had from ixSystems who
sell and support an entire stack (Supermicro servers, FreeBSD or FreeNAS or
TrueNAS, and grok ZFS and storage drivers to the tune that they have done
quite a bit of development.

~~~
matt_wulfeck
Why FreeBSD?

~~~
b34r
stability.

~~~
pbarnes_1
Compared to?

I agree to some extent that ZFS on FreeBSD might have been more stable than
ZoL, but in general? Meh.

------
twiss
> I purchased these units through a vendor we like to use and they hooked us
> up, so I won’t be able to share my specific pricing. (...) If you build the
> systems out on there you’ll find that they come in around $35,000 (USD)
> each.

That devided by 52 x 8 = 416TB is 0.084$/GB. For comparison, the Backblaze
Storage Pod 6.0 [1] claims 0.06$/GB for the version with the same hard drives.
Although this version has a bunch of extra features like 2 x 800GB SSD's for
ZFS SLOG, 8x more RAM for a total of 256GB, etc.

[1]: [https://www.backblaze.com/blog/open-source-data-storage-
serv...](https://www.backblaze.com/blog/open-source-data-storage-server/)

------
4ad

        you can run ZFS on Ubuntu [...] You could also build this on Solaris with necessary licensing if you wanted to that route but it’d be more expensive.
    

I find it bewildering the author didn't even consider illumos or FreeBSD,
where ZFS is a first class citizen.

~~~
notpeter
It's likely a human resources problem. For every competent FreeBSD or Illumos
sysadmin there are 10x equally experienced with Linux. And those numbers are
much worse outside of major cities. The commercial support from Ubuntu tips
the scales.

I made this same decision at my last job. I ran Solaris and Illumos on our
file servers and loved it, but a year before I left I ported all the pools to
Ubuntu so my successor only needed to be Linux competent and ZFS trainable.

Sometimes when choosing tech it's not what's technologically superior solution
nor what you personally could run well, it's what's best for your coworkers,
your successor and the organization in the long term.

~~~
atmosx
> For every competent FreeBSD or Illumos sysadmin there are 10x equally
> experienced with Linux

That's the Nth I've read this quote on HN, it became a classic... You _can 't_
find a FreeBSD sysadmin but you can find a Linux admin.

Where I work I have to deal with AIX, Solaris, Open/FreeBSD (Had Net before),
Linux (all major flavours) and (god Forbid) Windows Server (2008, 12R2, 2016
and Nano). I've build packages for most of these systems. I don't know all of
them inside/out but NEVER had problem implementing/setting/testing features in
any of them.

Can you tell me in a what way an _intermediate_ (say 5 years) of experience,
linux sysadmin would have _problems_ managing a Free/Open/NetBSD?

~~~
tedunangst
Guess it depends on the flavor of admin. With some regularity a linux admin
"with decades of experience" will show up and announce that openbsd is
terribly broken and nothing works, not even the most basic pkg-add command.
Uh, did you mean pkg_add? See! Openbsd is so broken they called the command
pkg_add while I typed pkg-add. I've never had this trouble with linux!

I'd be worried about letting such a person admin linux servers, but I guess
you can limp by as long as you keep your infra to what they already know.
Ideally you'd weed such people out before hiring, but maybe if you need to
hire an admin you don't know enough to do that?

~~~
atmosx
Totally agree. BTW, thanks for signify it is an awesome system for pkg
signatures. It took me a while to get it working correctly for automatic pkg
signing but once it all clicked together, the system was simple as any I've
see!

ps. The best documentation I've found, was the manpage[1].

[1] [https://man.openbsd.org/signify](https://man.openbsd.org/signify)

------
sandGorgon
The most important line for me was _" Today, you can run ZFS on Ubuntu
16.0.4.2 LTS with standard repositories and Canonical’s Ubuntu Advantage
Advanced Support. That makes the decision easy."_

Its highly interesting that Canonical does this with ZFS. I'm not sure why
they dont market this more.

~~~
tylerjl
I'm only a casual user of ZFS on Linux for personal storage projects, but I've
spoken with people who rely very heavily on ZFS on Linux for their small
businesses, and it's interesting to hear their perspectives on this.
Essentially, because btrfs has failed to deliver on the next-gen filesystem
front, ZFS on Linux is such a critical piece of technology that unless Red Hat
has an answer soon for out-of-the-box ZFS on Linux, Canonical has a pretty
staggering advantage. One theory is that Canonical has essentially forced
Oracle to decide whether it wants to crack down on inclusion of CDDL-licensed
code being shipped with Ubuntu's stock kernel, so Red Hat may be waiting to
see if Oracle sweeps in or not before following suit.

I'd be very surprised if RHEL (by observing the progression of Fedora
development) continues to bet on btrfs, as I have yet to encounter anyone
(including myself) who would ever trust btrfs over ZFS on Linux with anything
of importance based on their experiences with the two - my experience is
anecdotal, but ZFS has been just as reliable on *BSD as with any Linux
distribution.

~~~
georgyo
Oracle is not the one that could sue. There is nothing in the CDDL that
prevents it being used else where.

The GPL on the other hand is a strong copy left. If you link against GPL code,
your code must also be licensed as GPL.

This means the Linux copyright owners could sue the distributers of ZoL
binaries, but Oracle could not.

Oracle has the power to allow their ZFS code to be relicensed as GPL, removing
this road block, but they have no incentive to do so.

~~~
lottin
According to the Software Freedom Conservancy

...redistributing a binary work incorporating CDDLv1'd and GPLv2'd copyrighted
portions constitutes copyright infringement in both directions...
[[https://sfconservancy.org/blog/2016/feb/25/zfs-and-
linux/](https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/)]

so it seems that Oracle could in fact sue.

~~~
fnj
> so it seems that Oracle could in fact sue.

Says the SFC. But Oracle has had plenty of time and they have _not_ sued. In
fact, they have not criticized Canonical for integrating ZFS.

~~~
wila
The past is not always a predictable measure for the future, especially as you
don't know Oracle's agenda.

They might just wait until there is more money to be gained from a lawsuit or
until there's more people already on ZFS who don't want to have anything less.

~~~
sandGorgon
the courts will not allow that. you may claim that you recently discovered
that there is infringement or you can claim that you are a new owner of a
patent and are enforcing it.

But I dont think courts will take too kindly to someone who knowingly sat
without doing something and waited for it to become big.

I would assume that Canonical has already sent infromation about this to
Oracle. If they havent done anything now, they cant do anything later.

in fact - [https://insights.ubuntu.com/2016/02/18/zfs-licensing-and-
lin...](https://insights.ubuntu.com/2016/02/18/zfs-licensing-and-linux/)

 _We at Canonical have conducted a legal review, including discussion with the
industry’s leading software freedom legal counsel, of the licenses that apply
to the Linux kernel and to ZFS.

And in doing so, we have concluded that we are acting within the rights
granted and in compliance with their terms of both of those licenses_

------
vc00000
We bought our initial 2 TrueNAS servers from IX Systems (SuperMicro) back in
2011, have been upgrading over the years and they have been very reliable
servers.

Currently each server has 63 drives (4TB HGST NL-SAS) with 1 hot spare,
configured as RAIDZ.

Right now there is 200TB of usable storage, we initially started with 29TB and
have been expanding as needed when it hits about 79%, I buy 18 drives roughly
every 6-8 months, 9 drives per server and expand the pool.

To say that we never had issues is lying, we did have some major issues when
upgrading from versions, but this was early on, now it is a rock solid storage
system.

Although there is less than 300 active users connecting to the primary server,
there is a lot of very important pre & post production high dev videos.

Reboot with 63 drives is around 10 minutes or less.

Resilvering could take 24-48 hours, depending on load, depending on how much
data the failed drive contained.

Performance has been great, reliability has been great, support has been
great.

Sadly IX Systems can no longer provide support after the end of this year,
they've extended support beyond the expected lifetime of the hardware.

------
jjirsa
Zfs on linux and huge single servers, what could go wrong?

It's like a blog written by a 22 year old straight out of college that's never
dealt with a real production deployment/failure

Zfs on Linux has data loss bugs. There's at least one unpatched and there are
bound to be more.

Single huge servers eventually fail. Maybe it'll be a drive controller. Maybe
it'll be CPU or ram with bit flips as a side effect. Downtime would be the
least painful part of the eventual failure.

~~~
patrickg_zill
Actually the real issue is, "when the system is 65% full and you reboot, how
long will it take for ZFS to mount it"?

Perhaps he has split up ZFS into a number of different pools and they can be
mounted in parallel (depends on the init script and whether ZFS can do this).
But I do recall that larger ZFS pools can take a bit of time to mount; maybe
the updated ZFS for Linux is faster....

~~~
exikyut
> _how long will it take for ZFS to mount it?_

What? How long _would_ it take, roughly? Genuinely curious.

~~~
XorNot
Depends on the number of snapshots in my experience. 84,000 was slightly too
many.

------
guroot
Can I just ask. Why not use FreeBSD?

~~~
Timothycquinn
Agreed. With Jails, dtrace and Linux binary support, it would be a no-brainer
for me.

------
mikekij
Almost big enough to archive SoundCloud!

------
jaytaylor
What about cooling? Will the lifespan of the high-capacity platter-dense hard
drives be drastically reduced by clumping them together like that with what
looks like little airflow?

~~~
comboy
AFAIR from the backblaze blog, a bigger issue that shortens the lifespan is
vibration.

~~~
deelowe
That's probably because their design doesn't do much to mitigate it.

------
cmurf
Do either of these project spec hardware that would work for this use case?

opencompute.org Backblaze storage pod, they're up to v 6.0 now

(Netflix open connect specs supermicro hardware)

Others?

~~~
electrum
The next generation Facebook design for storage is Bryce Canyon, available
through the Open Compute Project:
[https://code.facebook.com/posts/1869788206569924/introducing...](https://code.facebook.com/posts/1869788206569924/introducing-
bryce-canyon-our-next-generation-storage-platform/)

------
andreiw
I wish Supermicro had a similar chassis around the Cavium ThunderX. That would
make a lot of sense for network-attached storage, regardless of whether one
goes with SATA or drops in a SAS adapter or two. Does anyone know if any of
the Cavium accelerators (crypto or compression) can improve ZFS perf?

~~~
ironMann
Since ZoL 0.7, Raid-Z parity and checksumming (Fletcher4) operations are
accelerated using SIMD instruction sets (in case of ThunderX, using aarch64
NEON)

------
rurban
I'm not a HW guy but those drives seems to be far too close together. A few
more millimeters space will keep the temperature down much better I assume.

------
z3t4
Anyone else addicted to acquiring servers and high bandwidth connections ? Any
ideas on what to do with the over capacity ?

~~~
sp332
ArchiveTeam is working on backing up the Internet Archive.
[http://iabak.archiveteam.org/](http://iabak.archiveteam.org/)
[http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/g...](http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-
annex_implementation)

------
notyourday
Very good experience with 45drives.com storinator XLs.

------
yest
>It’s hard – if not impossible – to beat the $/GB and durability that Amazon
is able to provide with their object storage offering.

what the actual fuck?? AWS S3 is a abominable rip off. After I rented to my
own dedicated server, I am paying several times less.

~~~
acdha
You're running three geographically separated servers with 24x7 monitoring &
security, automatic rebuilds, and active bit-rot scrubbing? If not, you're
doing a lot less than S3.

It's possible to beat S3 pricing but you either need to be buying a lot of
storage or cutting corners to do it. The most common mistake I've seen when
people make those comparisons is excluding staff time, followed by presenting
a system with no or manual bit-rot protection as equivalent.

~~~
yest
just buy/rent a dedicated box and put RAID1, that's all most stratups need.
Your points would be valid if S3 would be allowing to disable `geographically
separated servers`, `bit-rot scrubbing` (no idea wtf is this). But those who
say S3 is a cheap solution for more than few GB are fools or shills

~~~
acdha
So … that box is run by a volunteer sysadmin who doesn't charge you? … and
doesn't mind getting up at 3am to replace a drive?

That server has perfectly reliable power and environmental setup so you never
have prolonged downtime or a double disk failure?

You're okay losing everything if someone makes a mistake running that server
since backups cost too much?

You have higher-level software which tells you when data on that RAID array is
corrupted? Your free sysadmin periodically runs an audit to make sure that the
data stored on disk is what you originally stored? That's what I was referring
to with scrubbing: even with RAID corruption happens and most storage admins
have stories about the time they found out it'd happened after the only good
disk failed, been written to tape, etc. The best solution is to actively scan
every copy and verify it against the stored hashes for what you originally
stored, which also protects against cases where a bug or human error meant
that e.g. your RAID array faithfully protected a truncated file because the
original write failed and nobody noticed in time. S3 provides a strong
guarantee that you will get back the original data you stored or an error but
never a corrupt copy and that you can prevent storing a partial or corrupted
upload. If you roll your own, you need to provide those same protections for
the full stack or accept a higher level of risk and perhaps mitigate it in
other ways (e.g. Git-style distributed full copies with integrity checks).

Again, I'm not saying that it's impossible to pay less than S3 but your
response is a bingo card for the corners people cut until something breaks and
they learn the hard way why raw storage costs less than a supported storage
service. Doing this for real adds support cost for the OS, your software,
security, monitoring, backups & other DR planning, etc. If you use S3, Google,
etc. you get all of that built into a price which is known in advance, which
is a significant draw for anyone who wants to spend their mental capacity on
other issues.

Many places don't have enough storage demand for that overhead to pay off in
less than years and startups in particular should be extremely careful about
spending their limited staff time on commodity functions rather than something
which furthers their actual business. If you're Dropbox, sure, invest in a
capable storage team because that's a core function but if your business is
different it's time to look long and close at whether it makes any sense to
devote staff time to saving a few grand a year.

------
BigIQ
Biggest question: why?

At that scale something like Ceph would be more reasonable. Just because ZFS
can handle those filesystem sizes doesn't necessarily mean that it's the best
tool for the job. There's a reason why all big players like Google, Amazon and
Facebook go for the horizontal scaling approach.

------
SoMisanthrope
Very impressive! It's amazing what people are doing with OTS technology.

