Hacker News new | past | comments | ask | show | jobs | submit login
832 TB – ZFS on Linux (jonkensy.com)
320 points by beagle3 on Aug 5, 2017 | hide | past | web | favorite | 157 comments

"I ended up between the Supermicro SSG-6048R-E1CR60L or the SSG-6048R-E1CR90L – the E1CR60L is a 60-bay 4U chassis while the E1CR90L is a 90-bay 4U chassis. This nice part is that no matter which platform you choose Supermicro sells this only as a pre-configured machine – this means that their engineers are going to make sure that the hardware you choose to put in this is all from a known compatibility list. Basically, you cannot buy this chassis empty and jam your own parts in"

This is a major departure from the Supermicro business model and practices and basically broke all of our next generation expansion roadmaps.

This was not a technical decision - it is the same old economic decision that every large VAR/integrator/supplier has succumbed to for the last 30 years. They aren't the first ones to try this trick and they won't be the last.

We (rsync.net) are not playing ball, however. After 16 years of deploying solely on supermicro hardware (server chassis and JBODs) we bought our first non-supermicro JBOD last month.

The HGST JBOD is built in a much more robust way than anything from Supermicro anyway. The second generation is almost perfect with its integrated ethernet console and standard micro-SAS ports.

What specifically is the problem? Overpriced drives?

Yes. $30-$50 each. Multiply that by 60 or 90 and multiply that by (however many you put in a rack).

That adds up to five figure premiums on drives we're going to burn in anyway.

We know how to read an HCL - it's not rocket science.

Even if reading hcl were rocket science a more validating vendor would preserve buyer self determination

Ok 60X50 is 3K even with extra 3K Supermicro is still a lot cheeper

When drives are never replaced, yes. In the case where the chassis lasts longer than the drives (which I'd imagine is often), extra cost in drives adds up.

Can you say what vendor you are using now?

Well, our vendor continues to be IX Systems (for the most part).

However, instead of buying exclusively supermicro from them, we are now buying these Hitachi 60bay JBODs:


It really bugs that supermicro went this way. They used to be a very boring company that did nothing but build great chassis. I used to jokingly say that "supermicro is the rsync.net of hardware".

But now they're getting cute and that's never a good sign.

Those are HGST parts.

Hitachi is NOT HGST.

/Pet peeve

//Used to work for HGST alongside the EMEA 4U60 team, so perhaps a bit sensitive!

///Was often asked to help move the damn things on/off trolleys when the fully-loaded demo units came back for refresh before going out again - it was a 4-person lift with under-unit webbed straps - I'm sure my spine was compressed and I left the job a couple of cm shorter!

We once wanted to buy 200+ JBODs from them and they seemed very happy to put us in touch with their engineering team so we could ask for any modifications we wanted. It felt like they were very flexible and knew how to close deals. We ended up buying that quantity but without any modifications.

Maybe they have gone too large and now have to explore the same tactics from other big companies, like you mentioned. That seems the most plausible explanation.

They're starting to see enterprise growth. Enterprises spend lots of money but are usually pretty dumb.

I'm wondering who signed off on your website. Consider nixing the forced scrolling and/or preventing the header from covering your copy (http://rsync.net/#secondpage).


a) your comment doesn't add anything to the discussion. b) you've obviously got some issues that you need to work out and the HN community shouldn't need to be exposed to them.

What competitors are there? Their hardware is getting more and more specific, looking at e. g. their four-socket systems or their 2U quad-systems. Nothing that doesn't handle like any other computer, but hard to cross-shop for.

There are a couple needful tweaks to this BOM for anyone wanting to follow this..

Only populate one CPU socket. Zone allocation between two NUMA nodes is kind of hard, especially since Ubuntu 16.04 zfs is pre- OpenZFS ABD where memory fragmentation is reality.

I would recommend better NICs like a Chelsio T5 or T6. Aside from better drivers and a responsive vendor, you can experiment with some of the iscsi offloads or zero copy TCP.

Supermicro seriously under-provisioned I/O on that chassis. I'd add LSI/Avago/now Broadcom cards so you can get native ports to every drive. Even if it's just a cold storage box, it will help with rebuild and scrub times and peace of mind. The cost of this is not bad compared to the frustration of SAS expander firmwares. 2x24 or 3x16 and 4 drives on the onboard if you can skip the backplane expander. Supermicro will usually do things like this if you insist, or an integrator like ixSystems can handle it.

More subjectively, I would also recommend FreeBSD. It seems their main justification for Ubuntu was paid support, which can be had from ixSystems who sell and support an entire stack (Supermicro servers, FreeBSD or FreeNAS or TrueNAS, and grok ZFS and storage drivers to the tune that they have done quite a bit of development.

Why FreeBSD?

The reason is that the Linux and FreeBSD kernels are vastly different. The FreeBSD kernel is more similar to Solaris than Linux and this is what ZFS was developed for.

So in the end you have way more kernel "workarounds" for ZFS on Linux than the FreeBSD implementation. Especially the virtual memory implementation on Linux behaves very differently and has caused many issues over time, but I think it is very good now.

Goldilocks zone for this kind of stuff.. it's a bit simpler, most of what you need is in the base system, and a lot of people are doing the same type of builds.

And for self-support, the freebsd-scsi and freebsd-net mailing lists will get you in touch right away with the people that understand the firmware, protocols, drivers, and systems engineering. You will probably get a response from a kernel and operations aware developers from Dell Isilon, SpectraLogic, or Netflix within the day that commit the code you are using. You probably wont transcend customer support at other organizations in that timeframe.


Compared to?

I agree to some extent that ZFS on FreeBSD might have been more stable than ZoL, but in general? Meh.

> I purchased these units through a vendor we like to use and they hooked us up, so I won’t be able to share my specific pricing. (...) If you build the systems out on there you’ll find that they come in around $35,000 (USD) each.

That devided by 52 x 8 = 416TB is 0.084$/GB. For comparison, the Backblaze Storage Pod 6.0 [1] claims 0.06$/GB for the version with the same hard drives. Although this version has a bunch of extra features like 2 x 800GB SSD's for ZFS SLOG, 8x more RAM for a total of 256GB, etc.

[1]: https://www.backblaze.com/blog/open-source-data-storage-serv...

    you can run ZFS on Ubuntu [...] You could also build this on Solaris with necessary licensing if you wanted to that route but it’d be more expensive.
I find it bewildering the author didn't even consider illumos or FreeBSD, where ZFS is a first class citizen.

It's likely a human resources problem. For every competent FreeBSD or Illumos sysadmin there are 10x equally experienced with Linux. And those numbers are much worse outside of major cities. The commercial support from Ubuntu tips the scales.

I made this same decision at my last job. I ran Solaris and Illumos on our file servers and loved it, but a year before I left I ported all the pools to Ubuntu so my successor only needed to be Linux competent and ZFS trainable.

Sometimes when choosing tech it's not what's technologically superior solution nor what you personally could run well, it's what's best for your coworkers, your successor and the organization in the long term.

> For every competent FreeBSD or Illumos sysadmin there are 10x equally experienced with Linux

That's the Nth I've read this quote on HN, it became a classic... You can't find a FreeBSD sysadmin but you can find a Linux admin.

Where I work I have to deal with AIX, Solaris, Open/FreeBSD (Had Net before), Linux (all major flavours) and (god Forbid) Windows Server (2008, 12R2, 2016 and Nano). I've build packages for most of these systems. I don't know all of them inside/out but NEVER had problem implementing/setting/testing features in any of them.

Can you tell me in a what way an intermediate (say 5 years) of experience, linux sysadmin would have problems managing a Free/Open/NetBSD?

Guess it depends on the flavor of admin. With some regularity a linux admin "with decades of experience" will show up and announce that openbsd is terribly broken and nothing works, not even the most basic pkg-add command. Uh, did you mean pkg_add? See! Openbsd is so broken they called the command pkg_add while I typed pkg-add. I've never had this trouble with linux!

I'd be worried about letting such a person admin linux servers, but I guess you can limp by as long as you keep your infra to what they already know. Ideally you'd weed such people out before hiring, but maybe if you need to hire an admin you don't know enough to do that?

Totally agree. BTW, thanks for signify it is an awesome system for pkg signatures. It took me a while to get it working correctly for automatic pkg signing but once it all clicked together, the system was simple as any I've see!

ps. The best documentation I've found, was the manpage[1].

[1] https://man.openbsd.org/signify

It's hard enough to find Linux admins, even in major cities. There are people who will put Linux experience on their resumes, but almost nobody who can boast more than casual home use. We ended up giving up on finding a Linux admin after a few months of the job posting collecting dust and decided to just hire someone fresh out of college a train them from scratch.

Then you run into the problem that BSD admins are even rarer, and AIX/Solaris even rarer than that.

Someone with experience with Linux could likely find their way around any UNIX like system but given the choice we would rather deal with some awkwardness in a familiar environment where admins have far more experience and we can leverage our exiting infrastructure, scripts, and config management.

I disagree.

FreeBSD is just as easy to administer as Linux. Solaris I find to be a bit more challenging. I can appreciate your commentary, but I challenge you to prove me wrong. Just for some color I've ran the Gambit of Unix and Linux systems in my career with FreeBSD being one of the easiest and most consistent out of the box technologies. Solaris had already been quite the opposite. I guess you could say AIX is a close second to Solaris, but I think that's a bit of a grey area considering how opinionated that operating system is.

To be clear I'm not a FreeBSD zealot. I'm for what works. Right now what works is the emerging container based solutions albeit not relevant to this discussion.

> For every competent FreeBSD or Illumos sysadmin there are 10x equally experienced with Linux.

Was parent wrong about that point?

Not exactly, but to any competent administrator, learning FreeBSD based on Linux experience is not that difficult. It's sort of like tasking someone to start working on a program written in Go even if they have only C++ experience. You can cope.

I get a little annoyed with this line of reasoning though.

"Anyone can learn anything" doesn't help me if I need an expert now. And it doesn't magically jump the gap between "functional" (I can make a thing work in an ugly and naive way) and "good" (I can weigh the trade-offs behind the scenes and choose the optimal from multiple alternatives).

Unless the assertion is that FreeBSD / Go is easy, logical, and/or obvious enough that a master Linux / C++ programmer will be productive and community standard-compliant without any effective lag time.

And I'm not trying to be obtuse. I honestly see it a lot and think it's a blind spot: reverse Pareto principle if you will. "Getting up to 80% proficiency is easy, so let's ignore the last 20% because it must also be easy."

Experts have to get made somehow. It's not as if Linux is a frozen target where you can count on being "productive and community standard-compliant without any effective lag time" without going back to the docs sometimes.

A great example in my opinion: Red Hat RHEL7 introduced systemd. A lot changed versus RHEL6. RHEL6 "experts" turned into clumsy RHEL7 "80%-ers". We figured it out.

Not to even mention that SuSE, RHEL, and Ubuntu are about as similar as "Linux" and FreeBSD, if you are worried about the finer points of best practice. We figure it out.

Absolutely. It happens. But to harken back to the original post, there are definite advantages to "technology with X experts available in the market" than "technology with Y experts available in the market". Where X > Y.

And those advantages don't disappear even if Y is easy to learn.

On that basic point, I agree -- although there are advantages to swimming upstream sometimes. Otherwise, given the landscape of 10-15 years ago, we'd be having this discussion about Windows servers instead of Linux!

Besides, probably the best way to find out if it's a "big deal" is to ask your sysadmins. Or, generally, the people who are going to be stuck running it.

Granted on the upstream point! Especially with how quick transformative technology goes through its various phases, it may be essential (/Strangelove emphasis) to make the harder choice now so that you're not behind your competitors in the near future.

I think your argument lies more in the fact that linux is non standard to Unix. They have gone their own way and made it difficult to transfer knowledge. Ask anyone who has ported a Linux application to any other Unix. It's at best a PITA. At worst a nightmare.

Can you do it? Sure. But it isn't pleasant. The Linux community is off in the weeds imo. Doing their own poor re-implementations of tech others have already done. See: Dtrace, Filesystems, Jails/zones, Networking, VM, Init systems, ...

So I think trying to argue it's easier to install Linux because Linux folks can't transfer their knowledge to other OS's speaks volumes to it being a poor choice to invest skills in if you can't transfer them to other OS's

That's not really a fair statement. First of all, Linux started as Linus just reimplementing the Unix semantics, so edge cases and subtle semantics should be expected to be different. Secondly, most Unix-based OSes are barely compatible in their facilities.

Sure, you have DTrace (which only macOS, illumos and FreeBSD have) and ZFS (which only illumos and FreeBSD have) but the rest is similarly incompatible. Solairs/illumos even has a complete NIH-reimplementation of FreeBSD's kqueue (event ports). They have different views on containerisation (Zones/Jails). They've historically had very different opinions on /proc and ioctls, not to mention that they were developed separately for such a long time that their shared history is not very recent.

As a result, porting from Solaris to FreeBSD is also difficult. Maybe it's harder or easier than porting to GNU/Linux, but I wouldn't just flat-out claim that GNU/Linux is the only member doing things that are incompatible.

Sun's event ports are so similar to kqueue cantrill has said they should have just adopted kqueue.

And there are not different views on zones and jails. They are the same thing. Sun just took the idea of jails and flushed them out further adding a separate network stack for each zone, which jails now also have. But they operate on the same principle and ideas. Jails was bare bones at inception. Jails shared their IP stack with the host, this was before cloud computing and the need for separate network stacks. They both started with being secure and then adding features where the Linux container mess started with features and then continues to try and address the fact they are insecure by design. So jail and zones are similar.

Porting from FreeBSD to Solaris and vice Versa is easier than you think. If it was insanely hard FreeBSD wouldn't have zfs or Dtrace from Sun. Dtrace was almost single handedly ported by one engineer. An amazing engineer, but he Was the only one. Same with ZFS. And illumos ported the FreeBSD installer. Also done by one individual. All very good engineers but still just one.

I also have watched Cantrill's talks. But I think you're intentionally ignoring my point, in two ways:

* You were discussing porting a Linux application, not a kernel feature like DTrace, ZFS, Zones, kqueue, etc. Obviously porting a kernel feature between two OSes that share a kernel history is going to be easier than porting to an entirely different kernel. It's almost tautological. Porting an application has more to do with whether the syscall/libc interfaces are compatible and if you take kqueue/eventports as an example you still need to do standard porting work. glibc provides BSD-like interfaces so it's not like you have to switch away from bzero or whatever -- that's not the hard part of porting code.

* You specifically stated that Linux is "off in the weeds", "doing their own poor re-implementations of tech others have already done". Ignoring how disrespectful that is, your response to me saying "the whole Unix family re-implements each others ideas all the time -- DTrace and ZFS are the exception and only three members of the family use them" isn't helping your original point.

Also this whole section is just a non-sequitur:

> And there are not different views on zones and jails. They are the same thing. [Long description of how they are different and were developed separately.] [Random aside about Linux containers and how they're a mess.] So jail and zones are similar.

I am aware of the similarities and differences between Jails and Zones, and I'm also very painfully familiar with Linux containers. Not sure why you're bringing them up in a discussion about porting applications between different Unix-like operating systems. Sounds like you just have an axe to grind.

event ports is not an NIH reimplementation. It was a framework intentionally developed to meet specific business and technical needs within the Solaris threading model at that time.

It's far closer to Windows IOCP than FreeBSD kqueues IMHO.

It's also been one of the most successful features added to Solaris and is used throughout the system.

As long as you can pay. You can find anything[1].

[1] https://www.freebsd.org/commercial/consult_bycat.html

> Was parent wrong about that point?

I believe it was. See comment above. FreeBSD has excellent documentation on nearly every topic. Nearly every program has a manpage and there's always google to help you.

For advanced topics (CARP, DTrace, ZFS, Jails, Accounting API) the docs are excellent and you'll have to some reading to properly implement any of those anyway.

That's part of the job actually, reading/learning.

I thought the same thing, they sort of answers in the comments.

Something about freenas saying no support and that the put together systems from iXsystems have lower drive bay counts.

My guess is they just prefer Linux. FreeBSD or illumos would definitely work.

I run napp-it on omni-os at home and work with good results. It's not too hard to adapt your linux knowledge if you're not used to the solaris-ish userspace.

For someone not very up to date with all the various ZFS flavors, can you please explain this a bit?

ZFS is a native filesystem to FreeBSD, where as in Linux it is not.

Sure but that alone doesn't mean it's any less suitable on Ubuntu. I'm just trying to understand what the differences are in implementation, reliability, etc.

ZFS on FreeBSD and on Illumos has been used, abused, tested and stressed all the way to the Moon and beyond. Loads of people with loads of data over loads time.

This makes bugs appear, and get addressed, and eventually gives you confidence that nothing nasty remains uncorrected, and it won't eat your data.

ZFS on Linux, due to the unfortunate licensing situation, is considerably less tested and thus scary, data-eating bugs are at least a bit more likely to exist.

On the other hand Linux is far more popular so I wouldn't be surprised if ZFS on Linux already clocks more hours of usage than on BSD.

> On the other hand Linux is far more popular so I wouldn't be surprised if ZFS on Linux already clocks more hours of usage than on BSD.

Yes and no. Few people, even among the Linux crowd, are aware it exists or feel like trusting it with important data sets/jobs. Those few are the most likely to be equally at ease running FreeBSD or Illumos (or Solaris), where the damn thing is known to work really well.

ZFS-on-Linux has not been mainstream (or existed) for as long as ZFS has been integrated inside FreeBSD. illumos is the free software fork of OpenSolaris -- so it's the repo of record for modern ZFS development.

OpenZFS is the parent project from which ZFS is applied to Illumos, FreeBSD, and ZFSoL


I'm aware of that, I just didn't want to additionally confuse someone who is barely familiar with ZFS with the whole OpenZFS fork and so on.

ZoL uses an emulation layer to make Linux look more like something with Unix ancestry, ancestry that BSD and Solaris share. The biggest difference I believe is in memory management / fs cache integration.

In practice this means that ZoL may require more memory to be stable, or may be less stable depending on configuration if low in memory.

I've run ZFS on Solaris (by way of Nexenta) and Linux since 2008 or so. I haven't seen much reliability difference in practice. I've had fewer hiccups streaming video from Linux though.

It's also had more baking time on FreeBSD.

He explained on Reddit that he has some support from Canonical as well.

Yeah I was like uhhhh what? Illumos and illumos based OS'es and FreeBSD are the only games in town where ZFS is a first class citizen. Ill chalk it up to extreme bias.

Yeah, many people don't consider FreeBSD because of silly reasons… :(

but mentioning Solaris that "requires licensing" and not illumos?? illumos is the new Solaris, the old Solaris devs are working on illumos! Oracle Solaris is irrelevant.

> Oracle Solaris is irrelevant.

Amen to that.

The most important line for me was "Today, you can run ZFS on Ubuntu LTS with standard repositories and Canonical’s Ubuntu Advantage Advanced Support. That makes the decision easy."

Its highly interesting that Canonical does this with ZFS. I'm not sure why they dont market this more.

I'm only a casual user of ZFS on Linux for personal storage projects, but I've spoken with people who rely very heavily on ZFS on Linux for their small businesses, and it's interesting to hear their perspectives on this. Essentially, because btrfs has failed to deliver on the next-gen filesystem front, ZFS on Linux is such a critical piece of technology that unless Red Hat has an answer soon for out-of-the-box ZFS on Linux, Canonical has a pretty staggering advantage. One theory is that Canonical has essentially forced Oracle to decide whether it wants to crack down on inclusion of CDDL-licensed code being shipped with Ubuntu's stock kernel, so Red Hat may be waiting to see if Oracle sweeps in or not before following suit.

I'd be very surprised if RHEL (by observing the progression of Fedora development) continues to bet on btrfs, as I have yet to encounter anyone (including myself) who would ever trust btrfs over ZFS on Linux with anything of importance based on their experiences with the two - my experience is anecdotal, but ZFS has been just as reliable on *BSD as with any Linux distribution.

SUSE has succeeded in shipping atomic OS updates and rollbacks by default both for enterprise and community versions of Linux several years ago. Facebook is using it, and the usage is growing. [1] And there are any number of examples of production usage of Btrfs at scale if you go look for them. [2] And the good and bad are fairly well understood, it is in fact getting better and will continue to get better.[3] And Red Hat isn't waiting. [4]

[1] https://www.spinics.net/lists/linux-btrfs/msg67885.html [2] https://www.spinics.net/lists/linux-btrfs/msg67308.html [3] https://www.spinics.net/lists/linux-btrfs/msg67940.html

[4] https://stratis-storage.github.io/StratisSoftwareDesign.pdf

What you're going to find in storage is that there are multiple valid approaches, no clear one size fits all winner, and people will choose based on the tools they're familiar with, the company they trust, and their use case. I pick Btrfs over ZFS pretty much based on equivalent trust and better flexibility for my use case, but then I'm much more familiar with Btrfs tools and where the bodies are buried than I am with ZFS. I don't need to go around impugning other projects to justify what I use.

Aren't Synology NAS devices also using Brtfs?

Yep, although they don't use Btrfs RAID but instead run it on top of LVM which provides the necessary RAID functionality.

Does LVM usage facilitates bitrot self-healing on btrfs part?

RedHat is apparently building something new called Stratis based on XFS, they're shooting for delivery within the next 1-2 years. Supposedly it'll be feature-equivalent to ZFS/Btrfs/LVM.


The ZFS comment in the document [1] is time inspecific. It's got version 1.0 in 1st half of 2018 and the "rough ZFS feature parity" comment is not until version 3.0, which has no listed time frame.

This is going to take lot of work, and not just for the stratis developers but for projects that need to manipulate it. It's asking for a lot of work for bootloader projects to support it, and

[1] https://stratis-storage.github.io/StratisSoftwareDesign.pdf

I'm friends with some of the original ZFS guys, Jeff Bonwick was a student of mine at Stanford and I got him to come to Sun, Bill Moore worked for me on BitKeeper. Those guys are seriously studly engineers. I've done file system work at Sun, I don't compare with Bill & Jeff, they are way better.

So yeah, trying to do better than those guys is going to be interesting. I'd like to know who is on the RedHat team working on this new file system, just the fact that they are trying is interesting.

Edit: and I was at SGI when XFS was pretty new, I know some of the XFS folks as well, Adam Sweeney and Mike Nishimoto.

I was the guy that plugged XFS into NFS over HIPPI, so I have more than a passing knowledge of it:


XFS was pretty cool but a lot of the technology that made it fast was XLV, the logical volume manager. XFS just made sure it handed very large, aligned, I/O requests to the volume manager, the volume manager was the layer that split them up and got all the DMA engines going. That's how we did 500MB/sec in the early 1990's on 200mhz MIPS chips, the MIPS chips weren't touching the data, the DMA engines were (the networking stack did page flipping to avoid bcopies).

I know XFS did other stuff for scaling but it most certainly did not do all the safety stuff (and I don't think it did transparent compression, those 200mhz MIPS cpus weren't fast enough to put that in there) that ZFS does.

So I'm wondering how much XFS has evolved from the SGI days. If it hasn't, I don't get why RedHat started there. Be really interested to know the back story.

XFS is the de facto standard file system for RHEL deployments, they employ a lot of engineers familiar with the code and it's the default since RHEL7 came out.

Do you know if they have any of the SGI team? Like Adam and/or Mike?

    $ git log --since=”2016-01-01” --pretty=format:”%an” --no-merges -- fs/xfs | sort -u

XFS history https://lwn.net/Articles/638546/

Linux file systems, where did they come from? (Presented by Dave Chinner who has been XFS maintainer for a few years.) https://www.youtube.com/watch?v=SMcVdZk7wV8

Eric Sandeen works for Red Hat on XFS, at the least. Mike and Adam are elsewhere according to their LinkedIn profiles - dunno who else was working on it at SGI.

Filesystems are HARD, see the plethora of FS's on Linux that are dead, dying, or horribly engineered. This seems like such a monumental waste of resources, time, and energy to chase ZFS, which itself is not a stationary target. ZFS is constantly developing new features. I think a lack of ZFS in Linux is hurting it. And that wound will only deepen over time.

Well obviously Linux devs would love having ZFS in the kernel, Linus even half joked that he'd consider trying to relicense Linux to GPLv3 if it meant he could use ZFS and DTrace (this was back when Sun was considering GPLv3 as the license to use when going open source).

But the problem remains, CDDL and GPL are likely incompatible, and Linus has said that no CDDL licensed code will be merged, most likely per advice from lawyers.

So unless something drastically changes, ZFS is off the table, and thus work will continue on with alternatives, Stratis is one such alternative, bcachefs is another, and of course btrfs will not die just because Red Hat isn't supporting it anymore, as they barely did to begin with.

Oracle is not the one that could sue. There is nothing in the CDDL that prevents it being used else where.

The GPL on the other hand is a strong copy left. If you link against GPL code, your code must also be licensed as GPL.

This means the Linux copyright owners could sue the distributers of ZoL binaries, but Oracle could not.

Oracle has the power to allow their ZFS code to be relicensed as GPL, removing this road block, but they have no incentive to do so.

According to the Software Freedom Conservancy

...redistributing a binary work incorporating CDDLv1'd and GPLv2'd copyrighted portions constitutes copyright infringement in both directions... [https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/]

so it seems that Oracle could in fact sue.

> so it seems that Oracle could in fact sue.

Says the SFC. But Oracle has had plenty of time and they have not sued. In fact, they have not criticized Canonical for integrating ZFS.

The past is not always a predictable measure for the future, especially as you don't know Oracle's agenda.

They might just wait until there is more money to be gained from a lawsuit or until there's more people already on ZFS who don't want to have anything less.

the courts will not allow that. you may claim that you recently discovered that there is infringement or you can claim that you are a new owner of a patent and are enforcing it.

But I dont think courts will take too kindly to someone who knowingly sat without doing something and waited for it to become big.

I would assume that Canonical has already sent infromation about this to Oracle. If they havent done anything now, they cant do anything later.

in fact - https://insights.ubuntu.com/2016/02/18/zfs-licensing-and-lin...

We at Canonical have conducted a legal review, including discussion with the industry’s leading software freedom legal counsel, of the licenses that apply to the Linux kernel and to ZFS.

And in doing so, we have concluded that we are acting within the rights granted and in compliance with their terms of both of those licenses

As I understand it, the license incompability between GPL and CDDL is that CDDL adds a restriction regarding patents, and GPL does not allow for additional restrictions to be added.

So to distribute CDDL licensed code is a breach of BOTH licenses as I see it, ignoring the 'patent peace' requirement of CDDL which would be the case if distributing it as GPL does not sound legal to me (IANAL).

I can certainly understand why Linus has stated that no CDDL code will be mainlined (brought into the Linux kernel tree), it all seems very ambiguous.

IANAL, but it (probably) depends on which license Canonical is using for the ZFS plugin.

If they use CDDL, then I can't see how Oracle would have a case.

They can't legally use anything except CDDL. If anyone could just pick and choose to alter the licensing, there wouldn't be any point in having licences.

RHEL is not continuing to bet on btrfs; they are deprecating it: https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

As was explained in depth in the HN comments, Red Hat's decision to not provide enterprise support is based on the fact that they don't have the necessary engineers to backport and maintain one of the fastest moving pieces of code in Linux today. All of their btrfs engineers moved to Facebook, and they're still working on btrfs improvements.

There are many decisions involved in deciding what to provide enterprise support for, and whether you like it or not, technical merit is only one of many factors. So Red Hat's decision was likely not entirely based on technical merit.

SUSE still provides enterprise support for btrfs (like we've always done), and there's still plenty of work being done from various large contributors.

[I work at SUSE.]

It sounds like you think you're disagreeing with me, but I'm not sure how that's possible given the near total lack of opinion in my post. In any case, the aspect of the parent comment that I was thinking of when I posted that was the part where the parent points out that Ubuntu has ZFS and how that could be a problem for RHEL. Yes RHEL offers a number of other FS options, but to my knowledge only ZFS and btrfs currently offer checksumming, which IMO is a big differentiator vs. more conventional filesystems.

I disagree with the use of the word "deprecating", and everyone has been parroting around the news as though Red Hat announced that btrfs causes machines to catch on fire.

Red Hat never provided enterprise support for btrfs, it was a technical preview that didn't pan out to become fully supported as part of their distribution. It's barely a story (there are plenty of other filesystems Red Hat doesn't support), but it's a good opportunity to spread misinformation.

> I disagree with the use of the word "deprecating"

Then you disagree (literally) with Red Hat's official statement as a matter of fact (not opinion). From Chapter 53, RHEL 7.4 release notes: "Btrfs has been deprecated ... Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux."[0]

Pretty definitive.

[0] https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

The context that deprecated is used is "as a technical preview" which is different than saying that "in our opinion, btrfs is a deprecated technology and people shouldn't use it". So, while Red Hat did use that word, if you don't follow it up by explaining in what way support was deprecated then you're misleading people who aren't aware of the whole situation.

    $ git log --since=”2016-01-01” --pretty=format:”%an” --no-merges -- fs/btrfs | sort -u
And then iterate for ext4 and xfs, and that comes to: 100 btrfs, 71 ext4, and 63 XFS contributors over those 18 months. They are all healthy ecofilesystems.

I have bet on ZFS for years in both business and my most important personal information and it has never let me down, on Illumos or Linux. Snapshots, upgrades, sync, and data transfer are just too easy to even consider switching. And support and features have seen steady improvements over the last decade. No other filesystem has all of these properties.

> One theory is that Canonical has essentially forced Oracle to decide whether it wants to crack down on inclusion of CDDL-licensed code being shipped with Ubuntu's stock kernel, so Red Hat may be waiting to see if Oracle sweeps in or not before following suit.

That is not a "theory" that anyone who understands the basics of the licenses adheres to. The possible threat is that linux' GPL could attack ZFS; not the other way round. There is nothing in the CDDL in the way of using ZFS in linux.

btrfs is getting dropped from future versions of RHEL so you can see what Red Hat thinks of the future of btrfs.

RedHat is betting on xfs.

Red Hat has never spent any real resources on btrfs, they've been all about XFS when it comes to enterprise needs.

They employed one of the original Btrfs developers for the first half of its development, but not since 2012. I can't speak of their recruitment efforts, but it does seem all the Btrfs developers are happy at their respective companies; and the Red Hat file systems folks are happy working on what they're working on which is not Btrfs.

>One theory is that Canonical has essentially forced Oracle to decide whether it wants to crack down on inclusion of CDDL-licensed code being shipped with Ubuntu's stock kernel

Huh? Are Canonical is shipping ZFS in their kernel now ? I thought they just distributed it as a separate module.

There was an article a bit ago on HN where the Free Software Conservancy (?) wrote that legally, shipping a precompiled binary kernel module is the same as having it in the kernel.

But obviously Canonical's lawyers don't think shipping it separately is the same as shipping it in the kernel, because if they did they would not ship it separately to begin with.

The argument really is based around what constitutes a derivative work. ZFS does not depend on Linux, OpenZFS which ZFS on Linux derives from is an Illumos project first and foremost - does modifying it to plug into the Linux VFS layer make it a derived work of the Linux kernel?

If this ever goes to court it will be an interesting case.

Granted. But Canonical's lawyers are not canonical.

There's legal disagreement, and saying "Ubuntu did it" doesn't make those problems disappear. (Sadly)

I haven't seen any legal opinion since Canonical's press release, which states that they consulted with counsel who believed that they were in compliance with both licenses.

I -have- seen a lot of argument online, opinions and otherwise that disagree with this, but not anything from a lawyer - mainly "philosophical" or opining on the "intent" of the GPL, rather than "based on clauses x, y, and z, this is impermissible".

Which is not to say, if something of an opinion exists in response to Canonical's release of ZFS, I would love to see it - a legal opinion, not armchair lawyering or opinionizing.

>>But Canonical's lawyers are not canonical.

Hmm... not sure what you meant by this statement ?

"authorized; recognized; accepted"

I.e. their opinion is not the only opinion on the matter.

I'm not sure. Maybe it's because they're not really supporting it all that well. The hole_birth data loss bug is still not fixed in 16.04 LTS.


This is a major data loss issue with some workflows and it's pretty significant that Canonical hasn't backported it.

We bought our initial 2 TrueNAS servers from IX Systems (SuperMicro) back in 2011, have been upgrading over the years and they have been very reliable servers.

Currently each server has 63 drives (4TB HGST NL-SAS) with 1 hot spare, configured as RAIDZ.

Right now there is 200TB of usable storage, we initially started with 29TB and have been expanding as needed when it hits about 79%, I buy 18 drives roughly every 6-8 months, 9 drives per server and expand the pool.

To say that we never had issues is lying, we did have some major issues when upgrading from versions, but this was early on, now it is a rock solid storage system.

Although there is less than 300 active users connecting to the primary server, there is a lot of very important pre & post production high dev videos.

Reboot with 63 drives is around 10 minutes or less.

Resilvering could take 24-48 hours, depending on load, depending on how much data the failed drive contained.

Performance has been great, reliability has been great, support has been great.

Sadly IX Systems can no longer provide support after the end of this year, they've extended support beyond the expected lifetime of the hardware.

Zfs on linux and huge single servers, what could go wrong?

It's like a blog written by a 22 year old straight out of college that's never dealt with a real production deployment/failure

Zfs on Linux has data loss bugs. There's at least one unpatched and there are bound to be more.

Single huge servers eventually fail. Maybe it'll be a drive controller. Maybe it'll be CPU or ram with bit flips as a side effect. Downtime would be the least painful part of the eventual failure.

> Zfs on Linux has data loss bugs.

Please don't spread untruths. Somebody who doesn't know better might actually believe you.


Unpatched on 16.04, referenced as supported in the article

That was strictly speaking an openzfs bug that hit all the ports.

There's another one, somewhat related:


which (so far) seems tied to using recordsize > 128k without either of the -L or -c flags on the zfs send side, with the result that the sendstream is corrupted in such a way that the receiver cannot detect the corruption. As with the filled-holes problem, the problem is real but rare. Unlike the filled-holes problem, it is unlikely to affect many people since it is (probably) very rare that anyone uses large records and does not use -L (or -c, or both), although there are certainly automatic snapshot-send systems (e.g. znapzend) that use a common minimal set of options to zfs send.

This is especially unfortunate because of the rarity of the corruptions, the apparent rarity of people using POSIX-layer checksumming (e.g. rsync -c, or sha256deep or the like) on large datasets (with large files that had holes made and refilled, for example) to validate that a received dataset really is the same as the original, and the apparent rarity of people doing this sort of validation specifically targetting backwards compatibility mechanisms (e.g. zfs recv into a version 28 or earlier pool from a source dataset that uses all the most recent bells and whistles).

Finally, it is extra-especially unfortunate because recovering from this sort of corruption is awkward and time-consuming; at the minimum the source and destination have to be entirely read at least once or alternatively the destination needs to be destroyed and sent again from scratch once the fix or workaround for send|recv corruptions is known.

ZoL or OpenZFS is irrelevant, to be honest. The point is it's an experimental filesystem (at least on Linux), and there's NO REASON you should have 800TB on one single server/filesystem, because it opens you up to bugs like these.

If, instead, they had sharded+replicated it across 4020TB(replication factor) systems, they'd pay a lot more in power, but they'd be able to tolerate a single FS bug unless it somehow it all of the replicas.

Oh, Sean and Jeff. But for back ups, OK, right?

I read the patch and linked issues. It looks like an OpenZFS bug to me, which wasn't ported to ZoL until recently but it definitely doesn't look like a ZoL-exclusive bug.

Not to mention that the issue description doesn't say it's a data loss bug. It's a bug that means that ZFS send would not include holes in very specific scenarios. On-disk data would still be safe as far as I can see (though I'm not an FS expert by any stretch).

> There's at least one unpatched and there are bound to be more.

References, please.

It might be worthwhile mentioning that at some point in the past, HN ran on 1 server (and might still do so).


Cute but irrelevant?

Actually the real issue is, "when the system is 65% full and you reboot, how long will it take for ZFS to mount it"?

Perhaps he has split up ZFS into a number of different pools and they can be mounted in parallel (depends on the init script and whether ZFS can do this). But I do recall that larger ZFS pools can take a bit of time to mount; maybe the updated ZFS for Linux is faster....

> how long will it take for ZFS to mount it?

What? How long would it take, roughly? Genuinely curious.

Depends on the number of snapshots in my experience. 84,000 was slightly too many.

I'm planning on doing a tiny ZFS pool (once I've finally saved up for it - financials can be fun sometimes!), and was thinking of doing FreeBSD for ZFS and Linux for everything else on top of Xen.

I'm currently unsure how to make Linux see the ZFS pool though. I.... don't really like NFS. It's too glitchy in my experience. I use it to listen to music stored on a different machine from my laptop, which uses a long-range USB Wi-Fi adapter. If I unplug the adapter without cleanly unmounting /nfs, I get a kworker in an infinite loop. I googled around one afternoon and discovered that RHEL apparently found and fixed this in kernel 3.x. Interesting - I'm on Slackware, with kernel 4.1.x. >.<

I wish you could do cross-VM virtio. That would be awesome. Then I could export the device node corresponding to the whole pool from FreeBSD and just mount it as a gigantic ext4 filesystem on Linux. (Can you do that?!)

You could in theory have a zvol exported from a ZFS pool on FreeBSD to Linux via some remote block dev protocol. iSCSI comes to mind, as it would have to be cross platform. Then slap ext4 on top of that in Linux.

The idea feels "janky" though. Lots of overhead compared to just running ZFS on Linux.

Not just in theory. I've done this for a large dev/test Xen cluster (pool was actually on Solaris). Worked well. We made snapshots to match major release tags.

That's something I was thinking about. This thing has to go down occasionally for patches, service, maintenance. 500TB is alot to trust to a single box or two.

I'm building a Ceph cluster right now, about the same size as the one in the article, except with 9x36 drive chassis.

Clustered filesystems just means that one machine can wreak havock across many domains of fun. (I used early lustre, I know from fun experience)

ceph seems nice, but appears to be very CPU hungry and not very fast.

Seems easier to have sharded vanilla linux file servers. It requires a decent asset manager, but once you have that, backups and rebalancing become trivial.

My (admittedly limted) experience with ZFS has been bad. I know it's supposed to be great but that is not what I have experienced. Multi-hour outages. Having to have Oracle help resolve problems. This was a few year ago now, maybe it's better these days. I was not the administrator, just a user, but we had ZFS issues at least quarterly on a ~20TB shared filesystem on a Linux cluster.

Was ZFS determined to be the problem in the root cause analysis, or was it another part of the storage system that failed? It's easy to pass the blame down to the lowest level of the system because no one wants to take the blame.

If it was actually determined to be the fault of ZFS, what was the problem? I'd like to avoid it in my own deployment.

Been running an 8 drive array for nearly 5 years now myself and never once had an issue through 3 PCs and 3 different operating systems. What was your issue?

With utmost respect: this isn't a super valuable data point. A "production deployment" with lots of users and/or workloads will see issues you will never encounter in a moderate setting.

FWIW, I use FreeNAS in a similar small but diverse setting with nary a problem, but when I sat it up for a 30+ organization, issue arose that I hadn't expected (not data loss, but usability and performance issues).

ZFS has definitely been used in very large production deployments in a variety of scenarios for well over a decade. The question seems reasonable given that history. ZFS isn't some toy filesystem hacked together over a weekend.

Does anyone have any proof of this? Seriously I'd love to see some. Bug reports or anything?

> It's like a blog written by a 22 year old straight out of college

Also no mention of FreeBSD. Instead he picked the less battle tested ZoL and only mentioned Solaris and licensing, so the author is not even aware of Illumos.

You care to back up those data loss claims with any proof?

I agree that this smells a bit funny, but the author seems to dance around the important topics like vpools, raid levels, and use case. Instead they just focus on the hardware, which is arguably the least interesting bit (until you know the former).

Can I just ask. Why not use FreeBSD?

Agreed. With Jails, dtrace and Linux binary support, it would be a no-brainer for me.

I'm surprised how often people ask this.

Almost big enough to archive SoundCloud!

What about cooling? Will the lifespan of the high-capacity platter-dense hard drives be drastically reduced by clumping them together like that with what looks like little airflow?

AFAIR from the backblaze blog, a bigger issue that shortens the lifespan is vibration.

That's probably because their design doesn't do much to mitigate it.

Could that be mitigated by redesigning the chassis for slightly less density in order to fit massive rubber standoffs/grommets?

Do either of these project spec hardware that would work for this use case?

opencompute.org Backblaze storage pod, they're up to v 6.0 now

(Netflix open connect specs supermicro hardware)


The next generation Facebook design for storage is Bryce Canyon, available through the Open Compute Project: https://code.facebook.com/posts/1869788206569924/introducing...

I wish Supermicro had a similar chassis around the Cavium ThunderX. That would make a lot of sense for network-attached storage, regardless of whether one goes with SATA or drops in a SAS adapter or two. Does anyone know if any of the Cavium accelerators (crypto or compression) can improve ZFS perf?

Since ZoL 0.7, Raid-Z parity and checksumming (Fletcher4) operations are accelerated using SIMD instruction sets (in case of ThunderX, using aarch64 NEON)

I'm not a HW guy but those drives seems to be far too close together. A few more millimeters space will keep the temperature down much better I assume.

Anyone else addicted to acquiring servers and high bandwidth connections ? Any ideas on what to do with the over capacity ?

ArchiveTeam is working on backing up the Internet Archive. http://iabak.archiveteam.org/ http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/g...

Software mirrors, http://yacy.net/, i2p, https://filecoin.io/

run a Tor middle relay

Very good experience with 45drives.com storinator XLs.

>It’s hard – if not impossible – to beat the $/GB and durability that Amazon is able to provide with their object storage offering.

what the actual fuck?? AWS S3 is a abominable rip off. After I rented to my own dedicated server, I am paying several times less.

You're running three geographically separated servers with 24x7 monitoring & security, automatic rebuilds, and active bit-rot scrubbing? If not, you're doing a lot less than S3.

It's possible to beat S3 pricing but you either need to be buying a lot of storage or cutting corners to do it. The most common mistake I've seen when people make those comparisons is excluding staff time, followed by presenting a system with no or manual bit-rot protection as equivalent.

just buy/rent a dedicated box and put RAID1, that's all most stratups need. Your points would be valid if S3 would be allowing to disable `geographically separated servers`, `bit-rot scrubbing` (no idea wtf is this). But those who say S3 is a cheap solution for more than few GB are fools or shills

So … that box is run by a volunteer sysadmin who doesn't charge you? … and doesn't mind getting up at 3am to replace a drive?

That server has perfectly reliable power and environmental setup so you never have prolonged downtime or a double disk failure?

You're okay losing everything if someone makes a mistake running that server since backups cost too much?

You have higher-level software which tells you when data on that RAID array is corrupted? Your free sysadmin periodically runs an audit to make sure that the data stored on disk is what you originally stored? That's what I was referring to with scrubbing: even with RAID corruption happens and most storage admins have stories about the time they found out it'd happened after the only good disk failed, been written to tape, etc. The best solution is to actively scan every copy and verify it against the stored hashes for what you originally stored, which also protects against cases where a bug or human error meant that e.g. your RAID array faithfully protected a truncated file because the original write failed and nobody noticed in time. S3 provides a strong guarantee that you will get back the original data you stored or an error but never a corrupt copy and that you can prevent storing a partial or corrupted upload. If you roll your own, you need to provide those same protections for the full stack or accept a higher level of risk and perhaps mitigate it in other ways (e.g. Git-style distributed full copies with integrity checks).

Again, I'm not saying that it's impossible to pay less than S3 but your response is a bingo card for the corners people cut until something breaks and they learn the hard way why raw storage costs less than a supported storage service. Doing this for real adds support cost for the OS, your software, security, monitoring, backups & other DR planning, etc. If you use S3, Google, etc. you get all of that built into a price which is known in advance, which is a significant draw for anyone who wants to spend their mental capacity on other issues.

Many places don't have enough storage demand for that overhead to pay off in less than years and startups in particular should be extremely careful about spending their limited staff time on commodity functions rather than something which furthers their actual business. If you're Dropbox, sure, invest in a capable storage team because that's a core function but if your business is different it's time to look long and close at whether it makes any sense to devote staff time to saving a few grand a year.

Biggest question: why?

At that scale something like Ceph would be more reasonable. Just because ZFS can handle those filesystem sizes doesn't necessarily mean that it's the best tool for the job. There's a reason why all big players like Google, Amazon and Facebook go for the horizontal scaling approach.

Very impressive! It's amazing what people are doing with OTS technology.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact