This is a major departure from the Supermicro business model and practices and basically broke all of our next generation expansion roadmaps.
This was not a technical decision - it is the same old economic decision that every large VAR/integrator/supplier has succumbed to for the last 30 years. They aren't the first ones to try this trick and they won't be the last.
We (rsync.net) are not playing ball, however. After 16 years of deploying solely on supermicro hardware (server chassis and JBODs) we bought our first non-supermicro JBOD last month.
That adds up to five figure premiums on drives we're going to burn in anyway.
We know how to read an HCL - it's not rocket science.
However, instead of buying exclusively supermicro from them, we are now buying these Hitachi 60bay JBODs:
It really bugs that supermicro went this way. They used to be a very boring company that did nothing but build great chassis. I used to jokingly say that "supermicro is the rsync.net of hardware".
But now they're getting cute and that's never a good sign.
Hitachi is NOT HGST.
//Used to work for HGST alongside the EMEA 4U60 team, so perhaps a bit sensitive!
///Was often asked to help move the damn things on/off trolleys when the fully-loaded demo units came back for refresh before going out again - it was a 4-person lift with under-unit webbed straps - I'm sure my spine was compressed and I left the job a couple of cm shorter!
Maybe they have gone too large and now have to explore the same tactics from other big companies, like you mentioned. That seems the most plausible explanation.
Only populate one CPU socket. Zone allocation between two NUMA nodes is kind of hard, especially since Ubuntu 16.04 zfs is pre- OpenZFS ABD where memory fragmentation is reality.
I would recommend better NICs like a Chelsio T5 or T6. Aside from better drivers and a responsive vendor, you can experiment with some of the iscsi offloads or zero copy TCP.
Supermicro seriously under-provisioned I/O on that chassis. I'd add LSI/Avago/now Broadcom cards so you can get native ports to every drive. Even if it's just a cold storage box, it will help with rebuild and scrub times and peace of mind. The cost of this is not bad compared to the frustration of SAS expander firmwares. 2x24 or 3x16 and 4 drives on the onboard if you can skip the backplane expander. Supermicro will usually do things like this if you insist, or an integrator like ixSystems can handle it.
More subjectively, I would also recommend FreeBSD. It seems their main justification for Ubuntu was paid support, which can be had from ixSystems who sell and support an entire stack (Supermicro servers, FreeBSD or FreeNAS or TrueNAS, and grok ZFS and storage drivers to the tune that they have done quite a bit of development.
So in the end you have way more kernel "workarounds" for ZFS on Linux than the FreeBSD implementation. Especially the virtual memory implementation on Linux behaves very differently and has caused many issues over time, but I think it is very good now.
And for self-support, the freebsd-scsi and freebsd-net mailing lists will get you in touch right away with the people that understand the firmware, protocols, drivers, and systems engineering. You will probably get a response from a kernel and operations aware developers from Dell Isilon, SpectraLogic, or Netflix within the day that commit the code you are using. You probably wont transcend customer support at other organizations in that timeframe.
I agree to some extent that ZFS on FreeBSD might have been more stable than ZoL, but in general? Meh.
That devided by 52 x 8 = 416TB is 0.084$/GB. For comparison, the Backblaze Storage Pod 6.0  claims 0.06$/GB for the version with the same hard drives. Although this version has a bunch of extra features like 2 x 800GB SSD's for ZFS SLOG, 8x more RAM for a total of 256GB, etc.
you can run ZFS on Ubuntu [...] You could also build this on Solaris with necessary licensing if you wanted to that route but it’d be more expensive.
I made this same decision at my last job. I ran Solaris and Illumos on our file servers and loved it, but a year before I left I ported all the pools to Ubuntu so my successor only needed to be Linux competent and ZFS trainable.
Sometimes when choosing tech it's not what's technologically superior solution nor what you personally could run well, it's what's best for your coworkers, your successor and the organization in the long term.
That's the Nth I've read this quote on HN, it became a classic... You can't find a FreeBSD sysadmin but you can find a Linux admin.
Where I work I have to deal with AIX, Solaris, Open/FreeBSD (Had Net before), Linux (all major flavours) and (god Forbid) Windows Server (2008, 12R2, 2016 and Nano). I've build packages for most of these systems. I don't know all of them inside/out but NEVER had problem implementing/setting/testing features in any of them.
Can you tell me in a what way an intermediate (say 5 years) of experience, linux sysadmin would have problems managing a Free/Open/NetBSD?
I'd be worried about letting such a person admin linux servers, but I guess you can limp by as long as you keep your infra to what they already know. Ideally you'd weed such people out before hiring, but maybe if you need to hire an admin you don't know enough to do that?
ps. The best documentation I've found, was the manpage.
Then you run into the problem that BSD admins are even rarer, and AIX/Solaris even rarer than that.
Someone with experience with Linux could likely find their way around any UNIX like system but given the choice we would rather deal with some awkwardness in a familiar environment where admins have far more experience and we can leverage our exiting infrastructure, scripts, and config management.
FreeBSD is just as easy to administer as Linux. Solaris I find to be a bit more challenging. I can appreciate your commentary, but I challenge you to prove me wrong. Just for some color I've ran the Gambit of Unix and Linux systems in my career with FreeBSD being one of the easiest and most consistent out of the box technologies. Solaris had already been quite the opposite. I guess you could say AIX is a close second to Solaris, but I think that's a bit of a grey area considering how opinionated that operating system is.
To be clear I'm not a FreeBSD zealot. I'm for what works. Right now what works is the emerging container based solutions albeit not relevant to this discussion.
Was parent wrong about that point?
"Anyone can learn anything" doesn't help me if I need an expert now. And it doesn't magically jump the gap between "functional" (I can make a thing work in an ugly and naive way) and "good" (I can weigh the trade-offs behind the scenes and choose the optimal from multiple alternatives).
Unless the assertion is that FreeBSD / Go is easy, logical, and/or obvious enough that a master Linux / C++ programmer will be productive and community standard-compliant without any effective lag time.
And I'm not trying to be obtuse. I honestly see it a lot and think it's a blind spot: reverse Pareto principle if you will. "Getting up to 80% proficiency is easy, so let's ignore the last 20% because it must also be easy."
A great example in my opinion: Red Hat RHEL7 introduced systemd. A lot changed versus RHEL6. RHEL6 "experts" turned into clumsy RHEL7 "80%-ers". We figured it out.
Not to even mention that SuSE, RHEL, and Ubuntu are about as similar as "Linux" and FreeBSD, if you are worried about the finer points of best practice. We figure it out.
And those advantages don't disappear even if Y is easy to learn.
Besides, probably the best way to find out if it's a "big deal" is to ask your sysadmins. Or, generally, the people who are going to be stuck running it.
Can you do it? Sure. But it isn't pleasant. The Linux community is off in the weeds imo. Doing their own poor re-implementations of tech others have already done. See: Dtrace, Filesystems, Jails/zones, Networking, VM, Init systems, ...
So I think trying to argue it's easier to install Linux because Linux folks can't transfer their knowledge to other OS's speaks volumes to it being a poor choice to invest skills in if you can't transfer them to other OS's
Sure, you have DTrace (which only macOS, illumos and FreeBSD have) and ZFS (which only illumos and FreeBSD have) but the rest is similarly incompatible. Solairs/illumos even has a complete NIH-reimplementation of FreeBSD's kqueue (event ports). They have different views on containerisation (Zones/Jails). They've historically had very different opinions on /proc and ioctls, not to mention that they were developed separately for such a long time that their shared history is not very recent.
As a result, porting from Solaris to FreeBSD is also difficult. Maybe it's harder or easier than porting to GNU/Linux, but I wouldn't just flat-out claim that GNU/Linux is the only member doing things that are incompatible.
And there are not different views on zones and jails. They are the same thing. Sun just took the idea of jails and flushed them out further adding a separate network stack for each zone, which jails now also have. But they operate on the same principle and ideas. Jails was bare bones at inception. Jails shared their IP stack with the host, this was before cloud computing and the need for separate network stacks. They both started with being secure and then adding features where the Linux container mess started with features and then continues to try and address the fact they are insecure by design.
So jail and zones are similar.
Porting from FreeBSD to Solaris and vice Versa is easier than you think. If it was insanely hard FreeBSD wouldn't have zfs or Dtrace from Sun. Dtrace was almost single handedly ported by one engineer. An amazing engineer, but he Was the only one. Same with ZFS. And illumos ported the FreeBSD installer. Also done by one individual. All very good engineers but still just one.
* You were discussing porting a Linux application, not a kernel feature like DTrace, ZFS, Zones, kqueue, etc. Obviously porting a kernel feature between two OSes that share a kernel history is going to be easier than porting to an entirely different kernel. It's almost tautological. Porting an application has more to do with whether the syscall/libc interfaces are compatible and if you take kqueue/eventports as an example you still need to do standard porting work. glibc provides BSD-like interfaces so it's not like you have to switch away from bzero or whatever -- that's not the hard part of porting code.
* You specifically stated that Linux is "off in the weeds", "doing their own poor re-implementations of tech others have already done". Ignoring how disrespectful that is, your response to me saying "the whole Unix family re-implements each others ideas all the time -- DTrace and ZFS are the exception and only three members of the family use them" isn't helping your original point.
Also this whole section is just a non-sequitur:
> And there are not different views on zones and jails. They are the same thing. [Long description of how they are different and were developed separately.] [Random aside about Linux containers and how they're a mess.] So jail and zones are similar.
I am aware of the similarities and differences between Jails and Zones, and I'm also very painfully familiar with Linux containers. Not sure why you're bringing them up in a discussion about porting applications between different Unix-like operating systems. Sounds like you just have an axe to grind.
It's far closer to Windows IOCP than FreeBSD kqueues IMHO.
It's also been one of the most successful features added to Solaris and is used throughout the system.
I believe it was. See comment above. FreeBSD has excellent documentation on nearly every topic. Nearly every program has a manpage and there's always google to help you.
For advanced topics (CARP, DTrace, ZFS, Jails, Accounting API) the docs are excellent and you'll have to some reading to properly implement any of those anyway.
That's part of the job actually, reading/learning.
Something about freenas saying no support and that the put together systems from iXsystems have lower drive bay counts.
My guess is they just prefer Linux. FreeBSD or illumos would definitely work.
This makes bugs appear, and get addressed, and eventually gives you confidence that nothing nasty remains uncorrected, and it won't eat your data.
ZFS on Linux, due to the unfortunate licensing situation, is considerably less tested and thus scary, data-eating bugs are at least a bit more likely to exist.
Yes and no. Few people, even among the Linux crowd, are aware it exists or feel like trusting it with important data sets/jobs. Those few are the most likely to be equally at ease running FreeBSD or Illumos (or Solaris), where the damn thing is known to work really well.
In practice this means that ZoL may require more memory to be stable, or may be less stable depending on configuration if low in memory.
I've run ZFS on Solaris (by way of Nexenta) and Linux since 2008 or so. I haven't seen much reliability difference in practice. I've had fewer hiccups streaming video from Linux though.
but mentioning Solaris that "requires licensing" and not illumos?? illumos is the new Solaris, the old Solaris devs are working on illumos! Oracle Solaris is irrelevant.
Amen to that.
Its highly interesting that Canonical does this with ZFS. I'm not sure why they dont market this more.
I'd be very surprised if RHEL (by observing the progression of Fedora development) continues to bet on btrfs, as I have yet to encounter anyone (including myself) who would ever trust btrfs over ZFS on Linux with anything of importance based on their experiences with the two - my experience is anecdotal, but ZFS has been just as reliable on *BSD as with any Linux distribution.
What you're going to find in storage is that there are multiple valid approaches, no clear one size fits all winner, and people will choose based on the tools they're familiar with, the company they trust, and their use case. I pick Btrfs over ZFS pretty much based on equivalent trust and better flexibility for my use case, but then I'm much more familiar with Btrfs tools and where the bodies are buried than I am with ZFS. I don't need to go around impugning other projects to justify what I use.
This is going to take lot of work, and not just for the stratis developers but for projects that need to manipulate it. It's asking for a lot of work for bootloader projects to support it, and
So yeah, trying to do better than those guys is going to be interesting. I'd like to know who is on the RedHat team working on this new file system, just the fact that they are trying is interesting.
Edit: and I was at SGI when XFS was pretty new, I know some of the XFS folks as well, Adam Sweeney and Mike Nishimoto.
I was the guy that plugged XFS into NFS over HIPPI, so I have more than a passing knowledge of it:
XFS was pretty cool but a lot of the technology that made it fast was XLV, the logical volume manager. XFS just made sure it handed very large, aligned, I/O requests to the volume manager, the volume manager was the layer that split them up and got all the DMA engines going. That's how we did 500MB/sec in the early 1990's on 200mhz MIPS chips, the MIPS chips weren't touching the data, the DMA engines were (the networking stack did page flipping to avoid bcopies).
I know XFS did other stuff for scaling but it most certainly did not do all the safety stuff (and I don't think it did transparent compression, those 200mhz MIPS cpus weren't fast enough to put that in there) that ZFS does.
So I'm wondering how much XFS has evolved from the SGI days. If it hasn't, I don't get why RedHat started there. Be really interested to know the back story.
$ git log --since=”2016-01-01” --pretty=format:”%an” --no-merges -- fs/xfs | sort -u
Linux file systems, where did they come from? (Presented by Dave Chinner who has been XFS maintainer for a few years.)
But the problem remains, CDDL and GPL are likely incompatible, and Linus has said that no CDDL licensed code will be merged, most likely per advice from lawyers.
So unless something drastically changes, ZFS is off the table, and thus work will continue on with alternatives, Stratis is one such alternative, bcachefs is another, and of course btrfs will not die just because Red Hat isn't supporting it anymore, as they barely did to begin with.
The GPL on the other hand is a strong copy left. If you link against GPL code, your code must also be licensed as GPL.
This means the Linux copyright owners could sue the distributers of ZoL binaries, but Oracle could not.
Oracle has the power to allow their ZFS code to be relicensed as GPL, removing this road block, but they have no incentive to do so.
...redistributing a binary work incorporating CDDLv1'd and GPLv2'd copyrighted portions constitutes copyright infringement in both directions... [https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/]
so it seems that Oracle could in fact sue.
Says the SFC. But Oracle has had plenty of time and they have not sued. In fact, they have not criticized Canonical for integrating ZFS.
They might just wait until there is more money to be gained from a lawsuit or until there's more people already on ZFS who don't want to have anything less.
But I dont think courts will take too kindly to someone who knowingly sat without doing something and waited for it to become big.
I would assume that Canonical has already sent infromation about this to Oracle. If they havent done anything now, they cant do anything later.
in fact - https://insights.ubuntu.com/2016/02/18/zfs-licensing-and-lin...
We at Canonical have conducted a legal review, including discussion with the industry’s leading software freedom legal counsel, of the licenses that apply to the Linux kernel and to ZFS.
And in doing so, we have concluded that we are acting within the rights granted and in compliance with their terms of both of those licenses
So to distribute CDDL licensed code is a breach of BOTH licenses as I see it, ignoring the 'patent peace' requirement of CDDL which would be the case if distributing it as GPL does not sound legal to me (IANAL).
I can certainly understand why Linus has stated that no CDDL code will be mainlined (brought into the Linux kernel tree), it all seems very ambiguous.
If they use CDDL, then I can't see how Oracle would have a case.
There are many decisions involved in deciding what to provide enterprise support for, and whether you like it or not, technical merit is only one of many factors. So Red Hat's decision was likely not entirely based on technical merit.
SUSE still provides enterprise support for btrfs (like we've always done), and there's still plenty of work being done from various large contributors.
[I work at SUSE.]
Red Hat never provided enterprise support for btrfs, it was a technical preview that didn't pan out to become fully supported as part of their distribution. It's barely a story (there are plenty of other filesystems Red Hat doesn't support), but it's a good opportunity to spread misinformation.
Then you disagree (literally) with Red Hat's official statement as a matter of fact (not opinion). From Chapter 53, RHEL 7.4 release notes: "Btrfs has been deprecated ... Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux."
$ git log --since=”2016-01-01” --pretty=format:”%an” --no-merges -- fs/btrfs | sort -u
That is not a "theory" that anyone who understands the basics of the licenses adheres to. The possible threat is that linux' GPL could attack ZFS; not the other way round. There is nothing in the CDDL in the way of using ZFS in linux.
Huh? Are Canonical is shipping ZFS in their kernel now ? I thought they just distributed it as a separate module.
If this ever goes to court it will be an interesting case.
There's legal disagreement, and saying "Ubuntu did it" doesn't make those problems disappear. (Sadly)
I -have- seen a lot of argument online, opinions and otherwise that disagree with this, but not anything from a lawyer - mainly "philosophical" or opining on the "intent" of the GPL, rather than "based on clauses x, y, and z, this is impermissible".
Which is not to say, if something of an opinion exists in response to Canonical's release of ZFS, I would love to see it - a legal opinion, not armchair lawyering or opinionizing.
Hmm... not sure what you meant by this statement ?
I.e. their opinion is not the only opinion on the matter.
This is a major data loss issue with some workflows and it's pretty significant that Canonical hasn't backported it.
Currently each server has 63 drives (4TB HGST NL-SAS) with 1 hot spare, configured as RAIDZ.
Right now there is 200TB of usable storage, we initially started with 29TB and have been expanding as needed when it hits about 79%, I buy 18 drives roughly every 6-8 months, 9 drives per server and expand the pool.
To say that we never had issues is lying, we did have some major issues when upgrading from versions, but this was early on, now it is a rock solid storage system.
Although there is less than 300 active users connecting to the primary server, there is a lot of very important pre & post production high dev videos.
Reboot with 63 drives is around 10 minutes or less.
Resilvering could take 24-48 hours, depending on load, depending on how much data the failed drive contained.
Performance has been great, reliability has been great, support has been great.
Sadly IX Systems can no longer provide support after the end of this year, they've extended support beyond the expected lifetime of the hardware.
It's like a blog written by a 22 year old straight out of college that's never dealt with a real production deployment/failure
Zfs on Linux has data loss bugs. There's at least one unpatched and there are bound to be more.
Single huge servers eventually fail. Maybe it'll be a drive controller. Maybe it'll be CPU or ram with bit flips as a side effect. Downtime would be the least painful part of the eventual failure.
Please don't spread untruths. Somebody who doesn't know better might actually believe you.
Unpatched on 16.04, referenced as supported in the article
There's another one, somewhat related:
which (so far) seems tied to using recordsize > 128k without either of the -L or -c flags on the zfs send side, with the result that the sendstream is corrupted in such a way that the receiver cannot detect the corruption. As with the filled-holes problem, the problem is real but rare. Unlike the filled-holes problem, it is unlikely to affect many people since it is (probably) very rare that anyone uses large records and does not use -L (or -c, or both), although there are certainly automatic snapshot-send systems (e.g. znapzend) that use a common minimal set of options to zfs send.
This is especially unfortunate because of the rarity of the corruptions, the apparent rarity of people using POSIX-layer checksumming (e.g. rsync -c, or sha256deep or the like) on large datasets (with large files that had holes made and refilled, for example) to validate that a received dataset really is the same as the original, and the apparent rarity of people doing this sort of validation specifically targetting backwards compatibility mechanisms (e.g. zfs recv into a version 28 or earlier pool from a source dataset that uses all the most recent bells and whistles).
Finally, it is extra-especially unfortunate because recovering from this sort of corruption is awkward and time-consuming; at the minimum the source and destination have to be entirely read at least once or alternatively the destination needs to be destroyed and sent again from scratch once the fix or workaround for send|recv corruptions is known.
If, instead, they had sharded+replicated it across 4020TB(replication factor) systems, they'd pay a lot more in power, but they'd be able to tolerate a single FS bug unless it somehow it all of the replicas.
Not to mention that the issue description doesn't say it's a data loss bug. It's a bug that means that ZFS send would not include holes in very specific scenarios. On-disk data would still be safe as far as I can see (though I'm not an FS expert by any stretch).
Perhaps he has split up ZFS into a number of different pools and they can be mounted in parallel (depends on the init script and whether ZFS can do this). But I do recall that larger ZFS pools can take a bit of time to mount; maybe the updated ZFS for Linux is faster....
What? How long would it take, roughly? Genuinely curious.
I'm currently unsure how to make Linux see the ZFS pool though. I.... don't really like NFS. It's too glitchy in my experience. I use it to listen to music stored on a different machine from my laptop, which uses a long-range USB Wi-Fi adapter. If I unplug the adapter without cleanly unmounting /nfs, I get a kworker in an infinite loop. I googled around one afternoon and discovered that RHEL apparently found and fixed this in kernel 3.x. Interesting - I'm on Slackware, with kernel 4.1.x. >.<
I wish you could do cross-VM virtio. That would be awesome. Then I could export the device node corresponding to the whole pool from FreeBSD and just mount it as a gigantic ext4 filesystem on Linux. (Can you do that?!)
The idea feels "janky" though. Lots of overhead compared to just running ZFS on Linux.
I'm building a Ceph cluster right now, about the same size as the one in the article, except with 9x36 drive chassis.
ceph seems nice, but appears to be very CPU hungry and not very fast.
Seems easier to have sharded vanilla linux file servers. It requires a decent asset manager, but once you have that, backups and rebalancing become trivial.
If it was actually determined to be the fault of ZFS, what was the problem? I'd like to avoid it in my own deployment.
FWIW, I use FreeNAS in a similar small but diverse setting with nary a problem, but when I sat it up for a 30+ organization, issue arose that I hadn't expected (not data loss, but usability and performance issues).
Also no mention of FreeBSD. Instead he picked the less battle tested ZoL and only mentioned Solaris and licensing, so the author is not even aware of Illumos.
I agree that this smells a bit funny, but the author seems to dance around the important topics like vpools, raid levels, and use case. Instead they just focus on the hardware, which is arguably the least interesting bit (until you know the former).
Backblaze storage pod, they're up to v 6.0 now
(Netflix open connect specs supermicro hardware)
what the actual fuck?? AWS S3 is a abominable rip off. After I rented to my own dedicated server, I am paying several times less.
It's possible to beat S3 pricing but you either need to be buying a lot of storage or cutting corners to do it. The most common mistake I've seen when people make those comparisons is excluding staff time, followed by presenting a system with no or manual bit-rot protection as equivalent.
That server has perfectly reliable power and environmental setup so you never have prolonged downtime or a double disk failure?
You're okay losing everything if someone makes a mistake running that server since backups cost too much?
You have higher-level software which tells you when data on that RAID array is corrupted? Your free sysadmin periodically runs an audit to make sure that the data stored on disk is what you originally stored? That's what I was referring to with scrubbing: even with RAID corruption happens and most storage admins have stories about the time they found out it'd happened after the only good disk failed, been written to tape, etc. The best solution is to actively scan every copy and verify it against the stored hashes for what you originally stored, which also protects against cases where a bug or human error meant that e.g. your RAID array faithfully protected a truncated file because the original write failed and nobody noticed in time. S3 provides a strong guarantee that you will get back the original data you stored or an error but never a corrupt copy and that you can prevent storing a partial or corrupted upload. If you roll your own, you need to provide those same protections for the full stack or accept a higher level of risk and perhaps mitigate it in other ways (e.g. Git-style distributed full copies with integrity checks).
Again, I'm not saying that it's impossible to pay less than S3 but your response is a bingo card for the corners people cut until something breaks and they learn the hard way why raw storage costs less than a supported storage service. Doing this for real adds support cost for the OS, your software, security, monitoring, backups & other DR planning, etc. If you use S3, Google, etc. you get all of that built into a price which is known in advance, which is a significant draw for anyone who wants to spend their mental capacity on other issues.
Many places don't have enough storage demand for that overhead to pay off in less than years and startups in particular should be extremely careful about spending their limited staff time on commodity functions rather than something which furthers their actual business. If you're Dropbox, sure, invest in a capable storage team because that's a core function but if your business is different it's time to look long and close at whether it makes any sense to devote staff time to saving a few grand a year.
At that scale something like Ceph would be more reasonable. Just because ZFS can handle those filesystem sizes doesn't necessarily mean that it's the best tool for the job. There's a reason why all big players like Google, Amazon and Facebook go for the horizontal scaling approach.