1. Author dislikes ZFS because you can't grow and shrink a pool however you want.
2. Author likes BTRFS because it implements more flexible grow/shrink, but finds that in other ways it's unstable, and is unhappy that BTRFS doesn't let you have arbitrary numbers of parity drives.
> 1. Author dislikes ZFS because you can't grow and shrink a pool however you want.
I believe that this was on the future timeline for ZFS. It required something like the ability to rewrite metadata or something.
The problem is that nobody really cares about this outside of a very few individual users. Anybody enterprise just buys more disks or systems. Anybody actually living in the cloud has to deal with entire systems/disks/etc. falling over so ZFS isn't sufficiently distributed/fault tolerant.
So, you have to be an individual user, using ZFS, in a multiple drive configuration to care. That's a really narrow subset of people, and the developers probably give that feature the time they think it deserves (ie. none).
ZFS device removal is on its way but not quite in illumos yet. From Alex's blog post: "We’ll publish the code when we ship our next release, probably in March [2015], [and we will] integrate into Illumos once we address all the future work issues."
Isn't it funny then that BTRFS supports growing and shrinking? Sounds like it's not such an edge case at all. BTRFS is also focussed on the enterprise, yet it is implemented.
People have been working around the limitations of lvm and Linux raid for years... it's not such a big deal. The issue with BTRFS is the stability and consistent performance, especially when the disk gets close to being full.
Would definitely prefer BTRFS to be stable over having additional features...
If you include that "using ZFS", it is a small subset and "a few individual users", but that may be because they care and don't choose ZFS because of it.
If you remove that clause, I would think there are way more people in that situation. That still wouldn't imply that those who write ZFS have to care, though. There's nothing wrong with providing a great solution for a selected subset that isn't suitable for the majority. In ZFS's case, that certainly applies. For its target audience, the ability to shrink pools is nice to have, and nowhere to minimum viable product.
That argument applies to btrfs, too, by he way.cthere are way more users for whom something like Google drive is a better option than running their own btrfs system.cthat doesn't mean there is no reason to develop btrfs.
The author wants zfs to support growing and shrinking pools in a specific way that it doesn't support.
However it's clear from the docs that you can grow zfs pools by adding new vdevs. Shrinking a pool is not possible.
In practice I haven't seen the inability to shrink pools as an issue. I simply do not create pools from entire disks and use partitions instead. For example each drive is split into 5 partitions. And if I need to grow a pool I add another partition across all my disks to the pool as a new vdev.
This also allows me to have multiple pools on each system that have different Raid-z levels based on different app reqs.
It is clear from the article that I acknowledge this fact but it is a more convoluted, less flexible solution that wastes space. BTRFS shows how it can and should be done.
Shrinking the filesystem is something I have never needed to do since becoming interested in computers in 1992. The only times I have really read about it is when people want to make room for a second OS.
I've seen people shrink filesystems to clone whiteboxes, and I can see some use cases in the embedded world, though these probably aren't the places you want to use zfs or btrfs.
Correct TLDR. But with BTRFS it's not about 6 parity drives. It's about not support tripple-parity or even RAID60. Features supported by ZFS since eons.
One thing to note here, is that BTRFS use (Lack of use), is heavily based on the way RAID1 is implemented. A BTRFS RAID1 doeesn't do dual disk striping, it writes the file twice on different drives. You can do a RAID1 with 3 drives or 4 or 5. giving the same redundancy of RAID5 with a different name. If you want to double the parity, I believe you would put dual RAID1s into another RAID1.
I think the functionality of BTRFS is there, though the ideas about how we build redundant data sets will need to shift a bit.
Btrfs raid1 currently means "2 copies", even if you have 3+ devices. Btrfs is also capable of using different sized drives efficiently. Try creating a single volume raid5 with 3x 1TB and 3x 2TB. This is ~7TB usable (not including some space for fs metadata but does account for parity). Rebalance is not required when adding those 3x 2TB drives, they'll fairly immediately be put to use (the next time data chunks are allocated). And also no initial syncs when creating or adding drives.
I'm sorry but I don't think I follow you. Mirroring can never achieve the space / redundancy efficiency of parity raid, I don't see how the above works.
It looks like he's trying to say: in an array of N >= 2 drives, "BTRFS RAID 1" will still only choose 2 disks to replicate a file to. It's just 2 different ones for every file.
Well, not every file, but you see the point.
If that's the case, this has the advantage of having a subset of files recoverable even after >1 disk failure. Unlike RAID 5.
On the flip side: read speed is limited to at most 2 concurrent disks. Unlike RAID 5, where it's N-1. But that's theoretical, what's the reality on read speeds in RAID 5?
Anyway, if that's true, that would be a really weird name. Why didn't they call that "BTRFS RAID 5"? A look at the docs doesn't make this clear to me at all. I'm not sure what to believe...
I was not even aware of this possibility, thanks for explaining it to me.
The concept fills with me with absolute horror though. Losing a part of my files is as bad as losing everything.
The performance benefit with larger files of RAID1 vs RAID5/6 can be very 'substantial'. Even my old 18 TB NAS could achieve 1+ GB/s on sequential reads.
RAID5/6+ is only as fast as your slowest disk, and seek times are no better. In most respects, mirroring gets you better performance, and for most drive configurations, is more reliable than RAID5. Massive throughput on single files one at a time is not interesting to most people with big storage requirements.
I currently run RAID6 (raidz2) across 10 drives. But I'll be moving to RAID10 before long. Whether using zfs or not, depends on btrfs stability.
RAID5/6 is only as fast as the slowes disk regarding ZFS yes if you talk about random IOPS. BTRFS performance I haven't measured yet.
RAID5/6 random IOPS can be quite good or as you can expect if you use MDADM.
The best solution seems to be tons of RAM for caching and ZFS with SSD SLOG if you need reliable and fast sync writes. But with ZFS, performance comes second over reliability.
RAID isn't backup. Losing some of your files instead of all your files means you have access to some of the files during the restore window. And your restore might be quicker?
But how do you even fire up an Oracle database with some of the dbf files missing? And I can't think of any other application (other than maybe a document storage system) that would be useful with some of the files missing.
As you said yourself, volume sizes aren't exactly the problem anymore ie there is tons of space to store bits these days.
Raid rebuilds take way too long for modern volume sizes.
Raid rebuilds take so long that the likelihood of losing a second drive before completion is very high.
The safety you generally want is redundant copies of checksummed data and metadata blocks.
Btrfs allows you to choose the number of copies of data or metadata separately.
Talking about RAID in the context of btrfs and zfs generally seems to confuse people as it brings in the old expectations and understandings they worked so hard at figuring out for the RAID era.
Rebuilds on ZFS and BTRFS are quite reasonable because most of the time only the data itself is rebuild, not the entire drive as with old-fashioned solutions.
Rebuilds depend on drive size, not array size.
Tripple-parity as part of ZFS allows you to create even larger VDEVS while keeping the risks manageable. Interesting for low-performance archiving solutions.
When it comes to file systems, I like boring. What I don't like is systems that have catastrophic unrecoverable wrecks because hypothetically a bit might get flipped at the hardware level. Or that just stop moving entirely when you run out of disk space, or whatever.
High-end storage systems support multiple hard drive controllers because expensive hard drive controllers burn up like matches. Funny, I've never had a cheap hard drive controller fails, but never throw out the box an expensive hard drive controller came in because the R.M.A. is just a matter of time.
> What I don't like is systems that have catastrophic unrecoverable wrecks because hypothetically a bit might get flipped at the hardware level.
Perhaps they're hypothetical for you; that's nice. Having seen files (not filesystems, just files) wrecked by things like "power supplies that are just faulty enough to trash data but not faulty enough to stop booting the system", not to mention dodgy drives and controllers, I'm quite happy to have checksumming filesystems available.
(1) From the perception of most people, mainstream filesystems such as ext4 and NTFS are pretty reliable. I've certainly had mainstream filesystems get damaged but I've been able to repair them or copy data off without a lot of trouble.
One reason mainstream filesystems are reliable is that they privilege reliability over performance.
(2) In my experience, new file systems are dangerous. I've pretty frequently experienced data corruptions within a few days of trying a new file system. I've frequently been on projects that tried the latest thing, like the Linux filesystem that was written by a murderer, and after experiencing problems, we've gone back to mainstream file systems.
(3) New file system advocates believe filesystems and disks are unreliable, so they're willing to tolerate a higher level of failures.
(4) Most terrifying, look at all the discussion on this thread about options you can choose that might mitigate this problem or that problem. Every configuration choice is a decision you can make wrong, is a reason why your system can wreck in the middle of the night. If you know your stuff or hire somebody who knows his stuff, maybe you'll get good choices, otherwise you are playing Russian Roulette with Vladimir Putin.
That's a very good article on that topic indeed. But in some way it's sad that you even have to think about this with ZFS. It seems that this is not or way less of an issue with BTRFS (or the magic unicorn file system that is still not there).
As the author of that article, thank you. And, I agree it would be sad if you had to think about exactly how many drives are in a RAID-Z group. The entire point of my blog post is that you do not have to think about it.
"TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases."
- start with 2 disks (not a mirror, but 2 independent vdevs)
- set copies=2 to enable ditto blocks, causing data blocks to be automatically stored on 2 different disks (ZFS always tries to store ditto blocks on different vdevs when at least 2 are available)
- when adding a 3rd disk, each data block will continue to be stored on 2 different disks, and you have the option to add a 4th disk, 5th disk, etc, any time
The overhead is 50% (like RAID1/mirroring), so I presume this can be a downside for the hobbyist who usually cares about dollar per TB. But nonetheless this is an option.
I can see why hobbyists may want to expand a pool one disk at a time, but personally I have been running ZFS for 10 years, I have a 20TB fileserver at home (I grew it from 1TB over the years), and I have never needed to add just one disk at a time. Usually when I run out of space, I replace all the disks at once with larger ones (and/or replace the server if it is more than 4-6 years old).
Another point I wanted to comment on: the author makes the typical mistake of assuming ZFS on a single disk is not very interesting ("A VDEV is either a single disk (not so interesting)") but it is. To name a few features why it is still great to run on a single-disk system: end-to-end checksumming, self-healing, scrubbing, snapshots, clones, compression, ditto blocks, deduplication, CoW, zfs send/recv, simple CLI tools, etc.
For the love of god don't spread that nonsense. Ditto blocks do NOT protect against a drive failure. If you lose a drive you will lose all of your data. It is intended as a belt and suspenders function and was put in place knowing as we get larger and larger drives with higher and higher error rates, RAID alone will likely not guarantee your data is protected.
Btrfs raid7 (3 parities) could be useful. But more parities really doesn't scale very well. Raid10 does scale. So does GlusterFS.
A bigger scalability problem with Btrfs, is its raid10 implementation. Conventional raid1+0 says you can lose more than one devices at a time so long as you don't lose two mirror pairs. Btrfs raid10 doesn't have consistent mirror pairings, so the mirrored chunks for device A1 aren't all on B1, they will be distributed on multiple other drives thereby increasing the chances B1 chunks are lost in a 2+ device failure.
Developers plain n-way raid1 Very Soon Now™ and hopefully will get around to better guarantees with raid10 to make it scale like raid10 should.
I am still waiting for a database-based file system.
Some filesystems like NTFS/ReFS, BeFS and afaik ReiseFS4(?) and Btrfs(?)) could be extended. Microsoft extended the NTFS fs driver with the Cairo project in the NT4/5 era: http://en.wikipedia.org/wiki/Cairo_(operating_system)
All of the other functions NTFS exports for the EFS
driver begin with the prefix NtOfs, which presumably
stands for NT Object File System. One of the original
goals of the NT 5.0 project (code-named Cairo) was to
develop an object-oriented file system. Although NTFS in
Win2K probably hasn't reached the level of object
orientation that the Cairo planners had in mind,
Microsoft has extended NTFS in several significant ways
from its NT 4.0 implementation. One of those ways is
NTFS's support for encrypted files via the NtOfs
interfaces.
Later Microsoft moved the database to user mode, the WinFS project, the MS SQL db was stored in hidden directory on the NTFS filesystem: http://en.wikipedia.org/wiki/WinFS . WinFS failed because of its over complicated Metadata-Ontology and it was too slow for a filesystem (dotNet based shell extension). Microsoft moved the ideas to "SharePoint" (2007-2016) that overs now some of the proposed features: http://en.wikipedia.org/wiki/SharePoint .
What is the difference between a file system and a database? From my point of view a file system is just a specific kind of database (Hierarchical key value).
Then we put other kinds of databases on top of the file system database. Funny isn't it?
If the file system where a complete enough database we might not need things like SQLite which purports itself as a replacement for fopen, hmm, full circle.
Of course we have barely figured out how to encode text, maybe one day we will know how to store and retrieve data flexibly and consistently, one day long from now.
I personally like the relational model the best, it would be interesting to see a OS and files system based on the relational model rather than the hierarchical, but what do I know.
The nice thing about computers (Turing Machines) is that they let us simulate anything, including having better[1] computers (that may just run a little slower). The big idea behind turning the file system into a "database" is that you would be able to simulate any kind of file system you wanted, including better ones. This is basically what RDBMSs like SQL Server and SQLite do.
I also like the relational model the best, but it's just one of the many available simulations of "betterness".
[1] Where "better" means: easier to program, or more reliable, or has infinite memory (garbage collection), or is easier to use, or...
> What is the difference between a file system and a database?
An additional index, support for separate/extended file streams and a query language - all implemented in the (kernel mode) filesystem driver (like NTFS-Cairo and BeFS).
In Mac OS X HFS+ has also been extended with arbitrary extended file attributes[0] as well as a layer ontop of the filesystem called Spotlight[1] that indexes attributes as well as file contents for instant boolean searches.
It is amazing once you start working with video how much space you can use with RAW formats and all the files you create. A Red EPIC in HDRx at 8:1 at 5K is almost 6GB/minute.
The amount of space isn't so impressive, but the physical size and number of spindles.
I worked at one place that had 3 100tb disk arrays, each one was spread over 15 racks, giving 3gig a second throughput. (in 2007)
Now, 170 tbs, and a server fit in 5u. 60 disks give 2gigs(bytes) a second throughput. 2.5 racks gives you 5.5PBs and ~60gigabytes/second peak (25-8 sustained)
RED is Hollywood-class, though towards the lower end. A lot of big-name movies that you've heard of include scenes shot on RED cameras: http://www.red.com/shot-on-red. The TV show Leverage was shot entirely on a pair of RED Xs.
RED's stuff is on the order of $25k. The main player for digital cinema is Arri ALEXA, on the order of $50k. It's not that far off.
Yeah I've read big names switching to RED (that's what made their business thrive). I was also referring to the disk space. A single person can assemble a 100TB setup. That was Pixar territory not long ago.
The saddest about both file systems are their licences.
The CDDL prevents ZFS from being included/heavily integrated into Linux, the BSDs, OS X and Windows. Btrfs will very likely stay Linux only because of GPL.
The interesting and hard parts of a file system are pretty much operating system independent, so I wish for a common file system initiative.
"The CDDL prevents ZFS from being included/heavily integrated into Linux, the BSDs, OS X and Windows" - no, it is included and integrated into FreeBSD with no issues, NetBSD has an old version that needs updating, OpenBSD thinks it is too complex. OSX could use it (they could get a commercial license anyway), but seemed to decide not to, as could Windows.
I think the actual difficulty is patents. From what I remember, once Oracle bought Sun, they decided to go after netapp for infringing on ZFS. As it turned out, Oracle was infringing on NetApp's patent but NetApp had let it slide because until Oracle started being a dick, it was mutually beneficial. That is why Oracle also started BTRFS and is putting it's development behind that instead.
I'm not really sure where that leaves other commercial projects (especially OSX which is badly in need of a functional file system).
It was really the other way around. NetApp had Sun in court over ZFS infringing on WAFL patents[1] well before the buyout.
I certainly don't know all the ways people use their Macs, but it seems like ZFS would not get a lot of use in OS X unless it was the only option:
- It doesn't offer a whole lot over traditional filesystems in a single-device context; since Apple has basically abandoned server/enterprise, I would wager the vast majority of new OS X systems are single-drive.
- At the time Apple was supposedly making a decision about this, I am not sure if ZFS handled 4k alignment (ashift); this is very important now that a lot of new Macs are shipping with SSDs.
- You're still, even w/ 10.10, discouraged from using case-sensitive HFS+; at least one application[2] won't work on a case-sensitive filesystem on OS X.
The above reasons probably apply to Windows as well. Also, in general, it seems like both vendors want to be the sole source for all of their core features.
ZFS has supported autodetecting 4k drives since day one, with the caveat that certain drives that were programmed to try to fool OSes that only supported 512 byte sectors would occasionally fool ZFS as well [1]. Additionally, ZFS can be set as case-insensitive, to match HFS+ behavior. Finally, I'd argue that ZFS is useful even to the one-drive user. The snapshot feature would make Time Machine backups even more transparent, compression would make things a bit smaller, and filesystem sending/receiving would make backing up to other devices a lot simpler. Modern filesystems just make sense, even for the regular user.
From link [1] above it looks like Sun went after NetApp, after NetApp poked around, it turned out that Sun was the one in violation. I remembered that was the case but I was thinking that oracle had already bought them at that point
FTA:
>Sun approached NetApp about 18 months ago with claims the storage maker was violating its patents and seeking a licensing agreement, NetApp Chief Executive Dan Warmenhoven said in a statement.
>Several months into those discussions and following a review of the matter, NetApp made a discovery of its own, Warmenhoven said, concluding NetApp did not infringe the patents but that Sun infringed on NetApp's.
That's not at all what happened. (Disclaimer: I was at Sun at the time and was deposed in the case.) Sun didn't "go after NetApp" -- NetApp tried to buy some StorageTek patents via a third-party intermediary, and when they were rebuffed, they came after ZFS.[1] And, it should be said, NetApp didn't particularly care about Sun -- they cared about the fact that ZFS was open source. NetApp wanted Sun to "close" ZFS or otherwise "restrict its use"[2]. As for the case itself, it was moved back to California (NetApp had initiated it in East Texas, the patent troll capital of the universe) where it became a massive case, and was then slimmed down by order of the magistrate to three patents on the NetApp side and four patents on the Sun countersuit side. At the same time, thanks to a community outpouring of prior art, Sun was able to pursue invalidating the claims of the NetApp patents with the US Patent Office.[3] These efforts were wildly successful, and all three NetApp patents were rejected on all claims. Amazingly, the case wasn't thrown out at that point (though any damages would obviously be very limited), but every turn in the case had gone Sun's way.
Then, Oracle acquired Sun, and for reasons that haven't been disclosed, Oracle and NetApp dismissed their respective cases.[4] While I can't disclose the reasons behind this, I can say that both Oracle and NetApp would have jumped at the chance to cross-license ZFS and WAFL patents in a way that extended only to Oracle and not to CDDL licensees. (That is, prohibited open source ZFS.) Because the CDDL is airtight with respect to patents, such cross-licensing was impossible, and by dismissing their suits (instead of settling), the findings of fact from the trial essentially disappear -- which is enormously to NetApp's advantage. Point is: ZFS actually has about as much patent security as one can find in an open source system, as it has withstood a direct, full-frontal assault by attorneys seeking to find a way around its patent grants.
Apple's Core Storage is a logical volume manager, and it has COW features, in particular when using its dmcache-like function to marry an SSD and HDD into a single logical volume. The could conceivably tack on checksums at this level (instead of parity for example). Core Storage volumes are the default since 10.10, in fact as long as you don't already have a Boot Camp'd dual-boot setup, the Yosemite install process converts the main partition to a Core Storage LV in-place.
Thanks for the links, it had been a while so I was a little foggy on that one. In the end, I don't think you'd want to back the project for that reason.
>It doesn't offer a whole lot over traditional filesystems in a single-device context; since Apple has basically abandoned server/enterprise, I would wager the vast majority of new OS X systems are single-drive.
Not really, but of all the major vendor file-systems, HFS+ is probably the weakest link. An update to a filesystem that is well maintained for a freeBSD base would probably be an upgrade. Also, even with a single disk, zfs would have the benifits of good filsystem level compression which can boost read write/speeds by 2 or 3 times. Especially on a no server box where you are likely to have idle cores.
There are some features from an os level of using COW if you want to give people restore points or built-in os VCS for files. Also checksums (ie, this file looks to be corrupted, would you like to go back to yesterday?).
So there's some cool usage for these file systems even on a single disk but again a corporation probably doesn't want to touch it with a 10 foot stick.
Pretty much the only license that CDDL conflicts with is GPL, and that's because of differing approaches of both licenses rather than one being non-free (IMO).
However, that doesn't really prevent it from being used with linux in any practical use…
1. Author dislikes ZFS because you can't grow and shrink a pool however you want.
2. Author likes BTRFS because it implements more flexible grow/shrink, but finds that in other ways it's unstable, and is unhappy that BTRFS doesn't let you have arbitrary numbers of parity drives.