Are you talking about snapshots that have been broken by extended attributes bugs, or "disappearing" snapshots ... and then what happens when you send/recv them ?
Actually, either way, would you email email@example.com so we can chat over email ?
However, we were scared enough about "unknown unknowns" that we set up this aspect of our service such that every ZFS send/recv customer gets their own zpool inside their own bhyve.
So it might be a DOS attack, but I don't think there is a data attack against rsync.net here. Comments ?
Unrelated: I'm curious - when I speak of snapshots corrupted by extended attributes or "disappearing snapshots", do you know what I mean ? Maybe we should talk :)
Funnily enough is that OpenZFS is now the more advanced filesystem minus the clustering features Oracle integrated (shared disk clustering that is, not shared-nothing).
I think the actual expression Bryan uses is "lawn mower" :)
Disclaimer: I work for Oracle on Solaris (but not ZFS).
But basically tons of improvements to the L2ARC and volume management (async destroy is a god send for example).
Lack of encryption and no clustering support is a bit of a bummer but both are easily worked around, the performance improvements in the OpenZFS branch however are very dramatic and well worth running it over the Oracle branch where possible.
There is also all the inflight stuff here on the main page: http://open-zfs.org/wiki/Main_Page
I searched that page (and this HN discussion) for the word "defrag" and got nothing. That's a problem.
Does the Oracle version of ZFS have defrag, or have it in the pipeline ?
It is not reasonable to expect folks to just never, ever, exceed 85% pool utilization. Further, most of those folks don't realize that there's no coming back from it - exceed 85% even for a day and you have a performance penalty forever.
"Oh it's no problem, just export the pool and recreate it"
Sure ... let's buy another $20k of equipment and duplicate a 250 TB pool just because we had a usage overrun (which wasn't really a usage overrun at all) one weekend.
Finally: think of the economics of this unofficial limit on your pools ... you probably aren't running a pure stripe, right ? Maybe you're running raidz3 ? So you already gave up three drives for data protection ... did your cost accounting also subtract another 15-20% from available storage space ?
ZFS needs defrag.
A lot of what makes ZFS good is it's CoW implementation, that implementation is simplified by the concept that a block will never change. However the main underlying feature that would allow defragmentation is referred to as "block pointer rewrite. Effectively it would allow you to copy a block and apply any other transformation to it and then transactionally update all pointers to that block. This is very hard when you factor in all the things that could possibly point to a block including many snapshots/clones etc.
So the long and the short of it is the situation isn't great.
Will we ever see BPR? Maybe.
Is it still a really good filesystem even with this limitation? Definitely.
I've seen pools that have climbed above 90%, and maybe they've suffered permanent speed degradation, but they still run fast enough to mostly saturate a 10GbE connection, so... not a problem for me?
However, the effects scared us enough such that we will never let it happen again ... and that is a fairly severe economic and provisioning penalty (ie., chop off 15% of your inventory on top of the three drives per vdev you already lost for parity).
The "more advanced" claim is certainly disputable but OpenZFS has a larger and rapidly growing user base. The ZFSonLinux and OpenZFSonOSX ports in particular are bringing loads of new users to the table, and that means more testing, more contributors, and in the long run more features. (I've also become an occasional ZoL contributor that way.)
Given you could just run LUKS on top of (open)ZFS, it's probably the better security position to run an audited, established encryption product, than to consider a layer thrown on top of ZFS a feature.
GELI creates a virtual block device that works great with ZFS, you get it all, self-healing, checksumming etc.
The whole point is to run ZFS on top of virtual block devices which are encrypted with GELI and "unlocked".
Whether or not it's cheaper depends on your use. Comparing to something like Glacier, FreeNAS is definitely cheaper if you need to store more than a couple hundred GB and/or access it frequently.
The thing about corrupting the entire zpool looks to be a bit of an urban legend (http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-yo...). Flipping bits can of course cause problems, but doing a scrub after a bit is flipped in memory doesn't hose your whole pool. There's been endless discussion and flame wars on the topic, but I've never once heard someone who actually had it happen (because it can't really happen). You can lose some data due to bit flips, you won't however lose everything in the whole volume like some people claim (unless you have something like an entire chip on the memory fail, then you're definitely screwed but neither ECC nor a different FS will help you there).
ECC is a good idea and it's not hugely expensive, but there is one very big difference. That is, most people who might want to build a NAS have a decent bit of spare hardware lying around that they might be able to repurpose. That hardware is probably an old desktop machine and doesn't have ECC, so if you insist on using ECC then you can't use that old machine and have to buy a new mobo+RAM at least. I've been running ZFS for years without ECC doing weekly scrubs and never once destroyed anything.
Edit: the counter-argument/source of the theory that you absolutely need ECC is primarily this post https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-... if you want to the other side for yourself.
Bit flip has been shown to cause both data loss/corruption as well as filesystem failure in ZFS. The scenario they presented in the paper is probably less relevant for a home lab where most data are accessed infrequently and the real chance of catastrophic failure is probably below 0.1% per bit flip event, but the risk is real and will only get higher as storage densities go up.
People on FreeNAS forum tend to play on the extreme side of caution - after all if one is willing to cut corners on hardware other unsafe practices will probably follow, hence it is better to make it clear that certain acts are not worth the risk.
:DRAM Errors in the Wild: A Large-Scale Field Study
: End-to-end Data Integrity for File Systems: A ZFS Case Study
> As much as it affects ZFS, that's my point.
That's not true. RAID protects you against disk failure but bitrot is caused by unrecoverable read failures. Check out this ars article:
As usual with a technical subject, one has to be a lot more specific to get to the truth, both in describing the problem and identifying the exact behavior of software in response. "Disk failure" covers a lot of ground, and the article you linked doesn't narrow it down all that much.
User data is stored on disk (or other media) along with a large number of ECC bits; the vendors know that bits rot on disk or in flash, and this allows transparent correction of most errors. So bits rot all the time and you normally never notice it; the rot is corrected by the disk firmware before any data is ever returned. Of greater interest are sectors that are repeatedly read in a way that they have so many errors as to be uncorrectable, which is probably what the author of that article intended to address. A disk that reliably fails a request for such a sector produces an error visible to the OS (or off-CPU RAID controller if in use), so that it can be tried against a mirror or computed from parity as appropriate. This case produces much the same result no matter what RAID or filesystem technology you use: a very slow read that eventually returns correct data. Other things may happen, too, like incrementing fault counters, sparing out the disk, rewriting the bad block (onto the same or a different location on the same or another disk), etc. These are dependent on the disk firmware and RAID or volume management implementation, and it's worth noting them because they greatly affect what happens the next time you try to read the same block and ultimately how likely you are to lose data to these defects.
However, the more interesting cases (which the ZFS authors understand very well but were not specifically described in this article because its author does not appear to understand how disks work) are those in which the wrong data is returned. Contrary to the author's idea, this is usually not caused by media errors as those are almost always detected by ECC firmware. Instead, they are caused by bugs in the disk firmware, or even more commonly the OS, such as enabling a disk's write cache and then not flushing it to close the filesystem transaction. As another example, some software RAID implementations (and/or filesystems) by design require a nonvolatile cache for correctness, especially if parity RAID is used, but many machines do not have one installed or configured for use by software. Such problems cannot be corrected at read time; it's too late. There are of course many other causes of such "phantom writes" and all manner of other firmware bugs, and it's important to characterize them more fully. For example, sometimes a re-read of the same block will return correct data, assuming it was written in the first place, and sometimes it won't (even better, sometimes whether it does or not depends on transient and unobservable state!). Regardless of the underlying cause, ZFS can protect the user application from being given incorrect data, assuming the corruption is not happening within the computer itself, such as may be caused by an OS bug, non-ECC DRAM, or a bad CPU. If you use RAID (mirrors or parity) with ZFS, it can often return the correct data instead, and fix up the problem in the background as well as keeping track of these problems so that persistently or severely faulty disks can be condemned. Even if you do not use RAID, ZFS will (possibly after various retries) turn this into something that looks like an unrecoverable media error; the application gets EIO. Of course, in that case your data is still lost (unless you elected to enable ditto blocks for user data...).
There's plenty more complexity here if you delve down into all the specific failure modes. Disks have an awful lot of them, and doing the right thing, or even agreeing on what the right thing is, for every one is difficult. Do you fail and spare out the disk? Do you count the errors and fail it after N of them? Do you rewrite the block, and if so at what address? Do you retry? How many times, and for how long? What if the disk returns correct data but only after a long time (usually caused by internal firmware retries)? And so on.
"Bit rot", i.e., silent bit flips within underlying media, may be real, but it's not directly observable by software and it's only the least interesting tip of the disk failure iceberg.
btrfs is a shitshow - sorry to having to resort to such words but I will refrain from using it ever again if I have a choice.
If you look at the btrfs wiki there a lot of features but most of these are horrible broken or implemented in a way that's surprising.
It's full of bugs and not-yet-implemented stuff - some stuff that comes to mind:
* RAID is utterly broken:
* Lot's of ENOSPC bugs (just look at the mailing list)
* You need to run the latest kernel (don't try anything below 4.2)
- if not: Data corruption bugs, deadlocks
* Quota is utterly broken and continues to be breaking (just look at the mailinglist)
* It's useless for databases or VM images (disabling datacow is a ugly hack IMHO)
- e.g. see here: http://blog.pgaddict.com/posts/friends-dont-let-friends-use-...
ZFS is not perfect either especially on Linux it's lacking tight integration into the VM subsystem and at least a few versions ago was prone to deadlocking, too. However it's not eating data and compared to btrfs it's rock solid.
sigh From the message that you linked:
"FWIW... Older btrfs userspace such as your v3.17 is "OK" for normal runtime use, assuming you don't need any newer features, as in normal runtime, it's the kernel code doing the real work and userspace for the most part simply makes the appropriate kernel calls to do that work.
But, once you get into a recovery situation like the one you're in now, current userspace becomes much more mportant, as the various things you'll do to attempt recovery rely far more on userspace code directly accessing the filesystem, and it's only the newest userspace code that has the latest fixes."
If you look, you see that the guy reporting the problem was using a very new kernel, but a one-year-old btrfs-tools package. It makes no sense to use tools that are substantially older than the kernel you're running. He might have had much better luck with recovery (that is, he might have been able to get a rw volume with all his data, rather than an ro volume -with all his data- out of the process) with a recent tools package.
Maybe I'm reading this one wrong, but it looks like the summary of the issues are:
* For multi-disk arrays with disks of varying sizes, each smaller disk does not receive data until all larger disks have as much or less free space than the smaller disk in question. 
* The read scheduling algo is currently naive, using PID evenness to determine the disk in a pair to read, rather than current IO load on the disk.
Did I miss something, or misunderstand something? If I didn't miss anything, I don't see how you can get "RAID is utterly broken" from "RAID has less-than-optimal write and read scheduling".
> * It's useless for databases or VM images (disabling datacow is a ugly hack IMHO)
I can't agree. I use btrfs on spinning rust to store my multi-terabyte Postgres 9.4 DB. It works well enough for my low-to-medium-volume  workload.
> * Lot's of ENOSPC bugs (just look at the mailing list)
I don't run into those, and haven't for literal years.
> * Quota is utterly broken...
Yes, quota is a difficult problem that continues to be unsolved.
> * You need to run the latest kernel...
> ...if not: Data corruption bugs ...
I haven't run into any, ever on the handful of systems on which I use BTRFS.
> ...if not: ... deadlocks
True. There was that unlink/move deadlock in 4.1. I hit it from time to time when updating my copy of the metadata from the Gentoo Portage tree. Happily, it only blocked operation on the affected file. The remainder of the system kept running along, and a -ugh- reboot unblocked the operation. 
Look. I'm not saying that btrfs is feature-complete, suitable for every workload, or that any given person should use it for their workload. I'm just saying that the situation is far less dire than you're making it out to be. I've been using it as the rootfs for at least one of my daily use systems for the past 5.5 years, and haven't had any trouble out of it  in the past 2+ years.
 That phrasing was abysmal. Here's an example: Given an array of one each of 3GB, 2GB, 1.5GB devices. the 2GB device will get data after 1GB has been written to the 3GB device. The 1.5GB device will get data after 0.5GB of additional data has been written to each of the 3GB and 2GB devices (making a total of 1GB additional data written to the array before the 1.5GB device starts getting data). AIUI, if you're running in RAID1 mode, then the same logic applies, but with paired devices.
 Most definitely not web-scale workload.
 I don't know if a unmount/remount cycle would have also worked. The wedged operation happened on my rootfs, so I had no way to test this.
 Except for the unlink/move deadlock in 4.1.
>sigh From the message that you linked:
The part below your quote is interesting.
> General note about btrfs and btrfs raid. Given that btrfs itself remains
a "stabilizing, but not yet fully mature and stable filesystem", while
btrfs raid will often let you recover from a bad device, sometimes that
recovery is in the form of letting you mount ro, so you can access the
data and copy it elsewhere, before blowing away the filesystem and
I have hotplug and online-replace with ZFS and mdraid. btrfs wants a special degraded remount that might or might not be able to recover to rw someday... replacing a disk online is not really possible. Handling a failing drive is not really implemented.
Don't call it stable or don't call it RAID. It's nice that I'm able to pull of the data of my ro-mounted degraded array when a disk fails.. however all of this stuff is not really documented and was very surprising to me. If you have production machine e.g. on btrfs RAID10 it sucks to not be able to replace a broken drive on the fly online.
ZFS/mdraid just do it..
> * The read scheduling algo is currently naive, using PID evenness to determine the disk in a pair to read, rather than current IO load on the disk.
> Did I miss something, or misunderstand something? If I didn't miss anything, I don't see how you can get "RAID is utterly broken" from "RAID has less-than-optimal write and read scheduling".
Walking is a less than than optimal way to move from Moscow to Berlin. If I take a train I expect a train - btrfs is a train that occasionally stops and says: Well, just walk... okay, you see I'm kind of emotional invested in this shit. However it's not what's usually expected as RAID and it's not documented.
We have here a lot of applications that do exactly that: a single PID reads lots of stuff very often sequentially... so RAID1 doubles throughput. Well not on btrfs.
> I don't run into those, and haven't for literal years.
4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19 - how can this even happen?
> I haven't run into any, ever on the handful of systems on which I use BTRFS.
Good for you.
Was hit by this nasty undeletable directory bug: http://www.spinics.net/lists/linux-btrfs/msg35789.html
And then this: http://www.spinics.net/lists/linux-btrfs/msg45108.html
Umounting and running btrfs --check repair is not really cool if you want some uptime...
Currently btrfs deadlocks reliable in a OOM situation...
With 4.2 things got better, before running some medium heavy Hadoop job or ElastichSearch indexing killed 20% to 50% of the machines due to deadlocks (with 3.19 IIRC)
> Look. I'm not saying that btrfs is feature-complete, suitable for every workload, or that any given person should use it for their workload. I'm just saying that the situation is far less dire than you're making it out to be. I've been using it as the rootfs for at least one of my daily use systems for the past 5.5 years, and haven't had any trouble out of it  in the past 2+ years.
It's probably fine for personal use. But don't think about using it for something that is demanding. You can google the presentations that say: Use it! It's stable! Enterprise! It's not.
Maybe in 5 years. I just saw it breaking in too many strange ways to bother anymore with it.
No worries. I don't give a shit about tone, just about content. Tone is next to impossible to accurately judge through text anyway, so I don't presume to understand someone's intent unless I have a pile of evidence. :-)
> ...btrfs wants a special degraded remount that might or might not be able to recover to rw someday... replacing a disk online is not really possible.
But, from the email you quoted:
"...while btrfs raid will often let you recover from a bad device, sometimes that recovery is in the form of letting you mount ro, so you can access the data and copy it elsewhere, before blowing away the filesystem and starting over."
And  indicates that you can live-replace a failed drive in a btrfs RAID array. (The man page makes it seem that you might want to also pass the -r flag to the replace operation.)
I've never had the money around to amass the disks required to run my btrfs array in RAID1 data mode, so I can't test the claim of that Stack Overflow post, but I do use it in single data, RAID1 metadata and system mode. I've had a drive in that array suddenly drop out due to a firewire controller hiccup, and I suffered no data loss. (The volume did get forced into ro mode, so future writes were lost, but all applications knew that those writes were lost.)
Additionally, I've been able to live-add and live-remove drives from my array with no hassle (other than the expected speed degradation due to shuffling the data around) whatsoever.
> ...so RAID1 doubles [read] throughput. Well not on btrfs.
Maybe I'm out of the loop, but I've never heard anyone say "use RAID1 to double your read throughput". That sounds like a very nice thing to have, but far from essential. Its absence certainly doesn't make btrfs's RAID1 implementation broken. :)
> Umounting and running btrfs --check repair is not really cool if you want some uptime...
Agreed, very much so.
One thing bothers me about that guy who was reporting the bug on the 4.1 kernel. He claimed that his mount options were:
/dev/sdb /media/storage2 btrfs ro,noatime,compress=lzo,space_cache 0 0
It's also interesting that noone replied to his messages. Both linux-btrfs and #btrfs have always been quite responsive when I've sent messages.
> 4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19 - how can this even happen?
Dunno. It looks like the Global Reserve got introduced in ~3.16, so it's not likely related to that. Maybe you hit a hard-link limit or had really, really large xattrs attached to files or something? (I do know that btrfs has (had?) a surprisingly low limit on the number of hard links supported in a volume, but I have no idea how hitting that limit presents itself.)
Do you remember if you let the system idle for a few minutes and tried your operation again? (Not that that's a reliable [or even reasonable] solution, mind, but does provide a useful piece of diagnostic info.)
In the distant past, running a balance on empty chunks (in order to free them and free up otherwise unused metadata space) was one recommended way to work around mysterious ENOSPC issues. This isn't supposed to be required anymore, but it's a last-ditch cargo-cult thing that can't do anything worse than waste one's time.
> Good for you. Was hit by this nasty undeletable directory bug...
That seriously sucks. However, I don't call that data corruption. Inconvenient, bordering on unacceptable? Yes. Data corruption? Nah. Your data is still present and correct, but you can't get rid of some directories. 
> Currently btrfs deadlocks reliable in a OOM situation... With 4.2 things got better...
Is it any OOM situation, or just particular ones? I know that I've had google-chrome + Hangouts run away and nom all my memory on multiple occasions, and had my btrfs volumes function just fine after the OOM killer had its way with Chrome.
Also, do you have a handle on roughly how much better it has become in 4.2, and is this problem something that the btrfs folks are aware of?
> [btrfs is] probably fine for personal use. But don't think about using it for something that is demanding.
That's a far softer statement than "btrfs is a shitshow ... I will refrain from using it ever again if I have a choice.". :)
Anyway, I do REALLY hope I soon get the budget to set up a btrfs testbed. I'd really like to see (and help diagnose) all these failures that people keep talking about... in a carefully controlled environment, with data that I don't care about -naturally-. :)
 And -for the record- I saw an eerily similar thing happen every other month at a Windows-only programming shop I used to work for. One or more files and/or directories on the NTFS-backed Windows 2003 file server would inevitably become undeletable.  The sysadmin would have to take the whole thing down for a half hour to smear chicken guts in the right places to get rid of the offending files/directories. Despite these regular issues, we never decided to stop using NTFS. ;)
 And no, noone who was using the file server had admin access. Unprivved user access was all that was needed to create these undeletable files and directories.
Good catch! I have to test on a rented server I'm running anyway in the next days - adding a an additional drive, running replace and removing the old drive seems to work - I'm about to find out if - deleting a device, rebooting and trying to replace an non-existent device also works - I guess it's likely that I have to fumble with the provider-rescue system (so now control about the kernel version :/)
I thought that the -o degraded dance was still necassary, looks like I was wrong. Let's see how the disk replacement on that servers turns out...
> One thing bothers me about that guy who was reporting the bug on the 4.1 kernel. He claimed that his mount options were:
/dev/sdb /media/storage2 btrfs ro,noatime,compress=lzo,space_cache 0 0
but the issue he was reporting smelled like it could only happen on a rw volume. Wonder what was up with that.
That guy was me :)
Basically the disk got remounted ro after the failure and I just did a grep on /proc/mounts for the fs before posting (not very wise). This has been a bug that was fixed with 4.2 (didn't saw it again at least) and btrfs check --repair fixed it.
> It's also interesting that noone replied to his messages. Both linux-btrfs and #btrfs have always been quite responsive when I've sent messages.
I had no time to hang around IRC at this time. Yes. Usually it's very nice there. I also want to say I'm not ranting against any of the people involved. I'm just unhappy with the current state of the technology.
> Dunno. It looks like the Global Reserve got introduced in ~3.16, so it's not likely related to that. Maybe you hit a hard-link limit or had really, really large xattrs attached to files or something? (I do know that btrfs has (had?) a surprisingly low limit on the number of hard links supported in a volume, but I have no idea how hitting that limit presents itself.)
Actually I don't know - the filesystem in question ran a while with the 3.13 kernel and got later upgraded to something less old. Maybe that's part of the problem. ENOSPC happened while a misconfigured Hadoop NameNode wrote tons of data 4x without checkpointing on the disk and when restarting the checkpointing server it all crashed. It's now running 4.2 and fine so far...
> In the distant past, running a balance on empty chunks (in order to free them and free up otherwise unused metadata space) was one recommended way to work around mysterious ENOSPC issues. This isn't supposed to be required anymore, but it's a last-ditch cargo-cult thing that can't do anything worse than waste one's time.
Yeah did that balance dance and it somewhat worked.
> Is it any OOM situation, or just particular ones? I know that I've had google-chrome + Hangouts run away and nom all my memory on multiple occasions, and had my btrfs volumes function just fine after the OOM killer had its way with Chrome.
Never got around to nail down this one (would love to being able to file a bug on this) - but at the time I was not really able to do much on the machines - it's a NUMA system and basically ElasticSearch and Hadoop allocated far more than 64GB memory - OOM kicked in and _after a while_ the logs where full of hanging tasks related to btrfs - at that time I've found something related on the mailing list in regards to allocations and OOM however I'm lacking the intimate knowledge to be sure if it's related.
I've screamed at everyone to avoid OOM situations so I'm not sure if it's still a problem :)
> Also, do you have a handle on roughly how much better it has become in 4.2, and is this problem something that the btrfs folks are aware of?
So far I did not see it again. I had not enough data to file a bug or investigate further. It was something lock related IIRC.
> Anyway, I do REALLY hope I soon get the budget to set up a btrfs testbed. I'd really like to see (and help diagnose) all these failures that people keep talking about... in a carefully controlled environment, with data that I don't care about -naturally-. :)
Good luck! You can try running Hadoop and some heavy jobs on it. Lot's of threads that read and write lot's of data. There are some benchmarks like https://github.com/intel-hadoop/HiBench to stress the system. Should also work on a single system.
>  And -for the record- I saw an eerily similar thing happen every other month at a Windows-only programming shop I used to work for. One or more files and/or directories on the NTFS-backed Windows 2003 file server would inevitably become undeletable.  The sysadmin would have to take the whole thing down for a half hour to smear chicken guts in the right places to get rid of the offending files/directories. Despite these regular issues, we never decided to stop using NTFS. ;)
Yes. There is worse stuff out there. I was using ZFS before with lots of really fucked up disks in a kind of "don't do this, you are stupid!" setup and ZFS just did it's job and never complained, same Hadoop workload + disks with thousands of bad sectors. Maybe it's unfair to compare btrfs to this, however it was a ride full of disillusionment to think btrfs compares to ZFS.
Maybe it's getting there and I was in at a bad time... I'm not so sure through.
There was a talk "Why OpenBSD sucks" by Henning Brauer, where he complains that ZFS is a "kitchen sink" approach to a filesystem. He doesn't really explain further, but it would be interesting to hear someone elaborate that criticism.
Keep in mind that their criticisms are mostly about it not being a good choice for OpenBSD, although some of them apply across the board.
Thanks for info about Henning Brauer talk, will check it out.
It is very easy to crank out claims of features while not having really implemented said feature in a correct way. This leads to bugs which are hard to find and almost impossible to eradicate.
its really hard to skim the material and decide if i want to read more in depth.