Hacker News new | past | comments | ask | show | jobs | submit login
The Birth of ZFS [video] (youtube.com)
152 points by bcantrill on Nov 15, 2015 | hide | past | web | favorite | 58 comments



rsync.net supports ZFS send and receive to their cloud storage platform:

http://www.rsync.net/products/zfsintro.html


How have you dealt with the issue of byzantine streams? This is something that we have wanted to solve for a while, but it's a really nasty issue -- and at the recent OpenZFS Developers Summit we concluded that it wasn't practical. Instead, we're looking to tack more towards signed ZFS streams[1] -- though that work is only in the design stage. How has rsync.net solved this?

[1] https://github.com/joyent/rfd/blob/master/rfd/0014/README.md


I think I know what you're referring to, but we've never called them "byzantine streams".

Are you talking about snapshots that have been broken by extended attributes bugs, or "disappearing" snapshots ... and then what happens when you send/recv them ?

Actually, either way, would you email info@rsync.net so we can chat over email ?


I'm talking about a ZFS stream that has been deliberately corrupted and then sent to you to be received as a filesystem -- when it will in fact cause ZFS to panic (or worse). So I mean "byzantine" in the "characterized by deviousness" sense: someone actively trying to do harm to you. Or perhaps I have misunderstood the functionality that you offer?


Ok, that is not what I was thinking of. We have made no specific provisions for this.

However, we were scared enough about "unknown unknowns" that we set up this aspect of our service such that every ZFS send/recv customer gets their own zpool inside their own bhyve.

So it might be a DOS attack, but I don't think there is a data attack against rsync.net here. Comments ?

Unrelated: I'm curious - when I speak of snapshots corrupted by extended attributes or "disappearing snapshots", do you know what I mean ? Maybe we should talk :)


I don't know enough about bhyve or your implementation to be able to comment, but if you assume that any system that has ever received a ZFS stream from a user is entirely compromised, then perhaps you can work around it that way. I'm not aware of a disappearing snapshot problem; has there been something sent to the ZFS list about this? Would have been a good topic of discussion for the OpenZFS Developers Summit a few weeks ago... ;)


Yes, that's exactly the model - these users have their own zpool inside their own VM.


So, are Oracle still developing their ZFS separately, or they switched to OpenZFS?


Separately. However their changes and OpenZFS changes are now effectively incompatible.

Funnily enough is that OpenZFS is now the more advanced filesystem minus the clustering features Oracle integrated (shared disk clustering that is, not shared-nothing).


I guess Oracle still didn't learn anything. Too bad for them.


Their bank account shows how bad Oracle learns its lessons.


That just gives them a really long runway until the consequences of said decisions become apparent.


I doubt they'll ever learn though. As Bryan Cantrill put it[1], Oracle is like a landlord. They simply can't comprehend such things.

[1]: https://www.youtube.com/watch?v=-zRN7XLCRhc


> Oracle is like a landlord

I think the actual expression Bryan uses is "lawn mower" :)


That's a great talk, thanks!


I've not followed closely but got any references for that ("the more advanced filesystem")? From what I can see Oracle's ZFS still has some features that OpenZFS lacks (on disk encryption support being one of the most obvious) and there's been lots of development in the last few years. https://blogs.oracle.com/zfs/entry/welcome_to_oracle_solaris...

Disclaimer: I work for Oracle on Solaris (but not ZFS).


You can see the full list of (larger) improvements since the Oracle ZFS/OpenZFS split here: http://open-zfs.org/wiki/Features

But basically tons of improvements to the L2ARC and volume management (async destroy is a god send for example).

Lack of encryption and no clustering support is a bit of a bummer but both are easily worked around, the performance improvements in the OpenZFS branch however are very dramatic and well worth running it over the Oracle branch where possible.

There is also all the inflight stuff here on the main page: http://open-zfs.org/wiki/Main_Page


"You can see the full list of (larger) improvements since the Oracle ZFS/OpenZFS split here: http://open-zfs.org/wiki/Features"

I searched that page (and this HN discussion) for the word "defrag" and got nothing. That's a problem.

Does the Oracle version of ZFS have defrag, or have it in the pipeline ?

It is not reasonable to expect folks to just never, ever, exceed 85% pool utilization. Further, most of those folks don't realize that there's no coming back from it - exceed 85% even for a day and you have a performance penalty forever.

"Oh it's no problem, just export the pool and recreate it"

Sure ... let's buy another $20k of equipment and duplicate a 250 TB pool just because we had a usage overrun (which wasn't really a usage overrun at all) one weekend.

Finally: think of the economics of this unofficial limit on your pools ... you probably aren't running a pure stripe, right ? Maybe you're running raidz3 ? So you already gave up three drives for data protection ... did your cost accounting also subtract another 15-20% from available storage space ?

ZFS needs defrag.


ZFS defragmentation is a very difficult topic, mainly because it's very hard to do transactionally.

A lot of what makes ZFS good is it's CoW implementation, that implementation is simplified by the concept that a block will never change. However the main underlying feature that would allow defragmentation is referred to as "block pointer rewrite. Effectively it would allow you to copy a block and apply any other transformation to it and then transactionally update all pointers to that block. This is very hard when you factor in all the things that could possibly point to a block including many snapshots/clones etc.

So the long and the short of it is the situation isn't great. Will we ever see BPR? Maybe. Is it still a really good filesystem even with this limitation? Definitely.


Just how bad is this "permanent" performance hit? Do you have numbers?

I've seen pools that have climbed above 90%, and maybe they've suffered permanent speed degradation, but they still run fast enough to mostly saturate a 10GbE connection, so... not a problem for me?


We (rsync.net) do not have any numbers, but we know what happened when we broke 85% on a zpool that contained a single vdev ... things went to shit. Luckily, we expanded that zpool with two other vdevs and it sort of balanced things out and rescued it ... meaning, enough IO happens on the other two vdevs to make the zpool viable.

However, the effects scared us enough such that we will never let it happen again ... and that is a fairly severe economic and provisioning penalty (ie., chop off 15% of your inventory on top of the three drives per vdev you already lost for parity).


Oracle's ZFS encryption is susceptible to watermarking attacks: http://lists.freebsd.org/pipermail/freebsd-hackers/2013-Sept...

The "more advanced" claim is certainly disputable but OpenZFS has a larger and rapidly growing user base. The ZFSonLinux and OpenZFSonOSX ports in particular are bringing loads of new users to the table, and that means more testing, more contributors, and in the long run more features. (I've also become an occasional ZoL contributor that way.)


>on disk encryption

Given you could just run LUKS on top of (open)ZFS, it's probably the better security position to run an audited, established encryption product, than to consider a layer thrown on top of ZFS a feature.


There are major benefits to be had from moving the encryption layer on top of the volume management/storage EDAC layers which ZFS provides: in particular, it'd be nice to be able to scrub a locked dataset. I think (but haven't seen this firsthand) that Oracle's implementation offers that benefit.


You can still scrub a dataset with GELI (https://www.freebsd.org/cgi/man.cgi?geli%288%29)

GELI creates a virtual block device that works great with ZFS, you get it all, self-healing, checksumming etc.


It looks like GELI goes below zfs just like dm-crypt. That doesn't allow you to scrub a locked dataset (one where the key is not in memory, possibly unknown). You could use zvols, of course, but that loses some (but not all) of the full-stack zfs benefits.


No, you create virtual block devices that needs to be "unlocked" before you can start using ZFS. If you need to scrub the dataset, you need to unlock all the virtual block devices, start ZFS and then run the scrub command.

The whole point is to run ZFS on top of virtual block devices which are encrypted with GELI and "unlocked".


This is kind of a poor defence. ZFS built-in encryption would be a great feature, but since the source got leaked implementing it is a bit of a legal minefield.


I thought FreeNAS uses OpenZFS and that definitely gives me encryption -am I missing something, please?


Using zfs in production virtualization it´s fast and provides good integrity. You can basically get some of the hyped features of Docker the copyonwrite filesystem part with ordinary Linux virtualization.


I've been using it at home via FreeNAS after losing a bunch of old files to bitrot.


FreeNAS looks interesting, but I found the hardware requirements a bit prohibitive for home use. Renting storage in the ”cloud” feels more cost-effective.


The hardware requirements are sometimes exaggerated. 8GB (non-ecc) memory on a cheap processor is fine for a home user with FreeNAS.

Whether or not it's cheaper depends on your use. Comparing to something like Glacier, FreeNAS is definitely cheaper if you need to store more than a couple hundred GB and/or access it frequently.


You are quite right that 8GB memory and a cheap processor is good enough for home use. However I thought the consensus is that one cannot skip ECC with ZFS because a silent bitflip can corrupt an entire zpool. In any case the extra cost is minimal so why not.


Running without ECC is no worse than doing so with any other filesystem [as far as I understand].

The thing about corrupting the entire zpool looks to be a bit of an urban legend (http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-yo...). Flipping bits can of course cause problems, but doing a scrub after a bit is flipped in memory doesn't hose your whole pool. There's been endless discussion and flame wars on the topic, but I've never once heard someone who actually had it happen (because it can't really happen). You can lose some data due to bit flips, you won't however lose everything in the whole volume like some people claim (unless you have something like an entire chip on the memory fail, then you're definitely screwed but neither ECC nor a different FS will help you there).

ECC is a good idea and it's not hugely expensive, but there is one very big difference. That is, most people who might want to build a NAS have a decent bit of spare hardware lying around that they might be able to repurpose. That hardware is probably an old desktop machine and doesn't have ECC, so if you insist on using ECC then you can't use that old machine and have to buy a new mobo+RAM at least. I've been running ZFS for years without ECC doing weekly scrubs and never once destroyed anything.

Edit: the counter-argument/source of the theory that you absolutely need ECC is primarily this post https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-... if you want to the other side for yourself.


Thanks for the response. I think JRS's account has some serious issues with the writer conflating hardware failure (stuck bits, unreadable blocks, etc) with random error that is more common and inevitable, not to mention undetectable in principle without ECC[0].

Bit flip has been shown to cause both data loss/corruption as well as filesystem failure in ZFS[1]. The scenario they presented in the paper is probably less relevant for a home lab where most data are accessed infrequently and the real chance of catastrophic failure is probably below 0.1% per bit flip event, but the risk is real and will only get higher as storage densities go up.

People on FreeNAS forum tend to play on the extreme side of caution - after all if one is willing to cut corners on hardware other unsafe practices will probably follow, hence it is better to make it clear that certain acts are not worth the risk.

[0]:DRAM Errors in the Wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

[1]: End-to-end Data Integrity for File Systems: A ZFS Case Study http://research.cs.wisc.edu/adsl/Publications/zfs-corruption...


How would ZFS help there? If you don't scrub, you lose files whatever fs you use.


He never said he didn't scrub ... assuming he did (and he held 2+ copies), then the files are restored. Even if he only had a single copy, ZFS would at least tell you what files are corrupt, allowing you to restore them from backup.


I guess that's true, I was thinking about the case of the redundant array (otherwise the files wouldn't be "lost").


Bitrot still affects traditional RAID.

> As much as it affects ZFS, that's my point.

That's not true. RAID protects you against disk failure but bitrot is caused by unrecoverable read failures. Check out this ars article:

http://arstechnica.com/information-technology/2014/01/bitrot...


RAID can handle unrecoverable read failures. What it cannot handle is silent corruption where bad data is returned as if it were the correct data.


That makes sense, thanks


TLDR: This issue is way more complicated than you think, and the linked article is not very helpful in understanding it. It's fair to say that ZFS is as good as or better than anything else out there for any given storage topology and error case, but using ZFS in a redundant pool configuration is still necessary for universally good results.

As usual with a technical subject, one has to be a lot more specific to get to the truth, both in describing the problem and identifying the exact behavior of software in response. "Disk failure" covers a lot of ground, and the article you linked doesn't narrow it down all that much.

User data is stored on disk (or other media) along with a large number of ECC bits; the vendors know that bits rot on disk or in flash, and this allows transparent correction of most errors. So bits rot all the time and you normally never notice it; the rot is corrected by the disk firmware before any data is ever returned. Of greater interest are sectors that are repeatedly read in a way that they have so many errors as to be uncorrectable, which is probably what the author of that article intended to address. A disk that reliably fails a request for such a sector produces an error visible to the OS (or off-CPU RAID controller if in use), so that it can be tried against a mirror or computed from parity as appropriate. This case produces much the same result no matter what RAID or filesystem technology you use: a very slow read that eventually returns correct data. Other things may happen, too, like incrementing fault counters, sparing out the disk, rewriting the bad block (onto the same or a different location on the same or another disk), etc. These are dependent on the disk firmware and RAID or volume management implementation, and it's worth noting them because they greatly affect what happens the next time you try to read the same block and ultimately how likely you are to lose data to these defects.

However, the more interesting cases (which the ZFS authors understand very well but were not specifically described in this article because its author does not appear to understand how disks work) are those in which the wrong data is returned. Contrary to the author's idea, this is usually not caused by media errors as those are almost always detected by ECC firmware. Instead, they are caused by bugs in the disk firmware, or even more commonly the OS, such as enabling a disk's write cache and then not flushing it to close the filesystem transaction. As another example, some software RAID implementations (and/or filesystems) by design require a nonvolatile cache for correctness, especially if parity RAID is used, but many machines do not have one installed or configured for use by software. Such problems cannot be corrected at read time; it's too late. There are of course many other causes of such "phantom writes" and all manner of other firmware bugs, and it's important to characterize them more fully. For example, sometimes a re-read of the same block will return correct data, assuming it was written in the first place, and sometimes it won't (even better, sometimes whether it does or not depends on transient and unobservable state!). Regardless of the underlying cause, ZFS can protect the user application from being given incorrect data, assuming the corruption is not happening within the computer itself, such as may be caused by an OS bug, non-ECC DRAM, or a bad CPU. If you use RAID (mirrors or parity) with ZFS, it can often return the correct data instead, and fix up the problem in the background as well as keeping track of these problems so that persistently or severely faulty disks can be condemned. Even if you do not use RAID, ZFS will (possibly after various retries) turn this into something that looks like an unrecoverable media error; the application gets EIO. Of course, in that case your data is still lost (unless you elected to enable ditto blocks for user data...).

There's plenty more complexity here if you delve down into all the specific failure modes. Disks have an awful lot of them, and doing the right thing, or even agreeing on what the right thing is, for every one is difficult. Do you fail and spare out the disk? Do you count the errors and fail it after N of them? Do you rewrite the block, and if so at what address? Do you retry? How many times, and for how long? What if the disk returns correct data but only after a long time (usually caused by internal firmware retries)? And so on.

"Bit rot", i.e., silent bit flips within underlying media, may be real, but it's not directly observable by software and it's only the least interesting tip of the disk failure iceberg.


As much as it affects ZFS, that's my point.


It would be interesting to see comparison between ZFS and Btrfs, ReiserFS in terms of features and development effort. Any idea?


RaiserFS is dead as far as I've read.

btrfs is a shitshow - sorry to having to resort to such words but I will refrain from using it ever again if I have a choice.

If you look at the btrfs wiki there a lot of features but most of these are horrible broken or implemented in a way that's surprising.

It's full of bugs and not-yet-implemented stuff - some stuff that comes to mind:

* RAID is utterly broken: - http://www.spinics.net/lists/linux-btrfs/msg48561.html - http://www.spinics.net/lists/linux-btrfs/msg47845.html

* Lot's of ENOSPC bugs (just look at the mailing list)

* You need to run the latest kernel (don't try anything below 4.2) - if not: Data corruption bugs, deadlocks

* Quota is utterly broken and continues to be breaking (just look at the mailinglist)

* It's useless for databases or VM images (disabling datacow is a ugly hack IMHO) - e.g. see here: http://blog.pgaddict.com/posts/friends-dont-let-friends-use-...

ZFS is not perfect either especially on Linux it's lacking tight integration into the VM subsystem and at least a few versions ago was prone to deadlocking, too. However it's not eating data and compared to btrfs it's rock solid.



> * RAID is utterly broken: - http://www.spinics.net/lists/linux-btrfs/msg48561.html

sigh From the message that you linked:

"FWIW... Older btrfs userspace such as your v3.17 is "OK" for normal runtime use, assuming you don't need any newer features, as in normal runtime, it's the kernel code doing the real work and userspace for the most part simply makes the appropriate kernel calls to do that work.

But, once you get into a recovery situation like the one you're in now, current userspace becomes much more mportant, as the various things you'll do to attempt recovery rely far more on userspace code directly accessing the filesystem, and it's only the newest userspace code that has the latest fixes."

If you look, you see that the guy reporting the problem was using a very new kernel, but a one-year-old btrfs-tools package. It makes no sense to use tools that are substantially older than the kernel you're running. He might have had much better luck with recovery (that is, he might have been able to get a rw volume with all his data, rather than an ro volume -with all his data- out of the process) with a recent tools package.

> http://www.spinics.net/lists/linux-btrfs/msg47845.html

Maybe I'm reading this one wrong, but it looks like the summary of the issues are:

* For multi-disk arrays with disks of varying sizes, each smaller disk does not receive data until all larger disks have as much or less free space than the smaller disk in question. [0]

* The read scheduling algo is currently naive, using PID evenness to determine the disk in a pair to read, rather than current IO load on the disk.

Did I miss something, or misunderstand something? If I didn't miss anything, I don't see how you can get "RAID is utterly broken" from "RAID has less-than-optimal write and read scheduling".

> * It's useless for databases or VM images (disabling datacow is a ugly hack IMHO)

I can't agree. I use btrfs on spinning rust to store my multi-terabyte Postgres 9.4 DB. It works well enough for my low-to-medium-volume [1] workload.

> * Lot's of ENOSPC bugs (just look at the mailing list)

I don't run into those, and haven't for literal years.

> * Quota is utterly broken...

Yes, quota is a difficult problem that continues to be unsolved.

> * You need to run the latest kernel...

Yes. Agreed.

> ...if not: Data corruption bugs ...

I haven't run into any, ever on the handful of systems on which I use BTRFS.

> ...if not: ... deadlocks

True. There was that unlink/move deadlock in 4.1. I hit it from time to time when updating my copy of the metadata from the Gentoo Portage tree. Happily, it only blocked operation on the affected file. The remainder of the system kept running along, and a -ugh- reboot unblocked the operation. [2]

Look. I'm not saying that btrfs is feature-complete, suitable for every workload, or that any given person should use it for their workload. I'm just saying that the situation is far less dire than you're making it out to be. I've been using it as the rootfs for at least one of my daily use systems for the past 5.5 years, and haven't had any trouble out of it [3] in the past 2+ years.

[0] That phrasing was abysmal. Here's an example: Given an array of one each of 3GB, 2GB, 1.5GB devices. the 2GB device will get data after 1GB has been written to the 3GB device. The 1.5GB device will get data after 0.5GB of additional data has been written to each of the 3GB and 2GB devices (making a total of 1GB additional data written to the array before the 1.5GB device starts getting data). AIUI, if you're running in RAID1 mode, then the same logic applies, but with paired devices.

[1] Most definitely not web-scale workload.

[2] I don't know if a unmount/remount cycle would have also worked. The wedged operation happened on my rootfs, so I had no way to test this.

[3] Except for the unlink/move deadlock in 4.1.


Hi, sorry for my tone - it's beyond rational discussion - however I still think it's not totally coming from nowhere:

[RAID]

>sigh From the message that you linked:

The part below your quote is interesting.

> General note about btrfs and btrfs raid. Given that btrfs itself remains a "stabilizing, but not yet fully mature and stable filesystem", while btrfs raid will often let you recover from a bad device, sometimes that recovery is in the form of letting you mount ro, so you can access the data and copy it elsewhere, before blowing away the filesystem and starting over.

I have hotplug and online-replace with ZFS and mdraid. btrfs wants a special degraded remount that might or might not be able to recover to rw someday... replacing a disk online is not really possible. Handling a failing drive is not really implemented.

Don't call it stable or don't call it RAID. It's nice that I'm able to pull of the data of my ro-mounted degraded array when a disk fails.. however all of this stuff is not really documented and was very surprising to me. If you have production machine e.g. on btrfs RAID10 it sucks to not be able to replace a broken drive on the fly online.

ZFS/mdraid just do it..

[RAID II]

> * The read scheduling algo is currently naive, using PID evenness to determine the disk in a pair to read, rather than current IO load on the disk.

> Did I miss something, or misunderstand something? If I didn't miss anything, I don't see how you can get "RAID is utterly broken" from "RAID has less-than-optimal write and read scheduling".

Walking is a less than than optimal way to move from Moscow to Berlin. If I take a train I expect a train - btrfs is a train that occasionally stops and says: Well, just walk... okay, you see I'm kind of emotional invested in this shit. However it's not what's usually expected as RAID and it's not documented.

We have here a lot of applications that do exactly that: a single PID reads lots of stuff very often sequentially... so RAID1 doubles throughput. Well not on btrfs.

[ENOSPC]

> I don't run into those, and haven't for literal years.

4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19 - how can this even happen?

[Corruption]

> I haven't run into any, ever on the handful of systems on which I use BTRFS.

Good for you. Was hit by this nasty undeletable directory bug: http://www.spinics.net/lists/linux-btrfs/msg35789.html

And then this: http://www.spinics.net/lists/linux-btrfs/msg45108.html

Umounting and running btrfs --check repair is not really cool if you want some uptime...

[Deadlocks]

Currently btrfs deadlocks reliable in a OOM situation... With 4.2 things got better, before running some medium heavy Hadoop job or ElastichSearch indexing killed 20% to 50% of the machines due to deadlocks (with 3.19 IIRC)

> Look. I'm not saying that btrfs is feature-complete, suitable for every workload, or that any given person should use it for their workload. I'm just saying that the situation is far less dire than you're making it out to be. I've been using it as the rootfs for at least one of my daily use systems for the past 5.5 years, and haven't had any trouble out of it [3] in the past 2+ years.

It's probably fine for personal use. But don't think about using it for something that is demanding. You can google the presentations that say: Use it! It's stable! Enterprise! It's not.

Maybe in 5 years. I just saw it breaking in too many strange ways to bother anymore with it.


> Hi, sorry for my tone - it's beyond rational discussion...

No worries. I don't give a shit about tone, just about content. Tone is next to impossible to accurately judge through text anyway, so I don't presume to understand someone's intent unless I have a pile of evidence. :-)

> ...btrfs wants a special degraded remount that might or might not be able to recover to rw someday... replacing a disk online is not really possible.

But, from the email you quoted:

"...while btrfs raid will often let you recover from a bad device, sometimes that recovery is in the form of letting you mount ro, so you can access the data and copy it elsewhere, before blowing away the filesystem and starting over."

And [0] indicates that you can live-replace a failed drive in a btrfs RAID array. (The man page makes it seem that you might want to also pass the -r flag to the replace operation.)

I've never had the money around to amass the disks required to run my btrfs array in RAID1 data mode, so I can't test the claim of that Stack Overflow post, but I do use it in single data, RAID1 metadata and system mode. I've had a drive in that array suddenly drop out due to a firewire controller hiccup, and I suffered no data loss. (The volume did get forced into ro mode, so future writes were lost, but all applications knew that those writes were lost.)

Additionally, I've been able to live-add and live-remove drives from my array with no hassle (other than the expected speed degradation due to shuffling the data around) whatsoever.

> ...so RAID1 doubles [read] throughput. Well not on btrfs.

Maybe I'm out of the loop, but I've never heard anyone say "use RAID1 to double your read throughput". That sounds like a very nice thing to have, but far from essential. Its absence certainly doesn't make btrfs's RAID1 implementation broken. :)

> Umounting and running btrfs --check repair is not really cool if you want some uptime...

Agreed, very much so.

One thing bothers me about that guy who was reporting the bug on the 4.1 kernel. He claimed that his mount options were:

    /dev/sdb /media/storage2 btrfs ro,noatime,compress=lzo,space_cache 0 0
but the issue he was reporting smelled like it could only happen on a rw volume. Wonder what was up with that.

It's also interesting that noone replied to his messages. Both linux-btrfs and #btrfs have always been quite responsive when I've sent messages.

> 4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19 - how can this even happen?

Dunno. It looks like the Global Reserve got introduced in ~3.16, so it's not likely related to that. Maybe you hit a hard-link limit or had really, really large xattrs attached to files or something? (I do know that btrfs has (had?) a surprisingly low limit on the number of hard links supported in a volume, but I have no idea how hitting that limit presents itself.)

Do you remember if you let the system idle for a few minutes and tried your operation again? (Not that that's a reliable [or even reasonable] solution, mind, but does provide a useful piece of diagnostic info.)

In the distant past, running a balance on empty chunks (in order to free them and free up otherwise unused metadata space) was one recommended way to work around mysterious ENOSPC issues. This isn't supposed to be required anymore, but it's a last-ditch cargo-cult thing that can't do anything worse than waste one's time.

> Good for you. Was hit by this nasty undeletable directory bug...

That seriously sucks. However, I don't call that data corruption. Inconvenient, bordering on unacceptable? Yes. Data corruption? Nah. Your data is still present and correct, but you can't get rid of some directories. [1]

> Currently btrfs deadlocks reliable in a OOM situation... With 4.2 things got better...

Is it any OOM situation, or just particular ones? I know that I've had google-chrome + Hangouts run away and nom all my memory on multiple occasions, and had my btrfs volumes function just fine after the OOM killer had its way with Chrome.

Also, do you have a handle on roughly how much better it has become in 4.2, and is this problem something that the btrfs folks are aware of?

> [btrfs is] probably fine for personal use. But don't think about using it for something that is demanding.

That's a far softer statement than "btrfs is a shitshow ... I will refrain from using it ever again if I have a choice.". :)

Anyway, I do REALLY hope I soon get the budget to set up a btrfs testbed. I'd really like to see (and help diagnose) all these failures that people keep talking about... in a carefully controlled environment, with data that I don't care about -naturally-. :)

[0] http://superuser.com/questions/685364/can-a-failed-btrfs-dri...

[1] And -for the record- I saw an eerily similar thing happen every other month at a Windows-only programming shop I used to work for. One or more files and/or directories on the NTFS-backed Windows 2003 file server would inevitably become undeletable. [2] The sysadmin would have to take the whole thing down for a half hour to smear chicken guts in the right places to get rid of the offending files/directories. Despite these regular issues, we never decided to stop using NTFS. ;)

[2] And no, noone who was using the file server had admin access. Unprivved user access was all that was needed to create these undeletable files and directories.


> And [0] indicates that you can live-replace a failed drive in a btrfs RAID array. (The man page makes it seem that you might want to also pass the -r flag to the replace operation.)

Good catch! I have to test on a rented server I'm running anyway in the next days - adding a an additional drive, running replace and removing the old drive seems to work - I'm about to find out if - deleting a device, rebooting and trying to replace an non-existent device also works - I guess it's likely that I have to fumble with the provider-rescue system (so now control about the kernel version :/)

I thought that the -o degraded dance was still necassary, looks like I was wrong. Let's see how the disk replacement on that servers turns out...

> One thing bothers me about that guy who was reporting the bug on the 4.1 kernel. He claimed that his mount options were: /dev/sdb /media/storage2 btrfs ro,noatime,compress=lzo,space_cache 0 0 but the issue he was reporting smelled like it could only happen on a rw volume. Wonder what was up with that.

That guy was me :)

Basically the disk got remounted ro after the failure and I just did a grep on /proc/mounts for the fs before posting (not very wise). This has been a bug that was fixed with 4.2 (didn't saw it again at least) and btrfs check --repair fixed it.

> It's also interesting that noone replied to his messages. Both linux-btrfs and #btrfs have always been quite responsive when I've sent messages.

I had no time to hang around IRC at this time. Yes. Usually it's very nice there. I also want to say I'm not ranting against any of the people involved. I'm just unhappy with the current state of the technology.

[ENOSPC] > Dunno. It looks like the Global Reserve got introduced in ~3.16, so it's not likely related to that. Maybe you hit a hard-link limit or had really, really large xattrs attached to files or something? (I do know that btrfs has (had?) a surprisingly low limit on the number of hard links supported in a volume, but I have no idea how hitting that limit presents itself.)

Actually I don't know - the filesystem in question ran a while with the 3.13 kernel and got later upgraded to something less old. Maybe that's part of the problem. ENOSPC happened while a misconfigured Hadoop NameNode wrote tons of data 4x without checkpointing on the disk and when restarting the checkpointing server it all crashed. It's now running 4.2 and fine so far...

> In the distant past, running a balance on empty chunks (in order to free them and free up otherwise unused metadata space) was one recommended way to work around mysterious ENOSPC issues. This isn't supposed to be required anymore, but it's a last-ditch cargo-cult thing that can't do anything worse than waste one's time.

Yeah did that balance dance and it somewhat worked.

> Is it any OOM situation, or just particular ones? I know that I've had google-chrome + Hangouts run away and nom all my memory on multiple occasions, and had my btrfs volumes function just fine after the OOM killer had its way with Chrome.

Never got around to nail down this one (would love to being able to file a bug on this) - but at the time I was not really able to do much on the machines - it's a NUMA system and basically ElasticSearch and Hadoop allocated far more than 64GB memory - OOM kicked in and _after a while_ the logs where full of hanging tasks related to btrfs - at that time I've found something related on the mailing list in regards to allocations and OOM however I'm lacking the intimate knowledge to be sure if it's related.

I've screamed at everyone to avoid OOM situations so I'm not sure if it's still a problem :)

> Also, do you have a handle on roughly how much better it has become in 4.2, and is this problem something that the btrfs folks are aware of?

So far I did not see it again. I had not enough data to file a bug or investigate further. It was something lock related IIRC.

> Anyway, I do REALLY hope I soon get the budget to set up a btrfs testbed. I'd really like to see (and help diagnose) all these failures that people keep talking about... in a carefully controlled environment, with data that I don't care about -naturally-. :)

Good luck! You can try running Hadoop and some heavy jobs on it. Lot's of threads that read and write lot's of data. There are some benchmarks like https://github.com/intel-hadoop/HiBench to stress the system. Should also work on a single system.

> [1] And -for the record- I saw an eerily similar thing happen every other month at a Windows-only programming shop I used to work for. One or more files and/or directories on the NTFS-backed Windows 2003 file server would inevitably become undeletable. [2] The sysadmin would have to take the whole thing down for a half hour to smear chicken guts in the right places to get rid of the offending files/directories. Despite these regular issues, we never decided to stop using NTFS. ;)

Yes. There is worse stuff out there. I was using ZFS before with lots of really fucked up disks in a kind of "don't do this, you are stupid!" setup and ZFS just did it's job and never complained, same Hadoop workload + disks with thousands of bad sectors. Maybe it's unfair to compare btrfs to this, however it was a ride full of disillusionment to think btrfs compares to ZFS.

Maybe it's getting there and I was in at a bad time... I'm not so sure through.


Why ReiserFS? It's pretty much a defunct filesystem at this point. It's not really in the same class as ZFS and Btrfs, if you want a third option maybe look at HammerFS? Although I think pretty much everyone is waiting for Hammer2.

There was a talk "Why OpenBSD sucks" by Henning Brauer, where he complains that ZFS is a "kitchen sink" approach to a filesystem. He doesn't really explain further, but it would be interesting to hear someone elaborate that criticism.


Also see, Ted Unangst's post about ZFS where he lists some criticisms[0] which somewhat expands on Brauer's criticism, and he talks about when he was interviewed on BSD Now[1].

Keep in mind that their criticisms are mostly about it not being a good choice for OpenBSD, although some of them apply across the board.

[0] http://www.tedunangst.com/flak/post/ZFS-on-OpenBSD

[1] http://www.bsdnow.tv/episodes/2014_02_05-time_signatures


Though ReiserFS is not actively developed, If I'm not wrong some Btrfs concepts ((like tail-packing?) derived from ReiserFS.

Thanks for info about Henning Brauer talk, will check it out.


Far more interesting is the question: what error in the filesystem led to a situation where you lost data in production?

It is very easy to crank out claims of features while not having really implemented said feature in a correct way. This leads to bugs which are hard to find and almost impossible to eradicate.


Can we not post videos ?

its really hard to skim the material and decide if i want to read more in depth.


I actually was coming here to say that it was nice to have links to interesting stuff on Youtube. I agree it can be jarring when you're trying to read stuff and get hit with a video - maybe a HN youtube channel would be cool or something.


it's only 20 minutes and the content can be sped up.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: