
The Birth of ZFS [video] - bcantrill
https://www.youtube.com/watch?v=dcV2PaMTAJ4
======
rsync
rsync.net supports ZFS send and receive to their cloud storage platform:

[http://www.rsync.net/products/zfsintro.html](http://www.rsync.net/products/zfsintro.html)

~~~
bcantrill
How have you dealt with the issue of byzantine streams? This is something that
we have wanted to solve for a while, but it's a really nasty issue -- and at
the recent OpenZFS Developers Summit we concluded that it wasn't practical.
Instead, we're looking to tack more towards signed ZFS streams[1] -- though
that work is only in the design stage. How has rsync.net solved this?

[1]
[https://github.com/joyent/rfd/blob/master/rfd/0014/README.md](https://github.com/joyent/rfd/blob/master/rfd/0014/README.md)

~~~
rsync
I _think_ I know what you're referring to, but we've never called them
"byzantine streams".

Are you talking about snapshots that have been broken by extended attributes
bugs, or "disappearing" snapshots ... and then what happens when you send/recv
them ?

Actually, either way, would you email info@rsync.net so we can chat over email
?

~~~
bcantrill
I'm talking about a ZFS stream that has been deliberately corrupted and then
sent to you to be received as a filesystem -- when it will in fact cause ZFS
to panic (or worse). So I mean "byzantine" in the "characterized by
deviousness" sense: someone actively trying to do harm to you. Or perhaps I
have misunderstood the functionality that you offer?

~~~
rsync
Ok, that is not what I was thinking of. We have made no specific provisions
for this.

 _However_ , we were scared enough about "unknown unknowns" that we set up
this aspect of our service such that every ZFS send/recv customer gets their
own zpool inside their own bhyve.

So it might be a DOS attack, but I don't think there is a data attack against
rsync.net here. Comments ?

Unrelated: I'm curious - when I speak of snapshots corrupted by extended
attributes or "disappearing snapshots", do you know what I mean ? Maybe we
should talk :)

~~~
bcantrill
I don't know enough about bhyve or your implementation to be able to comment,
but if you assume that any system that has ever received a ZFS stream from a
user is entirely compromised, then perhaps you can work around it that way.
I'm not aware of a disappearing snapshot problem; has there been something
sent to the ZFS list about this? Would have been a good topic of discussion
for the OpenZFS Developers Summit a few weeks ago... ;)

~~~
rsync
Yes, that's exactly the model - these users have their own zpool inside their
own VM.

------
shmerl
So, are Oracle still developing their ZFS separately, or they switched to
OpenZFS?

~~~
jpgvm
Separately. However their changes and OpenZFS changes are now effectively
incompatible.

Funnily enough is that OpenZFS is now the more advanced filesystem minus the
clustering features Oracle integrated (shared disk clustering that is, not
shared-nothing).

~~~
mphalan
I've not followed closely but got any references for that ("the more advanced
filesystem")? From what I can see Oracle's ZFS still has some features that
OpenZFS lacks (on disk encryption support being one of the most obvious) and
there's been lots of development in the last few years.
[https://blogs.oracle.com/zfs/entry/welcome_to_oracle_solaris...](https://blogs.oracle.com/zfs/entry/welcome_to_oracle_solaris_11)

Disclaimer: I work for Oracle on Solaris (but not ZFS).

~~~
technion
>on disk encryption

Given you could just run LUKS on top of (open)ZFS, it's probably the better
security position to run an audited, established encryption product, than to
consider a layer thrown on top of ZFS a feature.

~~~
belovedeagle
There are major benefits to be had from moving the encryption layer on top of
the volume management/storage EDAC layers which ZFS provides: in particular,
it'd be nice to be able to scrub a locked dataset. I think (but haven't seen
this firsthand) that Oracle's implementation offers that benefit.

~~~
olavgg
You can still scrub a dataset with GELI
([https://www.freebsd.org/cgi/man.cgi?geli%288%29](https://www.freebsd.org/cgi/man.cgi?geli%288%29))

GELI creates a virtual block device that works great with ZFS, you get it all,
self-healing, checksumming etc.

~~~
belovedeagle
It looks like GELI goes below zfs just like dm-crypt. That doesn't allow you
to scrub a locked dataset (one where the key is not in memory, possibly
unknown). You could use zvols, of course, but that loses some (but not all) of
the full-stack zfs benefits.

~~~
olavgg
No, you create virtual block devices that needs to be "unlocked" before you
can start using ZFS. If you need to scrub the dataset, you need to unlock all
the virtual block devices, start ZFS and then run the scrub command.

The whole point is to run ZFS on top of virtual block devices which are
encrypted with GELI and "unlocked".

------
acd
Using zfs in production virtualization it´s fast and provides good integrity.
You can basically get some of the hyped features of Docker the copyonwrite
filesystem part with ordinary Linux virtualization.

~~~
zf00002
I've been using it at home via FreeNAS after losing a bunch of old files to
bitrot.

~~~
cpach
FreeNAS looks interesting, but I found the hardware requirements a bit
prohibitive for home use. Renting storage in the ”cloud” feels more cost-
effective.

~~~
gh02t
The hardware requirements are sometimes exaggerated. 8GB (non-ecc) memory on a
cheap processor is fine for a home user with FreeNAS.

Whether or not it's cheaper depends on your use. Comparing to something like
Glacier, FreeNAS is definitely cheaper if you need to store more than a couple
hundred GB and/or access it frequently.

~~~
Laforet
You are quite right that 8GB memory and a cheap processor is good enough for
home use. However I thought the consensus is that one cannot skip ECC with ZFS
because a silent bitflip can corrupt an entire zpool. In any case the extra
cost is minimal so why not.

~~~
gh02t
Running without ECC is no worse than doing so with any other filesystem [as
far as I understand].

The thing about corrupting the entire zpool looks to be a bit of an urban
legend ([http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-
yo...](http://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-your-data/)).
Flipping bits can of course cause problems, but doing a scrub after a bit is
flipped in memory doesn't hose your whole pool. There's been endless
discussion and flame wars on the topic, but I've never once heard someone who
actually had it happen (because it can't really happen). You _can_ lose some
data due to bit flips, you _won 't_ however lose everything in the whole
volume like some people claim (unless you have something like an entire chip
on the memory fail, then you're definitely screwed but neither ECC nor a
different FS will help you there).

ECC is a _good idea_ and it's not hugely expensive, but there is one very big
difference. That is, most people who might want to build a NAS have a decent
bit of spare hardware lying around that they might be able to repurpose. That
hardware is probably an old desktop machine and doesn't have ECC, so if you
insist on using ECC then you can't use that old machine and have to buy a new
mobo+RAM at least. I've been running ZFS for years without ECC doing weekly
scrubs and never once destroyed anything.

Edit: the counter-argument/source of the theory that you absolutely need ECC
is primarily this post [https://forums.freenas.org/index.php?threads/ecc-vs-
non-ecc-...](https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-
and-zfs.15449/) if you want to the other side for yourself.

~~~
Laforet
Thanks for the response. I think JRS's account has some serious issues with
the writer conflating hardware failure (stuck bits, unreadable blocks, etc)
with random error that is more common and inevitable, not to mention
undetectable in principle without ECC[0].

Bit flip has been shown to cause both data loss/corruption as well as
filesystem failure in ZFS[1]. The scenario they presented in the paper is
probably less relevant for a home lab where most data are accessed
infrequently and the real chance of catastrophic failure is probably below
0.1% per bit flip event, but the risk is real and will only get higher as
storage densities go up.

People on FreeNAS forum tend to play on the extreme side of caution - after
all if one is willing to cut corners on hardware other unsafe practices will
probably follow, hence it is better to make it clear that certain acts are not
worth the risk.

[0]:DRAM Errors in the Wild: A Large-Scale Field Study
[http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf](http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf)

[1]: End-to-end Data Integrity for File Systems: A ZFS Case Study
[http://research.cs.wisc.edu/adsl/Publications/zfs-
corruption...](http://research.cs.wisc.edu/adsl/Publications/zfs-corruption-
fast10.pdf)

------
giis
It would be interesting to see comparison between ZFS and Btrfs, ReiserFS in
terms of features and development effort. Any idea?

~~~
nisa
RaiserFS is dead as far as I've read.

btrfs is a shitshow - sorry to having to resort to such words but I will
refrain from using it ever again if I have a choice.

If you look at the btrfs wiki there a lot of features but most of these are
horrible broken or implemented in a way that's surprising.

It's full of bugs and not-yet-implemented stuff - some stuff that comes to
mind:

* RAID is utterly broken: \- [http://www.spinics.net/lists/linux-btrfs/msg48561.html](http://www.spinics.net/lists/linux-btrfs/msg48561.html) \- [http://www.spinics.net/lists/linux-btrfs/msg47845.html](http://www.spinics.net/lists/linux-btrfs/msg47845.html)

* Lot's of ENOSPC bugs (just look at the mailing list)

* You need to run the latest kernel (don't try anything below 4.2) \- if not: Data corruption bugs, deadlocks

* Quota is utterly broken and continues to be breaking (just look at the mailinglist)

* It's useless for databases or VM images (disabling datacow is a ugly hack IMHO) \- e.g. see here: [http://blog.pgaddict.com/posts/friends-dont-let-friends-use-...](http://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp)

ZFS is not perfect either especially on Linux it's lacking tight integration
into the VM subsystem and at least a few versions ago was prone to
deadlocking, too. However it's not eating data and compared to btrfs it's rock
solid.

~~~
simoncion
> * RAID is utterly broken: - [http://www.spinics.net/lists/linux-
> btrfs/msg48561.html](http://www.spinics.net/lists/linux-btrfs/msg48561.html)

 _sigh_ From the message that you linked:

"FWIW... Older btrfs userspace such as your v3.17 is "OK" for normal runtime
use, assuming you don't need any newer features, as in normal runtime, it's
the kernel code doing the real work and userspace for the most part simply
makes the appropriate kernel calls to do that work.

But, once you get into a recovery situation like the one you're in now,
current userspace becomes much more mportant, as the various things you'll do
to attempt recovery rely far more on userspace code directly accessing the
filesystem, and it's only the newest userspace code that has the latest
fixes."

If you look, you see that the guy reporting the problem was using a _very_ new
kernel, but a _one-year_ -old btrfs-tools package. It makes no sense to use
tools that are _substantially_ older than the kernel you're running. He
_might_ have had much better luck with recovery (that is, he might have been
able to get a rw volume with all his data, rather than an ro volume - _with
all his data_ \- out of the process) with a recent tools package.

> [http://www.spinics.net/lists/linux-
> btrfs/msg47845.html](http://www.spinics.net/lists/linux-btrfs/msg47845.html)

Maybe I'm reading _this_ one wrong, but it looks like the summary of the
issues are:

* For multi-disk arrays with disks of varying sizes, each smaller disk does not receive data until all larger disks have as much or less free space than the smaller disk in question. [0]

* The read scheduling algo is currently naive, using PID evenness to determine the disk in a pair to read, rather than current IO load on the disk.

Did I miss something, or misunderstand something? If I didn't miss anything, I
don't see how you can get "RAID is utterly broken" from "RAID has less-than-
optimal write and read scheduling".

> * It's useless for databases or VM images (disabling datacow is a ugly hack
> IMHO)

I can't agree. I use btrfs on spinning rust to store my multi-terabyte
Postgres 9.4 DB. It works well enough for my low-to-medium-volume [1]
workload.

> * Lot's of ENOSPC bugs (just look at the mailing list)

I don't run into those, and haven't for literal _years_.

> * Quota is utterly broken...

Yes, quota is a difficult problem that continues to be unsolved.

> * You need to run the latest kernel...

Yes. Agreed.

> ...if not: Data corruption bugs ...

I haven't run into any, _ever_ on the handful of systems on which I use BTRFS.

> ...if not: ... deadlocks

True. There _was_ that unlink/move deadlock in 4.1. I hit it from time to time
when updating my copy of the metadata from the Gentoo Portage tree. Happily,
it only blocked operation on the affected file. The remainder of the system
kept running along, and a -ugh- reboot unblocked the operation. [2]

Look. I'm not saying that btrfs is feature-complete, suitable for _every_
workload, or that any given person _should_ use it for _their_ workload. I'm
just saying that the situation is _far_ less dire than you're making it out to
be. I've been using it as the rootfs for at _least_ one of my daily use
systems for the past 5.5 years, and haven't had _any_ trouble out of it [3] in
the past 2+ years.

[0] That phrasing was abysmal. Here's an example: Given an array of one each
of 3GB, 2GB, 1.5GB devices. the 2GB device will get data after 1GB has been
written to the 3GB device. The 1.5GB device will get data after 0.5GB of
additional data has been written to each of the 3GB and 2GB devices (making a
total of 1GB additional data written to the array before the 1.5GB device
starts getting data). AIUI, if you're running in RAID1 mode, then the same
logic applies, but with paired devices.

[1] Most definitely _not_ web-scale workload.

[2] I don't know if a unmount/remount cycle would have _also_ worked. The
wedged operation happened on my rootfs, so I had no way to test this.

[3] Except for the unlink/move deadlock in 4.1.

~~~
nisa
Hi, sorry for my tone - it's beyond rational discussion - however I still
think it's not totally coming from nowhere:

[RAID]

>sigh From the message that you linked:

The part below your quote is interesting.

> General note about btrfs and btrfs raid. Given that btrfs itself remains a
> "stabilizing, but not yet fully mature and stable filesystem", while btrfs
> raid will often let you recover from a bad device, sometimes that recovery
> is in the form of letting you mount ro, so you can access the data and copy
> it elsewhere, before blowing away the filesystem and starting over.

I have hotplug and online-replace with ZFS and mdraid. btrfs wants a special
degraded remount that might or might not be able to recover to rw someday...
replacing a disk online is not really possible. Handling a failing drive is
not really implemented.

Don't call it stable or don't call it RAID. It's nice that I'm able to pull of
the data of my ro-mounted degraded array when a disk fails.. however all of
this stuff is not really documented and was very surprising to me. If you have
production machine e.g. on btrfs RAID10 it sucks to not be able to replace a
broken drive on the fly online.

ZFS/mdraid just do it..

[RAID II]

> * The read scheduling algo is currently naive, using PID evenness to
> determine the disk in a pair to read, rather than current IO load on the
> disk.

> Did I miss something, or misunderstand something? If I didn't miss anything,
> I don't see how you can get "RAID is utterly broken" from "RAID has less-
> than-optimal write and read scheduling".

Walking is a less than than optimal way to move from Moscow to Berlin. If I
take a train I expect a train - btrfs is a train that occasionally stops and
says: Well, just walk... okay, you see I'm kind of emotional invested in this
shit. However it's not what's usually expected as RAID and it's not
documented.

We have here a lot of applications that do exactly that: a single PID reads
lots of stuff very often sequentially... so RAID1 doubles throughput. Well not
on btrfs.

[ENOSPC]

> I don't run into those, and haven't for literal years.

4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19 -
how can this even happen?

[Corruption]

> I haven't run into any, ever on the handful of systems on which I use BTRFS.

Good for you. Was hit by this nasty undeletable directory bug:
[http://www.spinics.net/lists/linux-
btrfs/msg35789.html](http://www.spinics.net/lists/linux-btrfs/msg35789.html)

And then this: [http://www.spinics.net/lists/linux-
btrfs/msg45108.html](http://www.spinics.net/lists/linux-btrfs/msg45108.html)

Umounting and running btrfs --check repair is not really cool if you want some
uptime...

[Deadlocks]

Currently btrfs deadlocks reliable in a OOM situation... With 4.2 things got
better, before running some medium heavy Hadoop job or ElastichSearch indexing
killed 20% to 50% of the machines due to deadlocks (with 3.19 IIRC)

> Look. I'm not saying that btrfs is feature-complete, suitable for every
> workload, or that any given person should use it for their workload. I'm
> just saying that the situation is far less dire than you're making it out to
> be. I've been using it as the rootfs for at least one of my daily use
> systems for the past 5.5 years, and haven't had any trouble out of it [3] in
> the past 2+ years.

It's probably fine for personal use. But don't think about using it for
something that is demanding. You can google the presentations that say: Use
it! It's stable! Enterprise! It's not.

Maybe in 5 years. I just saw it breaking in too many strange ways to bother
anymore with it.

~~~
simoncion
> Hi, sorry for my tone - it's beyond rational discussion...

No worries. I don't give a shit about tone, just about content. Tone is next
to impossible to accurately judge through text anyway, so I don't presume to
understand someone's intent unless I have a pile of evidence. :-)

> ...btrfs wants a special degraded remount that might or might not be able to
> recover to rw someday... replacing a disk online is not really possible.

But, from the email you quoted:

"...while btrfs raid will often let you recover from a bad device, _sometimes_
that recovery is in the form of letting you mount ro, so you can access the
data and copy it elsewhere, before blowing away the filesystem and starting
over."

And [0] indicates that you _can_ live-replace a failed drive in a btrfs RAID
array. (The man page makes it seem that you might want to also pass the -r
flag to the replace operation.)

I've never had the money around to amass the disks required to run my btrfs
array in RAID1 data mode, so I can't test the claim of that Stack Overflow
post, but I _do_ use it in _single_ data, _RAID1_ metadata and system mode.
I've had a drive in that array suddenly drop out due to a firewire controller
hiccup, and I suffered _no_ data loss. (The volume did get forced into ro
mode, so future writes _were_ lost, but all applications _knew_ that those
writes were lost.)

Additionally, I've been able to live-add and live-remove drives from my array
with _no_ hassle (other than the expected speed degradation due to shuffling
the data around) whatsoever.

> ...so RAID1 doubles [read] throughput. Well not on btrfs.

Maybe I'm out of the loop, but I've _never_ heard anyone say "use RAID1 to
double your read throughput". That sounds like a _very_ nice thing to have,
but far from essential. Its absence _certainly_ doesn't make btrfs's RAID1
implementation _broken_. :)

> Umounting and running btrfs --check repair is not really cool if you want
> some uptime...

Agreed, _very_ much so.

One thing bothers me about that guy who was reporting the bug on the 4.1
kernel. He claimed that his mount options were:

    
    
        /dev/sdb /media/storage2 btrfs ro,noatime,compress=lzo,space_cache 0 0
    

but the issue he was reporting _smelled_ like it could only happen on a rw
volume. Wonder what was up with that.

It's _also_ interesting that noone replied to his messages. Both linux-btrfs
and #btrfs have always been quite responsive when I've sent messages.

> 4x4TB RAID10 (90% free) -> ENOSPC - this was somehwere between 3.16 or 3.19
> - how can this even happen?

Dunno. It looks like the Global Reserve got introduced in ~3.16, so it's not
likely related to that. Maybe you hit a hard-link limit or had _really_ ,
_really_ large xattrs attached to files or something? (I _do_ know that btrfs
has (had?) a _surprisingly_ low limit on the number of hard links supported in
a volume, but I have _no_ idea how hitting that limit presents itself.)

Do you remember if you let the system idle for a few minutes and tried your
operation again? (Not that that's a reliable [or even reasonable] solution,
mind, but does provide a useful piece of diagnostic info.)

In the _distant_ past, running a balance on empty chunks (in order to free
them and free up otherwise unused metadata space) was one recommended way to
work around mysterious ENOSPC issues. This isn't _supposed_ to be required
anymore, but it's a last-ditch cargo-cult thing that can't do anything worse
than waste one's time.

> Good for you. Was hit by this nasty undeletable directory bug...

That _seriously_ sucks. However, I don't call that data corruption.
Inconvenient, bordering on unacceptable? Yes. Data corruption? Nah. Your
_data_ is still present and _correct_ , but you can't get rid of some
directories. [1]

> Currently btrfs deadlocks reliable in a OOM situation... With 4.2 things got
> better...

Is it _any_ OOM situation, or just _particular_ ones? I know that I've had
google-chrome + Hangouts run away and nom all my memory on _multiple_
occasions, and had my btrfs volumes function just fine after the OOM killer
had its way with Chrome.

Also, do you have a handle on _roughly_ how much better it has become in 4.2,
and is this problem something that the btrfs folks are aware of?

> [btrfs is] probably fine for personal use. But don't think about using it
> for something that is demanding.

That's a _far_ softer statement than "btrfs is a shitshow ... I will refrain
from using it ever again if I have a choice.". :)

Anyway, I _do_ _REALLY_ hope I soon get the budget to set up a btrfs testbed.
I'd _really_ like to see (and help diagnose) all these failures that people
keep talking about... in a carefully controlled environment, with data that I
don't care about -naturally-. :)

[0] [http://superuser.com/questions/685364/can-a-failed-btrfs-
dri...](http://superuser.com/questions/685364/can-a-failed-btrfs-drive-in-
raid-1-be-replaced-live)

[1] And -for the record- I saw an _eerily_ similar thing happen _every other
month_ at a Windows-only programming shop I used to work for. One or more
files and/or directories on the NTFS-backed Windows 2003 file server would
_inevitably_ become undeletable. [2] The sysadmin would have to take the whole
thing down for a half hour to smear chicken guts in the right places to get
rid of the offending files/directories. Despite these _regular_ issues, we
never decided to stop using NTFS. ;)

[2] And no, _noone_ who was using the file server had admin access. Unprivved
user access was all that was needed to create these undeletable files and
directories.

~~~
nisa
> And [0] indicates that you can live-replace a failed drive in a btrfs RAID
> array. (The man page makes it seem that you might want to also pass the -r
> flag to the replace operation.)

Good catch! I have to test on a rented server I'm running anyway in the next
days - adding a an additional drive, running replace and removing the old
drive seems to work - I'm about to find out if - deleting a device, rebooting
and trying to replace an non-existent device also works - I guess it's likely
that I have to fumble with the provider-rescue system (so now control about
the kernel version :/)

I thought that the -o degraded dance was still necassary, looks like I was
wrong. Let's see how the disk replacement on that servers turns out...

> One thing bothers me about that guy who was reporting the bug on the 4.1
> kernel. He claimed that his mount options were: /dev/sdb /media/storage2
> btrfs ro,noatime,compress=lzo,space_cache 0 0 but the issue he was reporting
> smelled like it could only happen on a rw volume. Wonder what was up with
> that.

That guy was me :)

Basically the disk got remounted ro after the failure and I just did a grep on
/proc/mounts for the fs before posting (not very wise). This has been a bug
that was fixed with 4.2 (didn't saw it again at least) and btrfs check
--repair fixed it.

> It's also interesting that noone replied to his messages. Both linux-btrfs
> and #btrfs have always been quite responsive when I've sent messages.

I had no time to hang around IRC at this time. Yes. Usually it's very nice
there. I also want to say I'm not ranting against any of the people involved.
I'm just unhappy with the current state of the technology.

[ENOSPC] > Dunno. It looks like the Global Reserve got introduced in ~3.16, so
it's not likely related to that. Maybe you hit a hard-link limit or had
really, really large xattrs attached to files or something? (I do know that
btrfs has (had?) a surprisingly low limit on the number of hard links
supported in a volume, but I have no idea how hitting that limit presents
itself.)

Actually I don't know - the filesystem in question ran a while with the 3.13
kernel and got later upgraded to something less old. Maybe that's part of the
problem. ENOSPC happened while a misconfigured Hadoop NameNode wrote tons of
data 4x without checkpointing on the disk and when restarting the
checkpointing server it all crashed. It's now running 4.2 and fine so far...

> In the distant past, running a balance on empty chunks (in order to free
> them and free up otherwise unused metadata space) was one recommended way to
> work around mysterious ENOSPC issues. This isn't supposed to be required
> anymore, but it's a last-ditch cargo-cult thing that can't do anything worse
> than waste one's time.

Yeah did that balance dance and it somewhat worked.

> Is it any OOM situation, or just particular ones? I know that I've had
> google-chrome + Hangouts run away and nom all my memory on multiple
> occasions, and had my btrfs volumes function just fine after the OOM killer
> had its way with Chrome.

Never got around to nail down this one (would love to being able to file a bug
on this) - but at the time I was not really able to do much on the machines -
it's a NUMA system and basically ElasticSearch and Hadoop allocated far more
than 64GB memory - OOM kicked in and _after a while_ the logs where full of
hanging tasks related to btrfs - at that time I've found something related on
the mailing list in regards to allocations and OOM however I'm lacking the
intimate knowledge to be sure if it's related.

I've screamed at everyone to avoid OOM situations so I'm not sure if it's
still a problem :)

> Also, do you have a handle on roughly how much better it has become in 4.2,
> and is this problem something that the btrfs folks are aware of?

So far I did not see it again. I had not enough data to file a bug or
investigate further. It was something lock related IIRC.

> Anyway, I do REALLY hope I soon get the budget to set up a btrfs testbed.
> I'd really like to see (and help diagnose) all these failures that people
> keep talking about... in a carefully controlled environment, with data that
> I don't care about -naturally-. :)

Good luck! You can try running Hadoop and some heavy jobs on it. Lot's of
threads that read and write lot's of data. There are some benchmarks like
[https://github.com/intel-hadoop/HiBench](https://github.com/intel-
hadoop/HiBench) to stress the system. Should also work on a single system.

> [1] And -for the record- I saw an eerily similar thing happen every other
> month at a Windows-only programming shop I used to work for. One or more
> files and/or directories on the NTFS-backed Windows 2003 file server would
> inevitably become undeletable. [2] The sysadmin would have to take the whole
> thing down for a half hour to smear chicken guts in the right places to get
> rid of the offending files/directories. Despite these regular issues, we
> never decided to stop using NTFS. ;)

Yes. There is worse stuff out there. I was using ZFS before with lots of
really fucked up disks in a kind of "don't do this, you are stupid!" setup and
ZFS just did it's job and never complained, same Hadoop workload + disks with
thousands of bad sectors. Maybe it's unfair to compare btrfs to this, however
it was a ride full of disillusionment to think btrfs compares to ZFS.

Maybe it's getting there and I was in at a bad time... I'm not so sure
through.

------
geggam
Can we not post videos ?

its really hard to skim the material and decide if i want to read more in
depth.

~~~
matt2000
I actually was coming here to say that it was nice to have links to
interesting stuff on Youtube. I agree it can be jarring when you're trying to
read stuff and get hit with a video - maybe a HN youtube channel would be cool
or something.

