
Early Linux filesystem reliability - jsnell
http://minnie.tuhs.org/pipermail/tuhs/2017-May/009935.html
======
Florin_Andrei
I've actually used Ext2 in production, back in the '90s. That's all we had
back then. A bit later I've also played a lot with XFS, both on Linux and Irix
(I've worked at SGI for a while in the '00s).

Back in ye olden days, yanking the power cord basically guaranteed a corrupt
Ext2 FS and a visit from fsck. However, in the vast majority of cases, fsck
would actually do the job. You had to peruse lost+found and recover the lost
souls out of it, but that was usually the extent of it. I did 'fsck -y' many,
many times - in most cases with good results.

XFS on PC was worse. It tended to be stupendously fast when dealing with lots
of I/O to/from very large files (video editing, running VMs), but a power
failure on PC was a lot worse than with Ext2. There was a good chance you
would lose the whole volume.

On Irix / SGI hardware it was a very different story. Those suckers were quite
reliable. Heavy as hell, too.

\---

One habit from back then that's very hard to shake off is to run sync before
reboot, often as "sync; reboot" \- you know, just in case something gets stuck
on the way down and you have to hit the reset button. A more extreme example
would be to manually stop all services except sshd, then do "sync; reboot".

It's completely unjustified today, and yet I still do "sync; reboot"
occasionally. It's baked into the muscle memory of my fingers after sleepless
nights caused by losing stuff due a flaky driver that froze the system on
reboot.

~~~
throwaway76543
If you're concerned about flushing writes consider `mount -oremount,ro` --
this guarantees writes are flushed and is the only truly important step of a
system shutdown anyway WRT filesystem integrity. Once filesystems are mounted
read-only you can safely power off the machine.

Filesystems can be remounted read-only via the serial console as well, with a
break-u. Useful even when userland is otherwise inaccessible such as in the
event of a fork bomb.

~~~
mschuster91
Nice to learn this - but how does one send a "break-u" e.g. from inside
minicom, or in KVM virtual machines, via virsh console?

~~~
throwaway76543
I usually use `cu`, but the docs for minicom tell me it's control-A followed
by F to send a break signal. Then just hit "u." There are a number of single
characters following a break with actions to them, the full doc on this kernel
feature is here: [https://www.kernel.org/doc/Documentation/admin-
guide/sysrq.r...](https://www.kernel.org/doc/Documentation/admin-
guide/sysrq.rst)

I'm not terribly familiar with KVM but the VM console tools probably have some
way to generate a break signal. Check the docs for how to "send a break."

Oh, and don't forget to enable this feature via sysctl -- per the above linux
doc.

------
tetha
Hell the attitude in the article evokes strong memories to my last workplace.

Almost all development teams tried so hard to build smart software doing the
right thing, recovering from everything, tried to fix so many things. They
spent insane amounts of work trying to be smart in error situations. And all
of that was useless or harmful - it either didn't work, or it it made things
worse.

My team operated under simple principles. Crash early, crash hard, log well,
trust the operator. But you know, Ops at that place loved our applications. It
took time to get them going, but it was easy to understand what to do whenever
they crashed. It was easy to understand why it stopped. It was easy to
understand when to file a bug.

~~~
kpil
Getting good at error handling, queue sizing and writing relevant logs
requires some repeated exposure to real world pear-shaped events.

Failing early and visible is always good, especially before a program have
been taken into production. When things have stabilized for a while,
recoverable errors can be ignored, and just logged.

It's also important to take the opportunity to improve the error handling and
logging first, before fixing the actual problem. Errors are hard to fake
"right" so getting a good reproducible error is an opportunity.

------
m45t3r
Quite a coincidence. I recently went back from XFS to EXT4 in my work laptop
using Arch Linux, and before that I went from BTRFS to XFS.

I went from BTRFS to XFS because I got some really bad corner case of
performance in BTRFS. My work consists of mostly backend development using
Ruby web stack and PostgreSQL/Redis, and often my laptop would freeze
completely and only get back after a dozen of seconds. This being in a SSD was
unacceptable*. So I decided to go back to XFS and for some time everything
went well, no more random freezes.

In this weekend I updated my system by running pacman, as usual. Chrome seemed
to consume all memory and the system freeze during the update. Ok, not bad I
thought, I rebooted and tried to run pacman again, hopping that only some
files was corrupted, however sufficient files were corrupted that I needed to
reinstall all packages in the system. Another freeze during the update and my
system was essentially death after reboot. I tried to recover using chroot
however multiple files were broken beyond repair.

So I decided to go back to EXT4, and reading this article does make me more
confident that this shouldn't happen again.

~~~
kev009
Hard not to come off smug but as a ZFS user it is always jarring to read
anecdotes like this in {{ current_year }}.

~~~
m45t3r
I am quite curious to test ZFS, however Arch Linux does not maintain official
ZFS packages in repositories, so I would need to maintain my own kernel
updates just to use ZFS. Not interesting to me, specially considering that
this is my work laptop.

When (if?) Arch supports ZFS in main repos I will probably test it. That is,
unless bcachefs comes first.

~~~
codys
> maintain my own kernel updates

Not really: zfs is available in aur as a dkms package. Pacman automatically
rebuilds dkms modules when it updates the kernel. All you'd need to do is
modify mkinitcpio.conf to include "zfs" in the HOOKS at the appropriate stage.

So: more complicated that just using ext4, but not "maintain my own kernel"
level of difficulty.

------
saosebastiao
So last I checked, btrfs was the way of the future according to Ted, but every
time I see it discussed, it's Here Be Dragons galore. Is there some timeframe
where btrfs will take over? Or at least be stable enough for a Debian or Red
Hat to switch to it as a default?

~~~
Mister_Snuggles
SuSE/OpenSuSE has been using BTRFS by default for a while and it seems to work
well enough. There's a default schedule of running 'btrfs balance' every week,
seemingly based on when the OS was installed (it relies on the timestamp of a
file that gets updated with every run), that makes the system(1) virtually
unusable for about 15 minutes.

(1) I've only seen this on one machine, so maybe it's a quirk of that
machine's workload. But it sure does suck when it's in the middle of a
workday.

~~~
codys
On btrfs causing latency: I've got a few systems with btrfs as the rootfs (on
top of lvm, on top of dm-crypt, on a SSD).

I recently started using `snapper` to create snapshots on a schedule on them.
I enabled quota support in btrfs so I could see how much space snapshots were
using.

I noticed that filesystem wide latency tended to spike when removing snapshots
(several minutes of all fs access stalling).

Balancing with quotas enabled is even worse: my systems were hung for multiple
days, until I forcibly restarted them and disabled quotas. Then the fs hangs
were much smaller (a few seconds) and not to noticeable. Balancing finished in
something on the order of an hour.

While I had quotas enabled, I was constantly having btrfs tell me the data was
bad and needed rescanning (rescanning quotas would also induce fs wide
latency).

The thing is, ZFS has snapshot space usage info, and doesn't have awful
latency (it also doesn't have a "balance" operation, but I'm not sure how
relevant that is).

Given my experience with both btrfs & ZFS, I'll likely consider using ZFS as
my rootfs in the future.

~~~
Mister_Snuggles
I have no idea what the state of ZFS on Linux is, but I've been using it on
FreeBSD for a while now and it's fantastic. Comparing FreeBSD to Linux is a
bit of an apples/oranges thing though.

------
simula67
> There is yet another example of "Worse is Better" in how Linux had PCMCIA
> support several years before FreeBSD/NetBSD. However, if you ejected a
> PCMCiA card in a Linux system, there was a chance (in practice it worked out
> to be about in 1 in 5 times for a WiFI card, in my experience) that the
> system would crash. The *BSD's took a good 2-3 years longer to get PCMCIA
> support, but when they did, it was rock solid. Of course, if you are a
> laptop user, and are happy to keep your 802.11 PCMCIA card permanently
> installed, guess which OS you were likely to prefer --- "sloppy but works,
> mostly", or "it'll get there eventually, and will be rock solid when it
> does, but zip, nada, right now"?

Time to market matters. Probably explains how codebases with low reputed
quality still seems to win in the real world : MySQL vs Postgres ? MongoDB vs
RethinkDB ?

------
kev009
Ted's post is quite balanced and accurate vs the revisionist fawning in the
quoted message from McVoy. But Warner's message elsewhere in the thread is
also good
[http://minnie.tuhs.org/pipermail/tuhs/2017-May/009880.html](http://minnie.tuhs.org/pipermail/tuhs/2017-May/009880.html).

The biggest Linux defect was not ext{2,3,4} but LVM and MD, which would throw
away write barriers until something like kernel 2.6.31 (which was especially
painful on Ubuntu LTS). Many distros used LVM by default, and many servers
used mdraid somewhere in the stack. I saw many corrupt Linux systems in the
2000s through the first part of this decade.. it was especially egregious for
DBs and hypervisors with file based disk images.

------
jdblair
It took me a few passes to figure out that "Things Just Worse" should be
"Things Just Work." Autocorrect much?

------
codewiz
_PC class hardware tends not to have power fail interrupts, and when power
drops, and the voltage levels on the power rails start drooping, DRAM tends to
go insane and starts returning garbage long before the DMA engine and the hard
drive stops functioning._

This is insane, I refuse to believe it. Even a junior EE knows how to design a
PCB so that the RESET signal is asserted on all ICs as soon as the voltage
drops below a safe operating level.

~~~
DannyB2
The PCB may be well designed, but the power supply may not be. I have
witnessed what a cheap PC can do. Purchased from a Walmart some years ago.
Windows 7 immediately replaced with Ubuntu. After a brownout, that machine was
unbootable. Also un-fsck'able.

My thought was that the power supply should guarantee adequate power, or none
at all. Not something in between. Also, no a rapidly alternating states of
adequate power / no power.

Short term solution: rebuild that system and get it a UPS.

Longer term solution: get a much better box.

~~~
tetha
> My thought was that the power supply should guarantee adequate power, or
> none at all. Not something in between. Also, no a rapidly alternating states
> of adequate power / no power.

That is very hard to do, actually. Voltage does fluctuate to a certain degree,
because physics. And you don't want to cut power to a system just due to a
small power dip. And you can't distinguish a small power dip from 10 small
power dips over some time ago. And your flapping detection only works if the
power drops are close enough to each other.

Proper control theory is amazingly hard. As my prof on that said - no one
fully understands PID controllers, but some blokes have a really lucky thumb.

~~~
codewiz
The case Ted Ts'o describes is much simpler: DRAM returning garbage while the
CPU is still executing instructions and sending commands on the SATA bus.

This can't possibly happen even if the power supply is crazy bad, because the
reset logic on the main board will halt everything before the DRAM starts
malfunctioning.

~~~
DSMan195276
> RAM returning garbage while the CPU is still executing instructions and
> sending commands on the SATA bus.

When I read it, I got the impression he was saying the DRAM was going crazy
while a DMA transfer to the hard-drive was still going on. That doesn't
require the CPU to be functional when the DRAM is corrupt, only the DMA
controller. I can't personally say if that makes it any more likely though.

~~~
codewiz
That's right, but all stateful ICs on the motherboard have a reset pin,
including the north and south bridges when they were still separate packages.
Even PCI cards will receive the reset signal simultaneously.

Not sure what the hard drive would do with a truncated ATA command though.

------
GalacticDomin8r
A more in-depth explanation of journaling vs soft-updates:

[https://www.freebsd.org/doc/handbook/configtuning-
disk.html](https://www.freebsd.org/doc/handbook/configtuning-disk.html)

------
seltzered_
This post brought up memories of trying Linux in the late 90s and giving up
(or reinstalling the os) because ext2 crashed to a point unrecoverable by
fsck. Anyone else have a similar experience back then?

------
ianai
I stand corrected, sorry everybody.

~~~
laumars
If you can afford ECC then that is obviously preferable. However the horror
stories regarding running ZFS on non-ECC are _greatly_ exaggerated.

~~~
michaelmrose
Not merely exaggerated, nonsense. All filesystems using non ecc memory are
more likely to experience damage to the data stored therein.

Zfs is no more likely to be damaged that anything else.

This whole confusion stems from zfs devs suggestion that ecc ram be used and
people who don't know any better thinking oh that must be some special
requirement for zfs.

These folks then spread this misinformation all over the Internet where its
passed on by people who know nothing about the topic.

Pro tip: don't talk about things on the Internet that you only know about 3rd
or 37th hand. If you know nothing about the topic you aren't improving the
store of human knowledge by passing on noise.

~~~
laumars
> _Not merely exaggerated, nonsense. All filesystems using non ecc memory are
> more likely to experience damage to the data stored therein._

If it were "nonsense" then one could flip the argument and say "ECC memory has
no beneficial effect what-so-ever" \- which be both know is untrue. However I
can see why people do say "nonsense" given the ECC-myth seems to only be
discussed in relation to ZFS and the probability data errors due to non-ECC
memory is small.

For what it's worth, I did originally include a comment about how the risk
being the same as on any other file systems - but then deleted it fearing I
might have overlooked something on another fs I'm less familiar with.

> _Pro tip: don 't talk about things on the Internet that you only know about
> 3rd or 37th hand. If you know nothing about the topic you aren't improving
> the store of human knowledge by passing on noise._

I don't think that's fair. The problem with 1st hand experience is that it's
often just based on anecdote. Which can often be worse than 3rd hand advice.
And in the case of failures: often some products are too widely used to be
worth constantly badgering the development team for help; which means you end
up having to rely on the advice of others. The problem is really more that
some people are terrible at researching so take 3rd hand advice without
bothering to fact checking it.

In any case, I don't know if your comment was aimed at me or not, but I do
have nearly 10 years of experience (wow that's gone quick!) running ZFS across
a variety of systems, some running ECC memory, others not. I've also been a
keen study of the Sun/Oracle docs. So I do consider myself reasonably well
informed from both credible sources and personal experience. Though I'm not
arrogant enough to assume I'm an expert either - communities like HN can be
deeply humbling places.

~~~
dom0
> > Not merely exaggerated, nonsense. All filesystems using non ecc memory are
> more likely to experience damage to the data stored therein.

> If it were "nonsense" then one could flip the argument and say "ECC memory
> has no beneficial effect what-so-ever" \- which be both know is untrue.

The original argument is like an implication ("no ECC => don't use ZFS"),
which he refutes by saying (correctly), that file systems are generally
affected the same way by memory corruption. Your argument seems to invert the
original implication and drawing conclusions from that (fallacy of the
converse, I believe). ∎

~~~
laumars
He was refuting my " _greatly_ exaggerated" remark by going on to make the
same points I was alluding to albeit directly and in more detail. So I was in
turn elaborating on my choice of language.

I feel between this post and you're previous one that you are basically just
agreeing with me via the process of nitpicking the language I used.

------
regularfry
Complex Things Fail In Complex Ways: An Object Lesson

------
Dimi9909
slightly off topic: How facebook uses btrfs:

[https://www.linux.com/news/learn/intro-to-linux/how-
facebook...](https://www.linux.com/news/learn/intro-to-linux/how-facebook-
uses-linux-and-btrfs-interview-chris-mason)

~~~
Dimi9909
[http://masoncoding.com/presentation/ks-14/btrfs.html#/1](http://masoncoding.com/presentation/ks-14/btrfs.html#/1)

------
Dimi9909
Ted is brilliant

