
XFS: There and back  and there again? - jakobdabo
https://lwn.net/Articles/638546/
======
DiabloD3
I use XFS everywhere I can: out of the box, it requires no tuning to get near
maximum performance, yet I generally have to tune ext3/4 to get what I want
(and, largely, I still consider that a bit of voodoo).

Problem is, I run a dedicated server hosting company, and the majority of my
customers either want CentOS 6.x or Debian Stable. Neither can install XFS as
the root filesystem (but can use it for other filesystems during install,
strangely enough; separate ext2/3/4 /boot doesn't fix the issue).

At home I have a mini-server that is 2x M550 128GB SSD, 2x es.2 2TB HDD, with
the SSDs partitioned as 16GB md raid1 XFS for /, 256MB for ZFS ZIL, rest for
ZFS L2ARC, and the 2x2TB as ZFS mirror; /tank and /home are on ZFS, and / is
pretty much empty.

The only thing that would improve XFS at this point is if it supported
optional checksumming and LZ4 compression on root filesystems, otherwise its
basically perfect.

By the way, said mini-server? Dual core Haswell 3.x GHz, 16GB of DDR3-1600,
from pressing the power button, going through BIOS, hammering enter to get
past grub menu as fast as possible, it takes about 7 seconds to get to the
login prompt; less than 3 of that is between leaving grub and getting to the
prompt.

~~~
chrisbolt
> Problem is, I run a dedicated server hosting company, and the majority of my
> customers either want CentOS 6.x or Debian Stable. Neither can install XFS
> as the root filesystem

FYI, I've got many machines running Debian Stable (wheezy) with xfs root
filesystems, no /boot partition. No problems.

~~~
DiabloD3
Debian can physically boot it fine, Debian-Installer still refuses to actually
install it with XFS. Its infuriating, and D-I devs are aware of it.

My home server I described? Runs Debian, has XFS root. It clearly works.

~~~
chrisbolt
Do you have a Debian bug ID? D-I has been doing XFS root installs for me for
years now, whether it's interactively with the netinst ISO in VMware or
remotely with the debian-installer-netboot package, a DHCP server, and an
unattended installer preseed file.

~~~
DiabloD3
I just tested it again. XFS now works. It did not work the most recent time I
tried (within the past 12 months). The only change in my preseed file I made
to test the change was the expert_recipe stanza for /, from ext4 filesystem {
ext4 } to xfs filesystem { xfs }.

Thanks for getting me to test again. One less OS I have to deal with that
can't do XFS properly.

------
pipeep
_The progression in solid-state drives (SSDs) shows slow, unreliable, and
"damn expensive" 30GB drives in 2005. Those drives were roughly $10/GB, but
today's rack-mounted (3U) 512TB SSDs are less than $1/GB and can achieve
7GB/second performance. That suggests to him that by 2025 we will have 3U SSDs
with 8EB (exabyte, 1000 petabytes) capacity at $0.1/GB._

Ignoring the limitations of physics, 8EB at $0.1/GB comes out to $858,993,459.
I don't think there will ever be enough of a market to support the mass
production of billion-dollar disks.

~~~
zanny
On the other side of the spectrum, I am totes hyped for the day soonish when a
3TB SSD costs $80. Sooner rather than later. In the shorter term, I'm super
hyped to see $200 1TB SSDs in the next year or so.

~~~
cdr
$200 TB SSDs is not what you should be excited about... sub-$400 400GB
2400MB/s consumer NVMe SSDs (Intel 750) is what's exciting.

------
wyldfire
Every few years, my team compares other filesystems against XFS for sequential
write throughput to disk arrays. We consistently get near-block-dev
performance with XFS and other filesystems are only starting to approach XFS'.

Consistent throughput is far and away our top ranked requirement, so XFS has
ruled for over a decade.

------
colanderman
> _zeroing of all open files when there is an unclean shutdown. None of those
> were true_

That one most definitely _is_ at least partially true. I had experienced this
several times using XFS on my home Linux machine in the early 2000s. I use JFS
now, but supposedly this bug/misfeature was fixed some time ago, thankfully.

Edit: indeed, the linked slides themselves back me up on this; just three
slides after the one quoted by the article, we have "Null files on crash
problem fixed!"

~~~
cbsmith
No, you misunderstand. This is related to the O_PONIES problem
([http://lwn.net/Articles/351422/](http://lwn.net/Articles/351422/)). The bug
was software that failed to use datasync when it should have. The files
weren't getting zeroed out on shutdown. The data had yet to be flushed to
disk, and it wasn't required to be flushed to disk. The metadata was required
to be flushed to disk. For security reasons, and unallocated inode data space
has to be filled with 0's. So... data not flushed to disk? You get 0's when
there is an unclean shutdown.

~~~
colanderman
Ah gotcha. So a _different_ files-zeroed-on-close "bug" ;) At least that one
makes sense.

~~~
cbsmith
Also, the bug wasn't in the filesystem.

------
angersock
So, why didn't XFS "win" out over ReiserFS, ext3/4, btrfs, or whatever else is
being pushed nowadays?

~~~
Maakuth
Btrfs is still very new as a filesystem. There's little evidence that it can
hold up in production use. It is similar to ZFS, which is the filesystem of
choice for many large deployments. Unfortunately ZFS can't be distributed with
Linux because of license mismatch.

Ext3/4 are of previous generation of filesystems, their performance on very
large volumes is not acceptable. And you don't want your production system to
be down because of boot-time fsck for hours.

ReiserFS had some promise years ago, but the project leader is in prison and
nobody has picked up the effort. I don't think it's possible for it to close
the gap to other filesystems even if Mr. Reiser returns to develop it after
his sentence.

~~~
ryao
I am not a lawyer, but I have yet to speak to a lawyer who claimed you could
not distribute a Linux kernel module under an "incompatible" license when the
said module is not considered a derived work of GPL software. Any Linux port
of ZFS is a derived work of OpenSolaris (in the legal sense), such that you
can distribute ZFS with Linux as long as it is a separate kernel module. There
are plenty of companies doing this. I have spoken to lawyers regarding this
and I have yet to meet a lawyer who disagrees when asked if the GPL applies to
a Linux ZFS kernel module. To be honest, I did encounter one attorney who
thought that certain forms of advertising might be able to trigger the derived
works clause, but he considered it to be an avoidable problem and his opinion
was an outlier.

~~~
Maakuth
To my understanding this is quite an open issue, and it's hard to find a
company that would want to risk it in court. ZFS on Linux project currently
distributes the modules as source, and they are built for particular kernel
locally in any machine that needs the modules. This is the same approach as
GPU vendors do with their closed source graphics drivers.

------
jbuzbee
Around the turn of the century, I was working for a company that was
continuously recording live video streams to a hard drive. Since you can't do
that indefinitely, we'd only keep around the last two hours or so. If I recall
correctly, XFS was chosen both because of performance and because it was the
only filesystem that supported file truncation at the start of the file. It
was just a single call to truncate a few hundred meg at the start of the file
when you ran out of space. I don't know if any other filesystems support that
these days, but it was a great feature for us at the time.

------
thrownaway2424
The only trouble I ever had was blowing out the stack in highly layered
storage setups when using XFS. XFS uses (or used) tons of stack space and if
you put, say, LVM, md, and NFS in the mix there just wasn't enough.

Something along these lines, although I never used RHEL:
[https://access.redhat.com/solutions/54544](https://access.redhat.com/solutions/54544)

------
vondur
Isn't XFS the default file system for RHEL now? I remember a video showing the
RedHat Devs discussing how they were adding on new features to it.

~~~
jfreax
Yes, it is [1]. It works even for /boot. It is also the default filesystem for
/home in SLES12 [2]

[1] [http://www.redhat.com/en/about/press-releases/red-hat-
unveil...](http://www.redhat.com/en/about/press-releases/red-hat-unveils-
rhel-7) [2] [https://www.suse.com/communities/conversations/xfs-the-
file-...](https://www.suse.com/communities/conversations/xfs-the-file-system-
of-choice/)

------
stox
In 1994, we used to joke that XFS was a write-only filesystem. The early
version was less than stellar. SGI pretty quickly solved the worst issues. It
sure was nice not having to wait for 6 hours for a Challenge XL to fsck.

~~~
lloydde
Wow, that is uncanny, in 2012, at a former employer, that was our joke for
btrfs right before we switched to xfs with Ceph for our OpenStack product. We
still had problems that appeared to be locks between xfs above and below Ceph,
but nothing catastrophic like the customer problems with btfs where we knew
the data had been written, but recover was not reasonable.

------
jfindley
XFS is great, now. It had some problems with metadata performance at scale a
few years ago (affected early versions of RHEL/CentOS 6, and may be one of the
reasons it was not the default in those versions), but it's now the best of
the "traditional" filesystems.

> From Btrfs, GlusterFS, Ceph, and others, we know that it takes 5-10 years
> for a new filesystem to mature.

Those are bad examples - they are all significantly more complex than XFS/ext.
Two of the three are distributed filesystems that aren't solving any of the
same problems.

However, their inclusion in the article is worth noting, even if the author
put them in the wrong paragraph. Increasingly, large volumes are becoming
distributed over lots of individual servers, with technologies like glusterfs
and ceph. Both of these, and some of their competitors too (xtreemfs is also
really good, despite the silly name), use traditional filesystems on the
underlying server volumes that are presented as a large distributed FS. XFS is
generally used for this task - and unless an individual node got larger than ~
8EB, there's currently no reason to change this.

The real question, then, becomes - will a single server need a local FS larger
than 8EB by 2025-2030? Possibly not. It's very dangerous to say "X is all
anyone will ever need", but I think that we're going to increasingly see bulk
storage go the same way that CPUs did - instead of a single huge local FS
(analogous to increasing single-core clockspeed), we'll see an increasing
number of storage nodes combined into one, via a distributed filesystem
(analogous to higher core count).

Part of the reason for this is that there are quite a few disadvantages with
having very large volumes in one place. RAID rebuilds become unreasonably
long, if you use RAID, and RAID rebuild speed is not currently keeping pace
with storage growth. If you don't, and solve redundancy with multiple nodes,
then the bigger the individual node is, the larger the impact if it fails. At
some point, you're also going to need to shift data off that box, and while
we're likely to have 100GbE server ports around then[1], even 100GbE is going
to take an unreasonably long time to move 8EB anywhere.

1:
[https://www.nanog.org/meetings/nanog56/presentations/Tuesday...](https://www.nanog.org/meetings/nanog56/presentations/Tuesday/tues.general.kipp.23.pdf)

EDIT: Just did some quick calculations. Assuming a 100 Gigabit server port,
that's 12.5 Gigabytes/s. At 12.5 GB/s, it would take over 20 _years_ to
transfer 8EB anywhere. The idea that we're going to have 8EB of data on an
individual server in that timeframe is starting to look a bit silly, with that
in mind - how's it going to get there? What on earth would you do with it once
it is there?

Even if by some magic we invent 1TbE and make it cheap enough to use on
servers (and invent 10+TbE for the network core) by 2025, that would still
take over 2 years to fill the disk. Yes, sorry, but this is just silly. 8EB
across lots of individual servers? Sure. But all on one server in a regular
filesystem? Not going to happen any time soon.

