Hacker News new | past | comments | ask | show | jobs | submit login
XFS: There and back and there again? (lwn.net)
110 points by jakobdabo on Apr 9, 2015 | hide | past | web | favorite | 40 comments

I use XFS everywhere I can: out of the box, it requires no tuning to get near maximum performance, yet I generally have to tune ext3/4 to get what I want (and, largely, I still consider that a bit of voodoo).

Problem is, I run a dedicated server hosting company, and the majority of my customers either want CentOS 6.x or Debian Stable. Neither can install XFS as the root filesystem (but can use it for other filesystems during install, strangely enough; separate ext2/3/4 /boot doesn't fix the issue).

At home I have a mini-server that is 2x M550 128GB SSD, 2x es.2 2TB HDD, with the SSDs partitioned as 16GB md raid1 XFS for /, 256MB for ZFS ZIL, rest for ZFS L2ARC, and the 2x2TB as ZFS mirror; /tank and /home are on ZFS, and / is pretty much empty.

The only thing that would improve XFS at this point is if it supported optional checksumming and LZ4 compression on root filesystems, otherwise its basically perfect.

By the way, said mini-server? Dual core Haswell 3.x GHz, 16GB of DDR3-1600, from pressing the power button, going through BIOS, hammering enter to get past grub menu as fast as possible, it takes about 7 seconds to get to the login prompt; less than 3 of that is between leaving grub and getting to the prompt.

> Problem is, I run a dedicated server hosting company, and the majority of my customers either want CentOS 6.x or Debian Stable. Neither can install XFS as the root filesystem

FYI, I've got many machines running Debian Stable (wheezy) with xfs root filesystems, no /boot partition. No problems.

Debian can physically boot it fine, Debian-Installer still refuses to actually install it with XFS. Its infuriating, and D-I devs are aware of it.

My home server I described? Runs Debian, has XFS root. It clearly works.

I just installed on to a brand new VM from debian-7.8.0-amd64-CD-1.iso. Selected manual partitioning, set a bootable XFS partition, installed and ran fine.

Do you have a Debian bug ID? D-I has been doing XFS root installs for me for years now, whether it's interactively with the netinst ISO in VMware or remotely with the debian-installer-netboot package, a DHCP server, and an unattended installer preseed file.

I just tested it again. XFS now works. It did not work the most recent time I tried (within the past 12 months). The only change in my preseed file I made to test the change was the expert_recipe stanza for /, from ext4 filesystem { ext4 } to xfs filesystem { xfs }.

Thanks for getting me to test again. One less OS I have to deal with that can't do XFS properly.

Does XFS support being /boot now? I remember being tripped up by this many years ago.

It has since at least squeeze.

> and the majority of my customers either want CentOS 6.x

I'd have expected them to be beating down the CentOS 7 door by now.

A few have asked when I'm going to offer CentOS 7. My reply? When it actually works with my deployment system properly and isn't a colossal buggy mess, and actually works with the software you (my customers) use (a lot of things have not announced CentOS 7 support for the same reasons I have, its a buggy mess).

I wish I could move everything to 7. I had to modify our kickstart but that is not surprising, but haven't really had any other issues.

What sort of problems have you encountered?

Do you have any examples of the CentOS 7 problems? Bug reports? I'm curious.

I'm not a CentOS admin, but I know a lot of CentOS admins. The concept of Linux distro problems is kind of foreign to me: I'm a Debian guy, the most problems we've had in the past 5 years has been pretty much just systemd (which I file away as "systemd sucks" isuses, not Debian issues).

Basically, everything they've said boils down to package deps are broken and theres no easy way to fix it, and random segfaults and kernel panics (which I have no idea why any distro would ever have this issue).

CentOS 7.x come bundled with systemd, while 6.x still uses init.

The progression in solid-state drives (SSDs) shows slow, unreliable, and "damn expensive" 30GB drives in 2005. Those drives were roughly $10/GB, but today's rack-mounted (3U) 512TB SSDs are less than $1/GB and can achieve 7GB/second performance. That suggests to him that by 2025 we will have 3U SSDs with 8EB (exabyte, 1000 petabytes) capacity at $0.1/GB.

Ignoring the limitations of physics, 8EB at $0.1/GB comes out to $858,993,459. I don't think there will ever be enough of a market to support the mass production of billion-dollar disks.

If the price per GB scales down to more "typical" sized drives, it still has interesting implications for filesystem devs.

And given XFS being the strongest of the Linux filesystems for large systems they will undoubtedly try and handle the high capacity drives and arrays too (relative to today).

On the other side of the spectrum, I am totes hyped for the day soonish when a 3TB SSD costs $80. Sooner rather than later. In the shorter term, I'm super hyped to see $200 1TB SSDs in the next year or so.

$200 TB SSDs is not what you should be excited about... sub-$400 400GB 2400MB/s consumer NVMe SSDs (Intel 750) is what's exciting.

Really hard to see trends from two observations :) I'd be curious to see more serious predictions.

You can guess at trends from the fact that most improvement curves tend to be roughly exponential. Moore's law is the most spectacular, but similar (if less extreme) ones have been found in everything from battery capacity to the maximum size of scoops in a hydraulic backhoe.

You can find verification and many more examples in The Innovator's Dilemma.

Every few years, my team compares other filesystems against XFS for sequential write throughput to disk arrays. We consistently get near-block-dev performance with XFS and other filesystems are only starting to approach XFS'.

Consistent throughput is far and away our top ranked requirement, so XFS has ruled for over a decade.

> zeroing of all open files when there is an unclean shutdown. None of those were true

That one most definitely is at least partially true. I had experienced this several times using XFS on my home Linux machine in the early 2000s. I use JFS now, but supposedly this bug/misfeature was fixed some time ago, thankfully.

Edit: indeed, the linked slides themselves back me up on this; just three slides after the one quoted by the article, we have "Null files on crash problem fixed!"

No, you misunderstand. This is related to the O_PONIES problem (http://lwn.net/Articles/351422/). The bug was software that failed to use datasync when it should have. The files weren't getting zeroed out on shutdown. The data had yet to be flushed to disk, and it wasn't required to be flushed to disk. The metadata was required to be flushed to disk. For security reasons, and unallocated inode data space has to be filled with 0's. So... data not flushed to disk? You get 0's when there is an unclean shutdown.

Ah gotcha. So a different files-zeroed-on-close "bug" ;) At least that one makes sense.

Also, the bug wasn't in the filesystem.

What made you choose JFS over ext4?

ext4 didn't exist when I made the choice ;) In fact I think ext3 was still not considered stable.

So, why didn't XFS "win" out over ReiserFS, ext3/4, btrfs, or whatever else is being pushed nowadays?

Btrfs is still very new as a filesystem. There's little evidence that it can hold up in production use. It is similar to ZFS, which is the filesystem of choice for many large deployments. Unfortunately ZFS can't be distributed with Linux because of license mismatch.

Ext3/4 are of previous generation of filesystems, their performance on very large volumes is not acceptable. And you don't want your production system to be down because of boot-time fsck for hours.

ReiserFS had some promise years ago, but the project leader is in prison and nobody has picked up the effort. I don't think it's possible for it to close the gap to other filesystems even if Mr. Reiser returns to develop it after his sentence.

I am not a lawyer, but I have yet to speak to a lawyer who claimed you could not distribute a Linux kernel module under an "incompatible" license when the said module is not considered a derived work of GPL software. Any Linux port of ZFS is a derived work of OpenSolaris (in the legal sense), such that you can distribute ZFS with Linux as long as it is a separate kernel module. There are plenty of companies doing this. I have spoken to lawyers regarding this and I have yet to meet a lawyer who disagrees when asked if the GPL applies to a Linux ZFS kernel module. To be honest, I did encounter one attorney who thought that certain forms of advertising might be able to trigger the derived works clause, but he considered it to be an avoidable problem and his opinion was an outlier.

To my understanding this is quite an open issue, and it's hard to find a company that would want to risk it in court. ZFS on Linux project currently distributes the modules as source, and they are built for particular kernel locally in any machine that needs the modules. This is the same approach as GPU vendors do with their closed source graphics drivers.

Dave Chinner works for Red Hat, and XFS is the default FS on the most popular commercial Linux distro. Seems like winning.

If you mean, "why didn't it win earlier?" it's probably a combination of ext3 being good enough (particularly when multi-spindle, large partition environments were less common for Linux users), Reiser3 doing a great PR job (pity about the crappy filesystem...), and ext3 having better data integrity because Ted T'so didn't understand how his filesystem implementation worked (he fixed that with ext4).

I would think it did? ext4 is the most deployed when you don't need extra features, because it's very stable, seamless upgrading from ext3, universaly GRUB support, and has little surprises and is actually pretty fast. ReiserFS is mostly dead, btrfs isn't even really stable yet and is slower.

I wouldn't be surprised if XFS was the second most deployed filesystem after ext4.

Where it's really losing is to the "block layering violations" of btrfs. Being able to manage storage pools at the filesystem level like ZFS is a major feature advantage.

Around the turn of the century, I was working for a company that was continuously recording live video streams to a hard drive. Since you can't do that indefinitely, we'd only keep around the last two hours or so. If I recall correctly, XFS was chosen both because of performance and because it was the only filesystem that supported file truncation at the start of the file. It was just a single call to truncate a few hundred meg at the start of the file when you ran out of space. I don't know if any other filesystems support that these days, but it was a great feature for us at the time.

The only trouble I ever had was blowing out the stack in highly layered storage setups when using XFS. XFS uses (or used) tons of stack space and if you put, say, LVM, md, and NFS in the mix there just wasn't enough.

Something along these lines, although I never used RHEL: https://access.redhat.com/solutions/54544

Isn't XFS the default file system for RHEL now? I remember a video showing the RedHat Devs discussing how they were adding on new features to it.

Yes, it is [1]. It works even for /boot. It is also the default filesystem for /home in SLES12 [2]

[1] http://www.redhat.com/en/about/press-releases/red-hat-unveil... [2] https://www.suse.com/communities/conversations/xfs-the-file-...

In 1994, we used to joke that XFS was a write-only filesystem. The early version was less than stellar. SGI pretty quickly solved the worst issues. It sure was nice not having to wait for 6 hours for a Challenge XL to fsck.

Wow, that is uncanny, in 2012, at a former employer, that was our joke for btrfs right before we switched to xfs with Ceph for our OpenStack product. We still had problems that appeared to be locks between xfs above and below Ceph, but nothing catastrophic like the customer problems with btfs where we knew the data had been written, but recover was not reasonable.

XFS is great, now. It had some problems with metadata performance at scale a few years ago (affected early versions of RHEL/CentOS 6, and may be one of the reasons it was not the default in those versions), but it's now the best of the "traditional" filesystems.

> From Btrfs, GlusterFS, Ceph, and others, we know that it takes 5-10 years for a new filesystem to mature.

Those are bad examples - they are all significantly more complex than XFS/ext. Two of the three are distributed filesystems that aren't solving any of the same problems.

However, their inclusion in the article is worth noting, even if the author put them in the wrong paragraph. Increasingly, large volumes are becoming distributed over lots of individual servers, with technologies like glusterfs and ceph. Both of these, and some of their competitors too (xtreemfs is also really good, despite the silly name), use traditional filesystems on the underlying server volumes that are presented as a large distributed FS. XFS is generally used for this task - and unless an individual node got larger than ~ 8EB, there's currently no reason to change this.

The real question, then, becomes - will a single server need a local FS larger than 8EB by 2025-2030? Possibly not. It's very dangerous to say "X is all anyone will ever need", but I think that we're going to increasingly see bulk storage go the same way that CPUs did - instead of a single huge local FS (analogous to increasing single-core clockspeed), we'll see an increasing number of storage nodes combined into one, via a distributed filesystem (analogous to higher core count).

Part of the reason for this is that there are quite a few disadvantages with having very large volumes in one place. RAID rebuilds become unreasonably long, if you use RAID, and RAID rebuild speed is not currently keeping pace with storage growth. If you don't, and solve redundancy with multiple nodes, then the bigger the individual node is, the larger the impact if it fails. At some point, you're also going to need to shift data off that box, and while we're likely to have 100GbE server ports around then[1], even 100GbE is going to take an unreasonably long time to move 8EB anywhere.

1: https://www.nanog.org/meetings/nanog56/presentations/Tuesday...

EDIT: Just did some quick calculations. Assuming a 100 Gigabit server port, that's 12.5 Gigabytes/s. At 12.5 GB/s, it would take over 20 years to transfer 8EB anywhere. The idea that we're going to have 8EB of data on an individual server in that timeframe is starting to look a bit silly, with that in mind - how's it going to get there? What on earth would you do with it once it is there?

Even if by some magic we invent 1TbE and make it cheap enough to use on servers (and invent 10+TbE for the network core) by 2025, that would still take over 2 years to fill the disk. Yes, sorry, but this is just silly. 8EB across lots of individual servers? Sure. But all on one server in a regular filesystem? Not going to happen any time soon.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact