Hacker News new | past | comments | ask | show | jobs | submit login
Linux Storage, Filesystem, and Memory-Management Summit (lwn.net)
148 points by l2dy 6 days ago | hide | past | web | favorite | 56 comments

If you appreciate this kind of reporting, please consider subscribing to LWN. Subscriber support is the only thing that allows us to do this kind of work.

Favorite quote from the articles so far about BPF:

Gregg started with a demonstration tool that he had just written: it's immediate manifestation was in the creation of a high-pitched tone that varied in frequency as he walked around the lectern. It was, it turns out, a BPF-based tool that extracts the signal strength of the laptop's WiFi connection from the kernel and creates a noise in response. As he interfered with that signal with his body, the strength (and thus the pitch of the tone) varied. By tethering the laptop to his phone, he used the tool to measure how close he was to the laptop. It may not be the most practical tool, but it did demonstrate how BPF can be used to do unexpected things.

Brendan Gregg is also the guy who shouted at a rack in a datacenter to prove that some hard drives were sensitive performance-wise to vibrations. The video is on youtube and it's hilarious.

I was surprised to see discussion of NFS. NFS certainly was a big deal "back in the day" but it had its own quirks and headaches. I haven't seen NFS in 20 years, but that could simply be because of the particular worlds I live in.

Is it still widely used and I just happen never to see it because the environments in which I work?

Or is it only used for a small number of sites (or certain applications) but they happen to be extremely important ones?

AWS EFS[1] uses NFS which our team used last year for some internal infrastructure. NFS is still around. Also have been looking into more durable storage this year it looks like the Rados gateway[2] in Ceph supports a nfs front end as well.

1: https://aws.amazon.com/efs/

2: http://docs.ceph.com/docs/master/radosgw/nfs/

I would be shocked if there's a single fortune 500 that doesn't have a significant NFS footprint. Everything from user shares to Oracle or VMware get deployed via NFS.

NFS is increasingly becoming an access protocol, to speak to $whatever through some sort of proxy/gateway. $whatever might be Gluster (for which I'm still a maintainer), Ceph, EFS, etc. It's nice because clients are everywhere, they're maintained by someone else, and you can update your own stuff as much as you like without landing in the swamp of updating client software. The people who do maintain those clients - I know a few - deserve our thanks.

I'm curious what your domain is and what tech you've been using?

Every Linux shop I've worked in 10+ years has used NFS without any word of an alternative. The few places I've worked that mixed NFS with SMB or exclusively used SMB I've had performance issues when jumping between the two (perhaps due to experience and configuration).

My work has mostly been in Linux shops with 10s-100s of workstations+servers. NFS was used for homedirs and shared data. Sometimes servers were hundreds of miles away.

> I'm curious what your domain is and what tech you've been using?

It’s good to see these replies as everybody’s experience is different and I learn by asking.

Perhaps NFS has improved a lot; I started using it in the 1980s after using client/server filesystems like IFS and Alpine at parc and the per-file protocols we had at mit for the pod-10s and Lispms. With NFS locking and such were painful because it tried hard to look no different from a local filesystem, but couldn’t be. Network mounted homedirs were common to me from the early 80s but under NFS were too painful.

I imagine things have improved over the last couple of decades!

The environments I’ve used more recently have used a combination of replication (e.g. Dropbox), moved the computation to the data (“cloud” though for me typically this has meant in-house someplace rather than a third party or shared machine) or hybrid (e.g. IMAP).

Datasets tend not to be super huge — < 50 TB — so other approaches are used at the back end.

HPC sysadmin here. Our cluster is built around NFS for storage. User home directories are mounted on all compute nodes via NFS. Research datasets are mounted on compute nodes via NFS. It is simple and stable and works very well within our internal network.

My bias shows here, but what's your alternative to NFS from *nix? We used AFS at school. I suppose you could keep your data locally and be responsible for your own backups / use version control as backups.

CIFS / SMB ; Client is in most kernels these days, server (Samba) is easy to install and use. In my experience performance is acceptable (that is, I'm getting >110MB/s on a 1Gb connection), locking semantics if you need them work much better than an NFS - and most importantly, it doesn't hang as bad when a server goes away (NFS behaves like it's 1980 in that respect - the recommended and only consistent way to unhang processes when the server goes away is, I kid you not, start a local NFS server on the client, and add the IP address of the server locally to the client; Then, one of the retries will get an error from the local server, at that point you can deal with it, shutdown the local NFS and remove the extra IP address)

I'm actually curious to try that next time it comes up, but with automount + NFSv3 if a server goes down and isn't expected to come back up I can 'umount -l' and kill the hung process.

With CIFS/SMB throughput wasn't the issue, but dealing with small files seemed to be. With NFS most places served software packages off of it and whenever this was tried with SMB it was unreasonably slow. I'm ignorant enough at the implementation that I could believe this was a configuration thing.

> I can 'umount -l' and kill the hung process.

I don't use automount, but unless automount does some crazy magic ... this doesn't really help; umount -l (or umount -lf) indeed removes the mountpoint from the filesystem (so no new processes can access it and get stuck), but the kernel thread is still stuck waiting for an answer that will never come, and many processes cannot be killed even with "kill -9".

You are right. In practice it generally works for our needs--I haven't looked deep enough at the problem to defend every reason why. With the mount in place, things like 'df' hang indefinitely, whereas once they're unmounted people can work again. I think automount is doing a lot of the heavy-lifting here because I guess next time you navigate there it notices the NFS server is down and returns an error instead of hanging the new process.

Usually, whatever old process that is hanging onto the mount gives up or dies. If not, often a "kill -9" takes care of it. I have had these processes get stuck indefinitely. I was usually doing something stupid. For example, mounting a USB drive over NFS for a user. They got trigger-happy and pulled the USB drive without unmounting it or informing anyone. Effectively, I just considered that mount point "burned" on the client until I could reboot.

I'm not sure how other people use network mounts. In general, they're expected to be up and any changes or removals will be part of a maintenance window. Changes would go through a series of steps to avoid this scenario. Sure, it's a bit of a pain, but doesn't seem to unreasonable in practice (stop new mounts, kill existing processes/mounts, update). My experience has mostly been with NFSv3, for all I know newer versions address this better.

VMware supports NFS as a storage backend so it ends up being used in a lot of storage arrays in Enterprise shops.

Gitlab.com was also using it as of last year, but I’m not sure if that’s still the case.

Yes, it has now been replaced by Gitaly: https://about.gitlab.com/2018/09/12/the-road-to-gitaly-1-0/

From the drawing, it looks like gitaly decouples git storage from the workers, but still uses NFS on the backend.

VMware over NFS maybe used in SMB (small medium businesses) enterprises use iscsi or SANs

when the NFS server crashes bad things happen and if you are building a redundant multi master NFS it’s a lot simpler to use iscsi and FC SAN

At my workplace we run VMware over NFS served by Netapp clusters. It's very stable, performs well, is easy to manage and has functioning HA. We used to have a SAN but Netapp worked well enough that it was decommissioned.

My favourite moment operating one cluster was when I intentionally caused a kernel panic and core dump on a live production system to gather debug info for an issue we were having. It had zero observable impact. :) Netapp is not cheap, but they seem to know what they're doing.

I used to work for a hyper converged infrastructure provider which had an option to aggregate all the disks in a computing cluster and expose it as an NFS filesystem. You can then enable fault tolerance, compression, encryption etc on a file system level.

I was surprised more because I thought it was pretty well settled. I use it in my home LAN to access file shares. (Linux desktops and a Linux server.) Is would be used between 'nix servers and clients? Is it something that would be useful on the home LAN?

NFS is absolutely still in use at my workplace - there's a lot of older infrastructure still in place that makes use of it. Lots of AFS as well.

I believe my alma mater's CS department is still using NFS for home directories last I checked.

Yep, it is still widely used as smb/cifs is a horrible protocol and it should feel bad.

There are some issues with nfs(v4) on the bleeding edge by default, but it has been really cool to see some of the nfs workarounds to deal with big data.

I use NFS for a media/backup server since I don't want to bother with Samba.

Btrfs isn't even on the agenda.

Cause it’s a summit to talk about the whole ecosystem, not one file system. We btrfs developers talk enough to each other, we don’t need to bore the rest of the attendees with our own topics. That’s what the hallway tracks are for. I put together the schedule for the file system side of this years conference, I specifically don’t put single fs issues on the general schedule.

is btrfs as a project even healthy? a startup I worked at looked hard at btrfs since the COW model and better integrity verification were both very, very useful from our POV. But btrfs in practice was just simply unsuitable; not only were there basic reliability issues (corrupted filesystems) and really bad corner case behavior (full filesystem in particular), we noted that it was not actually fully endian neutral, at least at the time, which caused filesystems between x86 and a BE ISA to appear to work, and then to be horribly corrupted if taken from a BE embedded system and mounted on x86.

Given the volume of btrfs negative experiences that are going to need to be overcome (for example, many of the posts in [1]), maybe people have just given up? If ZoL licensing wasn't a problem, would anyone even be interested?

[1] https://news.ycombinator.com/item?id=15087754

I routinely hear about how btrfs is unreliable, unstable, and corrupts your data.

But it's also been the default filesystem for SUSE Linux and Synology NAS products for a long time now, and they don't seem to be having any problems.

I don't know what to believe.

I think it's good to get more data out there, positive and negative, so I'll add: I'm using btrfs on a home server (and sometimes desktop) for half a year now and haven't had any issues. It's fast (on nvme ssd), I can play with snapshots. It is admittedly a very lightly-loaded system.

I filled up the root filesystem by accident once, which is supposed to be a very bad situation for btrfs. I was able to ssh in, diagnose stuff, delete the offending files, and then I did a minor rebalance to clean up some of the unnecessary metadata space that got allocated (not essential but I wanted to try it). No big deal and still works fine.

I haven't used any btrfs raid features. I mirror it (via daily rsync) to an ext4 filesystem on another device to guard against filesystem-related bugs and device failure.

I'm using btrfs for a year on my laptop and there are no problems. I did some research before as I considered ZFS too but it seems most btrfs criticism is from old versions that did have bugs or using configurations that have known flaws (raid5/6, see [0]).

ZFS, as much as I like it, will unfortunately never be part of the kernel and users will constantly be put into this kind of situations: https://www.phoronix.com/scan.php?page=news_item&px=ZFS-On-L...

[0]: https://btrfs.wiki.kernel.org/index.php/Gotchas

ReadyNAS products have also been using it for even longer.

They don't use the RAID support from BTRFS (they do it themselves in software a level up using mdadm), only the snapshot, COW and general storage.

There very occasionally (based on forum posts) seem to be issues with metadata fragmentation for some people when there's a lot of metadata, but otherwise seems very reliable (I've got two myself).

Btrfs is reliable and stable - when you're running it on a single disk. That's what tends to the default mode for operating system installs that use Btrfs. I use it on all my single-disk installs to get checksumming, COW, and transparent compression.

It also seems to be pretty solid in enterprise-friendly RAID10 software mode. Anything outside of that should be treated with extreme caution and avoided for anything you can't afford to lose. For RAID5/6 in particular, as far as I know, little has changed since the discovery of a major corruption-creating bug (which supposedly was going to require a complete rethink of major portions of the Btrfs architecture). It's an embarrassing situation, but no one seems to want to work on it because it would require a lot of work to fix, and anyone who's paying people to work on Btrfs isn't going to run RAID6 in production anyway.

It's unfortunate, but the real best advice to anyone who wants software RAID and doesn't know exactly what they're getting into is "just use ZFS".

Facebook runs many thousands of btrfs filesystems every day, and it seems to works fine for them. They have helped to iron out a lot of bugs (and developed several features for it)

I personally have been using it for maybe 7 years and it works just fine, and snapshots are awesome. To me, btrfs is being more stable than Ext3/4 back when I used them.

Anecdotally, I've found btrfs unreliable. Here is a comment I made a week ago:


Personally, BTRFS feels like it has a ways to go before it's ready for prime-time.

I've had two major and one minor BTRFS-related issues that have scared me away from it.

1) One of my computers got its BTRFS filesystem into a state where it would hang when trying to mount read/write. What I suspect is that there was some filesystem thing happening in the background when I rebooted the machine. I rebooted via the GUI and there was no sign that something was happening in the background, so this was really a normal thing that a user would do. No amount of fixing was able to get it back, but I was able to boot from the installation media, mount it read-only, and copy the data elsewhere.

2) Virtually all of the Linux servers at work will randomly hang for between minutes and hours. This was eventually traced to a BTRFS-scrub process that the OS vendor schedules to run weekly. The length and impact of the hang seemed to be based on how much write activity happens - servers where all the heavy activity happens on NFS mounts saw no impact, but servers that write a lot of logs to the local filesystem would get severely crippled on a weekly basis. We've moved a bunch of our write-heavy filesystems to non-BTRFS options as a result of this.

3) This is a more minor issue, but still speaks to my experience. I had a VM that was basically a remote desktop for me to use. Generally speaking it would hang hard after a few days of uptime with no actual usage. When I reinstalled it on a non-BTRFS (sorry, can't remember which filesystem I used) filesystem it was rock solid. I have no proof that this had anything to do with BTRFS.

All of these were things that happened around a year ago, they may not be a true representation of the current state of BTRFS. But they've burned me badly, so now any use of BTRFS will be evaluated very carefully.

In contrast, I've been running ZFS on a couple of FreeBSD servers, with fairly write-heavy loads, and have had no issues that were filesystem-related. Even under severe space and memory constraints ZFS has been rock solid.


The first problem is directly attributable to BTRFS. There is no way a filesystem should get corrupted by a simple user-initiated reboot, regardless of what the system is doing in the background.

The second problem is a combination of BTRFS and the distribution. The distribution added a weekly job which did a BTRFS scrub (IIRC), under certain workloads that would completely hang machines for minutes to hours. The time this ran seems to be based on when the OS was installed, so as luck would have it these brought production systems down during business hours.

The third problem is something I have no idea about. It could be BTRFS, it could be something completely different, I honestly have no idea.

>One of my computers got its BTRFS filesystem into a state where it would hang when trying to mount read/write. What I suspect is that there was some filesystem thing happening in the background when I rebooted the machine.

>The first problem is directly attributable to BTRFS. There is no way a filesystem should get corrupted by a simple user-initiated reboot, regardless of what the system is doing in the background.

It doesn't seem like you have any direct evidence that this has something to do with rebooting. In fact, Btrfs is much less susceptible to these sorts of problems than previous file systems. That's because writes in Btrfs are atomic, so a file is never "partially" written to disk. Either it's written, or it isn't. You can't get disk corruption from write failures.

What I suppose might potentially happen is that you rebooted the system while in the middle of installing a bootloader or kernel update. I don't know if you tried mounting the partition r/w from another system or not, but assuming you didn't, it's probably more likely that something broke on your system that prevented it from remounting the partition during the boot process.

I know that updates were not installing when I rebooted. It may have been checking for updates, but it definitely wasn't installing them.

To recover, I booted from an installer image on a USB drive and tried to mount the partition R/W and it hung. Since the installer is a known-good environment, this rules out breaking my system the way you describe. This was a corrupt BTRFS filesystem.

One possibility is that because I was using a rolling-release distribution, I may have gotten a version of BTRFS with a bug in it or minor changes in the BTRFS code as the system got updated eventually got the filesystem into a state that rendered it unable to mount R/W. Either scenario doesn't inspire confidence.

"can't" is a very strong statement.

Wow did we work on the same startup? We ended up with a ZoL cluster with better but not great stability instead with less than half the performance for latency critical loads. But at least it was correct.

Not sure what you are talking about, it’s definitely endian neutral. You can’t take different page sized fs and put it on another page size arch, but that’s a separate issue.

2013 MIPS on Cavium as big endian -vs- Intel x86-64; appeared to work, did not. Page size not an issue, appeared to work, occasionally crazy with dumps containing values that are recognizably scrambled.

It's been awhile since then, but that was our experience. I don't have access to the records or email from the time. Maybe it's been fixed.

The filesystem track at LSF/MM tends to focus on topics which impact more of the kernel than one specific filesystem, and we didn't have any of those topics for Btrfs this year. Plus, only a couple of the Btrfs developers were at LSF/MM this year, and we typically sync up via other means.

Your point being?

Neither btrfs or the newer bcachefs being on that page is kinda sad as Linux is supposed to be the #1 operating system / kernel and Apple has deployed a next-gen file system by default but in the Linux world nothing seems to be happening beyond some people who still believe that btrfs has a future and one guy who is implementing bcachefs and has to beg for money on patreon because apparently no company wants to hire him to work on it.

In many ways ext4 is "good enough" but there are sweed juicy advantages of copy on write, checksummed file systems. I guess all of the enthusiasts are using ZFS which is a work of art but it has licensing problems AND IP is owned by a law firm with some engineers attached that was kickstarted by the CIA.

I felt like I had too much going on at the time when conference proposals were going out to attend, so that's why bcachefs isn't mentioned. But that's not for lack of progress; upstreaming is actually getting really close.

Wow, that's amazing. Looking forward to it!

> IP is owned by a law firm with some engineers attached that was kickstarted by the CIA.

I use ZFS and know a bit of the history, but it took me a while to realize that meant "Oracle". I didn't realize CIA was their first customer (something I found while googling just now) but I suppose that's not too surprising. Although to be fair I am not sure having CIA as a client in 1977 means much for ZFS.

A lot is happening arround XFS, LVM, device mapper and also the Stratis project.

Also a thin LV with XFS will give you many if not most of the benefits of BTRFs/ZFS but with stability and being available on any reasonably old Linux distro.

I don't think ZoL[1], a based on a ground-up re-implementation of ZFS, has those licensing/IP issues. We've been using it in production problem free for a year now (on Ubuntu 18).

[1] https://zfsonlinux.org/

> ZoL[1], a based on a ground-up re-implementation of ZFS

ZFS on linux is not a re-implementation of ZFS

ZoL is a port, not a reimplementation at all, and retains all of the licensing problems. Ubuntu ships it because they have a different reading of the licensing situation than many involved parties.

It would have been nice if btrfs would be discussed, since it's supposed to be the next generation filesystem for Linux.

since... 10 years soon?

> System observability with BPF

Initially I thought this was about recent Spectre vulnerability variants related to BPF. Then I found it is actually discussed in BPF: what's good, what's coming, and what's needed (https://lwn.net/Articles/787856/)

Are videos of this conference available or is going to be available?

No video recording was done, so no.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact