
How we spent two weeks hunting an NFS bug in the Linux kernel - fanf2
https://about.gitlab.com/2018/11/14/how-we-spent-two-weeks-hunting-an-nfs-bug/
======
drewg123
I spend my days chasing bugs like this in the FreeBSD kernel, and make heavy
use of dtrace. I expect that using something like bpftrace(1) might have
accelerated their debugging as compared to inserting stack traces and
prints...

(1): [http://www.brendangregg.com/blog/2018-10-08/dtrace-for-
linux...](http://www.brendangregg.com/blog/2018-10-08/dtrace-for-
linux-2018.html)

~~~
lbotos
We're getting there! :) I've asked my team to balance learning K8s with new
learning lower level debugging tools like bcc and it's cohort.

~~~
sofaofthedamned
That's awesome. I see too many younger staff know the hotness but not the
fundamentals, and not enough employers care. I've had to strace a K8s cluster
before where it turned out the problem was in kube-dns.

We are relearning the same things we did wrong in the 90s once we got nice
libraries and middleware, i.e. to forget the platform we're building on.

------
atq2119
It's amazing how much we take the power of free and open source software for
granted these days.

Imagine this same scenario if GitLab was using a closed source operating
system. Would they have been able to track this down? Quite unlikely, but
maybe if they were even more persistent and got lucky. Would they have been
able to fix it? Absolutely not. They'd be at the mercy of the vendor.

~~~
erikb
Just to give a picture how similar situations play out in closed source
environments: They would have a key account manager who's live they would turn
into hell. And depending on the size of gitlab's business in relation to the
customers of the software provider it would have more or less effect towards
software development. Reaction might range from having a fix in 2 days plus
on-site engineer/consultant visits plus inviting the corresponding manager of
gitlab to dinner for $100+/person to not even getting a response to even
getting an angry call from the provider side's manager that the account will
be closed if gitlab doesn't behave.

So, Github owned by Microsoft might not even notice such a bug for long, since
the engineer would be able to get it fixed through an email or two. While the
pre-Github-purchase Gitlab might have gotten no fix at all.

In that regard one might argue that open source doesn't completely destroy the
ability to solve problems, but open source certainly helps balancing out the
odds for different competitors.

~~~
pjc50
This very much depends on how big you are and how much you're willing to
spend. At a medium-size all-Microsoft shop with 100+ MSDN licensees, I found a
bug in WINCE7's handling of "structured exceptions", and eventually got them
to _acknowledge_ it but they never fixed it.

(I got as far as disassembling their DLLs to point to the exact problem)

~~~
uep
Yeah, I don't really buy the parent comment's insinuation that Microsoft
GitHub would be more likely to fix it than GitHub previously. My company has
shipped hundreds of thousands of devices with a Microsoft OS on it, and they
virtually never fixed anything reported. At least one bug was very serious,
and we _thought_ they would have to fix it, but it just didn't happen.

What's hundreds of thousands to the hundreds of millions they ship every year?

~~~
timerol
I think you misread the comment. Assuming that NFS was sold by NFS-Co, GitHub,
as part of Microsoft, would be a big enough customer to get NFS-Co to fix the
bug quickly. However, GitLab would be a tiny NFS-Co customer, and so the bug
would have gone unfixed. The difference is in the size of who reports the bug.

------
floatingatoll
Representing such an example of work in a job application / resume / interview
would be more valuable to me than a college degree. Due diligence and
persistence — in the face of real-world, difficult, hundreds-of-moving-parts
technical issues — are worth every penny.

EDIT: Yes, college degrees require due diligence and persistence, but they
offer no indication of the willingness to exercise those skills _after_
college. This work does.

~~~
idclip
i recently graduated and started work.

academic achievements, relating to actuall work, is a drop of piss in an
ocean. my experience at a university actually made me lose respect for
academics.

edit: who ever is downvoting is romanticizing the achievements of scientists
of yore. or thinks that MIT is the norm.

no, for the most part its publish or die, ive heard professors refer to
students as "harvest" and laugh, while copying slides off of google. ive seen
professors lie their way into grants.

all that contrasted with how the industry actually works and its needs, and
what it actually require the universities to produce.

yeah, ive become a achievement oriented cynic. titles truly only make me think
less of a person if thats all they have to impress with.

~~~
Angostura
You should name your university. So others can avoid it. Because not all are
like that.

~~~
idclip
i dont believe there is sense in naming it so people can scape goat it anf
solve things by avoiding one bad university. because i think the BA/MA system
is a road paved with good intentions. but its leading us to hell.

naming any singular entity would just make us think its them to blame, and i
believe the problem is endemic.

i sat once on a table with PHD students and complained about the quality of
introductory courses, where i was then sternly put back in my place with "a
university does not prepare for work! it prepares for research!"

i told him someone should tell that to all the students enrolling to CS in
hopes of careers.

not all maybe true, but its more likely most. maybe its this cynism.

solving this isnt easy, i would just like to open a school myself, and offer
guidance / support for people struggeling as i did myself back then.

and my advice to most people who want to pursue CS is to do it via
apprenticeship and later approach a technical university.

the drama/pitty is that this is a process that starts at 17. when we are most
clueless.

~~~
pjc50
> "a university does not prepare for work! it prepares for research!"

This is what universities have always believed that they are for. PhD courses
in particular. There used to be a separate category of school that was both
technical and employment focused; in the UK these were called "polytechnics",
in the US they would be things like the Texas Agricultural and Mining College.
For complex reasons due to both the class system and the set of accidents of
history that caused a lot of startup founders to come from places like
Stamford, they have become a "unfashionable".

~~~
idclip
thats actually interesting!

i have a feeling they will be making a comeback woth a vengeance, but maybe
not in our time.

today i just do what i can when a youth comes to me for advice.

------
mrunkel
Nice write up. It reminded me of the time I spent several weeks once trying to
get OpenBGPD to work on OpenBSD.

First I tried getting some test VMs up and talking to each other. When I
couldn't get that to work, I setup a few physical boxes to test it out.. When
that didn't work either, I started debugging the code. A few strace's and some
routine C debugging work later and I found a bug that would prevent _any_ BGP
connection from ever establishing.

A quick post on the OpenBSD listserv and the problem was fixed within day.
(Wow that was almost 10 years ago?! How time flies)

[https://github.com/openbsd/src/commit/13fba73cec6be16d64c86e...](https://github.com/openbsd/src/commit/13fba73cec6be16d64c86e03a04dc4a8a26d46f6)

We ultimately went with VyOS (back then called Vyatta) and Quagga but it felt
good to find a bug like this.

Most of the work went in to confirming that there was an actual bug. Finding
out where the bug was and fixing it was relatively trivial.

------
saketuec
Great post! In my own experience of working with NFS version 4 servers, we
discovered several bugs that have been actually fixed in latest version of
kernels. The unfortunate thing is that most enterprises still run old CentOS /
Redhat release kernels that although are stable, but yet lack several of these
fixes.

~~~
asveikau
I don't have a lot of experience with NFS aside from a few machines that don't
see insane use, but it's surprising to me how v4 implementations seem to
introduce such instability. I had an experience a few years ago with a Mac
client, quitting vim would cause a kernel panic. NFS v3 did fine.

------
bostonvaulter2
Excellent write-up, I like how they briefly summarized each section so you
knew what to expect (I find it helps with understanding). Including the false
path is very nice as well since those are very common when debugging.

~~~
danso
Completely agree, finding and fixing the bug is commendable enough. But being
so thorough as to provide a free educational lesson to everyone else about
professional debugging is the kind of self-promotional material I definitely
want to read.

------
stuxnet79
GitLab has a strong engineering team. I appreciate this article. For those
with experience, what's the best approach to introducing a documentation /
"writing up a post-mortem culture" into a company that traditionally doesn't
value these things?

~~~
dsumenkovic
At GitLab we really care about our culture and core values [1]. As freddie
said below: "Start doing it, celebrate it, reward it". If you are not sure how
to make the first move, I would like to say that transparency is what really
pushes everyone forward.

Start iterating on transparency, it may be hard but you will see the great
results and it will make everyone around collaborate much more.

You probably heard of the event [2] which occurred almost 2 years ago - people
are still talking about it and we are really happy and impressed to see
everyone, including us, learning from that experience.

I hope this non-technical suggestion will help you to think about a solution
to your question. If your team is not used to this kind of openness,
eventually they will like the positive feedback from the community (we see
that as a small iteration :-)). A comment section at [2] may be the extra
source of motivation.

Have a nice day,

Djordje - Community Advocate at GitLab

[1]
[https://about.gitlab.com/handbook/values/#transparency](https://about.gitlab.com/handbook/values/#transparency)

[2] [https://about.gitlab.com/2017/02/01/gitlab-dot-com-
database-...](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-
incident/)

------
batbomb
I had been experiencing a similar bug that we reported to Red Hat after being
stumped. It started occurring out of the middle of nowhere but it would happen
only in .01% of our jobs we launched into a batch farm. We launch about 15k
batch jobs a day, and it was enough to be a problem.

Before a job was launched, a daemon pre-staged some job contents (logfiles,
env, etc..) and started writing out to a job summary file. Then the job would
start, continue writing to one of the files, which would become corrupted.

It ended up being this bug: [https://www.spinics.net/lists/linux-
nfs/msg41335.html](https://www.spinics.net/lists/linux-nfs/msg41335.html)

~~~
tinus_hn
Too bad that kind of thing is difficult to catch using static analysis.

------
alexeiz
Git on NFS has proved to be a full of annoyances. Though I haven't seen any
Git repo integrity issues on NFS, normal Git operations can be so slow on NFS
it's infuriating (especially when the size of your repository is sufficiently
large). Do you like that fancy Bash prompt showing 'git status' for the
repository you currently work with? Forget about it, if you're on NFS. Or get
used to waiting a couple of seconds after each Bash command while it's blocked
on 'git status'. The solution is just to avoid NFS altogether and work with
Git repos on a local filesystem.

To add insult to injury, this is an example of how people like to work with
Git repositories at our company:

* Clone a Git repo into $HOME to work with it on different Linux hosts. $HOME is an NFS automount so that you have the same home environment on any host you log in to.

* $HOME is also exposed to Windows desktop machines via SMB. So convenient, right? You can edit source code in your favorite Windows IDE now!

Imagine their surprise when they make yet another Git commit with garbage in
it. CR/LF, file mode bits are all messed up. Sometimes a file change on
Windows take a long time to propagate to NFS, or worse yet there can be some
garbage at the end of the file. Combine this with a common practice of
committing with 'git commit -am' without even looking at the diff and you get
a recipe for disaster.

~~~
tracker1
I wish that git for windows would just change the default line endings to \n.
Most modern editors should also make this shift as a default, or if there's no
\r\n combination in an existing doc, just use \n.

I, generally set this as default, and sometimes forget on a new machine... I
tend to prefer those tools/programs that work in Windows, Mac and Linux even
if not quite as good, so concerns about where I am is less. I use windows
keyboard on mac, and change the mapping... only gotcha is when I need ^C in a
terminal on mac, the muscle memory screws me up sometimes switching from
working at home (mac or linux) to working at work (windows).

Some quirkiness with git's bash on windows (my shell default) get me sometimes
too.

------
rawoke083600
I love these posts ! Once or twice I was lucky enough to have had a somewhat
similar intense debug-puzzle's to solve myself. Wonderful times :) strace is a
god send !

------
virtualized
I always wonder why Linux doesn't seem to have any kind of tests. How can they
afford not to have regression tests for bugs they fixed? How do they know that
this bug fix didn't break anything? What does "never break userspace" even
mean if there is no way to check whether userspace has been broken?

~~~
jibal
[https://stackoverflow.com/questions/3177338/how-is-the-
linux...](https://stackoverflow.com/questions/3177338/how-is-the-linux-kernel-
tested)

------
raverbashing
To be honest, NFS is usually more pain than it is worth it. (But hey, at least
it's not iSCSI)

Yes please let the default not allow me to unmount a fs from a server that
died.

~~~
steve19
Does anyone know why NFS is such a pain? In the past (10 years ago) I just
assumed I was doing it wrong and stopped using it, and have not used it since.

~~~
jclulow
In my experience, the quality of NFS client implementations varies
significantly between different operating systems. We made _heavy_ use of NFS
for home directories and application backing stores at the University where I
used to work, and it was a very good experience -- but this was on Solaris 10
(and later, OpenSolaris) machines. We had heavy NFS client use on many multi-
user machines (shell servers, Sun Ray servers, etc) and didn't see reliability
problems. On the odd occasion that we needed to reboot the file server for
updates, clients would pause and then resume promptly after the server
rebooted.

Towards the end of my tenure there, I gave a Linux desktop a try. The NFS
experience was amazingly bad by comparison; lots of issues with locking, with
becoming disconnected (often until a reboot) from NFS servers, odd performance
issues, reliability issues with the automounter, etc.

In the last few months I have tried the NFS client on my current Linux desktop
again, thinking things might have improved -- they have, I guess, but not by
much. It's still pretty easy for the client to get into a hung state if
there's too much packet loss, or if the file server reboots, or whatever. I
have to imagine that not enough people are really using Linux NFS clients in
anger to drive fixing the issues with it. There is often no escape from the
Quality Death Spiral.

~~~
romeisendcoming
NFS requires long admin experience and tuning for each use case. The only
comment I can agree with you on is that the linux automounter is lackluster.
We used BSD amd for many years with good success.

~~~
jabl
> NFS requires long admin experience and tuning for each use case.

Depends on what you're going to do with it. For something like sharing home
directories, it works well enough.

The defaults are usually pretty decent. There's unfortunately a lot of
obsolete NFS tuning advice hanging around on the internet that seems to get
cargo culted over and over again.

Like the advice to set some specific rsize/wsize settings because the default
is too small, oblivious to the fact that the NFS protocol allows the client
and server to negotiate maximum sizes, and at least the Linux client and
server have taken advantage of this negotiation mechanism for the past 2
decades or so.

~~~
romeisendcoming
Well enough isn't good enough for production. Having worked in high volume,
highly available environments for years soho scenarios are not good examples.
Real world issues with complex NFS environments (mixed nfs3/4 + krb5p and
multiple OS'es + automounters) or pNFS and gluster require more than tuning
mount options. Tuning NFS for a latency averse and throughput intensive
application operating on large netcdf and hdf5 file hierarchies is a worthy
example.

~~~
jabl
> Well enough isn't good enough for production.

How informative.

> Having worked in high volume, highly available environments for years soho
> scenarios are not good examples.

FWIW, I wasn't talking about SOHO. At least in my experience, defaults work
well for home & shared work dirs for O(10k) users (not all simultaneously
active, though). HA is a pain, though, if you want to DIY, I'll grant you
that.

> Real world issues with complex NFS environments (mixed nfs3/4 + krb5p and
> multiple OS'es + automounters)

Complex? Sounds like a pretty standard NFS environment.

> or pNFS and gluster require more than tuning mount options.

Yeah, no personal experience there. What did you have to do there?

We did have a clustered NFS appliance for HPC use a decade or so ago. People
like to complain how Lustre is a beast to run, but IME Lustre has been smooth
sailing compared to the grief that POS gave us. But that wasn't really the
fault of the NFS protocol per se, it was just the architecture as well as the
implementation of that appliance was crap, particularly so for HPC.

~~~
romeisendcoming
* routine buffer tweak is soho speak. * defaults don't work esp in mixed nfs 3/4 on linux across mixed 1/10 gb segment subnet boundaries. Try and get back to me. You will DOS your file service. * krb5p standard? First I've heard of it. FreeBSD won't do krb5p at nfs4 vanilla via linux nfs server. * Would never do it again. gluster is a shit storm of problems but nice when it works.

------
Ya9yoeGh
NFS open file handle semantics are quite an annoyance.

The recommended way to perform atomic writes on POSIX is the create-write-
fsync-rename-fsyncdir[0] dance. But that replaces the original file which
causes ESTALE for all readers on NFS servers that don't support "delete on
last close"[1] semantics.

This breaks common pattern where you can continue reading slightly stale data
from unlinked files while writers updating the data atomically. In other words
it makes it much harder to do filesystem concurrency correctly which already
is hard enough.

A practical case where I'm seeing it is on Amazon's EFS. Updating thumbnails
occasionally results in torn images because the server tries to send a stale
file.

[0] [https://danluu.com/file-consistency/](https://danluu.com/file-
consistency/) [1]
[http://nfs.sourceforge.net/#faq_d2](http://nfs.sourceforge.net/#faq_d2)

------
avar
A tangential question, the post links to an earlier post[1] saying that GitLab
itself doesn't use NFS anymore, pointing out that they migrated to Gitaly.

But ultimately Gitaly will need to do a local FS operation, so there's still
the problem of ensuring HA for a given repository. GitHub solved this by
writing their own replication layer on top of Git[2], but what's GitLab doing?
Manually sharding repos on local FS's that are RAID-ed with frequent backups?

1\. [https://about.gitlab.com/2018/09/12/the-road-to-
gitaly-1-0/](https://about.gitlab.com/2018/09/12/the-road-to-gitaly-1-0/)

2\. [https://githubengineering.com/introducing-
dgit/](https://githubengineering.com/introducing-dgit/)

~~~
lbotos
We are working on Gitaly HA. You can check out the Epic here:

[https://gitlab.com/groups/gitlab-
org/-/epics/289](https://gitlab.com/groups/gitlab-org/-/epics/289)

~~~
avar
So since redundancy & horizontal scaling are goals of Gitaly HA am I to
understand that right now GitLab.com is run on some ad-hoc setup like what I
described, and you can lose data if you're unlucky enough with a machine or
two disks going down at the same time?

------
justinclift
The Red Hat BZ is "Access Denied", so can't see if it's fixed in RHEL & CentOS
yet:

[https://bugzilla.redhat.com/show_bug.cgi?id=1648482](https://bugzilla.redhat.com/show_bug.cgi?id=1648482)

:(

~~~
stanhu
You can sign up for an account to see the bug report status. The patch has not
yet been backported.

~~~
justinclift
I have an account. Why would that make any difference?

RH BZ's are (or used to be) public by default, unless they're manually
changed. eg for security related things

------
perlpimp
Great writing effortlessly makes you feel smarter. Great story

------
jlokier
Really nice. I have more respect for GitLab now. That's a great write-up, and
it led me to read some of their other nice reports too.

It's not exactly new for NFS to have cache coherency "surprises". But it
should have "close-to-open" coherency at least, and the bug found by GitLab
fails even that.

Here's an anecdote.

A Mac client talking to Samba on Linux. The client deletes random files that
the client isn't even looking at, but which happen to be changed on the server
around the time the client looks at the directory containing those files.

I am not joking. Randomly deleting files it's not even reading.

It delayed a product rollout for about 8 months. I was sure there must be a
flaw in some file-updating code, somewhere in application code running on
Linux. What else would make update-by-rename-over files disappear once every
few weeks? Surely the usual tmpfile-fsync-rename dance was durable on Linux,
on ext4? It must have been a silly, embarrasing error in the application code
right? Calling unlink() with the wrong string or something.

But no, application was fine. Libraries were fine. And the awful bugs in
VMware Fusion's file sharing were not to blame this time. (Ahem, another
anecdote...)

It only happened every few weeks. A random file would disappear and be
noticed. A web application would be told to update a file, and it'd
spontaneously complain that the file was gone. It wasn't reproducible until we
went all-out on trying to make it happen more often. But they kept
disappearing.

Things like invoices data files and edited documents. Once every few weeks for
no obvious reason. Not happy. And not safe to deploy.

Eventually, we found a very old bug in Emacs which deletes the file that's
being saved in rare circumstances that only manifest when file attributes
change at the wrong moment, which does happen with the weird and wonderful Mac
SMB client's way of caching attributes. We thought we'd found the cause with
great relief, and could proceed to rollout. Until after a few weeks, another
file disappeared. No!

It took weeks of tracing, reproducing, and learning new debugging tools (like
auditd running permanently) to rule out faults in (1) the application code and
libraries, (2) Linux itself, (3) Samba, (4) tools used on the Mac when viewing
a directory, and viewing and editing files.

Nope, it wasn't a bug in application code after all. There weren't any faulty
calls or wrong strings. Logging would have caught them. Linux rename() was
fine, not to blame. It wasn't a durability problem on power loss (the reason
you need fsync with rename). Nor VMware disk image snapshots, even though
other bugs were spotted with those. Nor was it the Emacs bug although that was
a surprise to find.

The reproducer turned out to be "run cat a lot on the Mac, on a file which
isn't being changed at all, while repeatedly updating another file on Linux in
the same directory, using rename to update. Watch the updated file disappear
eventually".

auditd showed Samba was doing the deletes, so I suspected a crazy bug in Samba
and had to work quite hard to convince myself Samba was only doing what it was
told by the client. I hoped it was Samba, because that's open source and I can
fix that.

No, it was an astonishingly crappy bug called "delete random files once in a
blue moon, hahaha!" in the Mac SMB client, which happened to occasionally be
used to look in the same directory, which happened to be shared over Samba for
convenience to look at it.

The confirmation of cause was from watching the SMB protocol, looking at Samba
logs set to maximum verbosity, and lots of reading.

atq2119 says: "Imagine this same scenario if GitLab was using a closed source
operating system. Would they have been able to track this down?"

I think I've had an experience like that - the above bug in the Mac SMB
client. (Seriously, deleting random files.)

Googling reveals similar-sounding bugs at least two versions of OSX later.
Yuck. I have no idea how to meaningfully get these things fixed or usefully
reported. And I've had enough to stop caring anyway. The workaround is "force
it to use SMB v1" (ye olde anciente). I can imagine the cause is something
trivial in directory caching; it's probably just a few lines to fix.

I'm certain if the Linux client had a bug like that, it would be fixed very
quickly, and probably backported by the big distros. I'm certain a Linux SMBFS
developer would have been very helpful. And, there's a fairly good chance I
could have fixed it myself and submitted the patch - probably less work than
finding the cause, in this instance.

As it is, I don't think I could have found the culprit if I couldn't look at
the Samba source to understand in detail what was going on in the SMB network
protocol, or if I didn't have excellent tracing tools in Linux to find which
process was responsible for stray deletions (i.e. not my application code, but
Samba, which was doing as requested).

~~~
specialist
Great war story. Agree with need for access to source.

\--

During the early Java WORA culture wars, Bill Joy's wisdom about NFS has
always stuck with me:

Interoperability is hard.

Despite having access to source code, a stable spec, working reference
implementations, testing suites, and aggressive evangelism, getting everyone's
NFS implementations to interoperate was a major challenge.

[https://en.wikipedia.org/wiki/Network_File_System](https://en.wikipedia.org/wiki/Network_File_System)

\--

I continue to think the authoritative history of NFS would become a seminal
text book. A useful guide for the younguns about to embark on grand new world
changing adventures. Many, many other protocols (DNS, TCP, HL7, CORBA...) have
faced the same challenges. But my hunch is NFS is a superset, hitting every
pain point.

