
Bus errors, core dumps, and binaries on NFS (2018) - signa11
https://rachelbythebay.com/w/2018/03/15/core/
======
dveeden2
Reminds me of the differences between `cp` and `install` for putting .so files
in place.

`install` basically results in a new inode allowing for processes to keep a
handle on the old version of the .so

`cp` copies the content over the old content, keeping the same inode. Now
applications that had that .so open won't be happy.

~~~
temac
Binaries are probably not mapped with MAP_SHARED, but I checked the mmap
manual for MAP_PRIVATE and read: "It is unspecified whether changes made to
the file after the mmap() call are visible in the mapped region."

I learn something everyday...

------
burntsushi
This can also occur when memory mapping a file on Linux, which is similarish
to the NFS issue. If you truncate a file that you've already memory mapped and
the OS tries to read past the truncated part, you get SIGBUS.

~~~
iforgotpassword
Wanted to comment the same. When I first learned about memory mapping I went
"mmap() all the things!!" since it's so much easier than reads and writes all
the time, checking for short reads, aligning the pointer and calling read
again, handling EINTR, you name it.

But at least you do get proper error codes that you can handle in a somewhat
sane way.

A read our write error for an mmapped file? SIGBUS, game over. Want to handle
it? Use a signal handler for SIGBUS, use setjmp before every access to your
mmapped region and longmp back from your signal handler. And you thought
handling all the failure modes of read/write was ugly.

Use mmap if you absolutely need the performance. Otherwise just don't.

~~~
icedchai
Back almost 20 years ago, I worked on a medium-sized system - 1000's of
simultaneous users, millions in $USD transactions daily - that was based on an
mmap'ed flat file "database." It worked amazingly well. (Note that we did none
of that sort of error handling!)

~~~
todd8
Yes, the first time I saw this described was in 1987 in a paper by A. Birrell,
et. al. See [1]. It was also available as a DEC SRC report, number 24.

[1] A simple and efficient implementation of a small database,
[https://dl.acm.org/doi/abs/10.1145/37499.37517?download=true](https://dl.acm.org/doi/abs/10.1145/37499.37517?download=true)

------
the8472
The lack of posix semantics when unlinking on NFS rears its ugly head in many
more places. For example the common atomic write pattern that allows readers
to keep reading a stale copy doesn't work anymore (you get ESTALE on IO or
SIGBUS if it's mmaped) which means anything involving a frequently replaced
file will require more workarounds than on any other filesystem.

~~~
jabl
Isn't that what "silly rename" (on nfsv3, v4 doesn't need it?) is supposed to
fix?

The problem the article mentions is overwritin a binary instead of renaming.

~~~
the8472
The article is about the atomic write pattern: create tempfile, move tempfile
over original which effectively is an unlink of the original.

And yes, this should be solved, but some NFS servers don't support it, e.g.
AWS's EFS.

------
qiqitori
In my experience, you're better off avoiding NFS as much as possible. (Perhaps
except when you're sharing a filesystem between VMs on the same machine.) Try
something else, perhaps rsync, unless you know what you're doing. NFS over a
VPN -- probably in for a rough ride.

In NFS, you can set mounts as 'hard' or 'soft'. If hard, errors will get you
stuck until the share is back. You probably don't want that. If soft, you're
slightly better off, but remember that the retry settings are all per-mount,
and perhaps one size doesn't fit all.

As far as I know, when NFS goes awry, you get the same or similar behavior to
a hypothetical HDD/SDD that just explicably decided to no longer do anything
for a while. Your processes will be in a D state and won't be killable for a
potentially long time.

~~~
macintux
When I worked in BBN R&D back in the day, we used lots of NFS on a very large
fragile LAN built from 10-base-2, plus some sketchy AppleTalk hardware in a
closet somewhere nearby.

Every now and then I’d know someone was in the closet because my transceiver’s
light would peg and NFS was locked up. Someone had once again bumped the
AppleTalk router.

