We are relearning the same things we did wrong in the 90s once we got nice libraries and middleware, i.e. to forget the platform we're building on.
Imagine this same scenario if GitLab was using a closed source operating system. Would they have been able to track this down? Quite unlikely, but maybe if they were even more persistent and got lucky. Would they have been able to fix it? Absolutely not. They'd be at the mercy of the vendor.
So, Github owned by Microsoft might not even notice such a bug for long, since the engineer would be able to get it fixed through an email or two. While the pre-Github-purchase Gitlab might have gotten no fix at all.
In that regard one might argue that open source doesn't completely destroy the ability to solve problems, but open source certainly helps balancing out the odds for different competitors.
(I got as far as disassembling their DLLs to point to the exact problem)
What's hundreds of thousands to the hundreds of millions they ship every year?
If Linux would be a proprietary product, than post-acquisition-github would get a fix from Linux developers because Microsoft is big. Not Github would provide anything, but they would get something due to the size of the Microsoft empirial stamp.
Assuming all competitors have the same engineers who are competent in the various technologies involved in debugging this (at a glance: filesystem operations, strace, wireshark, linux kernel compilation and modules, Google Cloud Platform...), know how to contact and approach the open-source maintainers.
... and most importantly: have the time to dedicate to such a debugging task.
We had the source code for an API that hooked into a proprietary library. We found a bug in the library. I don't think we had a support contract, and the issue was affecting production. Fixing it could have entailed some decompiling of the library, identifying the bad function, writing a workaround, and shoving it all into a new library. But I didn't have the expertise for all that, so instead I hacked up the API with a different workaround, essentially killing off some functionality, which avoided the bug. The application worked again, and we went on with life.
Another example: a proprietary extension to a tool did data replication. Under certain circumstances, data replication would fail, and the loss of data meant we would have to full-sync all data, taking up to four days. We reported the bug to the company. They determined it was a "minor error" and said the fix would arrive in the next release, in six months. So we identified a workaround (add cacheing, monitor for potential service disruption, restart services to re-connect networks before cache would empty) and implemented it until the fix could be delivered.
Regardless of who fixes the bug or how, the amount of time and money you invest in the fix matters. If a workaround saves you time and money by deferring the cost of the fix, that's often an acceptable solution. In this case, if the issue was affecting customers in production, blocking 'git gc' just for affected customers may have been a perfectly good workaround while whoever owned the NFS Client code figured out and implemented a fix.
EDIT: Yes, college degrees require due diligence and persistence, but they offer no indication of the willingness to exercise those skills _after_ college. This work does.
academic achievements, relating to actuall work, is a drop of piss in an ocean. my experience at a university actually made me lose respect for academics.
edit: who ever is downvoting is romanticizing the achievements of scientists of yore. or thinks that MIT is the norm.
no, for the most part its publish or die, ive heard professors refer to students as "harvest" and laugh, while copying slides off of google. ive seen professors lie their way into grants.
all that contrasted with how the industry actually works and its needs, and what it actually require the universities to produce.
yeah, ive become a achievement oriented cynic. titles truly only make me think less of a person if thats all they have to impress with.
naming any singular entity would just make us think its them to blame, and i believe the problem is endemic.
i sat once on a table with PHD students and complained about the quality of introductory courses, where i was then sternly put back in my place with "a university does not prepare for work! it prepares for research!"
i told him someone should tell that to all the students enrolling to CS in hopes of careers.
not all maybe true, but its more likely most. maybe its this cynism.
solving this isnt easy, i would just like to open a school myself, and offer guidance / support for people struggeling as i did myself back then.
and my advice to most people who want to pursue CS is to do it via apprenticeship and later approach a technical university.
the drama/pitty is that this is a process that starts at 17. when we are most clueless.
This is what universities have always believed that they are for. PhD courses in particular. There used to be a separate category of school that was both technical and employment focused; in the UK these were called "polytechnics", in the US they would be things like the Texas Agricultural and Mining College. For complex reasons due to both the class system and the set of accidents of history that caused a lot of startup founders to come from places like Stamford, they have become a "unfashionable".
i have a feeling they will be making a comeback woth a vengeance, but maybe not in our time.
today i just do what i can when a youth comes to me for advice.
First I tried getting some test VMs up and talking to each other. When I couldn't get that to work, I setup a few physical boxes to test it out.. When that didn't work either, I started debugging the code. A few strace's and some routine C debugging work later and I found a bug that would prevent any BGP connection from ever establishing.
A quick post on the OpenBSD listserv and the problem was fixed within day. (Wow that was almost 10 years ago?! How time flies)
We ultimately went with VyOS (back then called Vyatta) and Quagga but it felt good to find a bug like this.
Most of the work went in to confirming that there was an actual bug. Finding out where the bug was and fixing it was relatively trivial.
Start iterating on transparency, it may be hard but you will see the great results and it will make everyone around collaborate much more.
You probably heard of the event  which occurred almost 2 years ago - people are still talking about it and we are really happy and impressed to see everyone, including us, learning from that experience.
I hope this non-technical suggestion will help you to think about a solution to your question. If your team is not used to this kind of openness, eventually they will like the positive feedback from the community (we see that as a small iteration :-)). A comment section at  may be the extra source of motivation.
Have a nice day,
Djordje - Community Advocate at GitLab
Like all culture changes it isn't complicated, just hard.
Before a job was launched, a daemon pre-staged some job contents (logfiles, env, etc..) and started writing out to a job summary file. Then the job would start, continue writing to one of the files, which would become corrupted.
It ended up being this bug:
To add insult to injury, this is an example of how people like to work with Git repositories at our company:
* Clone a Git repo into $HOME to work with it on different Linux hosts. $HOME is an NFS automount so that you have the same home environment on any host you log in to.
* $HOME is also exposed to Windows desktop machines via SMB. So convenient, right? You can edit source code in your favorite Windows IDE now!
Imagine their surprise when they make yet another Git commit with garbage in it. CR/LF, file mode bits are all messed up. Sometimes a file change on Windows take a long time to propagate to NFS, or worse yet there can be some garbage at the end of the file. Combine this with a common practice of committing with 'git commit -am' without even looking at the diff and you get a recipe for disaster.
I, generally set this as default, and sometimes forget on a new machine... I tend to prefer those tools/programs that work in Windows, Mac and Linux even if not quite as good, so concerns about where I am is less. I use windows keyboard on mac, and change the mapping... only gotcha is when I need ^C in a terminal on mac, the muscle memory screws me up sometimes switching from working at home (mac or linux) to working at work (windows).
Some quirkiness with git's bash on windows (my shell default) get me sometimes too.
Yes please let the default not allow me to unmount a fs from a server that died.
Well, it is possible to handle correctly, e.g. Lustre. Lustre, however, is very complex compared to NFS, so there's absolutely a price to be paid.
NFS implements close-to-open consistency, which is much weaker than full cache coherency (again, e.g. Lustre).
rxd01 -fstype=nfs4,ro,soft,noatime,nodiratime,intr,rsize=65536,wsize=65536,nosuid,tcp,allow_other 192.168.8.3:/mnt/rxd01
Towards the end of my tenure there, I gave a Linux desktop a try. The NFS experience was amazingly bad by comparison; lots of issues with locking, with becoming disconnected (often until a reboot) from NFS servers, odd performance issues, reliability issues with the automounter, etc.
In the last few months I have tried the NFS client on my current Linux desktop again, thinking things might have improved -- they have, I guess, but not by much. It's still pretty easy for the client to get into a hung state if there's too much packet loss, or if the file server reboots, or whatever. I have to imagine that not enough people are really using Linux NFS clients in anger to drive fixing the issues with it. There is often no escape from the Quality Death Spiral.
Depends on what you're going to do with it. For something like sharing home directories, it works well enough.
The defaults are usually pretty decent. There's unfortunately a lot of obsolete NFS tuning advice hanging around on the internet that seems to get cargo culted over and over again.
Like the advice to set some specific rsize/wsize settings because the default is too small, oblivious to the fact that the NFS protocol allows the client and server to negotiate maximum sizes, and at least the Linux client and server have taken advantage of this negotiation mechanism for the past 2 decades or so.
> Having worked in high volume, highly available environments for years soho scenarios are not good examples.
FWIW, I wasn't talking about SOHO. At least in my experience, defaults work well for home & shared work dirs for O(10k) users (not all simultaneously active, though). HA is a pain, though, if you want to DIY, I'll grant you that.
> Real world issues with complex NFS environments (mixed nfs3/4 + krb5p and multiple OS'es + automounters)
Complex? Sounds like a pretty standard NFS environment.
> or pNFS and gluster require more than tuning mount options.
Yeah, no personal experience there. What did you have to do there?
We did have a clustered NFS appliance for HPC use a decade or so ago. People like to complain how Lustre is a beast to run, but IME Lustre has been smooth sailing compared to the grief that POS gave us. But that wasn't really the fault of the NFS protocol per se, it was just the architecture as well as the implementation of that appliance was crap, particularly so for HPC.
Trying to do anything like a database (and the 'git gc' process described is exactly that, a tiny database) over NFS requires the use of very specific techniques to get right.
Sibling commentator has it right - for unreliable WAN networks S3 offers far better semantics, because it's not quite a filesystem.
Firstly NFS only really works reliably when your network has harmonised UIG/GIDs. Thats the first pain point. This normally means LDAP/AD or shipping /etc/passwd (_shudders_) Also you need to squash root, otherwise people who are local root can do lots of naughty things.
Then you have to make sure that your mountpoint doesn't go away, because stale file handles are a pain in the arse.
Then you have file locking, which causes loads of other pain aswell. Most people turn that off.
After that its mostly alright.
nfsv4 has certain things that are good (pNFS, kerberos, etc) but support was not that great.
The recommended way to perform atomic writes on POSIX is the create-write-fsync-rename-fsyncdir dance. But that replaces the original file which causes ESTALE for all readers on NFS servers that don't support "delete on last close" semantics.
This breaks common pattern where you can continue reading slightly stale data from unlinked files while writers updating the data atomically. In other words it makes it much harder to do filesystem concurrency correctly which already is hard enough.
A practical case where I'm seeing it is on Amazon's EFS. Updating thumbnails occasionally results in torn images because the server tries to send a stale file.
 https://danluu.com/file-consistency/  http://nfs.sourceforge.net/#faq_d2
But ultimately Gitaly will need to do a local FS operation, so there's still the problem of ensuring HA for a given repository. GitHub solved this by writing their own replication layer on top of Git, but what's GitLab doing? Manually sharding repos on local FS's that are RAID-ed with frequent backups?
RH BZ's are (or used to be) public by default, unless they're manually changed. eg for security related things
It's not exactly new for NFS to have cache coherency "surprises". But it should have "close-to-open" coherency at least, and the bug found by GitLab fails even that.
Here's an anecdote.
A Mac client talking to Samba on Linux. The client deletes random files that the client isn't even looking at, but which happen to be changed on the server around the time the client looks at the directory containing those files.
I am not joking. Randomly deleting files it's not even reading.
It delayed a product rollout for about 8 months. I was sure there must be a flaw in some file-updating code, somewhere in application code running on Linux. What else would make update-by-rename-over files disappear once every few weeks? Surely the usual tmpfile-fsync-rename dance was durable on Linux, on ext4? It must have been a silly, embarrasing error in the application code right? Calling unlink() with the wrong string or something.
But no, application was fine. Libraries were fine. And the awful bugs in VMware Fusion's file sharing were not to blame this time. (Ahem, another anecdote...)
It only happened every few weeks. A random file would disappear and be noticed. A web application would be told to update a file, and it'd spontaneously complain that the file was gone. It wasn't reproducible until we went all-out on trying to make it happen more often. But they kept disappearing.
Things like invoices data files and edited documents. Once every few weeks for no obvious reason. Not happy. And not safe to deploy.
Eventually, we found a very old bug in Emacs which deletes the file that's being saved in rare circumstances that only manifest when file attributes change at the wrong moment, which does happen with the weird and wonderful Mac SMB client's way of caching attributes. We thought we'd found the cause with great relief, and could proceed to rollout. Until after a few weeks, another file disappeared. No!
It took weeks of tracing, reproducing, and learning new debugging tools (like auditd running permanently) to rule out faults in (1) the application code and libraries, (2) Linux itself, (3) Samba, (4) tools used on the Mac when viewing a directory, and viewing and editing files.
Nope, it wasn't a bug in application code after all. There weren't any faulty calls or wrong strings. Logging would have caught them. Linux rename() was fine, not to blame. It wasn't a durability problem on power loss (the reason you need fsync with rename). Nor VMware disk image snapshots, even though other bugs were spotted with those. Nor was it the Emacs bug although that was a surprise to find.
The reproducer turned out to be "run cat a lot on the Mac, on a file which isn't being changed at all, while repeatedly updating another file on Linux in the same directory, using rename to update. Watch the updated file disappear eventually".
auditd showed Samba was doing the deletes, so I suspected a crazy bug in Samba and had to work quite hard to convince myself Samba was only doing what it was told by the client. I hoped it was Samba, because that's open source and I can fix that.
No, it was an astonishingly crappy bug called "delete random files once in a blue moon, hahaha!" in the Mac SMB client, which happened to occasionally be used to look in the same directory, which happened to be shared over Samba for convenience to look at it.
The confirmation of cause was from watching the SMB protocol, looking at Samba logs set to maximum verbosity, and lots of reading.
atq2119 says: "Imagine this same scenario if GitLab was using a closed source operating system. Would they have been able to track this down?"
I think I've had an experience like that - the above bug in the Mac SMB client. (Seriously, deleting random files.)
Googling reveals similar-sounding bugs at least two versions of OSX later. Yuck. I have no idea how to meaningfully get these things fixed or usefully reported. And I've had enough to stop caring anyway. The workaround is "force it to use SMB v1" (ye olde anciente). I can imagine the cause is something trivial in directory caching; it's probably just a few lines to fix.
I'm certain if the Linux client had a bug like that, it would be fixed very quickly, and probably backported by the big distros. I'm certain a Linux SMBFS developer would have been very helpful. And, there's a fairly good chance I could have fixed it myself and submitted the patch - probably less work than finding the cause, in this instance.
As it is, I don't think I could have found the culprit if I couldn't look at the Samba source to understand in detail what was going on in the SMB network protocol, or if I didn't have excellent tracing tools in Linux to find which process was responsible for stray deletions (i.e. not my application code, but Samba, which was doing as requested).
During the early Java WORA culture wars, Bill Joy's wisdom about NFS has always stuck with me:
Interoperability is hard.
Despite having access to source code, a stable spec, working reference implementations, testing suites, and aggressive evangelism, getting everyone's NFS implementations to interoperate was a major challenge.
I continue to think the authoritative history of NFS would become a seminal text book. A useful guide for the younguns about to embark on grand new world changing adventures. Many, many other protocols (DNS, TCP, HL7, CORBA...) have faced the same challenges. But my hunch is NFS is a superset, hitting every pain point.