The climax of this story involves the belief that “Linux servers don’t just crash”.
The point being: Windows would just BSOD in that kind of situation. Not that continuing to run with corrupt kernel data structures is a good idea, but there is something grandiose about the OS stubbornly refusing to die when it's raining kernel crashes.
Anyway none of this is particularly important for the story, but I don't think the guy telling it is lying for dramatic effect, I think he's probably being honest that the boss saying "the server crashed" made him suspicious cause that server never crashed, and I too found it amusing for this to be included in the story (as a sort of by-the-by advertisement for linux). (Also, though: it turned out the server really did "crash" in some way).
BTW I've seen an error where app server couldn't write files to a directory but only for specific filenames (that wasn't already created). Turns out if you have dozen thousands of files in one directory the hash table has collisions and some files you can create while some other names you cannot. It was lot of fun to discover that :)
And customers described it as "server doesn't work" but when we connected the randomly generated names it was trying to write were different and it worked.
The backup was what killed it, it ran out of disk space and the box keeled over. I could not believe the backup program was that stupid to back up twice as much stuff as it had space for and then to kill off the important processes to keep the backup running until zero bytes were left.
I have also had a close one with mySQL replication, it took the disk to fill up before I configured it to purge the logs. My own stupidity is to blame for that one.
Log files are going to be the killer, run a linux box for long enough without any log file rotation and the disk is ultimately going to fill up. I can't imagine that a decade ago when this server was built that there was a rack of terabyte SSDs in there.
Email is also an area that just grows and grows. The email doesn't even have to be used, just your system message stuff.
You need to run stuff you care about on them for that to happen. If you don't they run flawlessly for decades. I know of a switch that was up for 11yr (not a server I know, still a unix/linux based OS though) which is as much a testament to UPSs and backup generators as it is to the OS.
OTOH, we also have an NGINX box that is automatically restarted once a week, because otherwise our API gets really slow. I understand that this isn't good practice, but we've spent too many hours debugging already and at the end of the day, this works.
Most unbelievable part of the story. Cracked me up.