Anyway none of this is particularly important for the story, but I don't think the guy telling it is lying for dramatic effect, I think he's probably being honest that the boss saying "the server crashed" made him suspicious cause that server never crashed, and I too found it amusing for this to be included in the story (as a sort of by-the-by advertisement for linux). (Also, though: it turned out the server really did "crash" in some way).
BTW I've seen an error where app server couldn't write files to a directory but only for specific filenames (that wasn't already created). Turns out if you have dozen thousands of files in one directory the hash table has collisions and some files you can create while some other names you cannot. It was lot of fun to discover that :)
And customers described it as "server doesn't work" but when we connected the randomly generated names it was trying to write were different and it worked.
The backup was what killed it, it ran out of disk space and the box keeled over. I could not believe the backup program was that stupid to back up twice as much stuff as it had space for and then to kill off the important processes to keep the backup running until zero bytes were left.
I have also had a close one with mySQL replication, it took the disk to fill up before I configured it to purge the logs. My own stupidity is to blame for that one.
Log files are going to be the killer, run a linux box for long enough without any log file rotation and the disk is ultimately going to fill up. I can't imagine that a decade ago when this server was built that there was a rack of terabyte SSDs in there.
Email is also an area that just grows and grows. The email doesn't even have to be used, just your system message stuff.