Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet. http://rachelbythebay.com/w/ has many good examples of this, some submitted and voted up here.
Cattle is a good philosophy but it takes a huge amount of work to approach perfection.
Everybody was embarassed because no monitoring caught it, but the VP of IT did by walking past the cage.
Aside: was it even legal for him to do that?
Also, Rickover was Congress's favorite admiral. They forced the Navy to promote him. I'm pretty sure they made sure that the laws were to his liking.
Also, if you're not confident enough in your nuclear reactor to apply chaos monkey techniques, you shouldn't be engineering nuclear reactors.
He’s either an MD in finance, which is the equivalent of a VP (roughly) in other kinds of companies, or he’s the CEO.
My dad had a book over almost all the cow names in Norway, per 1988. As a kid I found it rather fun to just flip through it and read some names, often wondering how they came up with them.
However since then it seems the tradition of naming cattle has dropped to less than 30%.
: "Gullhorn og dei andre : kunamn i Noreg" https://urn.nb.no/URN:NBN:no-nb_digibok_2010111708049
These days the number has risen, IIRC around 35, though that's still quite a low number compared to larger countries I imagine. I'm pretty sure the variance is quite high however, with a fair number of farms with just a few cows dragging down the mean.
I really hope the video is edited to strongly resemble this scene: https://www.youtube.com/watch?v=N9wsjroVlu8
We had a customer with regularly failing tape backups. CRC errors, verify pass failures, even failed writes, and so forth.
We replaced the tapes with new ones. Same issues.
We replaced the tape drive with a new one. Still the same problems.
We replaced the internal ribbon cable and the SCSI controller. No luck.
Firmware flashed everything. Didn't help.
New server chassis, wiped the OS and reinstalled everything from scratch. Changed the backup software just in case. The backups still failed!
Literally no part was the same. I went on site to start looking into things like the power cables, the UPS, or vibration issues. Basically were getting desperate and grasping at straws.
I was sitting down in an office, casually chatting with the IT guy while we were waiting for 5pm so we could reboot the server. He's leaning back in is office chair, and he casually picks up one of the tape cartridges and throws it up in the air and then catches it before it hits the ground. Just playing. Over and over.
I asked him if he does that a lot.
"Yes, it's fun!" he answered.
What was your companys role? Backup services/devices?
Finally I took all the parts out of the original computer and put them in a different chassis and it worked! Put them back in the old chassis and back to the old problem.
Eventually I noticed that there was an extra stand-off in the first computer case and it was shorting out the motherboard.
It was literally the chassis causing the problem.
Motherboard is bricked. Ring DELL for support. After going through the rigmarole of explaining what had happened and that we had a bricked motherboard, the person on the phone said "Have you tried taking out the CPU and rebooting?"
To avoid further delay in getting a replacement sent (we had 4 hour on-site at the time), we went through the motions. Not surprisingly, the motherboard was substantially bricked without a CPU.
The DELL engineer that came on-site was suitably amused.
Nothing worked. Finally, I removed the processor from the motherboard, looked at it, and reinstalled it. The computer booted right up and never had another problem. Weird.
What do you do to write software accordingly? Make it detect when it's running on a dud? Have it run as best as it can anyways?
Are you perhaps thinking of the printer execution scene from "Office Space"?
are you able to comment a bit further on why this machine was well known?
Could be a different incident and a different machine, though. I'm sure this story happened more than once.
The infamous machine did go through repairs and part swaps many times, as you could see from its long and troubled hwops history.
The worst machines were the zombies with NICs bad enough to break Stubby RPCs, but still passing heartbeat checks. Or breaking connections only when (re)using specific ports. Fun times!
Regarding this system: the motherboard was never swapped?
> engineers habitually run batch jobs with more replicas than there are machines
Idly curious, how do I parse parse this? It sounds like the same jobs are replicated to multiple machines as a sort of asynchronous, eventually-consistent lockstep arrangement?
Eyy what would I search on moma to find this video?
I can confirm I've read the same thing though, years back.
If any Googlers are reading this: just goto go/legends and search for officespace. The first link that pops up has context as to why the video exists.
And I wonder how well the signal in that ratio might scale down to hundreds or tens of disks.
"Hewlett Packard Enterprise (HPE) has once again issued a warning to its customers that some of its Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation unless a critical patch is applied.
Back in November of last year, the company sent out a similar message to its customers after a firmware defect in its SSDs caused them to fail after running for 32,768 hours."
Can you imagine provisioning and deploying a rack or 3 full of shiny new identical drives, all in RAID6 or RAID10, so you couldn't possibly lose any data without multiple drives all failing at once...
(Evidence that the universe can and does invent better idiots...)
Someone suggested just nuke it and bring it back up on a fresh instance. Problem was gone! Everything running smoothly again..
If CPU was bad, then that means that you kept running the instance on the same node. Quick was to test if it was "cattle" would have been to try on a different node
Additionally, if CPU was bad how was it not affecting other services?