Hacker News new | past | comments | ask | show | jobs | submit login

My fleet of machines I was the owner of at Facebook was around 10,000. I still remember the odd JVM crash that prompted me to reimage a machine. I wouldn't have remembered it except there were a few that month and it was the 3rd time I reimaged the same machine that I thought, "That's odd... I think I know that machine name." Checked history, saw the 3 repair jobs I had submitted... RAM was reset, CPU was eventually guessed at as bad.

Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet. http://rachelbythebay.com/w/ has many good examples of this, some submitted and voted up here.




It is indeed folly to assume that cattle have no identity. There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs. Sometimes a machine just takes an arrow to the knee and it's never the same again. Then there are all the uncontrolled or unrecorded differences between machines: the ones at the tops of the racks, or the ends of the rows, are hotter (or colder); there's some difference between the same model of hard disk made in Hungary compared to the ones made in Mexico; at some date the BIOS vendor made an undocumented firmware revision that changes an obscure energy/performance register in your CPU; you have a machine with a dead CMOS battery that worked normally until it was rebooted.

Cattle is a good philosophy but it takes a huge amount of work to approach perfection.


Once, the head of IT of a company I used to work for was touring the datacenter, passing some new racks filled with blade servers. He stopped, said "why are all the fans running full blast on this rack?" and the admins checked and they were running some test workload at scale somebody had forgotten about a few weeks before.

Everybody was embarassed because no monitoring caught it, but the VP of IT did by walking past the cage.


Admiral Rickover was known for walking into the engineering spaces of nuclear ships and just throwing a valve handle that would force a reactor scram. Not infrequently on a submerged submarine. Just to make sure the team was on their toes.


I'm not a sailor or a nuclear engineer, but that doesn't sound like a great idea. Should the Chaos Monkey approach really be used on nuclear systems?

Aside: was it even legal for him to do that?


If you are responsible for building the industry that designs and builds nuclear submarines that carry nuclear missiles, you had better make sure that those submarines and their crews can handle chaos monkeys.

Also, Rickover was Congress's favorite admiral. They forced the Navy to promote him. I'm pretty sure they made sure that the laws were to his liking.


Rickover was the sort of person who had the technical expertise, gumption, and charisma to get away with this. It's worth reading up on this amazing individual, who made navy's nuclear reactors so safe, and led the creation of nuclear reactor expertise within the navy.

Also, if you're not confident enough in your nuclear reactor to apply chaos monkey techniques, you shouldn't be engineering nuclear reactors.


In general, one would hope that a nuclear system is designed such that the problem can easily be corrected if a single button or lever is accidentally pressed. It would be quite a terrible system if you could e.g. trigger a meltdown with just one action.


A reactor scram is basically just an emergency shutdown. If I were a nuclear engineer, it might give me a heart attack to hear the scram alarms, but I would be plenty happy knowing the scram works.


I walk the server room daily, every morning. I've tended our monitoring system for 15 years now and I don't trust myself to be infallible. I'm also the MD ...


MD = VP for those not in finance


In the UK MD = CEO (when not finance), so without more context you couldn’t really say.

He’s either an MD in finance, which is the equivalent of a VP (roughly) in other kinds of companies, or he’s the CEO.


> It is indeed folly to assume that cattle have no identity.

My dad had a book over almost all the cow names in Norway[1], per 1988. As a kid I found it rather fun to just flip through it and read some names, often wondering how they came up with them.

However since then it seems the tradition of naming cattle has dropped[2] to less than 30%.

[1]: "Gullhorn og dei andre : kunamn i Noreg" https://urn.nb.no/URN:NBN:no-nb_digibok_2010111708049

[2]: https://www.nrk.no/nordland/kyrne-far-ikke-lenger-navn-1.835...


I only know of small farmers who name their livestock, but that was back in the States.


Compared to US-scale farming, I'd guess most Norwegian farmers are "small".


Indeed. When the 1988 study was done, which resulted in the book of cow names among other things, there were about 360k cows in Norway total, and the average number of cows per farm was ~5.6.

These days the number has risen, IIRC around 35, though that's still quite a low number compared to larger countries I imagine. I'm pretty sure the variance is quite high however, with a fair number of farms with just a few cows dragging down the mean.


There was a time when I managed a few clusters of MySQL servers provisioned on OVH bare-metal. [Yes, we know "they're the worst" at least used to be.] Everything was running well until we upgraded to their largest offerings. Turned out they had NUMA memory so had to retune the MySQL parameters. After sorting that out they all worked fine except one machine. Note that we already had a pretty good relationship so I'd already requested the exact same motherboards and BIOS on these machines. Somehow one machine always came up different. I couldn't trust OVH to set the CMOS the same so I carefully checked against one of the 'normal' machines. The cores reported differently and had different affinity characteristics. Never did find out why that was, perhaps a CPU microcode/stepping that I never investigated. Anyway I just worked around the issue (with a different tuning for this one machine) because asking for a replacement resulted in a similar situation.


> There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field

I really hope the video is edited to strongly resemble this scene: https://www.youtube.com/watch?v=N9wsjroVlu8


It was made to reference exactly that, down to the attire.


One of my computers has been aptly named 'THESEUS' due to what was replaced on it. By the time it was repaired to an acceptable level, the only original component remaining was the chassis.


Heh... that reminds me of my favourite troubleshooting story!

We had a customer with regularly failing tape backups. CRC errors, verify pass failures, even failed writes, and so forth.

We replaced the tapes with new ones. Same issues.

We replaced the tape drive with a new one. Still the same problems.

We replaced the internal ribbon cable and the SCSI controller. No luck.

Firmware flashed everything. Didn't help.

New server chassis, wiped the OS and reinstalled everything from scratch. Changed the backup software just in case. The backups still failed!

Literally no part was the same. I went on site to start looking into things like the power cables, the UPS, or vibration issues. Basically were getting desperate and grasping at straws.

I was sitting down in an office, casually chatting with the IT guy while we were waiting for 5pm so we could reboot the server. He's leaning back in is office chair, and he casually picks up one of the tape cartridges and throws it up in the air and then catches it before it hits the ground. Just playing. Over and over.

I asked him if he does that a lot.

"Yes, it's fun!" he answered.

ಠ_ಠ


That is quite the story! Sounds like a very large amount of resources spent on that case.

What was your companys role? Backup services/devices?


This was general IT consulting back in the early 2000s. The customer was small, they only had three tower servers and only one had a tape drive.


Many would have hit the ground too! I'm twitching...


I was troubleshooting a computer once that would randomly shut off during boot and, one component at a time, I replaced everything on it including the motherboard to no avail.

Finally I took all the parts out of the original computer and put them in a different chassis and it worked! Put them back in the old chassis and back to the old problem.

Eventually I noticed that there was an extra stand-off in the first computer case and it was shorting out the motherboard.

It was literally the chassis causing the problem.


Back in the 90s we had a faulty DELL server that someone decided needed to have its BIOS upgraded. They didn't read the specs and upgraded to a BIOS not supported by the CPU.

Motherboard is bricked. Ring DELL for support. After going through the rigmarole of explaining what had happened and that we had a bricked motherboard, the person on the phone said "Have you tried taking out the CPU and rebooting?"

To avoid further delay in getting a replacement sent (we had 4 hour on-site at the time), we went through the motions. Not surprisingly, the motherboard was substantially bricked without a CPU.

The DELL engineer that came on-site was suitably amused.


My neighbour's computer stopped working after a lightning storm and asked me to take a look at it. It wouldn't boot so I started taking things out of it (hard drive, video card, modem, etc.) and trying again.

Nothing worked. Finally, I removed the processor from the motherboard, looked at it, and reinstalled it. The computer booted right up and never had another problem. Weird.


Put in an extra standoff on my very first PC build, and it wouldn't boot until I found and removed it.


Reminds me of the saying, "I have used this broom for 20 years. I only needed to change the broom head 20 times and the broom stick 10 times.".


Trigger's broom scene from Only Fools And Horses. Classic British comedy


There needs to be some level of conformity between instances, they stop being a heard and more of a zoo if the skew is too large. The workloads running on the instances shouldn't be able to tell which instance type they are running on, or your workloads should be written such that it doesn't matter (but at some point it will). Things that grow and move together, wear together, so you will end up with a system that is designed against the empirical contract, not the stated one.


My approach has always been that somewhere in my fleet there is a heat sink that fell off and a CPU running at 400MHz. My last two jobs have started with me sitting down at my desk on day 1 and demonstrating this fact. After concluding that the zoo is unavoidable, the only thing left to do is write the software accordingly.


> After concluding that the zoo is unavoidable, the only thing left to do is write the software accordingly.

What do you do to write software accordingly? Make it detect when it's running on a dud? Have it run as best as it can anyways?


Suicide is a good solution, if some higher-level thing will notice and move the task to another machine. Batch frameworks can kill slow shards, or re-assign their work to faster shards. Clients of online services can direct more traffic to working shards and less or none to slow or broken ones.


"There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs..."

Are you perhaps thinking of the printer execution scene from "Office Space"?

https://www.youtube.com/watch?v=N9wsjroVlu8


> particular well-known machine was unracked

are you able to comment a bit further on why this machine was well known?


Because of the way engineers habitually run batch jobs with more replicas than there are machines, this one broken computer had crapped up every map/reduce job in that facility for a long time, and it had been sent to repairs many times without benefit. Many people knew instinctively that if their job was stuck it was probably because of the shard on xyz42 (or whatever the node name was).


I still remember the machine name. It starts with an l and ends with a 6. Over the course of a couple of years, pretty much all of its components (CPUs, RAM, drives) were replaced at least once. You could look up its maintenance history and it went on and on. I'm not sure if it was well known across all of engineering; from what I recall, it was in a cluster in Oregon reserved for a specific team. Because it was company property, no matter how doomed, they had to get signoff from upper management, close to Eric Schmidt's level, before they could destroy it.


My recollection, assuming it's the same machine I'm thinking of, is that it wasn't reserved for our team; rather, we left a do-nothing job permanently allocated to it, in order to prevent some poor other sucker from getting their job scheduled on it. (Because we, through painful experience, were well aware the machine had hardware problems; but we had long since given up on convincing the responsible parties to take it out of the pool, since it passed all their internal tests every time we complained. I don't remember how long this situation existed before someone finally took it out back and shot it.)

Could be a different incident and a different machine, though. I'm sure this story happened more than once.


Maybe a different machine? I meant that it was not in one of the general-purpose clusters: the entire pool was dedicated and a random team couldn't request Borg quota in it. For years, though, half of the Oregon datacenter was special for one reason or another.

The infamous machine did go through repairs and part swaps many times, as you could see from its long and troubled hwops history.

The worst machines were the zombies with NICs bad enough to break Stubby RPCs, but still passing heartbeat checks. Or breaking connections only when (re)using specific ports. Fun times!


I wonder if MR could integrate with a fuzzing engine that jumbles random combinations of real inputs into garbage but runnable jobs that cause reproducible crashes above some threshold (eg at least once per day, or if things are bad enough, once per month or something).

Regarding this system: the motherboard was never swapped?


In what way had the jobs failed? Very open-ended question :) but just coming from a hardware-diagnosis standpoint. (I guess the canonical answer is "here's the repair history," but yeah, duh.)

> engineers habitually run batch jobs with more replicas than there are machines

Idly curious, how do I parse parse this? It sounds like the same jobs are replicated to multiple machines as a sort of asynchronous, eventually-consistent lockstep arrangement?


> There's an internally famous video inside Google in which a particular well-known machine was unracked, dragged out into a field, and ceremonially smashed to pieces by some hardware techs.

Eyy what would I search on moma to find this video?


Sounds like they were re-enacting Office Space. If you haven't seen that movie... It's culturally relevant even today. Has a bit of profanity and such though.


Huh, I can't find it.

I can confirm I've read the same thing though, years back.


There used to be a go link for it (same as the machine name), but knowing Google, it might be stale. There's a good chance you can find more on the internal folklore site. If that one is still around, too.


Confirmed: the go link with the machine name works but I don't want to post it on HN to be safe :)

If any Googlers are reading this: just goto go/legends and search for officespace. The first link that pops up has context as to why the video exists.


Oh yeah there it is, hah


ms/die+mothafuckaz


back when we where buying hardware for big (at the time) intranet. All the servers where brought of the same batch of suns production line, I recall our sysadmin saying he rely wanted to do the same for the disks ie case of identical drives


A super-bad idea because drives made on the same week in the same facility will all fail at the same moment.


Unlikely that's not how statistics works in production engineering


With respect, I just went through a "code red" at a large, well-known cloud storage company caused by synchronized late-life death of hard disks all manufactured in the same batch. That's the second time in my career that I've been through the same phenomenon. Hard disks that are made together wear out together.


I can confirm this. I learned the hard way to buy hard disk drives of the same model but from different batches.


I'm curious how wide the failure window was (timespan, ramp-up/down, etc), relative to how many devices were involved.

And I wonder how well the signal in that ratio might scale down to hundreds or tens of disks.


Shockingly bad production engineering then.


Not once, but twice:

https://www.techradar.com/news/new-bug-destroys-hpe-ssds-aft...

"Hewlett Packard Enterprise (HPE) has once again issued a warning to its customers that some of its Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation unless a critical patch is applied.

Back in November of last year, the company sent out a similar message to its customers after a firmware defect in its SSDs caused them to fail after running for 32,768 hours."

Can you imagine provisioning and deploying a rack or 3 full of shiny new identical drives, all in RAID6 or RAID10, so you couldn't possibly lose any data without multiple drives all failing at once...

(Evidence that the universe can and does invent better idiots...)


Your default assumption only works if every disk has an independent probability of failing from each other. Which is definitely not true if you buy all the disks from the same batch.


Others have mentioned the problems with this strategy, but getting drives with the same firmware is done routinely to avoid having slightly different behavior in the RAID set.


I don’t think they make hard disks in Hungary.


The infamous IBM Deskstars aka Deathstars were made there.


Anymore.


Also at FB: one day we got a huge spike in measured site-wide cpu usage. After the terror subsided, we found that a single request on a single machine had reported an improbably huge number of cycles (like, a billion years of cpu time). We figured a hardware problem and sent it to repair. A month later the same thing happened to the same machine; it had just been reimaged and sent back into the fleet. There was some problem with the hardware performance counter where it randomly returned zero, but after that we made sure it was removed permanently.


Alternatively, you found a machine from the future :-)


Out of curiosity, do these reproducibly-broken components ever make it into an upstream testing environment?


Yes, in a few cases the hardware would go back to its manufacturer for more investigation. The ones I'm aware of were more subtly bad, and more reproducible than this one though.


Had a similar issue. Spend weeks debugging a process, adding logging metrics, making graphs to figure out where the performance cliff was in an email sending service. The results were so weird they were unexplainable.

Someone suggested just nuke it and bring it back up on a fresh instance. Problem was gone! Everything running smoothly again..


> Cattle is a threshold, but when the same problem keeps coming up it's time to call the vet.

If CPU was bad, then that means that you kept running the instance on the same node. Quick was to test if it was "cattle" would have been to try on a different node

Additionally, if CPU was bad how was it not affecting other services?


I think the issue with bad CPUs is that they'll unreliably error.


I’m confused as to why the solution wasn’t to just replace the machine (preferably automatically) the first time it failed.


I'm curious, how many machines have FB?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: