Hacker News new | past | comments | ask | show | jobs | submit login
I fixed a server with a precisely placed piece of tape (2023) (treehouse.systems)
90 points by Wowfunhappy 3 days ago | hide | past | favorite | 52 comments





I came into a company once that had built their own server room. It had a couple hundred pizza boxes in it and was always hot as hell even with full blast air conditioning. When I retired almost all of them to use virtual servers, it turns out that 1) a lot of the chipsets internallyl still had plastic film on them (to prevent scratches? Why would you put film on them? There were even some CPUs that didn't have any coolers on them, just film) and 2) A lot of the inpflow and outflow air ports also had plastic shipping film that was never removed. I was completely shocked. So I fixed dozens if not a hundred computers simply by removing a piece of tape.

I had a Dell laptop that was always overheating. My son pulled it apart and found that the CPU had been put in place with the plastic cover on ... and thermal paste helpfully squeezed on top of that!

This wouldn't make the room any cooler, though, as the total energy used by the computers is the same. It would just dissipate the heat in the air, rather than keep it on the chips.

> This wouldn't make the room any cooler, though

I feel like that might be embedding some questionable assumptions, such as:

1. That CPU computing work completed is directly proportional to joules of heat emitted.

2. That stymied active cooling measures don't themselves generate heat, such as all the fan motors spinning at 100% and the AC/DC conversion to power them.

3. That even if performance was improved by removing thermal-throttling, demand would still push everything to run at full-tilt rather than allowing some machines to idle.


I don't think GP was suggesting it would. The bit about the server room being hot as hell was just to illustrate that these machines were running in a very hot environment without proper cooling.

Ah, maybe, that makes sense.

Presumably it would make it hotter, as the CPUs wouldn't have to throttle themselves so much, generating more heat.

As I understand it, hot CPUs are less efficient CPUs. Could have an effect on temperature.

Unless the work to be done is unlimited, all else being equal, hopefully a well functioning system puts out less heat to do X amount of work than a poorly running system.


I would guess the opposite. Less energy pumped into the fans running at 100%.

The fan is 1W, the CPU hundreds, though. The fan's power usage is negligible.

Small desktop-class fans, sure. But when you're talking 1U or 2U rack-mounted systems with high speed (>10k RPM) counter-rotating fans, they often suck tens of watts each. And usually in those systems there are at least four, sometimes six, often eight. So it's reasonable to think that bringing down overall compute load over hundreds of systems would cut out a significant, measureable chunk of energy usage in just moving air.

Do server fans even throttle? All of the few I've seen are just on 100% always.

One would think the fans wouldn't work as hard, but then again server fans might blast on full regardless. It's amazing that none of the machines / components overheated and failed.

> There were even some CPUs that didn't have any coolers on them, just film

...I'm pretty surprised these systems booted at all without CPU coolers?


Now's the time for the infamous Tom's Hardware video comparing a P4 to an AMD Athlon to see what happens when you remove the cooler...

CPUs are very good about protecting themselves, but I'm shocked no one noticed that they took many hours to boot up or to do anything at all.

The thermal mass may have been enough to get through a lot of the boot, then by the time it's running maybe they're so lightly loaded nobody noticed.

Some people really like dedicated single function hardware, they might have been putting a micro service on each one


Sure, but in this case, I would have expected "protecting itself" to mean "shutting down." Can a server CPU—or any x86 cpu, for that matter—actually downclock itself enough to run without any sort of cooler?

486s were sometimes passively cooled. A room full of hardware replaced by a very early virtualization setup lines up reasonably well.

I did one of these in the early naughts. Something like half a rack of old pentium 2-3 era boxes running non-prod — back office, network share storage, dev, staging — to a single beefy server running some flavor of vmware. We had all the older gen hardware for isolation and convenience, not because we needed the compute so I specced a “big” box that could hold a lot of storage (for the time) and performance was a non-issue. I think I also saved a lot on Microsoft licensing which made budget for the hardware.


In the early 2000s we had a mail server running on an old recycled Cyrix 5x86. When we had to move offices someone knocked down the wall next to the server but as it turned out the Ethernet cable was stapled to the wall and the machine went flying. Whatever, pack it up and move to the new office, plug it back in and it works fine.

About a year later we were rearranging some stuff and someone powered it off and picked it up to move it… and heard a rattle inside. Pop the cover off and the CPU cooler is sitting in the bottom of the case and the CPU is just sitting there naked. Had absolutely no problem running without the fan at all. We clipped it back on anyway just to be safe :)


There are a few videos on YouTube showing CPUs running without a heatsink. I haven't watched them to see how long they can run for, or if any BIOS changes were needed (I noticed the first mentions a lower voltage needs to be set to boot into Windows), but it seems possible at least.

There are also comments that claim to have done this (the oldest I saw was a Q8200, and someone said a 3000G could be run indefinitely without a heatsink).

https://www.youtube.com/watch?v=tU9yjwMlbRI (Ryzen 3000)

https://www.youtube.com/watch?v=ycIF1NDkW6M (Pentium G5400)


I was playing Cities: Skylines on my old desktop PC a couple of years ago and the frame rate was really low. Windows was running sluggishly but mostly fine. I downloaded CpuZ to figure out what was going on with Skylines, and noticed that the core frequency was way lower than it should have been. I poked around in software for a while but couldn't figure out what was going on ... until I opened my case and realised that my CPU heatsink was dangling off the CPU, the top mounting pins having come loose.

So, yep, CPUs - at least 2014-vintage Intel Core i5s - can run surprisingly well without cooling!


I've heart that Intel CPUs can (often?) function (sloooowly?) without a cooler, while AMD CPUs will burn up. Perhaps the latter's temperature sensors aren't necessarily close to the major sources of heat, so the chip burns up before it notices that it's burning up. (And a cooler spreads the heat out so that the temp sensors see the real temperature.)

I'm writing this on a passively-cooled mini PC with Intel N100 CPU. (It has only E cores, no P cores.)

But passively cooled doesn't mean no cooler! Or does the CPU literally not have a heatsink?

It has a heat sink.

Right, so that's very different than covering the CPU with tape and nothing else!

I frankly don't believe GP's story, I think GP must be misremembering this detail.


Yeah, I didnt read the OP before commenting, wasting everyone's time...

I've seen one downclock to 300MHz after the heatsink fan failed.

I believe they can, although you’d probably violate the warranty.

> CPUs are very good about protecting themselves, but I'm shocked no one noticed that they took many hours to boot up or to do anything at all.

Why would they take hours? Something something about energy consumption (and heat) growing to the square of the clock speed. You take a server that can go to, say, 3.2 Ghz and limit it to, say, 0.8 Ghz, it's going to be cool to the touch. And at 25% the speed, something taking ms or seconds, won't suddenly take hours.

That's why nobody noticed: because they didn't take hours neither to boot nor to do something.


> Why would they take hours?

Thermal throttling. I have personal experience with that: once I received a laptop which was missing the four screws holding the heatsink. It took over half an hour trying to boot Windows before powering down due to overheating. Since I'm not used to Windows, I thought it being very slow (on the first boot, which sets things up) might be normal, but suddenly powering down certainly wasn't, and the BIOS event log pointed to the culprit.

The issue is that AFAIK it does not reduce the clock rate; it runs at the normal full clock rate, then when it detects the temperature went over the limit, it pauses the CPU for a while to let it cool down. With a missing CPU fan, that might be enough, but it wasn't enough with the heatsink detached.

> That's why nobody noticed: because they didn't take hours neither to boot nor to do something.

It's normal for servers to take a while to boot (before even getting to the operating system).


> The issue is that AFAIK it does not reduce the clock rate; it runs at the normal full clock rate, then when it detects the temperature went over the limit, it pauses the CPU for a while to let it cool down.

I'm 99% confident that that is not a thing, pausing the cpu. The closest thing that exists is sleep/suspend/hibernate, but those don't work like that (on temperature triggers). Other than those, if the machine is on: the cpu is always doing _something_.


> Other than those, if the machine is on: the cpu is always doing _something_.

That used to be the case back in the 1990s when running MS-DOS. Nowadays, every operating system "pauses" the CPU when nothing is going on. The traditional way on x86 was to use the HLT instruction, which stops the processor until an interrupt happens; other architectures have their own equivalents (for instance, ARM has the WFI and WFE instructions), and x86 has more modern ways to do the same thing (like the MONITOR/MWAIT instructions). These instructions allow the processor to dynamically enter lower power modes, for instance by blocking the clock and/or power going into parts of the processor core (that is, "gating" the clock or power).


Isn't that just parts of the cpu and at the OS level though? I don't think that happens for thermal reasons, more for "welp, nothing worth doing" reasons.

Maybe I'm wrong, looking back at what I said I'm having a hard time explaining what my objection is, but I just don't know of a mechanism that's like "welp, I'm too hot, pausing for a few seconds, see you then!". Aren't there buses and caches and etc. that need constant attention whenever the machine is on? The cpu can't just tell everything to wait for a bit.


Probably depends on the CPU. When my 2014-era Core i5's heatsink fell off (see comment above) I could see the CPU clocks actually reducing.

Guessing they got a 100x speedup when they moved to the cloud.

My 1070 is currently running with tape covering half of the PCIe pins so it works on x8 instead of x16.

One day it stopped working. During troubleshooting it would work only on a 10 year old PC I had lying around. The difference? Despite being full-size, the GPU slot on the old mobo was wired only for x8.

I reckon that the GPU die lost connection on some pins. Not worth fixing, x8 works fine.


Currently on a mission to tape a bunch of gear too. Though to cover LEDs. Why does consumer gear meant for houses look like a disco at night?

When I first moved into my studio apartment, I had a mission—no little lights! When I go to sleep, it should be completely dark.

Then I discovered that putting tape on my mouse's USB receiver interfered with the wireless connection.

Then I got a USB hard drive which has an LED along the top left corner, where tape won't stick.

Then I got a microwave which always displays the time when it's not in use, and if I cover up that display, I can't see the timer for my food.

And somewhere around there I gave up. It was impossible to tell before buying these products that they would come with lights, no one talks about this!


That's exactly the advantage of sneakernet shopping, you can ask to see the product in operation.

I don’t know that I’ve ever seen a microwave offered for sale that was plugged in at the store at any store. Ditto for any consumer electronics other than TVs and computers (and I guess stereos).

Maybe at Fry‘s they had the microwaves plugged in? I don’t remember.


Yeah I don't think I ever saw one either, but at least over here you can bring it over to a counter, or request an employee to turn it on and test.

Obviously not viable for evaluating continued operation for something like a fridge, but just turning it on for the panels is usually possible.


Let me plug LightDims: They're translucent stickers that keep LEDs from shining into the room, but still let you see the LED status.

Be sure to buy from the original family business, not the knockoffs.

https://www.amazon.com/gp/product/B00CLVEQCO/



I used those to dim the "Passenger" seatbelt light in my old car. Maybe I put my backpack on the seat.

I learned how to solder small connections just to bypass the LED's that are on every single piece of electronics anymore.

I don't need my house to look like a fucking carnival. Who does this appeal to?


A fix without finding the cause cannot be called a fix

As the comments say:

> Now waiting for the war story from your successor "you'll never guess what I found when I took apart this one server that wouldn't wake-on-lan"


You can "root cause" to any depth. The trick is to stop at some point that is fixable.

If the thing is broken because X then you can fix it by removing X. That's enough explanation. You don't need to know "why" for everything.


Sounds like there was something wrong (at a hardware level) with the network card, so the real fix would be to buy a replacement.

But if disabling a faulty connection you don't need in the first place stops the problem happening, that's a decent workaround... I'd just be a little worried that something else could be faulty on the card if the WoL pin is glitching, so potentially you could have other intermittent issues.


What about using a relay as a temporary solution to remotely turn it off instead of physically yanking the cable?

Maybe that's good enough for them, people never touch this problem again, and we wouldn't have this story (which of course would be sad). But it occurs to me that the author wasn't in a rush to fix this and had plenty of time to poke around. I wish I could do that.


Speaking as a "maker," you can find on AliExpress a fairly generic relay board that's controlled by a logic signal. That, and a cheap microcontroller board, are a super useful problem solver.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: