Heat pipes - the large copper tubes between hot chips and heatsinks - are actually hollow and usually filled with some liquid/gas mixture. That's often butane or something with a lowish boiling temperature at lowish pressures.
They work by having the liquid boil at the hot end, the gas flow along the pipe, condense back to a liquid at the cold end, and then the liquid flow back.
That last step, the liquid flowing back, is the critical one. With gravity, the liquid can gush back really fast, and loads of heat can be transferred.
But if the card is the wrong way up, then it relies on the liquid wicking back along some clothlike material coating the inside of the pipe. That's a far slower process. Therefore the heat that can be dissipated this way is much lower.
That's why heat pipe performance is dramatically affected by orientation.
The designers of this card will have been aware of that, but likely decided a small performance hit for sustained workloads was probably worth it.
It's not a small performance hit. The card is throttling a lot and ramping up the fans up to 1000RPM more than usual because cooling fails completely. There were report about cards failing while testing this (e.g. in the der8auer video), but ofc it's not possible to be sure that the card broke because of this. It's possible though.
This is not an expected failure mode the engineers would have been okay with even if it did not kill cards. No one would be okay with the heatpipe drying up like this.
That's not very re-assuring when the cooler is not working as expected. VRM gets killed by heat (I don't actually think it's likely that this is what happens here, but I do think AMD is making a mistake by letting customers run into this issue and by giving me reason to speculate like this).
I mean no VRM or vmem is meant to run at 110C sustained.
The 3090 design had an issue with vram cooking leading to 105+ temps and the gpu would throttle when it got there. Many people that didn't mitigate that problem (with extra cooking) experienced memory failure. The main problem was vram temps aren't recorded so it looked like the GPU temps were fine.
> That’s pretty terrible efficiency. I can get 120 with only 290W. For me the extra 4MH/s does not pay for 60W.
Some individual enthusiasts were trying to get every last "megahash" for benchmarking bragging rights, but anyone who had a bunch of 3090's would minimize power consumption because the marginal profit on the last 50+ watts was negative, which had the nice side effect of lowering temperatures.
3090 on ethereum (even after undervolting/underclocking gpu and bumping vram frequency) was hitting 100+C on VRAM. You went from 95MH/s to 120MH/s by doing this. Throttling happened at 110C so you wouldn't notice it on hashrate unless you had really poor cooling. It was causing premature failure on vram even in the first months of use at these temperatures. If you put some extra heat syncs on the card and got vmem down to ~90C you had no issues. Almost all miners were doing this. If you didn't cut back the gpu voltage/clock it would increase vram temps to 100+ even at stock vmem frequency.
GPU core temps dropped by 25-30C when underclocking/undervolting and power would also go down by 60-80W.
+1. I mined on a few 3060s (and heated my house) last winter. All with cores underclocked, memory overclocked, all at ~60C temperature wise, each consuming about 120W.
Anecdotally, my friend got one he had to return because it maxed out the fans, throttled, and crashed, regardless of orientation.
These are defective cards, and there are a lot of them.
Don't explain away broken cards by calling it some design compromise. A high percentage of the AMD coolers are completely unusable due to defect, and should be replaced. The ones that work in one orientation but not the other are also defective. No one should have to put up with any dry-out whatsoever, in either orientation.
Well, the card actually performs better in the orientation where half the vapor is going down and wicking back up, and worse in the orientation where it's flat.
But the bigger issue here is not the way orientation affects performance, but that defective cards seem to be hitting dryout. At that point a combination of not-enough-liquid and hot wick means that the liquid isn't making it back to the heat source before it boils again, and performance plummets.
The thing is that the vapor chamber is working perfectly fine upside down in most of these cards. It sounds like a manufacturing problem not a design flaw.
"We have the fix we are ready to send it to you. Just call our tech support line if you bought it from amd.com or if you bought it from one of our AIB partners call them, they have [replacement] units,
Geerling says:
AMD said "customers experiencing this unexpected limitation should contact AMD support", but if you head to the support page, and call the US support phone number, it directs you to the warranty claims page on the website. On that page, it guides you through a wizard and determines if you didn't buy the AMD card from AMD.com itself, you have to contact the partner manufacturer (even though in this case it's the reference design, just packaged by Sapphire).
As per der8auer and GamersNexus, AMD claims to know which batches of cards have underfilled vapor chambers yet at the same time refuses to recall them and wants customers to RMA instead.
Is this really that unreasonable? From the point of view of a customer who has one of these cards, a recall or RMA is basically the same: they have to mail in their card to get fixed or replaced.
Customers could also return them to the retailers they bought them from.
The hyperventilating about this on tech YouTube is getting a bit ridiculous. The same thing happened with the controversy over the power connector on the 4090. Yes, it's bad -- but it's not like AMD ran over your dog.
When a new product has bugs (and many do), it takes some time to figure out who is impacted and what the fix is -- and when an issue becomes high profile, a lot of people who are not impacted manage to convince themselves that they are. Given that, and the general hyperventilating about this, I am not surprised AMD wants to be careful.
> From the point of view of a customer who has one of these cards, a recall or RMA is basically the same
> a lot of people who are not impacted manage to convince themselves that they are
Well, there's one difference in the two: when the company doesn't announce a recall, customers are left guessing whether or not they're affected, assuming they even find out about this known issue to begin with. But if there was a recall, customers can verify whether or not they're affected and they can be notified.
They KNOW which cards are affected... this means they could at LEAST list the serial numbers so people that do know can check... without going through the last bit about proactive contact.
I don't have one, but I don't game much, and mostly run in a Linux desktop without overclocking... I do sell my older cards when I update/upgrade sometimes. This could literally saddle someone else down the road with a card that breaks/throttles/ramps fans to an annoying level, because AMD didn't at least release serial numbers for the cards in question.
I'm guessing it's a MASSIVE amount and they just don't want to do the right thing.
This is not at all the same kind of "controversy" as having a component fail catastrophically and catch fire. Comparing it to such is minimising how reckless and callous nvidia was.
GamersNexus paid a bunch of money and had a failure analysis lab go over the Nvidia issue with a fine-tooth comb. I think their conclusion was that the connecters weren't fundamentally problematic - the melting issue overwhelmingly occurred when people didn't actually seat the connector properly. Further, it had to be off by a lot and the cable had to simultaneously be at a significant angle. [1]
I don't disagree there could have been a better design, and I think the design is fundamentally a risk when misused. When seated properly there shouldn't be any risk. That said we should ask for better than that from hardware makers.
This feels like a job for a warning sticker this generation and something a little different next generation.
The Nvidia issue is not like the NZXT H1 riser fire issue, for instance.
To my knowledge it doesn't disconnect. People simply did not connect it.
GN is extremely critical and holds manufacturers to a very high standard. GN is the first to lambaste a manufacturer over a design that's even slightly sub-par. If Steve says it's fine, it's almost certainly fine IMO.
I'd recommend watching the video. [1]
[edit] johnnyguru apparently said the same thing. These are no spring chickens, and Johnny now works at Corsair AFAIK making power supplies.
The fix is shorter sense pins that don't indicate to the card the cable is actually plugged in until fully seated. Don't get me wrong it should have been that way from the start, and a warning is needed. That said, it's not an issue when used correctly.
It's not the same, but also, the 12pin connector melting issue affected well under a single percent of owners, closer to 1 out of thousands. AMDs defective coolers are, making a forgiving estimate based on polls and anecdotal info, affecting well above 1% of the first wave of those cards, and maybe reaching into the double digits.
I own a 3090. Skipping on the 4090 for now as I am disturbed by melting-connector-gate. The Youtube and reddit threads scared me off as a customer. Don't recall this for previous generations.
I member my old Voodoo graphics card had like 10 to 15 Watts. Then was surprised how much power the 6800Ultra consumed: whooping 70Watts. Those numbers are peanuts today. 350W for the AMD and 450W for the nVidia card.
AMD is predicting 0.7kW in 2025. I imagine in 2030 people will be having some some fancy CPU and RAM, but that GPU will be sitting on top of a 2kg block of copper to keep up with the heat generation. 1.2kW Cards here we go!
Next up: that power cable coming out of that wall is going straight into the GPU.
>> Next up: that power cable coming out of that wall is going straight into the GPU.
This is already a thing in some servers, and of course, crypto miners that use server power supplies + adapter boards + 8-pin plugs to power the GPUs separately. You chain the server PSU with your ATX PSU using a jumper and they start simultaneously. No doubt that some form of that is coming to gaming and machine learning PCs soon. (It's already here for ML builds actually, but not that widespread I guess)
>Next up: that power cable coming out of that wall is going straight into the GPU.
It kind of seems like we're evolving towards the GPU being the "main computer".
We need to keep an extra close eye on things like open standards and free drivers. nVidia's closed-source stack only feels like a minor hassle right now, but if you think of the GPU as the heart of the system it puts us right back to the bad old days of computing where cross-platform was impossible, compilers were vendored, and operating systems were closed source. We got into the current cushy situation of PC compatibility by sheer luck, and the "mobile revolution" is a warning sign that we can't take any of it for granted when the computing substrate evolves.
The way that is worded makes it sound like a certain German retailer is frequented by miners, and this is what the failure mode of RDNA2 cards look like when overclocked and overvolted until dead.
The article says the users claimed they weren't mining, but, obviously, a miner wouldn't say they were mining with it, even though the strict EU consumer protection laws would probably protect them anyways.
Miners almost never overclocked but tended to underclock and undervolt. Especially in a place like Germany where power is very expensive miners would be even more incentivized to find the optimum power/performance spot.
Usually miners undervolt and also clock down the gpu. The memory is what gets clocked up. Majority of those deaths happen due to heat-up and cool-down cycle of the gpu between gaming sessions.
Not implying that mining is light on the cards vs occasional gaming but if you game daily on it you'll have higher gpu temps during gaming.
Source: had an rx6800 that mined for over a year and no such issue.
I'm going to be charitable and make a wild guess. There's probably some software section that checks a memory mapped region for a value from thermometers and throttles as necessary. It doesn't sound too difficult to make a mistake related to testing that region.
If I were an evil overlord / dictator trying to follow one or more of the lists of things to do / not do... a dedicated hardware circuit for thermal overprotection would be among the must have items for any non-trivial device. Trivial devices being things that could fail with no risks, even from their worst failure mode.
It's more about how AMD is responding to this. Deffects happen, but amd downplayed it by saying that the temperatures and throttling are normal, refuse to recall, still sell affected cards, close the support website just as the news start spreading, make customers jump from manufacturer to seller to amd to get replacement/refund. And they also don't have replacement cards in stock. All this while their marketing department spends time making fun of competition.
I think you're misunderstanding the article. It's not that he wouldn't like a working AMD card it's that the cards are nearly impossible to find "since no comparable replacement cards can be had without paying scalpers on eBay".
It takes a good bit of work to get a flagship GPU at MSRP right now. If you've managed to get one from AMD and one from Nvidia at the same time and one turns out to be defective + they won't offer a direct replacement, only refund, then you're unlikely to be interested in putting a bunch more time in trying to find a non-defective model in stock at MSRP. Particularly when the manufacturer is making 0 effort to make this easier for you.
While this may be the first for you, it may just be another example on a list over time.
Apple walked away from Nvidia to solely use AMD after enough instances, but I was only personally affected by one model. It was apparently the one (2011 MBP) that broke the camel's back though, so there's that.
The bump crack issue with MacBooks was in the open in 2008.
Apple used AMD GPUs for the MacBook Pro 2011.
4 years after bump crack, they switched back to using Nvidia GPUs for the trusty old mid-2012 MacBook Pro that's still my daily driver, with excellent keyboard, and for the late-2013 MacBook Pro too.
If there's anything that, I don't know what it is.
Are you telling me the laptop that I have isn't what it is? Really? That's what you want to do?
It was repaired twice by Apple, but the third time it happened, they wanted me to pay for it which was going to be a good chunk of the cost of a new MBP.
I dont' know what you're trying to say, but I never said that the 2011 model that I had was the last version to ever have an Nvidia GPU or that AMD was never used before then. You're really trying to start something just because...what, you're bored? I specifically chose the Nvidia GPU over the AMD option because of wanting to use CUDA. Most of us are aware that Apple's hardware designs and manufacturing pipelines are planned so that if they decide something today, it won't become available to consumers for at least a year if not two.
So I guess, I'm really asking WTF is the point of you arguing?
You made a conjecture that Apple abandoned Nvidia GPUs after issues with their earlier GPUs.
This is not a new argument. It’s been made by many people before. All of these people forget that Apple switched back for a while, long after those issues.
I’m simply pointing that your conjecture is at least on shaky ground.
You also claimed that the 2011 MBP broke the camel’s back. I point out that there was no 2011 MBP model with an Nvidia GPU, which angers you for some reason. I can only assume that you are confusing “2011 MBP model” with “2010 MBP model that I bought in 2011”, but who knows?
What does price gouging mean to Jeff? What does it mean to you?
gaming cards are a market with competition (AMD, Nvidia). It is also a luxury. So how do you price gouge? Surely at almost all moments in time both of these public companies have sought to maximize profit, as they are now.
You don't maximize profits long term by killing (parts of) the market you rely on. Sales are at an all-time low because of the high prices, and if they continue like that the whole PC gaming market will evaporate.
Companies are under no obligation to follow short term profits, that an US myth.
Also, if you are in a market with very few participants there is no healthy mechanism of offer and demand. These companies are able to dictate prices at will, and that can be price gouging (which is why Intel entering the gpu market with a competitive product would be so important, three vendors is better than two).
The gaming market won't evaporate. If the market can't bear the new prices, Nvidia and AMD will just lower the prices.
A real example of price gouging would be a store charging a hundred dollars for bottled water after a hurricane, because people are desperate for it and can't get it elsewhere.
The situation in the gaming market is quite different. Gaming PCs in general, and high-end GPUs in particular, are luxury items, not necessities. There are no longer any meaningful GPU shortages, and there is an enormous secondary market that people can use if they don't want to pay the new prices.
Plus, these cards are the GPU equivalent of supercars -- looking at the Steam hardware survey, the number one GPU is the affordable RTX 1650, which only recently supplanted the 1060. Is Ferrari price-gouging because their cars cost more than Toyotas?
Side note to that: The GTX 1650 is actually weaker than the GTX 1060 (benchmarks: https://www.pc-kombo.com/us/benchmark/games/gpu/compare?ids[... ). That it was able to overtake the 1060 is actually a really bad sign for the health of the market. It's also not really affordable, in that it is completely overpriced for the performance it offers.
But the "new version" of the GTX 1060 is a lot more expensive. That's why it's the GTX 1060 that remained so popular and not the RTX 2060 or 3060. The 1060 came out almost 7 years ago. (And the 1650 is even lower performance!)
Now you hear that GPU sales are hitting lows they haven't been at for a while. The fear is that making a PC game would be targeting an old market, meanwhile consoles are affordable. But if fewer games are made for PCs then there's less reason for people to get gaming PCs too.
> ... the number one GPU is the affordable RTX 1650,...
I guess you mean GTX 1650.
"Affordable" is a relative and vague term; 200 EUR in my area for my target interest HDMI2 and DisplayPort (UHD 60fps HDR for video watching and desktop work).
I doubt Jensen Huang or Lisa Su are going anywhere any time soon. They're very entrenched and respected in their companies.
Also, they have the perfect excuse now: "you see dear members of the board, we're in a global economic recession, that's why GPU sales are low, not that we priced them way too high compared to previous generations".
>the whole PC gaming market will evaporate.
I'm not happy about the current GPU situation, but there's no way this is true. Will PC be impacted? I'm sure. Honestly, it might be for the better. My Steam Deck uses 15W, and the graphics it can put out are good enough for me. Improvement over time is nice, it doesn't have to be so fast. Gaming won't lose anything by slowing down its graphics cycles.
>Gaming won't lose anything by slowing down its graphics cycles.
It will miss that next generation of would be pc gamers. If the new gen is priced out, then consoles become more attractive, which means less games are made for PC, which then creates a chicken/egg problem.
And actually there’s significant lock-in effect - people don’t want to give up the games and in-game items they’ve bought on consoles, or stop playing with their friends they’ve made, just like pc gamers don’t want to give up their steam library or friends.
Yeah the definition of price gouging is sharply increasing the price of a essential product in times of disaster. Say supermarkets sharply increasing the price of water or food despite their supply being more than capable of handling the demand since they know people don't have time or ability to shop around.
The term gets used a buzzword now for anything people want that costs lots.
"Price gouging is a pejorative term used to describe the situation when a seller increases the prices of goods, services, or commodities to a level much higher than is considered reasonable or fair. Usually, this event occurs after a demand or supply shock. This term is commonly used to describe price increases of basic necessities after natural disasters."
Plenty of people use the term to talk about non-essential products.
A demand shock is exactly what increased prices here, and now a failure of competition is letting the duopoly keep profit margins extra high.
This isn't a problem for non essentials. No one needs to buy a 7900 XTX. It's really just a consumerist impulse for the majority. The last 4 years of cards still work perfectly fine.
While consumers still demand the latest cards at these prices, they will stay at those prices.
They recently changed their agreements with card makers and forced very high pricing. They are charging more because they can, it’s the definition of price gouging.
Every company charges as much as it thinks it can for a product. If that counts as price gouging, almost 100% of all the companies that exist must be guilty of it!
I don't think this is always true. Sometimes companies think that it is in their long term interest to charge less to keep different set of customers than the ones that are right now willing to pay more.
This is very true for graphics cards BTW, and why Nvidia tried to lock card out of being used for crypto. They believe, I would think, that it was in their long term interest to cater to PC gamers and not crypto miners.
That's the logic of the free market, yes. But do two companies which probably swap employees and executives constitute a market? If I went to a classical market, and two people were selling, is that a healthy market?
If those two people in your example tried as hard to compete with each other as AMD and Nvidia do, then yes, it's a market and it's more healthy than many others.
An unhealthy, uncompetitive market would not have such a notable rate of improvement of the products as we see in the GPU market.
Plus, if it's competition you are concerned about, the situation is improving! Intel is getting into the discrete GPU market too.
> such a notable rate of improvement of the products
To be fair, frames per dollar have been roughly stagnant for a couple of years. It's just that now you have the additional option of paying more dollars to get more frames. This is an unusual situation, but it is what it is. The AMD 7000 series is (so far) not better value than the AMD 6000 series. Same with Nvidia. That's odd.
If this is anything like that power connector debacle, it's actually a pretty small % of all users, they're just obviously very loud about it. (Which they should be, obv)
Manufacturing mistakes happen. The important thing is AMD is doing the right thing on the repair side for them, paying for cards to be sent back and fixed. This whole situation is a bit of a nothing burger.
In my case they aren't paying for cards to be sent back and fixed, and so far there's no concrete evidence it's a small/limited batch of cards affected.
Honestly, outside of those who actually benchmark their cards, most people would probably just live with slightly reduced performance and very noisy fans (none the wiser), so it's hard to tell how widespread the problem actually is.
there are system integrators (places like Puget Systems) who build boutique gaming rigs and workstations and those kinds of places perform additional testing to make sure everything is working as expected, and they've measured around 10% of MBA cards as having faulty coolers.
the thing with the connector problem was that nobody could reproduce it in a lab, so the rational assumption was that it was a very very rare failure case at best. NVIDIA has a lot of marketshare and the law of large numbers says that if even 0.04% of people experience a failure mode, that works out to at least a couple dozen people or whatever.
in contrast this one is easily measurable by SIs, and it very much should have been caught by QC at multiple levels - 10% is not exactly a rare failure, if you can't catch a failure rate of 1 in 10 units then you are probably are missing a bunch of other QC issues too. As Igor notes in his analysis - there actually are a lot of people who fucked up on this one.
Further, we are at the very earliest stage when MBA cards are most prominent in the market. Custom card designs (PCB and cooling) generally launch later in the lifecycle, this is a relatively large amount of the entire market of cards that have a problem. Once again, should have been obvious - if 200k cards launched and half of them are MBA (perhaps an underestimate) that means there's around 10k defective cards. Not everybody would (or will) notice the defect, and that's understandable to some degree because it's not emitting smoke and flames, but contrast maybe a couple dozen people plugging in their NVIDIA cards wrong with 10k AMD users with defective coolers here.
or perhaps a more dramatic comparison would be the recent spike in 6000-series chip failures reported by a well-known electronics repairman in germany... a relatively uncommon failure has now made up around 2/3rds of his repairs in the past month and the common factor seems to be they were all using the same driver version. the New Worlds thing was a huge scandal (despite basically tracing back to EVGA fucking up their soldering yet again) and people have perpetually murmured about "bad NVIDIA drivers causing chip failures back in the day" despite that really not being supported in any significant way, and here's actually the same thing seeming to occur on AMD 6000 series cards. again, if this was NVIDIA I can only imagine the social-media shitstorm and the murmur campaign that would be in progress right now, despite only being one repairman with a sample size of under 100 cards.
there is a major major attitude difference in the enthusiast community, people are ready to pounce at the first sign of a weakness or a problem from NVIDIA, and constantly assume the most pessimistic and nefarious possible scenarios, while downplaying AMD problems and giving them the most optimistic and altruistic benefit of the doubt at all times.
like, if this were NVIDIA, does anybody think the community's reaction to a bunch of cards seemingly failing with an usual and specific failure mode would be "now calm down and be rational, let's wait and let NVIDIA investigate and see what they say"? After the connector "scandal", after New Worlds, after POSCAP-gate, after the EVGA 1080 fires, etc?
pretty sure everyone whose card died in the last month for any reason would be coming out of the woodwork and posting on reddit, and tech media would be running breaking coverage about it, demanding statements from the manufacturer, etc.
or consider the 5700XT - if NVIDIA had shipped (what at this point appears to be) defective silicon, ignored widespread reports of crashes and blackscreening for months, and then after being called out by tech media finally released a couple halfhearted patches that didn't really fix the issue for most people, would the public have just shrugged and moved on like they did with AMD? There would be lawsuits and angry videos and widespread demands for refunds etc etc.
And they did the same thing with early Zen2 silicon which was far below advertised clocks and never really improved after the patches/etc. Later silicon was fine, just a batch of poorly-binned silicon at launch that they decided to ship anyway as higher-tier SKUs, but they never recalled it or made anyone whole. The social media community just buried it, did a bunch of "well it comes close if you just run loops of NOPs" damage control bullshit to argue that it wasn't technically false advertising. Once again everyone shrugged and moved on, later silicon got better and the people stuck with defective chips got screwed.
Everyone says AMD doesn't have mindshare but they clearly do have a massive amount of mindshare with loyalists that help them move past these sorts of issues very fluidly and without any accusations or recriminations or even controversy really. Everything is just "they're doing the best they can, practically a small family-owned business, you can't expect them to ship working products every time". The tyranny of low expectations. And that's what leads to these larger problems. The recall costs, lawsuits, and popular blowback are supposed to be the corrective mechanism for companies being sketchy as fuck or taking needless technical risks, and that has been short-circuited for AMD by popular assent.
a decently known PC hardware overclocking channel, der8auer, did a few hands on investigative videos on these kind of cards that I found interesting. maybe you will too. they do content in both english and german.
Pulse and Red Devil are the two major series that have hit the tops of the charts among miners; they overclock easier, they survive at overclocked longer, and they run at overclocked cooler, and their fans survive higher RPM longer.
They are also the two launch day non-refs for the 7900 XTX, and due to the vapor chamber issue, vanish from stock as soon as they come in.
Now that GPU mining is in decline, it's not hindering the rest of GPU users at least. But new Pulse models are still not readily available. Newegg and Amazon didn't even start selling them yet.
What is up with the first image? At first I thought it was a picture of a screen inception, but maybe it's a picture of a screen, with a slightly larger screen behind it, and a yet slightly larger screen behind that?
Perhaps. Or perhaps it's 1% of reference models. But it is, according to AMD, a knowable, countable, and therefore recallable number that AMD is electing not to issue a recall for.
If they did, the both the consumer market and the investor market would know how big a problem this is. Which AMD is very clearly shying away from.
As a cynic, it makes me suspect that the number is on the larger end of the spectrum.
Several youtubers have bought multiple cards to test and nearly all of them experienced this issue when installed horizontally (vast majority of gpu installs are this way - motherboard vertical, card horizontal). AMD claiming that it's due to some vapor chambers not being filled fully is pretty bunk since it is across all vendors and there is supposedly a revised cooler design going out to OEM partners for recall/refitting.
According to AMD, it's not a design issue but a manufacturing issue. Some of the coolers were not filled with enough water. Many cards with reference cools do not suffer from this issue.
It has already been confirmed by major people in the community that designs, ref or not, that use a conventional cooler do not exhibit the issue, and will reach about 85c (ie, normal) at 100% load.
The dead giveaway isn't that the chip reaches 110c, it's that the entire card gets so hot (due to manufacturing error of the vapor chamber, independently verified by the community and AMD both), that RAM starts throttling too (ie, identical to the 3090 issue with the special Micron GDDR6, and glad to see AMD also adopt RAM temp throttling), and that has not been observed on any non-vapor chamber card and also any vapor chamber that doesn't have the issue.
I think part of the issue is that silicon is intentionally being run increasingly hotter over the last few years.
The temperature sensors have gotten good enough that the hardware can auto-boost right up to the physical limits. The card will essentially "overclock" itself until it either 1) hits a temperature limit, or 2) hits a power draw limit.
A high temperature is indeed normal, and customer support / marketing is aware of this. You have to look at the power drawn to reach that temperature to determine if there is a cooling issue, but such technical details are probably lost on most low-level CS folks.
yes, at 7nm and below you pretty much have to have runtime monitoring of the voltage/temperature microclimate across each core (CU or SMX) because it simply is not possible to validate that a given core will always be stable under a given operating regime. it is not possible to guarantee that the worst-case scenario of a very hot transistor (needs more power) that is in an area where voltage is drooping low and all the nearby transistors fire at once in a worst-case neighbor load will not result in switching times that are too slow and an incorrect output. the operating margin at 7nm and below is extremely slim and you just cannot get competitive clocks while guaranteeing that transistors will never be pushed out of their stability regime. And in computing, one in a million chance happens 1000 times a second.
instead you actually design the chips to detect this and slow themselves down... that was the whole thing with "clock stretching" showing up on zen2. everyone has had to implement something similar, the chip watches itself and will either delay instructions or slow down clocks or something similar if the voltage/temperature is not going to allow the timing hotpath to converge inside the clock window.
side effect is that now everyone is looking at "hotspots", but hotspots generally are significantly higher than the average, that's why you have to watch them. there is a much stronger "microclimate" where voltage can droop and temperatures can increase significantly between different regions of the chip.
also just very simply the density has been increasing faster than the power consumption has been decreasing (no more Dennard scaling) so overall the power (and thermal) density has been getting higher every time you shrink.
110-degree junction temperatures are indeed nothing to worry about, it's just the card shouldn't be throttling and reach these temperatures at these frequencies and power draw.
When a new product is released, it is common for there to be a learning curve for support in terms of what is normal and what is not. They are not engineers.
This is just an unfortunate coincidence - batches were tested in vertical orientation, which is normal for test benches, but most users use the card in horizontal orientation.
Orientation affecting temperatures is unexpected and unintuitive, so this was missed. Cause is rumoured to be insufficient coolant in vapor chamber for affected batches.
Please don't read too much into this low-effort rant. I don't blame the author, it's okay to be frustrated when a big purchase doesn't go their way, but if you have a large following you should be more mindful of your words.
>This is just an unfortunate coincidence - batches were tested in vertical orientation, which is normal for test benches, but most users use the card in horizontal orientation.
Why would you choose to test your product in a manner that is completely opposite from how most people would use it? Granted, I'm asking from a place of ignorance, but that just seems... dumb. I'd be curious to learn why my perspective might be wrong.
Practically all the reviewers test these cards on test benches too, which masks the problem with the horizontal card orientation (thus nobody uncovered the issue in the initial reviews).
Test benches are way more convenient than minitowers when you need to regularly swap components.
It is physically easier to push down than to push sideways. Pushing down means the card is in a vertical position. It would be 2 extra steps to rotate the bench to a vertical (and therefore horizontal card) position.
So a proper "closer-to-real-world-conditions" test environment is too much effort for all of these tech enthusiasts?
"Vapor chambers also rarely show up on GPUs, until recently. "
Anything with a heat pipe going through it is a vapor chamber cooler design. My still-used 9800GTX+ has one. That card will be 15 years old in three months. These are not new things in GPUs, by any means. That's how you get so many prior gens of top-end cards in a two-slot design despite hitting nearly 300W+ in power draw.
Heat pipes from the previous decade don't have a wicking matrix, they're hollow, and even often smooth-walled. Vapor chambers have sponge-like structures inside of them to re-condense water closer to the heat source and assist its movement back to the hotspot via capillary action.
"Heat pipes from the previous decade don't have a wicking matrix"
Yes, they do. If they're not flat pipes, they've likely got a wicking material inside them. This isn't really new tech, it's been around for cooling LASER and LED arrays for quite some time.
Thank you for correcting me. I thought I remembered photos of heat pipes from popular CPU coolers c. 2010 with smooth inner faces, but on reviewing some old photos from across the internet I found I was mistaken.
The flat pipes tend to not need wicks as they're thin enough that they get the added benefit of capillary action. Any tube design, excepting ones using fairly-thin tubing (2mm OD 1.5mm ID) will likely use a wick for its own capillary action.
It's not just for cosmetics. Many video cards block two expansion slots. Some block three, and I think a few block four.
If you can use a cable to make it possible mount that card away from your motherboard in a place where it blocks exactly one slot, then you've freed up several-to-many expansion slots.
Source: I did this in my PC. I actually bought a large case so that I could do things like this.
I'm more inclined to believe AMD are fully aware of the effect that gravity can have on a cooler and thoroughly tested it during the design phase in all orientations. The ball was probably dropped later when testing samples were returned by the manufacturer where they either only tested them in the vertical orientation or they didn't sample enough of them to catch the issue. After all, this is apparently not a design bug but a manufacturing one.
What are you on about? Some of the cards don't have enough coolant in their vapor chambers. The ones that do are perfectly fine. It has nothing to do with "pushing the limits of design". Simple manufacturing defect.
The standard position for graphics card has been horizontal, for many, many years. I don't think the standard position for a cigarrete lighter is upside down, to begin with.
Next, for your "solution" to work you need a special case that allows for the card to be installed vertically, which is not common.
It severely impacts performance. Also, once the junction reaches 110°, the fans also start spinning at max speed with no hope of actually cooling down the card enough. So yes, if you don't mind the noise from 100% fans always being on, your card being constantly unable to cool itself, and a 20-30% hit on performance on your 1200$ card... sure I guess?
They work by having the liquid boil at the hot end, the gas flow along the pipe, condense back to a liquid at the cold end, and then the liquid flow back.
That last step, the liquid flowing back, is the critical one. With gravity, the liquid can gush back really fast, and loads of heat can be transferred.
But if the card is the wrong way up, then it relies on the liquid wicking back along some clothlike material coating the inside of the pipe. That's a far slower process. Therefore the heat that can be dissipated this way is much lower.
That's why heat pipe performance is dramatically affected by orientation.
The designers of this card will have been aware of that, but likely decided a small performance hit for sustained workloads was probably worth it.