They've actually been making positive moves with GPUs lately along with a succes...

kimixa · 2024-12-24T08:22:13 1735028533

B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.

That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.

bitmasher9 · 2024-12-24T15:26:19 1735053979

It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.

One of the benefits of being Intel.

rockskon · 2024-12-26T03:09:53 1735182593

Well yes, it's in the name "loss leader". It's not meant to be sustainable. It's meant to get their name out there as a good alternative to Radeon cards for the lower-end GPU market.

Profit can come after positive brand recognition for the product.

jvanderbot · 2024-12-24T12:03:12 1735041792

I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.

7speter · 2024-12-24T18:31:34 1735065094

I don’t know if this matters but while the B580 has a die comparable in size to a 4070 (~280mm^2), it has about half the transistors (~17-18 billion), iirc.

ryao · 2024-12-24T23:49:21 1735084161

Tom Petersen said in a hardware unboxed video that they only reported “active” transistors, such that there are more transistors in the B580 than what they reported. I do not think this is the correct way to report them since one, TSMC counts all transistors when reporting the density of their process and two, Intel is unlikely to reduce the reported transistor count for the B570, which will certainly have fewer active transistors.

That said, the 4070 die is 294mm^2 while the B580 die is 272mm^2.

ryao · 2024-12-24T22:46:48 1735080408

Is it a loss leader? I looked up the price of 16Gbit GDDR6 ICs the other day at dramexchange and the cost of 12GB is $48. Using the gamer nexus die measurements, we can calculate that they get at least 214 dies per wafer. At $12095 per wafer, which is reportedly the price at TSMC for 5nm wafers in 2025, that is $57 per die.

While defects ordinarily reduce yields, Intel put plenty of redundant transistors into the silicon. This is ordinarily not possible to estimate, but Tom Petersen reported in his interview with hardware unboxed that they did not count those when reporting the transistor count. Given that the density based on reported transistors is about 40% less than the density others get from the same process and the silicon in GPUs is already fairly redundant, they likely have a backup component for just about everything on the die. The consequence is that they should be able to use at least 99% of those dies even after tossing unusable dies, such that the $57 per die figure is likely correct.

As for the rest of the card, there is not much in it that would not be part of the price of an $80 Asrock motherboard. The main thing would be the bundled game, which they likely can get in bulk at around $5 per copy. This seems reasonable given how much Epic games pays for their giveaways:

https://x.com/simoncarless/status/1389297530341519362

That brings the total cost to $190. If we assume Asrock and the retailer both have a 10% margin on the $80 motherboard used as a substitute for the costs of the rest of the things, then it is $174. Then we need to add margins for board partners and the retailers. If we assume they both get 10% of the $250, then that leaves a $26 profit for Intel, provided that they have economics of scale such that the $80 motherboard approximation for the rest of the cost of the graphics card is accurate.

That is about a 10% margin for Intel. That is not a huge margin, but provided enough sales volume (to match the sales volume Asrock gets on their $80 motherboards), Intel should turn a profit on these versus not selling these at all. Interestingly, their board partners are not able/willing to hit the $250 MSRP and the closest they come to it is $260 so Intel is likely not sharing very much with them.

It should be noted that Tom Petersen claimed during his hardware unboxed interview that they were not making money on these. However, that predated the B580 being a hit and likely relied on expected low production volumes due to low sales projections. Since the B580 is a hit and napkin math says it is profitable as long as they build enough of them, I imagine that they are ramping production to meet demand and reach profitability.

SixtyHurtz · 2024-12-25T13:33:58 1735133638

That's just BOM. When you factor in R&D they are clearly still losing money on B580. There's no way they can recoup R&D this generation with a 10% gross margin.

Still, that's to be expected considering this is still only the second generation of Arc. If they can break even on the next gen, that would be an accomplishment.

ryao · 2024-12-25T15:48:52 1735141732

To be fair, the R&D is shared with Intel’s integrated graphics as they use the same IP blocks, so they really only need to recoup the R&D that was needed to turn that into a discrete GPU. If that was $50 million and they sell 2 million of these, they would probably recoup it. Even if they fail to recoup their R&D funds, they would be losing more money by not selling these at all, since no sales means 0 dollars of R&D would be recouped.

While this is not an ideal situation, it is a decent foundation on which to build the next generation, which should be able to improve profitability.

schmidtleonard · 2024-12-24T05:14:00 1735017240

Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.

I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.

sodality2 · 2024-12-24T05:45:02 1735019102

MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.

ryao · 2024-12-25T02:36:37 1735094197

My analysis is that the B580 is profitable if they build enough of them:

https://news.ycombinator.com/item?id=42505496

The margins might be describable as razor thin, but they are there. Whether it can recoup the R&D that they spent designing it is hard to say definitively since I do not have numbers for their R&D costs. However, their iGPUs share the same IP blocks, so the iGPUs should be able to recoup the R&D costs that they have in common with the discrete version. Presumably, Intel can recoup the costs specific to the discrete version if they sell enough discrete cards.

While this is not a great picture, it is not terrible either. As long as Intel keeps improving its graphics technology with each generation, profitability should gradually improve. Although I have no insider knowledge, I noticed a few things that they could change to improve their profitability in the next generation:

  * Tom Petersen made a big deal about 16-lane SIMD in Battlemage being what games want rather than the 8-lane SIMD in Alchemist. However, that is not quite true since both Nvidia and AMD graphics use 32-lane SIMD. If the number of lanes really matter and I certainly can see how it would if game shaders have horizontal operations, then a switch to 32-lane SIMD should yield further improvements.
  * Tom Petersen said in his interview with Hardware Unboxed that Intel reported the active transistor count for the B580 rather than the total transistor count. This is the contrary to others who report the total transistor count (as evidenced by their density figures being close to what TSMC claims the process can do). Tom Petersen also stated that they would not necessarily be forced by defects to turn dies into B570 cards. This suggests to me that they have substantial redundant logic in the GPU to prevent defects from rendering chips unusable, and that logic is intended to be disabled in production. GPUs are already highly redundant. They could drop much of the planned dark silicon and let defects force a larger percentage of the dies to be usable by only cutdown models.

I could have read too much into things that Tom Petersen said. Then again, he did say that their design team is conservative and the doubling rather than quadrupling of the SIMD lane count and the sheer amount of dark silicon (>40% of the die by my calculation) spent on what should be redundant components strike me as conservative design choices. Hopefully the next generation addresses these things.

Also, they really do have >40% dark silicon when doing density comparisons:

  * ARC B580: 72.1M / mm²
  * Nvidia 4070 Ti: 121.8M / mm²
  * TSMC claim for 5nm: 138.2M / mm²

They have 41% less density than Nvidia and 48% less density than TSMC claims the process can obtain. We also know that they have additional transistors on the die that are not active from Tom Petersen’s comments. Presumably, they are for redundancy. Otherwise, there really is no sane explanation that I can see for so much dark silicon. If they are using transistors that are twice the size as the density figure might be interpreted to suggest, they might as well have used TSMC’s 7nm process since while a smaller process can etch larger features, it is a waste of money.

Note that we can rule out the cache lowering the density. The L1 + L2 cache on the 4070 Ti is 79872 KB while it is 59392 KB on the B580. We can also rule out IO logic as lowering the density, as the 4070 Ti has a 256-bit memory bus while the B580 has a 192-bit memory bus.

https://www.techpowerup.com/gpu-specs/arc-b580.c4244

https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3...

https://en.wikipedia.org/wiki/5_nm_process#Nodes

The hardware unboxed interview of Tom Petersen is here:

https://youtu.be/XYZyai-xjNM

Dylan16807 · 2024-12-25T12:47:12 1735130832

> Tom Petersen made a big deal about 16-lane SIMD in Battlemage [...]

Where? The only mention I see in that interview is him briefly saying they have native 16 with "simple emulation" for 32 because some games want 32. I see no mention of or comparison to 8.

And it doesn't make sense to me that switching to actual 32 would be an improvement. Wider means less flexible here. I'd say a more accurate framing is whether the control circuitry is 1/8 or 1/16 or 1/32. Faking extra width is the part that is useful and also pretty easy.

ryao · 2024-12-25T15:19:47 1735139987

For context, Alchemist was SIMD8. They made a big deal out of this at the alchemist launch if I recall correctly since they thought it would be more efficient. Unfortunately, it turned out to be less efficient.

Tom Petersen did a bunch of interviews right before the Intel B580 launch. In the hardware unboxed interview, he mentioned it, but accidentally misspoke. I must have interpreted his misspeak as meaning games want SIMD16 and noted it that way in my mind, as what he says elsewhere seems to suggest that games want SIMD16. It was only after thinking about what I heard that I realized otherwise. Here is an interview where he talks about native SIMD16 being better:

https://www.youtube.com/live/z7mjKeck7k0?t=35m38s

In specific, he says:

> But we also have native SIMD support—SIMD16 native support, which is going to say that you don’t have to like recode your computer shader to match a particular topology. You can use the one that you use for everyone else, and it’ll just run well on ARC. So I’m pretty exited about that.

In an interview with gamers nexus, he has a nice slide where he attributes a performance gain directly to SIMD16:

https://youtu.be/ACOlBthEFUw?t=16m35s

At the start of the gamers nexus video, Steve mentions that Tom‘s slides are from a presentation. I vaguely remember seeing a video of it where he talked more about SIMD16 being an improvement, but I am having trouble finding it.

Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count. Interestingly, AMD switched from a 16 lane count to a 32 lane count with RDNA, and RDNA turned out to be a huge improvement in efficiency. The switch is actually somewhat weird since they had been emulating SIMD64 using their SIMD16 hardware, so the hardware simultaneously became wider and narrower at the same time. Their emulation of SIMD64 in SIMD16 is mentioned in this old GCN documentation describing cross lane operations:

https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operat...

That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations. Contrast this with 12.5.1 of RDNA 3 ISA documentation, where the native SIMD32 units just fetch the values from each others’ registers with no mention of a temporary location:

https://www.amd.com/content/dam/amd/en/documents/radeon-tech...

That strikes me as much more efficient. While I do not write shaders, I have written CUDA kernels and in CUDA kernels, you sometimes need to do what Nvidia calls a parallel reduction across lanes, which are cross lane operations (Intel’s CPU division calls these horizontal operations). For example, you might need to sum across all lanes (e.g. for an average, matrix vector multiplication or dot product). When your thread count matches the SIMD lane count, you can do this without going to shared memory, which is fast. If you need to emulate a higher lane width, you need to use a temporary storage location (like what AMD described), which is not as fast.

If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations. Intel’s slide attributes a 0.3ms reduction in render time to their switch from SIMD8 to SIMD16. I suspect that they would see a further reduction with SIMD32 since that would eliminate the need to emulate SIMD32 for games that expect SIMD32 due to Nvidia (since as late as Turing) and AMD (since RDNA 1) both using SIMD32.

To illustrate this, here are some CUDA kernels that I wrote:

https://github.com/ryao/llama3.c/blob/master/rung.cu#L15

The softmax kernel for example has the hardware emulate SIMD1024, although you would need to look at the kernel invocations in the corresponding rung.c file to know that. The purpose of doing 1024 threads is to ensure that the kernel is memory bandwidth bound since the hardware bottleneck for this operation should be memory bandwidth. In order to efficiently do the parallel reductions to calculate the max and sum values in different parts of softmax, I use the fast SIMD32 reduction in every SIMD32 unit. I then write the results to shared memory from each of the 32 SIMD32 units that performed this (since 32 * 32 = 1024). I then have all 32x SIMD32 units read from shared memory and simultaneously do the same reduction to calculate the final value. Afterward, the leader in each unit tells all others the value and everything continues. Now imagine having a compiler compile this for a native SIMD16.

A naive approach would introduce a trip to shared memory for both reductions, giving us 3 trips to shared memory and 4 reductions. A more clever approach would do 2 trips to shared memory and 3 reductions. Either way, SIMD16 is less efficient. The smart thing to do would be to recognize that 256 threads is likely okay too and just do the same exact thing with a smaller number of threads, but a compiler is not expected to be able to make such a high level optimization, especially since the high level API says “use 1024 threads”. Thus you need the developer to rewrite this for SIMD16 hardware to get it to run at full speed and with Intel’s low marketshare, that is not very likely to happen. Of course, this is CUDA code and not a shader, but a shader is likely in a similar situation.

Dylan16807 · 2024-12-25T19:01:18 1735153278

> Having to schedule fewer things is a definite benefit of 32 lanes over a smaller lane count.

From a hardware design perspective, it saves you some die size in the scheduler.

From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.

> That documentation talks about writing to a temporary location and reading form a temporary location in order to do cross lane operations.

> If games’ shaders are written with an assumption that SIMD32 is used, then native SIMD32 is going to be more performant than native SIMD16 because of faster cross lane operations.

So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.

ryao · 2024-12-25T21:44:40 1735163080

> From a performance perspective, as long as the hardware designer kept 32 in mind, it can schedule 32 lanes and duplicate the signals to the 16 or 8 wide lanes with no loss of performance.

I was looking at the things that were said for XE2 in Lunar Lake and it appears that the slides suggest that they had special handling to emulate SIMD32 using SIMD16 in hardware, so you might be right.

> So this is a situation where wider lanes actually need more hardware to run at full speed and not having it causes a penalty. I see your point here, but I will note that you can add that criss-cross hardware for 32-wide operations while still having 16-wide be your default.

To go from SIMD8 to SIMD16, Intel halved the number of units while making them double the width. They could have done that again to avoid the need for additional hardware.

I have not seen the Xe2 instruction set to have any hints about how they are doing these operations in their hardware. I am going to leave it at that since I have spent far too much time analyzing the technical marketing for a GPU architecture that I am not likely to use. No matter how well they made it, it just was not scaled up enough to make it interesting to me as a developer that owns a RTX 3090 Ti. I only looked into it as much as I did since I am excited to see Intel moving forward here. That said, if they launched a 48GB variant, I would buy it in a heartbeat and start writing code to run on it.

ryao · 2024-12-25T17:28:06 1735147686

There is a typo in the Tom Petersen quote. He said “compute shader”, not “computer shader”. Autocorrect changed it when I had transcribed it and I did not catch this during the edit window.

oofabz · 2024-12-24T06:01:58 1735020118

The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.

272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.

tjoff · 2024-12-24T09:15:56 1735031756

Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.

Intel has neither, or at least not as much of them.

KeplerBoy · 2024-12-24T13:18:43 1735046323

At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.

7speter · 2024-12-24T18:34:51 1735065291

It kinda seems they are covering the cost of throwing massive amounts of resources trying to get Arc’s drivers in shape.

KeplerBoy · 2024-12-24T23:03:05 1735081385

I really hope they stick with it and become a viable competitor in every market segment a few more years down the line.

ryao · 2024-12-25T23:18:43 1735168723

The drivers are shared by their iGPUs, so the cost of improving the drivers is likely shared by those.

ryao · 2024-12-25T00:40:30 1735087230

The idea that Intel is selling these at a loss does not sound reasonable to me:

https://news.ycombinator.com/item?id=42505496

The only way this would be at a loss is if they refuse to raise production to meet demand. That said, I believe their margins on these are unusually low for the industry. They might even fall into razor thin territory.

derektank · 2024-12-24T05:40:45 1735018845

Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.

ryao · 2024-12-25T01:45:27 1735091127

The former is likely true, but the latter is not:

https://news.ycombinator.com/item?id=42505496

That being said, the IP blocks are shared by their iGPUs, so the discrete GPUs do not need to recoup the costs of most of the R&D, as it would have been done anyway for the iGPUs.

rockskon · 2024-12-24T07:52:23 1735026743

They're trying to unseat Radeon as the budget card. That means making a more enticing offer than AMD for a temporary period of time.

ryao · 2024-12-25T00:09:05 1735085345

That guy’s reasoning is faulty. To start, he has made math mistakes in every video that he has posted recently involving math. To give 3 recent examples:

At 10m3s in the following video, he claims to add a 60% margin by multiplying by 1.6, but in reality is adding a 37.5 margin and needed to multiply by 2.5 to add a 60% margin. This can be calculated by calculating Cost Scaling Factor = 1 / (1 - Normalized Profit Margin):

2.5 = 1 / (1 - 0.6)

1.6 = 1 / (1 - 0.375)

https://youtu.be/pq5G4mPOOPQ

At 48m13s in the following video, he claims that Intel’s B580 is 80% worse than Nvidia’s hardware. He took the 4070 Ti as being 82% better than the 2080 SUPER, assumed based on leaks from his reviewer friends that the B580 was about at the performance of the 2080 SUPER and then claimed that the B580 would be around 80% worse than the 4070 Ti. Unfortunately for him, that is 45% worse, not 80% worse. His chart is from Techpowerup and if he had taken the time to do some math (1 - 1/(1 + 0.82) ~ 0.45), or clicked to the 2080 SUPER page, he would have seen it has 55% of the performance of the 4070 Ti, which is 45% worse:

https://youtu.be/-lv52n078dw

At 1m2s in the following video, he makes a similar math mistake by saying that the B580 has 8% better price/performance than the RTX 3060 when in fact it is 9% better. He mistakenly equated the RTX 3060 being 8% worse than the B580 to mean that it is 8% better, but math does not work that way. Luckily for him, the math error is small here, but he still failed to do math correctly and his reasoning grows increasingly faulty with the scale of his math errors. What he should have done that gives the correct normalized factor is:

1.09 ~ 1 / (1 - 0.08)

A factor of 1.09 better is 9% better.

https://youtu.be/3jy6GDGzgbg

He not just fails at mathematical reasoning, but lacks a basic understanding of how hardware manufacturing works. He said that if Intel loses $20 per card in low production volumes, then making 10 million cards will result in a $200 million loss. In reality, things become cheaper due to economics of scale and simple napkin math shows that they can turn a profit on these cards:

https://news.ycombinator.com/item?id=42505496

His $20 loss per card remark is at 11m40s:

https://youtu.be/3jy6GDGzgbg

His behavior is consistent with being on a vendetta rather than being a technology journalist. For example, at 55m13s in the following video, he puts words in Tom Petersen’s mouth and then with a malicious smile on his mouth, cheers while claiming that Tom Petersen declared discrete ARC cards to be dead when Tom Petersen said nothing of the kind. Earlier in the same video at around 44m14s, he calls Tom Petersen a professional liar. However, he sees no problem expecting people to believe words he shoved into the “liar’s” mouth:

https://youtu.be/xVKcmGKQyXU

If you scrutinize his replies to criticism in his comments section, you would see he is dodging criticism of the actual issues with his coverage while saying “I was right about <insert thing completely unrelated to the complaint here>” or “facts don’t care about your feelings”. You would also notice that he is copy and pasting the same statements rather than writing replies addressing the details of the complaints. To be clear, I am paraphrasing in those two quotes.

He also shows contempt for his viewers that object to his behavior in the following video around 18m53s where he calls them “corporate cheerleaders”:

https://youtu.be/pq5G4mPOOPQ

In short, Tom at MLID is unable to do mathematical reasoning, does not understand how hardware manufacturing works, has a clear vendetta against Intel’s discrete graphics, is unable to take constructive criticism and lashes out at those who try to tell him when he is wrong. I suggest being skeptical of anything he says about Intel’s graphics division.