True. But another piece of knowledge about Nvidia is its competitors are waaaaaa...

josephg · 2024-05-22T21:59:50

> its competitors are waaaaaaaaay behind. They stand peerless.

Its competitors are only way behind when it comes to software support. The hardware coming out of Intel and amd is, especially for its price, very capable. Given how much money is being invested in AI right now, I don’t see Nvidia’s moat lasting more than a few more years.

Vvector · 2024-05-22T22:15:15

Hardware has never been AMD's problem. Their chips are great, on paper. But their software has always been a few generations behind.

HDThoreaun · 2024-05-22T23:39:01

But every big tech compoany is handing nvdia billions a year now. Its not just amd working on amd's software anymore. Now theyve got armies of open source developers from around the industry ready to save their company a boatload if they can get the software working.

latchkey · 2024-05-22T23:11:04

They are moving a lot more quickly now on resolving those issues. It won't happen over night, but the ship is definitely changing course on that.

floatinglotus · 2024-05-22T23:39:10

The number of companies that wish they were a SW company rather than a HW company is very long, and despite all their efforts once they get to a certain size the die is cast.

ethbr1 · 2024-05-22T23:49:58

This. Going to GTC and seeing Jensen present demos at the keynote they were coding the night before was... interesting.

Either you're the type of company that does that, or you aren't.

latchkey · 2024-05-22T23:51:02

If you look at all the open "AI" job rec's at AMD right now, they want to be a SW company too.

Jlagreen · 2024-05-23T13:57:46

Yes, but will they pay as well?

Getting good AI talent now is very costly. HW engineers are cheaper.

Nvidia has more SW than HW engineers for a reason and the transformation for that started slowly almost 2 decades ago and accelerated 2012 with AlexNet, the first public showcase of a NN running on GPUs. Jensen saw what that meant and transformed the company from that moment focusing on DeepLearning.

Nvidia isn't waiting for a market to develop but prefers to create markets by tackling hard and complex problems. It seems that Nvidia got lucky with AI but it was a long lasting preparation for Jensen.

latchkey · 2024-05-23T15:44:04

I agree, it is an uphill battle for AMD.

Tell me though, what Fortune 500 do you know that is willing to put all their eggs in one basket? It is MBA 101 to not do that.

There needs to be alternatives in the space. Why not let them try?

mike_hearn · 2024-05-23T08:04:30

We've been hearing that line for years, but HN comments are a valuable source of info on where NVDA might go in future. You're the only poster here I could name who is strongly pro-AMD in the AI space. Everyone else seems to put their toes in the water in the hope of being able to get an edge by avoiding NVIDIA's monopoly prices, immediately gets burned by trash software quality and runs away screaming "never again".

I only dabble in AI stuff but have decades of experience doing quick surface-level quality checks of open source projects. I looked at some of AMD's ROCm repos late last year. Even basic stuff like the documentation for their RNG libraries didn't inspire confidence. READMEs had blatant typos in, everything gave off a feeling of immense lack of effort or care. Looking again today the ROCrand docs do seem improved, at least on the surface, I haven't tried it out for real.

But if we cast the net a little wider again, the same problems rear their ugly head. Flash Attention is a pretty important kernel to have if working with LLMs, maybe I'd like one of those for AMD hardware?

https://github.com/ROCm/flash-attention

We're in luck! An official AMD repo with flash attention in it, great! Except.... the README says at the top:

Requirements: CUDA 11.4 and above. We recommend the Pytorch container from Nvidia, which has all the required tools to install FlashAttention.

Really? Ah, if we scroll down all the way to the bottom we can find a new section that says "AMD/ROCm: Prerequisite: MI200 & MI300 GPUs". Guys, why not just rewrite the README, literally the first thing you see, to put the most important information up front? Why not ensure it makes sense? It takes 10 seconds and is the kind of attention to detail that makes me think the rest of your work will be high quality too.

Checking the issue tracker we see people reporting that the fork is very out of date, and that some models just mysteriously don't work with it due to bugs. These issue reports go unanswered for months. And let's not even go there on the hardware compatibility front, everyone already knows what "AMD support" really means (not the AMD cards you might actually own) vs what "NVIDIA support" means (any device that supports the needed CUDA version, of any size).

latchkey · 2024-05-23T15:42:35

Mike, thanks for the long thoughtful response.

I would never try to defend AMD with regards to them needing to catch up. Even talking with executives at AMD, neither would they. Nobody is trying to pull a fast one on this.

What has changed for certain, is their attitude and attention. I just got back from Dell Tech World. Dell was caught off-guard with this AI thing too. It is obvious the only thing that anyone is talking about now is "ai ai ai ai ai ai".

Give them a bit of time and I think they will start to become competitive over the next few years. It won't happen over night. You won't see README's fixed right away. But one thing that is for certain, they are all at least trying now, instead of pretending it doesn't exist.

Whether they will be successful or not, is yet to be seen. I wouldn't even know how to define successful. I don't think anyone is kidding themselves about Nvidia being dominant. But, I'm personally willing to bet on them selling a lot of hardware and working on their software story.

You might not, and that is fine too.

mike_hearn · 2024-05-23T17:20:06

It's great that you're pushing AMD forward, no doubt about it! And I'm sure they'll make good progress now. Like I said, the ROCrand repository seems to be in much better shape now.

latchkey · 2024-05-23T17:29:16

What is telling for me is that ROCm itself is on regular cadence updates. Not just small updates, but actual meaningful fixes and improvements.

Not only that, but it is all being done in the open, unlike their competition. Hotz demanded some documentation, they provided it and he still complained. Some people just can't find happiness.

Now, whether or not I am pushing them forward is yet to be seen, but at least I'm trying. By positioning myself as a new startup who's trying to help... that will easily garner all their support as well. As I said in another comment, why not let them try too?

floatinglotus · 2024-05-22T23:37:21

No.

First off, it’s a HW/SW solution and things like CUDA/NCCL/etc make a HUGE difference.

Second, the token/watt ratio of every other option is nearly an order of magnitude difference in real world tests. When you add in custom silicon like moronic Grok/Dojo and you see that there aren’t really any close competitors when using custom spins. That is money down the drain IMO. Best bet for most enterprises is to buy 25% AMD and 75% H100 if they can get it.

I think Blackwell is potentially a long term generational problem due to power limitations in most data centers for now.

prng2021 · 2024-05-23T00:19:13

Isn’t that like saying Apple’s iPads will only maintain a moat for a few more years because MS Surface tablets match in hardware and “only” lag in software? My general point is that the core software (and the software ecosystem built around it) is half of the product. NVIDIA isn’t going to stand still either.

svnt · 2024-05-23T04:22:32

B2C and B2B are completely different when it comes to moats.

If I can save 20% of my data center costs and cut a price-gouging vendor while bringing the solution in-house at a big tech org I am a hero.

Consumers won’t buy a Surface because Microsoft isn’t cool.

Jlagreen · 2024-05-23T14:05:59

You ignore a very important fact which goes above any cost and that is security and stability. A fast unstable or unsafe system belongs into the trash can in any enterprise.

B2C will first ask about security and stability.

Do you think AWS, Azure and GCP are the cheapest cloud offerings? Of course not, but why do they dominate cloud computing in B2C while price gouging everyone?

Because they offer something beyond price and that is security and stability as well as a reliable partner. They also offer support and capacity on a level which a startup CSP will never be able to offer.

This is also the reason why all AI accelerator competitors won't be a competition for Nvidia.

To beat Nvidia it's not only about beating CUDA, it's about beating Nvidia Enterprise AI suite with it's security offerings and support options. But enterprise business level SW is a level where AMD and others will never go to and will have to rely on Big Tech like MS, Amazon and so on to do that for them. But why should they if they have in-house solutions? Big CSPs developing their own AI accelerators shows you that they understand Nvidia's business model and are trying to compete head on because they understand that Nvidia is attacking them at enterprise level with AI enterprise solutions. And of course any enterprise using Nvidia enterprise SW will automatically use Nvidia HW.

Once SW is more spread than HW then it dictates where the direction goes. If MS releases Windows 12 only for ARM then Intel and AMD are immediately screwed and they can't do nothing about that. No enterprise in the world cares if their CAD system runs on x86 or ARM as long as it can be used for the intended use.

svnt · 2024-05-27T16:50:38

You make a lot of assumptions about what I understand.

If I am in charge of a data center I had better understand the impact of security and stability as well as the qualities of vendor relationships on my costs or I probably won’t be in that role very long.

You, on the other hand, apparently have never managed an enterprise ISA transition, or even cross-compiled software. The idea that Microsoft would just do that and that it would work is naive in the extreme. CAD software is compiled first for an architecture, and then generally within an operating system. It is all interconnected and interdependent.

devonkim · 2024-05-22T22:17:10

I don’t really see the software ecosystem catching up that quickly is the thing. Sure, we have some support for hardware like Google’s TPUs and Coral but the field’s practitioners and researchers are oftentimes so behind the curve of general systems work like dealing with the nuts and bolts of libraries and package management that anyone trying to compete against NVIDIA will need to spend a lot of time investing there rather than yet another group of ML engineers that shudder at the thought of packaging and distributing their software to the public and supporting frameworks for years with partnerships and continued investments doing a lot of work that’s basically toil and extremely undesirable for said ML engineers.

mnau · 2024-05-22T21:59:19

Isn't CUDA basically hidden behind pyrorch and others?

I know that it's touted as the key competitive advantage, but it seems to stem from the fact it actually works, unlike others.

Still great advantage, but not a lock in. If competitors get their act together, couldn't they just replace CUDA with another API, all hidden somewhere in the sw stack?

ein0p · 2024-05-23T01:20:21

For training workloads? Yes. For inference? Nope. Their competition is very technically strong there. The lack of market attention (pun intended) is utterly baffling there. And before someone inevitably chimes in about CUDA - for inference you don’t need it. Ask Google for example.

whaleofatw2022 · 2024-05-23T03:39:01

I am still absolutely baffled that Intel won't YOLO with a 32GB A770.

ein0p · 2024-05-23T04:13:50

That’s not where the money is. Why they aren’t selling Gaudi2 by the pallet I can’t quite understand.