>On GPUs, ML "just works" If you had worked with ML, you'd know that this is not...

latchkey · 2024-03-26T20:09:45 1711483785

> That's why for example noone is using AMD GPUs for ML

You're right, they are behind, but to say that nobody is using it, is not truthful. AMD HPC clusters are being used [0] and [1] for AI/ML.

The larger issue is that AMD has only been building HPC clusters for the last period of time. Now, with the release of MI300x, we have Azure and Oracle coming online with them now. Disclosure, my business is also building a MI300x super computer as well, with the express goal of enabling more access to developers.

[0] https://defensescoop.com/2023/08/23/navys-new-25m-supercompu...

[1] https://arxiv.org/abs/2312.12705

sigmoid10 · 2024-03-28T11:06:34 1711623994

>AMD HPC clusters are being used [0] and [1] for AI/ML.

Funny how you can immediately tell when the business people made these decisions and not the tech people. This is exactly what I would have expected from an organization like the Navy. On paper it does sound great and the Navy bean counters probably loved this. But they are in for a rough awakening.

latchkey · 2024-03-28T18:59:09 1711652349

As far as I can tell, the only rough awakening is that they paid $25m in 2023, that costs a fraction of that today, for even better performance.

In a few months, my own cluster will be nearly 2x that size, with better networking, and we aren't spending anywhere near $25m.

Disclosure: building my own supercomputer business around AMD hardware

sigmoid10 · 2024-04-08T09:21:30 1712568090

The best I can say is that my thoughts and prayers go to the ML engineers who will actually have to deal with this. Those companies literally couldn't pay me enough to put up with it. They will likely only attract people who care about the salary and the position instead of getting things done. I've seen it with other colleagues before. These numbers of yours are completely worthless without someone who is willing to put in 5 times the work for the same or worse results.

latchkey · 2024-04-08T15:08:54 1712588934

People choose jobs and tools for a variety of reasons. I don't feel the need to cast judgement on them over it.

The numbers I gave aren't worthless, nor does it take 5x the amount of work. I also don't think that going with a single source for hardware for all of AI is very smart either, especially given the fact that there are serious supply shortages from that single vendor. No fortune 100 would put all their eggs in one basket and even if it was 5x the work, it is worth it.

mike_hearn · 2024-03-26T18:17:30 1711477050

Probably bartwr is using "GPUs" to mean NVIDIA GPUs. Seeing as nobody uses AMD GPUs for it, that simplification seems OK.