if you're talking about building anything, that is already too hard for ML resea...

eslaught · 2024-07-15T21:55:29.000000Z

If your point is that HIP is not a zero-effort porting solution, that is correct. HIP is a low-effort solution, not a zero effort solution. It targets users who already use and know CUDA, and minimizes the changes that are required from pre-existing CUDA code.

In the case of these abstraction layers, then it would be the responsibility of the abstraction maintainers (or AMD) to port them. Obviously, someone who does not even use CUDA would not use HIP either.

To be honest, I have a hard time believing that a truly zero-effort solution exists. Especially one that gets high performance. Once you start talking about the full stack, there are too many potholes and sharp edges to believe that it will really work. So I am highly skeptical of original article. Not that I wouldn't want to be proved wrong. But what they're claiming to do is a big lift, even taking HIP as a starting point.

The easiest, fastest (for end users), highest-performance solution for ML will come when the ecosystem integrates it natively. HIP would be a way to get there faster, but it will take nonzero effort from CUDA-proficient engineers to get there.

currymj · 2024-07-16T14:48:09.000000Z

I agree completely with your last point.

As other commenters have pointed out, this is probably a good solution for HPC jobs where everyone is using C++ or Fortran anyway and you frequently write your own CUDA kernels.

From time to time I run into a decision maker who understandably wants to believe that AMD cards are now "ready" to be used for deep learning, and points to things like the fact that HIP mostly works pretty well. I was kind of reacting against that.

elashri · 2024-07-16T00:56:17.000000Z

As someone doing a lot of work with CUDA in a big research organization, there are few of us. If you are working with CUDA, then you are not from the type of people who wait to have something that just works like you describe. CUDA itself is a battle with poorly documented stuff.

bootsmann · 2024-07-15T20:51:43.000000Z

Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.

gaogao · 2024-07-15T21:40:15.000000Z

Engineers who are really, really good at CUDA are worth their weight in gold, so there's more projects for them than they have time. Worth their weight in gold isn't figurative here – the one I know has a ski house more expensive than 180 lbs of gold (~$5,320,814).

Willish42 · 2024-07-16T03:43:24.000000Z

The fact that "worth their weight in cold" typically means in the single-digit millions is fascinating to me (though I doubt I'll be able to get there myself, maybe someday). I looked it up though and I think this is undercounting the current value of gold per ounce/lb/etc.

5320814 / 180 / 16 = ~1847.5

Per https://www.apmex.com/gold-price and https://goldprice.org/, current value is north of $2400 / oz. It was around $1800 in 2020. That growth for _gold_ of all things (up 71% in the last 5 years) is crazy to me.

It's worth noting that anyone with a ski house that expensive probably has a net worth well over twice the price of that ski house. I guess it's time to start learning CUDA!

atwrk · 2024-07-16T08:10:43.000000Z

> That growth for _gold_ of all things (up 71% in the last 5 years) is crazy to me.

For comparison: S&P500 grew about the same during that period (more than 100% from Jan 2019, about 70 from Dec 2019), so the higher price of gold did not outperform the growth of the general (financial) economy.

dash2 · 2024-07-16T12:34:48.000000Z

But that's still surprising performance, because the S&P generates income and pays dividends. Its increase reflects (at least, is supposed to!) expectations of future higher income. Gold doesn't even bear interest....

t-3 · 2024-07-17T08:01:31.000000Z

Gold is commonly seen as a hedge against inflation and a decently stable non-currency store of value. With many countries having/being perceived to have high inflation during this time, the price of gold is bound to rise as well. Pretty much any economic or sociopolitical tremor will bounce up the price of the gold at least temporarily.

roenxi · 2024-07-17T04:10:40.000000Z

The S&P doesn't really pay much in the way of dividends does it? Last time I checked it was order-of-magnitude 1% which is a bit of a joke figure.

Anyway, there isn't a lot of evidence that the value of gold is going up. It seems to just be keeping pace with the M2. Both doubled-and-a-bit since 2010 (working in USD).

boulos · 2024-07-16T04:05:42.000000Z

Note: gold uses troy ounces, so adjust by ~10%. It's easier to just use grams or kilograms :).

Willish42 · 2024-07-16T17:02:44.000000Z

Thanks, I'm a bit new to this entire concept. Do troy lbs also exist, or is that just a term when measuring ounces?

someguydave · 2024-07-17T05:22:15.000000Z

yes, there are troy pounds, they are 12 troy ounces (not 16 ounces, like normal (avoirdupois) pounds)

https://en.wikipedia.org/wiki/Troy_weight

180 avoirdupois pounds is 2,625 ounces troy. The gold price is around $2470/ounce troy today, so $2470*2625 ~= $6.483 million

bbkane · 2024-07-15T21:53:22.000000Z

Would you (or your friend) be able to drop any good CUDA learning resources? I'd like to be worth my weight in gold...

throwaway81523 · 2024-07-16T06:43:05.000000Z

A working knowledge of C++, plus a bit of online reading about CUDA and the NVidia GPU architecture, plus studying the LCZero chess engine source code (the CUDA neural net part, I mean) seems like enough to get started. I did that and felt like I could contribute to that code, at least at a newbie level, given the hardware and build tools. At least in the pre-NNUE era, the code was pretty readable. I didn't pursue it though.

Of course becoming "really good" is a lot different and like anything else, it presumably takes a lot of callused fingertips (from typing) to get there.

suresk · 2024-07-16T21:56:30.000000Z

Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.

Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.

mosselman · 2024-07-16T07:56:49.000000Z

The real challenge is probably getting your hands on a 4090 for a price you can pay before you are worth your weight in gold. Because an arm and a limb in gold is quite a lot.

throwaway81523 · 2024-07-16T09:05:21.000000Z

You don't really need a 4090. An older board is plenty. The software is basically the same. I fooled around with what I think was a 1080 on Paperspace for something like 50 cents an hour, but it was mostly with some Pytorch models rather than CUDA directly.

saagarjha · 2024-07-17T09:44:48.000000Z

Modern GPU architectures are quite different than what comes before them if you truly want to push them to their limits.

throwaway81523 · 2024-07-18T10:37:48.000000Z

Really old GPU's were different but the 1080 is similar to later stuff with a few features missing. Half precision and "tensor cores" iirc. It could be that the very most recent stuff has changed more (I haven't paid attention) but I thought that the 4090 was just another evolutionary step.

saagarjha · 2024-07-19T21:08:37.000000Z

Those are the features everyone is using, though.

wing-_-nuts · 2024-07-17T15:09:30.000000Z

Everyone and I mean everyone I know doing AI / ML work values VRAM above all. The absolute bang for buck are buying used p40's and if you actually want to have those cards be usable for other stuff, used 3090's are the best deal around and they should be ~ $700 right now.

saagarjha · 2024-07-19T21:07:50.000000Z

What they really value is bandwidth. More VRAM is just more bandwidth.

wing-_-nuts · 2024-07-22T23:42:25.000000Z

Well, to give an example, 32GB of vram would be vastly more preferable to 24GB of higher bandwidth vram. You really need to be able to put the entire LLM in memory for best results, because otherwise you're bottlenecking on the speed of transfer between regular old system ram and the gpu.

You'll also note that M1/2 macs with large amounts of system memory are good at inference because of the fact that the gpu has a very high speed interconnect between the soldiered on ram modules and the on die gpu. It's all about avoiding bottlenecks whereever possible.

touisteur · 2024-07-18T18:54:11.000000Z

Not really any paradigm shift since the introduction of Tensor Cores in NVIDIA archs. Anything Ampere or Lovelace, will do to teach yourself CUDA up to the crazy optimization techniques and the worst libraries that warp the mind. You'll only miss on HBM which allows you to cheat on memory bandwidth, amount of VRAM (teach yourself on smaller models...), double precision perf and double precision tensor cores (go for an A30 then and not sure they'll keep them - either the x30 bin, or DP tensor cores - ever since "DGEMM on Integer Matrix Multiplication Unit" - https://arxiv.org/html/2306.11975v4 ). FP4, DPX, TMA, GPUDirect are nice but you must be pretty far out already for them to be mandatory...

saagarjha · 2024-07-19T21:08:17.000000Z

"Cheating on bandwidth" is the name of the game right now.

ahepp · 2024-07-16T13:24:43.000000Z

I was looking into this recently and it seems like the cheapest AWS instance with a CUDA GPU is something on the order of $1/hr. It looks like an H100 instance might be $15/hr (although I’m not sure if I’m looking at a monthly price).

So yeah it’s not ideal if you’re on a budget, but it seems like there are some solutions that don’t involve massive capex.

throwaway81523 · 2024-07-16T13:44:05.000000Z

Look on vast.ai instead of AWS, you can rent machines with older GPU's dirt cheap. I don't see how they even cover the electricity bills. A 4090 machine starts at about $.25/hour though I didn't examine the configuration.

A new 4090 costs around $1800 (https://www.centralcomputer.com/asus-tuf-rtx4090-o24g-gaming...) and that's probably affordable to AWS users. I see a 2080Ti on Craigslist for $300 (https://sfbay.craigslist.org/scz/sop/d/aptos-nvidia-geforce-...) though used GPU's are possibly thrashed by bitcoin mining. I don't have a suitable host machine, unfortunately.

dotancohen · 2024-07-16T16:17:59.000000Z

Thrashed? What type of damage could a mostly-solid state device suffer? Fan problems? Worn PCi connectors? Deteriorating Arctic Ice from repeated heat cycling?

mschuster91 · 2024-07-16T20:46:31.000000Z

Heat. A lot of components - and not just in computers but everything hardware - are spec'd for something called "duty cycles", basically how long a thing is active in a specific time frame.

Gaming cards/rigs, which many of the early miners were based on, rarely run at 100% all the time, the workload is burst-y (and distributed amongst different areas of the system). In comparison, a miner runs at 100% all the time.

On top of that, for silicon there is an effect called electromigration [1], where the literal movement of electrons erodes the material over time - made worse by ever shrinking feature sizes as well as, again, the chips being used in exactly the same way all the time.

[1] https://en.wikipedia.org/wiki/Electromigration

ssl-3 · 2024-07-16T19:50:52.000000Z

Nope, none of those.

When people were mining Ethereum (which was the last craze that GPUs were capable of playing in -- BTC has been off the GPU radar for a long time), profitable mining was fairly kind to cards compared to gaming.

Folks wanted their hardware to produce as much as possible, for as little as possible, before it became outdated.

The load was constant, so heat cycles weren't really a thing.

That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.

Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.

For Ethereum, RAM got hit hard. But RAM doesn't necessarily care about that; DRAM in general is more or less just an array of solid-state capacitors. And people needed that RAM to work reliably -- it's NFG to spend money producing bad blocks.

Power supplies tended to be stable, because good, cheap, stable, high-current, and stupidly-efficient are qualities that go hand-in-hand thanks to HP server PSUs being cheap like chips.

There were exceptions, of course: Some people did not mine smartly.

---

But this is broadly very different from how gamers treat hardware, wherein: Heat cycles are real, over clocking everything to eek out an extra few FPS is real, pushing things a bit too far and producing glitches can be tolerated sometimes, fan speeds are whatever, and power supplies are picked based on what they look like instead of an actual price/performance comparison.

A card that was used for mining is not implicitly worse in any way than one that was used for gaming. Purchasing either thing involves non-zero risk.

metadat · 2024-07-16T23:14:48.000000Z

> That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.

> Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.

In the ideal case, this is spot on. Annoyingly however, this hinges on the assumption of an awful lot of competence from top to bottom.

If I've learned anything in my considerable career, it's that reality is typically one of the first things tossed when situations and goals become complex.

The few successful crypto miners maybe did some of the optimizations you mention. The odds aren't good enough for me to want to purchase a Craigslist or FB marketplace card for only a 30% discount.

I do genuinely admire your idealism, though.

ssl-3 · 2024-07-17T17:25:58.000000Z

It isn't idealism. It's background to cover the actual context:

A used card is it sale. It was previously used for mining, or it was previously used for gaming.

We can't tell, and caveat emptor.

Which one is worse? Neither.

SonOfLilit · 2024-07-16T16:35:08.000000Z

replying to sibling @dotancohen, they melt, and they suffer from thermal expansion and compression

robotnikman · 2024-07-16T16:50:53.000000Z

Are there any certifications or other ways to prove your knowledge to employers in order to get your foot in the door?

8n4vidtmkvmk · 2024-07-16T07:40:12.000000Z

Does this pay more than $500k/yr? I already know C++, could be tempted to learn CUDA.

throwaway81523 · 2024-07-16T09:00:49.000000Z

I kinda doubt it. Nobody paid me to do that though. I was just interested in LCZero. To get that $500k/year, I think you need up to date ML understanding and not just CUDA. CUDA is just another programming language while ML is a big area of active research. You could watch some of the fast.ai ML videos and then enter some Kaggle competitions if you want to go that route.

almostgotcaught · 2024-07-16T12:37:37.000000Z

You're wrong. The people building the models don't write CUDA kernels. The people optimizing the models write CUDA kernels. And you don't need to know a bunch of ML bs to optimize kernels. Source: I optimize GPU kernels. I don't make 500k but I'm not that far from.

HarHarVeryFunny · 2024-07-16T14:58:00.000000Z

How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?

What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?

throwaway81523 · 2024-07-16T20:54:32.000000Z

The CUDA version of LCZero was around 2x or 3x faster than the Tensorflow(?) version iirc.

throwaway81523 · 2024-07-16T12:56:20.000000Z

Heh I'm in the wrong business then. Interesting. Used to be that game programmers spent lots of time optimizing non-ML CUDA code. They didn't make anything like 500k at that time. I wonder what the ML industry has done to game development, or for that matter to scientific programming. Wow.

eigenvalue · 2024-07-15T22:09:41.000000Z

That’s pretty funny. Good test of value across the millennia. I wonder if the best aqueduct engineers during the peak of Ancient Rome’s power had villas worth their body weight in gold.

Winse · 2024-07-15T23:26:47.000000Z

Lol. For once being overweight may come with some advantages here.

necovek · 2024-07-16T06:02:22.000000Z

Or disadvantages: you may be as rich as your skinny neighbour, but they are the only ones worth their weight in gold ;)

radarsat1 · 2024-07-17T06:33:18.000000Z

Selection bias. I'm sure there are lots of people who are really good at CUDA and don't have those kind of assets. Not everyone knows how to sell their skills.

Der_Einzige · 2024-07-17T12:37:52.000000Z

Right now, nvidias valuations have made a lot of people realize that their CUDA skills were being undervalued. Anyone with GPU or ML skills who hasn’t tried to get a pay raise in this market deserves exactly the life that they are living.

smallnamespace · 2024-07-17T07:16:07.000000Z

Unfortunately it's also hard to buy (find) people who don't know how to sell.

iftheshoefitss · 2024-07-16T08:02:59.000000Z

What do people study to figure out CUDA? I’m studying to get me GED and hope to go to school one day

paulmd · 2024-07-16T18:17:36.000000Z

Computer science. This is a grad level topic probably.

Nvidia literally wrote most of the textbooks in this field and you’d probably be taught using one of these anyway:

https://developer.nvidia.com/cuda-books-archive

“GPGPU Gems” is another “cookbook” sort of textbook that might be helpful starting out but you’ll want a good understanding of the SIMT model etc.

amelius · 2024-07-16T10:32:43.000000Z

Just wait until someone trains an ML model that can translate any CUDA code into something more portable like HIP.

GP says it is just some #ifdefs in most cases, so an LLM should be able to do it, right?

FuriouslyAdrift · 2024-07-16T13:52:17.000000Z

OpenAI Triton? Pytorch 2.0 already uses it.

https://openai.com/index/triton/

phkahler · 2024-07-16T13:26:35.000000Z

>> Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.

This seems to be fairly common problem with software. The people who create software regularly deal with complex tool chains, dependency management, configuration files, and so on. As a result they think that if a solutions "exists" everything is fine. Need to edit a config file for your particular setup? No problem. The thing is, I have been programming stuff for decades and I really hate having to do that stuff and will avoid tools that make me do it. I have my own problems to solve, and don't want to deal with figuring out tools no matter how "simple" the author thinks that is to do.

A huge part of the reason commercial software exists today is probably because open source projects don't take things to this extreme. I look at some things that qualify as products and think they're really simplistic, but they take care of some minutia that regular people are will to pay so they don't have to learn or deal with it. The same can be true for developers and ML researchers or whatever.

ezekiel68 · 2024-07-15T23:10:58.000000Z

> if you're talking about building anything, that is already too hard for ML researchers.

I don't think so. I agree it is too hard for the ML researches at the companies which will have their rear ends handed to them by the other companies whose ML researchers can be bothered to follow a blog post and prompt ChatGPT to resolve error messages.

currymj · 2024-07-16T14:43:23.000000Z

I'm not really talking about companies here for the most part, I'm talking about academic ML researchers (or industry researchers whose role is primarily academic-style research). In companies there is more incentive for good software engineering practices.

I'm also speaking from personal experience: I once had to hand-write my own CUDA kernels (on official NVIDIA cards, not even this weird translation layer): it was useful and I figured it out, but everything was constantly breaking at first.

It was a drag on productivity and more importantly, it made it too difficult for other people to run my code (which means they are less likely to cite my work).

jokethrowaway · 2024-07-16T00:50:26.000000Z

a lot of ML researchers stay pretty high level and reinstall conda when things stop working

and rightly so, they have more complicated issues to tackle

It's on developers to provide better infrastructure and solve these challenges

LtWorf · 2024-07-16T04:13:37.000000Z

Not rightly. It'd be faster on the long term to address the issues.

bayindirh · 2024-07-16T05:57:01.000000Z

Currently nobody think that long term. They just reinstall, that’s it.

jchw · 2024-07-15T20:56:00.000000Z

The target audience of interoperability technology is whoever is building, though. Ideally, interoperability technology can help software that supports only NVIDIA GPUs today go on to quickly add baseline support for Intel and AMD GPUs tomorrow.

(and for one data point, I believe Blender is actively using HIP for AMD GPU support in Cycles.)

Agingcoder · 2024-07-15T21:08:06.000000Z

Their target is hpc users, not ml researchers. I can understand why this would be valuable to this particular crowd.

klik99 · 2024-07-16T01:45:59.000000Z

God this explains so much about my last month, working with tensorflow lite and libtorch in C++