A working knowledge of C++, plus a bit of online reading about CUDA and the NVid...

suresk · 2024-07-16T21:56:30.000000Z

Having dabbled in CUDA, but not worked on it professionally, it feels like a lot of the complexity isn't really in CUDA/C++, but in the algorithms you have to come up with to really take advantage of the hardware.

Optimizing something for SIMD execution isn't often straightforward and it isn't something a lot of developers encounter outside a few small areas. There are also a lot of hardware architecture considerations you have to work with (memory transfer speed is a big one) to even come close to saturating the compute units.

mosselman · 2024-07-16T07:56:49.000000Z

The real challenge is probably getting your hands on a 4090 for a price you can pay before you are worth your weight in gold. Because an arm and a limb in gold is quite a lot.

throwaway81523 · 2024-07-16T09:05:21.000000Z

You don't really need a 4090. An older board is plenty. The software is basically the same. I fooled around with what I think was a 1080 on Paperspace for something like 50 cents an hour, but it was mostly with some Pytorch models rather than CUDA directly.

saagarjha · 2024-07-17T09:44:48.000000Z

Modern GPU architectures are quite different than what comes before them if you truly want to push them to their limits.

throwaway81523 · 2024-07-18T10:37:48.000000Z

Really old GPU's were different but the 1080 is similar to later stuff with a few features missing. Half precision and "tensor cores" iirc. It could be that the very most recent stuff has changed more (I haven't paid attention) but I thought that the 4090 was just another evolutionary step.

saagarjha · 2024-07-19T21:08:37.000000Z

Those are the features everyone is using, though.

wing-_-nuts · 2024-07-17T15:09:30.000000Z

Everyone and I mean everyone I know doing AI / ML work values VRAM above all. The absolute bang for buck are buying used p40's and if you actually want to have those cards be usable for other stuff, used 3090's are the best deal around and they should be ~ $700 right now.

saagarjha · 2024-07-19T21:07:50.000000Z

What they really value is bandwidth. More VRAM is just more bandwidth.

wing-_-nuts · 2024-07-22T23:42:25.000000Z

Well, to give an example, 32GB of vram would be vastly more preferable to 24GB of higher bandwidth vram. You really need to be able to put the entire LLM in memory for best results, because otherwise you're bottlenecking on the speed of transfer between regular old system ram and the gpu.

You'll also note that M1/2 macs with large amounts of system memory are good at inference because of the fact that the gpu has a very high speed interconnect between the soldiered on ram modules and the on die gpu. It's all about avoiding bottlenecks whereever possible.

touisteur · 2024-07-18T18:54:11.000000Z

Not really any paradigm shift since the introduction of Tensor Cores in NVIDIA archs. Anything Ampere or Lovelace, will do to teach yourself CUDA up to the crazy optimization techniques and the worst libraries that warp the mind. You'll only miss on HBM which allows you to cheat on memory bandwidth, amount of VRAM (teach yourself on smaller models...), double precision perf and double precision tensor cores (go for an A30 then and not sure they'll keep them - either the x30 bin, or DP tensor cores - ever since "DGEMM on Integer Matrix Multiplication Unit" - https://arxiv.org/html/2306.11975v4 ). FP4, DPX, TMA, GPUDirect are nice but you must be pretty far out already for them to be mandatory...

saagarjha · 2024-07-19T21:08:17.000000Z

"Cheating on bandwidth" is the name of the game right now.

ahepp · 2024-07-16T13:24:43.000000Z

I was looking into this recently and it seems like the cheapest AWS instance with a CUDA GPU is something on the order of $1/hr. It looks like an H100 instance might be $15/hr (although I’m not sure if I’m looking at a monthly price).

So yeah it’s not ideal if you’re on a budget, but it seems like there are some solutions that don’t involve massive capex.

throwaway81523 · 2024-07-16T13:44:05.000000Z

Look on vast.ai instead of AWS, you can rent machines with older GPU's dirt cheap. I don't see how they even cover the electricity bills. A 4090 machine starts at about $.25/hour though I didn't examine the configuration.

A new 4090 costs around $1800 (https://www.centralcomputer.com/asus-tuf-rtx4090-o24g-gaming...) and that's probably affordable to AWS users. I see a 2080Ti on Craigslist for $300 (https://sfbay.craigslist.org/scz/sop/d/aptos-nvidia-geforce-...) though used GPU's are possibly thrashed by bitcoin mining. I don't have a suitable host machine, unfortunately.

dotancohen · 2024-07-16T16:17:59.000000Z

Thrashed? What type of damage could a mostly-solid state device suffer? Fan problems? Worn PCi connectors? Deteriorating Arctic Ice from repeated heat cycling?

mschuster91 · 2024-07-16T20:46:31.000000Z

Heat. A lot of components - and not just in computers but everything hardware - are spec'd for something called "duty cycles", basically how long a thing is active in a specific time frame.

Gaming cards/rigs, which many of the early miners were based on, rarely run at 100% all the time, the workload is burst-y (and distributed amongst different areas of the system). In comparison, a miner runs at 100% all the time.

On top of that, for silicon there is an effect called electromigration [1], where the literal movement of electrons erodes the material over time - made worse by ever shrinking feature sizes as well as, again, the chips being used in exactly the same way all the time.

[1] https://en.wikipedia.org/wiki/Electromigration

ssl-3 · 2024-07-16T19:50:52.000000Z

Nope, none of those.

When people were mining Ethereum (which was the last craze that GPUs were capable of playing in -- BTC has been off the GPU radar for a long time), profitable mining was fairly kind to cards compared to gaming.

Folks wanted their hardware to produce as much as possible, for as little as possible, before it became outdated.

The load was constant, so heat cycles weren't really a thing.

That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.

Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.

For Ethereum, RAM got hit hard. But RAM doesn't necessarily care about that; DRAM in general is more or less just an array of solid-state capacitors. And people needed that RAM to work reliably -- it's NFG to spend money producing bad blocks.

Power supplies tended to be stable, because good, cheap, stable, high-current, and stupidly-efficient are qualities that go hand-in-hand thanks to HP server PSUs being cheap like chips.

There were exceptions, of course: Some people did not mine smartly.

---

But this is broadly very different from how gamers treat hardware, wherein: Heat cycles are real, over clocking everything to eek out an extra few FPS is real, pushing things a bit too far and producing glitches can be tolerated sometimes, fan speeds are whatever, and power supplies are picked based on what they look like instead of an actual price/performance comparison.

A card that was used for mining is not implicitly worse in any way than one that was used for gaming. Purchasing either thing involves non-zero risk.

metadat · 2024-07-16T23:14:48.000000Z

> That heat was minimized; cards were clocked (and voltages tweaked) to optimize the ratio of crypto output to Watts input. For Ethereum, this meant undervolting and underclocking the GPU -- which are kind to it.

> Fan speeds were kept both moderate and tightly controlled; too fast, and it would cost more (the fans themselves cost money to run, and money to replace). Too slow, and potential output was left on the table.

In the ideal case, this is spot on. Annoyingly however, this hinges on the assumption of an awful lot of competence from top to bottom.

If I've learned anything in my considerable career, it's that reality is typically one of the first things tossed when situations and goals become complex.

The few successful crypto miners maybe did some of the optimizations you mention. The odds aren't good enough for me to want to purchase a Craigslist or FB marketplace card for only a 30% discount.

I do genuinely admire your idealism, though.

ssl-3 · 2024-07-17T17:25:58.000000Z

It isn't idealism. It's background to cover the actual context:

A used card is it sale. It was previously used for mining, or it was previously used for gaming.

We can't tell, and caveat emptor.

Which one is worse? Neither.

SonOfLilit · 2024-07-16T16:35:08.000000Z

replying to sibling @dotancohen, they melt, and they suffer from thermal expansion and compression

robotnikman · 2024-07-16T16:50:53.000000Z

Are there any certifications or other ways to prove your knowledge to employers in order to get your foot in the door?

8n4vidtmkvmk · 2024-07-16T07:40:12.000000Z

Does this pay more than $500k/yr? I already know C++, could be tempted to learn CUDA.

throwaway81523 · 2024-07-16T09:00:49.000000Z

I kinda doubt it. Nobody paid me to do that though. I was just interested in LCZero. To get that $500k/year, I think you need up to date ML understanding and not just CUDA. CUDA is just another programming language while ML is a big area of active research. You could watch some of the fast.ai ML videos and then enter some Kaggle competitions if you want to go that route.

almostgotcaught · 2024-07-16T12:37:37.000000Z

You're wrong. The people building the models don't write CUDA kernels. The people optimizing the models write CUDA kernels. And you don't need to know a bunch of ML bs to optimize kernels. Source: I optimize GPU kernels. I don't make 500k but I'm not that far from.

HarHarVeryFunny · 2024-07-16T14:58:00.000000Z

How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?

What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?

throwaway81523 · 2024-07-16T20:54:32.000000Z

The CUDA version of LCZero was around 2x or 3x faster than the Tensorflow(?) version iirc.

throwaway81523 · 2024-07-16T12:56:20.000000Z

Heh I'm in the wrong business then. Interesting. Used to be that game programmers spent lots of time optimizing non-ML CUDA code. They didn't make anything like 500k at that time. I wonder what the ML industry has done to game development, or for that matter to scientific programming. Wow.