The ability to get huge amounts of VRAM/$ is what I find incredibly interesting. A lot of diffusion techniques are incredibly VRAM intensive and high VRAM consumer cards are rare and expensive. I'll gladly take the slower speeds of an APU if it means I can load the entire model in memory instead of having to offload chunks of it.
Using DDR5 as VRAM however means you're only getting 50GB/s to 100GB/s read/write speed instead of the 500GB/s available on a proper GPU.
That might be a fine tradeoff for some kernels. But my understanding is that Stable Diffusion is very VRAM-bandwidth heavy and actually benefits from the higher-speed GDDR6 or HBM RAM on a proper high-end GPU.
With a 4090 nvidia-smi reports ~60-70% mem bandwidth usage while at 99% gpu usage, so that'd be 650-750GB/s. Considering how slow the inference is on the APU having 10% of the 4090 mem bandwidth maybe isn't that much of an issue?
edit: comment from the reddit thread:
> LLM isn't all that great as it is primarily memory bandwidth bound, ie almost no difference from a CPU if your memory bw is mere 12/25Gb/s. SD needs far more compute for inference - APU with slow memory helps.
That's $250 in Australian dollars though, which is about US$160. I'm not affiliated with that seller btw, I just remembered the search result from looking a while back. :)
For not much more (US$200) you can find lots of P40s, which is a generation newer and will give you double the memory bandwidth and FP32. That being said, used 3090s are going for about $600 now and are much better bang/buck and easier (software and hardware) to setup.
These have lots of memory but are pretty slow, can be 50x-100x slower than a gamer card from the past couple years, plus lots of heat / power inefficiency. If you have any software than can take advantage of tensor cores / matrix cores or int8 / fp16 ops then modern hardware will probably win.
Hey fellow person-in-Australia, I bought a P100 for $250 on eBay and have been using that with some custom cooling I printed and it works pretty damn well. You wouldn’t want an M or K series Tesla though as they’re just too old and not powerful enough to be that useful.
I think it was a fair deal - they are very old cards now require you to 3-D print a fan shroud, but and attach a fan and implement some temperature based fan control (
I whipped up this crappy little golang daemon for it <https://github.com/sammcj/nv_fan_control/blob/main/nv_fan_co...> ).
And of course you have to be happy with the additional 30 W idle power consumption when you’re not using it if it’s in your home server, which as I’m sure you’ll appreciate in Australia can be quite expensive now!
If they cost more (say $350+) I probably go get a slightly more expensive, but much more modern RTX.
IIRC, an M40 is a pair of GPUs on one card, it has not video outputs, and it sucks a lot of power even for a GPU.
I'm looking at used M6000. cards that seem a better compromise at a higher price point.
A PCIE 4.0 16x link should provide around 32GB/s of bandwidth, close to the 33GB/s of DDR5-3200. In a perfect world, it would seem to me that doing 100% offloading (streaming everything as needed from system memory) should be equivalent to doing the calculations in system memory in the first place. The GPU memory just acts as a cache, and should only speed up the processing.
Every modern motherboard is dual-channel DDR5 at a minimum, maybe quad-channel.
2x sticks of 32GB/s RAM, properly configured, will run at 64GB/s of bandwidth. Modern servers are quad, hex, or oct-channel (4x, 6x, or 8x parallel sticks of RAM) in practice. Or even more (ex: Intel Xeon Platinums are 6x channel per socket, so an 8x socket 8x CPU Xeon Platinum will be like 48x DDR5 parallel RAM sticks).
----------
PCIe x16 by the way, is 16x parallel lanes of PCIe. All the parallelism is already innate in modern systems.
---------
L3 cache is TB/s bandwidth IIRC. CPUs inside of CPU-space will automatically be caching a lot of those RAM commands, so you can go above the RAM-bandwidth limitations in practice, though it depends on how your code accesses RAM.
GPUs have very small caches, and have higher latency to those caches. Instead, GPUs rely upon register-space and extreme amounts of SMT/wavefronts to kinda-sorta hyperthread their cores to hide all that latency.
The full mem-swap congo line is DRAM<->PCIE<->VRAM<->GPU. PCIE is the weak link, but I'd be willing to bet that those transfers are't 100% overlapped, and PCIE transfer rate represents best case speed as opposed to expected. In the case of unidirectional writes, you'd have to cut that speed in half.
That's true. The comment i was responding to talked about offloading. I was assuming he was talking about offloading part of the core model to the system RAM which would need to be reloaded frequently.
These chips don’t have a socket and were designed for laptops. However, they have up to 54W TDP which is not quite a laptop’s territory. Luckily, there’re mini-PCs on the market with them. The form factor is similar to Intel NUC or Mac Mini. An example is Minisforum UM790 Pro (disclaimer: I have never used one, so far only read a review).
The integrated Radeon 780M GPU includes 12 compute units of RDNA3, peak FP32 performance is about 9 TFlops, peak FP16 about 18 TFlops. The CPU supports two channels of DDR5-5600, a properly built computer has 138 GB/second memory bandwidth.
The 4600G supports two channels of DDR4-3200 which has a maximum memory bandwidth of around 50GB/s (actual graphics cards are in the hundreds). While this chip may be decent for SD and other compute-bound AI apps it won't be good for LLMs as inference speed is pretty much capped by memory bandwidth.
Apple Silicon has extremely high memory bandwidth which is why it performs so well with LLMs.
> The 4600G supports two channels of DDR4-3200 which has a maximum memory bandwidth of around 50GB/s (actual graphics cards are in the hundreds).
DDR4-4800 exists. 76.8GB/s. You can also get a Ryzen 7000 series for around $200 that can use DDR5-8000, which is 128GB/s. By contrast, the M1 is 68GB/s and the M2 is 100GB/s. (The Pro and Max are more, but they're also solidly in "buy a GPU" price range.)
Doesn't change the meat of your argument, the 4000G series runs DDR4-3600 pretty well (even if the spec sheet only goes to 3200), and it's almost a crime to run an APU at anything worse than DDR4-3600 CL16. You can go higher than that too, but depending on your particular chip, when you get much above 3600, you may not be able to run the ram base clock at the same rate as the infinity fabric, which isn't ideal.
Odd. The high memory bandwidth of M2 intrigues me but I have not seen many people having success with AI apps on Apple Silicon. Which LLMs run better on Apple silicon than comparably priced Nvidia cards?
They don't run better on AS than on GPUs with even more memory bandwidth. They run better on AS than on consumer PC CPUs (or presumably iGPUs) with less memory bandwidth.
there are no comparably priced nvidia cards, thats the point, of comparing apple soc/apu with amd/intel, specialized hardware is and always will be better
I wonder how does it compare to running same model on the same cpu in ram(making sure fast cpu ML library that utilises AVX2 is used for example Intel MKL)?
Also, when doing a test like this it's important to compare same bit depths so fp32 on both.
My untested rule of thumb is that if something fits inside of GPU-register space, its probably faster on GPU. (GPUs have more architectural registers than CPUs, since CPUs are using all those "shadow registers" to implement out-of-order stuffs). Compilers will automatically turn your most-used variables into registers today, though you might need to verify by looking at the assembly code whether or not you're actually in register-space.
But if something fits inside of CPU-cache space, its probably faster on CPU. (Intel is upto 2MB L2 cache on some systems, AMD is up to 192MB L3 cache on some systems).
But if something is outside of CPU-cache space but fits inside of GPU-VRAM space, its probably faster on GPU. (Better to be bandwidth-limited at 500GB/s GPU-VRAM speeds than 50GB/s DDR5 speed). Ex: 16GBs GPU-VRAM is back into GPU-sized solutions to a problem.
Then if something fits inside of CPU-RAM space, which is like 2TB in practice (!!!!), then CPUs are probably faster. Because there's no point taking a 2TB DDR5 RAM limited problem, passing some of that data to 16GB or 96GB of GPU-VRAM, and then waiting for DDR5 RAM anyway.
------------
Not that I've tested out this theory. But it'd make sense in my brain at least, lol.
OP has tested the 4600G APU. Theoretically, the chip can do 710 FP32 GFlops on CPU, and 1702 GFlops on the integrated GPU.
Another thing, it’s hard to write CPU code which saturates memory throughput instead of stalling on memory latency. In GPUs, the issue is bypassed on architecture level. They enjoy a high degree of concurrency; GPU cores simply switch to another thread instead of waiting. The CPU cores only run 2 threads each due to hyper-threading, not enough threads to efficiently hide memory latency the way GPUs are doing.
When it says "turned" I'm assuming there are some kernel boot parameters or driver configurations that are needed for it to allocate 16GB of main RAM for the GPU. Did they publish those or is this behavior out of the box?
In the BIOS of my Lenovo laptop (T13 Gen3 AMD) I can select how much of the RAM should be reserved as VRAM. I guess they are doing something similar.
ROCm and PyTorch recognize the GPU, as in, `torch.cuda.is_available` returns true, but I haven't actually run any models yet.
The maximum I can select on my 32GB RAM laptop is 8GB. Sounds like there are desktop AM4 mainboards where you can go up to 64GB.
One interesting thing I have seen being reported by `glxinfo` is auxiliary memory, which in my case is another 12GB, for a total of 20GB reported total available memory GPU memory. Unclear if this could be used by ROCm/PyTorch.
The billion dollar question here is how to use the HSA feature in APU to avoid needing to split the RAM between GPU and CPU; in theory at least they both should be able to access same memory.
~2016 I bought a Lenovo ThinkPad E465 sporting an AMD Carrizo APU specifically to take advantage of HSA. It seemed like the feature, or at least the toolchain to take advantage of it, never really materialized. I'm glad at least someone else remembers it.
Which says:
"For APUs, this distinction is important as all memory is shared memory, with an OS typically budgeting half of the remaining total memory for graphics after the operating system fulfils its functional needs. As a result, the traditional queries to Dedicated Video Memory in these platforms will only return the dedicated carveout – and often represent a fraction of what is actually available for graphics. Most of the available graphics budget will actually come in the form of shared memory which is carefully OS-managed for performance."
The implication seems to be that you can have an arbitrary amount of graphics RAM, which would be appealing for AI use cases, even though the GPU itself is relatively underpowered.
Still, the question remains open, how to precisely control APU/GPU memory allocation on Linux and what is the limitations?
My Asus ROG Ally (amd powered handheld win11 gaming machine) comes with 16gb of ram. In the bios I can set the GPU ram to 2,4,6,7 or 8; which will lower my available system ram. However, I can also change this in the AMD Armor Crate software in windows. So, in my case, it came like this out of the box.
I don't play while outside the house for extended periods, so battery isn't a concern for me. I mostly just wanted to be able to play my games on the couch, whether that was using the Ally as a handheld or by being able to dock it to the TV.
The biggest issue that keeps me from recommending it to anyone is that it is a windows 11 machine. That is both its strongest and weakest point. Turn it on for the first time, windows setup on a tiny screen with no physical mouse keyboard. And then there is messing with game graphics settings to eek out the performance you want. SD slot is complete garbage. Most likely a hardware issue that will not be able to be fixed. It stops reading cards and in some cases, destroys the card. It is fairly easy to upgrade the internal SSD though. ASUS has also been pretty quiet on graphics driver updates.
As long as you are cool with that stuff, its an awesome machine. Steamdeck has more of a 'just works' kinda console feel and kinda outdated hardware wise. Lenovo Go images just leaked today.
If you have 32gb of RAM the option for force allocating 16gb will be available in the bios. I think it lets you set a maximum of half your ram as reserved for the iGPU
How well do these workloads parallelize? Especially over customer-tier interconnects? What's stopping someone from picking up 100 of these to setup a cool little 1.6tb-vram cluster? Whole thing would probably cost less than an h100.
The post is short so I'll paste it here. If this is against the rules please ban me.
-----begin copy paste-----
The 4600G is currently selling at price of $95. It includes a 6-core CPU and 7-core GPU. 5600G is also inexpensive - around $130 with better CPU but the same GPU as 4600G.
It can be turned into a 16GB VRAM GPU under Linux and works similar to AMD discrete GPU such as 5700XT, 6700XT, .... It thus supports AMD software stack: ROCm. Thus it supports Pytorch, Tensorflow. You can run most of the AI applications.
16GB VRAM is also a big deal, as it beats most of discrete GPU. Even those GPU has better computing power, they will get out of memory errors if application requires 12 or more GB of VRAM. Although the speed is an issue, it's better than out of memory errors.
For stable diffusion, it can generate a 50 steps 512x512 image around 1 minute and 50 seconds. This is better than some high end CPUs.
5600G was a very popular product, so if you have one, I encourage you to test it. I made some videos tutorials for it. Please search tech-practice9805 for on Youtube and subscribe to the channel for future contents. Or see the video links in Comments.
You demand a receipt and call that script kiddie garbage. What an irony.
ROCm has kinda worked for some APUs in 5.x for a while now. As much as things are expected to work on AMD hardware anyway.
So basically install ROCm 5.5 and check if rocminfo lists your APU as device. Assign more VRAM if possible in your Bios.
There are really no fundamental secret tricks involved. I run pytorch on a 5600G. It's not great and breaks all the time, but that's not really APU specific.