I've done it. I have a GPD Pocket 4 with 64 GB of RAM and the less capable HX 37...

I've done it. I have a GPD Pocket 4 with 64 GB of RAM and the less capable HX 370 Strix Point chip.

Using ollama, hardware acceleration doesn't really work through ROCm. The framework doesn't officially support gfx1151 (Strix Point RDNA 3.5+), though you can override it to fake gfx1150 (Strix Halo, also RDNA 3.5+ and UMA), and it works.

I think I got it to work for smaller models that fit entirely into the preallocated VRAM buffer, but my machine only allows for statically allocating up to 16 GB for the GPU, and where's the fun in that? This is a unified memory architecture chip, I want to be able to run 30+ GB models seamlessly.

It turns out, you can. Just build llama.cpp from source with the Vulkan backend enabled. You can use a 2 GB static VRAM allocation and any additional data spills into GTT which the driver handles mapping into the GPU's address space seamlessly.

You can see a benchmark I performed of a small model on GitHub [0], but I've done up to Gemma3 27b (~21 GB) and other large models with decent performance, and Strix Halo is supposed to have 2-3x the memory bandwidth and compute performance. Even 8b models perform well with the GPU in power saving mode, inside ~8W.

Come to think of it, those results might make a good blog post.

[0] https://github.com/ggml-org/llama.cpp/discussions/10879

Search for "HX 370"