yes, I was thinking of HSA, but they don't seem to be working on it. I think it turned out to be hard, and therefore expensive, and they got distracted by staying alive by developing decent processors.
I think we could argue that tensorflow and/or pytorch has displaced any HSA interest, too. These are programming interfaces that do an OK job of abstracting from the hardware details, and are totally embraced by the most demanding field (AI).
They have abandoned it at their lowest to focus on Zen. Now they seem to start picking up the slack. Upcoming Instinct MI300 APU brings HBM as unified system memory, together with hardware cache coherency between CPU cores and GPUs within and across NUMA nodes.
In the wee hours of this morning, I was trying to figure out why my attempts to build LLVM 17.0.3 from source were failing because it couldn't find an HSA-related symbol.
My impression was that the build-system's support for using HSA was a little wonky, but I'm not sure if that's fair. IIRC I worked around the issue by not trying to build LLVM's openmp code.
It's a real shame that AMD didn't embrace HSA for the long term, it's the future of computing. Intel's QuickSync is very popular with certain folks for streaming, and HSA would have taken that a step further.
My understanding is that later pre-Zen APUs and Zen chips all essentially have the HSA hardware side... it's just that AMD never managed to prepare a long-term coherent software offering like nvidia did with CUDA. So outside of one-offs or very closed platforms like consoles, the HSA capabilities were just... lying there, used as super fast PCI-e alternative to access the GPU.
Now consider that with Phoenix Point APUs you have Zen 4 CPU, memory controllers, IO controller, RDNA3 GPU, and the XDNA "transputer-like" coprocessor, all hanging off common cache coherent transport (Infinity Fabric). All the hardware parts are there.
Seems this is the approach Apple has taken with the M chips. Everything is unified more or less. Myabe one day in the future the CPU will just be one fpga that recompiles itself for a given usecase and codepaths become hardware paths and then goes back to general for the next task.
it seems like everyone settles for ugly, barely usable interfaces like CUDA, OpenCL, and then scabs them over with a higher-level interface like tensorflow.