Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?
If that is the case, does it actually matter what CPU core pv runs on? I feel like _something_ must ultimately zero the page out before it can get re-mapped to the process, but I'm not sure which core that happens on, or whether there's some other hardware mechanism that allows the OS to zero a page without actually utilizing memory bandwidth.
Unfortunately, the memory bandwidth that matters here is not bandwidth to GPU memory, but bandwidth to main system memory (unless anyone knows how to splice a pointer to GPU memory onto a unix pipe). That's specifically why the M1 came to mind, as a powerful UMA machine that can run Linux. Perhaps a modern gaming console could hit the same performance, but I don't know if they can run Linux.
PCIe 5 speed is 64 GB/s, so theoretically if you perfectly pipelined the result you could achieve the same performance on a next-generation discrete GPU.