Hacker News new | past | comments | ask | show | jobs | submit login

Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?



Pv never touches the data. It splices the input into the output, /dev/null in this case, so once written, a page is never touched again.

Splice is linux specific though, so you would need to run it on M1.


If that is the case, does it actually matter what CPU core pv runs on? I feel like _something_ must ultimately zero the page out before it can get re-mapped to the process, but I'm not sure which core that happens on, or whether there's some other hardware mechanism that allows the OS to zero a page without actually utilizing memory bandwidth.


The page is reused by the generator process without being zeroed.


Couldn't we load it onto an NVIDIA RTX A6000? It is much much faster than the M1 Max. It has a much greater memory bandwith too


Unfortunately, the memory bandwidth that matters here is not bandwidth to GPU memory, but bandwidth to main system memory (unless anyone knows how to splice a pointer to GPU memory onto a unix pipe). That's specifically why the M1 came to mind, as a powerful UMA machine that can run Linux. Perhaps a modern gaming console could hit the same performance, but I don't know if they can run Linux.


Wouldn’t benefit from UMA


PCIe 5 speed is 64 GB/s, so theoretically if you perfectly pipelined the result you could achieve the same performance on a next-generation discrete GPU.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: