Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the detailed reply, I've been reading about the tools, and tracking various forums discussing running local LLMs. ANE isn't required, just optional.

However the Apple M2 MAX GPU seems to do inference decently when using code targetting Metal (Apple's GPU API). Apparently inference on the llama 65B model with a M2 max happens at 5 tokens/sec. Not an amazing perf/$ for running inference at scale, but pretty interesting for a developer to tinker with. While the M2 Max with 96GB ram is slower than a 4090 it can run larger models and I'm not expecting to be particularly performance limited for home/local inference.




Yeah, the performance is honestly incredible for the (comparatively) poor software libraries, limited GPU cores, and low ceiling for power. I adore my MacBook Pro and find it to be more than enough for my current needs. It's been surprisingly performant for my transformers course this summer.

Apple Silicon is an even better deal for learning/tinkering with ML models if you already own one or plan to purchase one regardless. As you mentioned, the unified memory architecture is a cheaper alternative to massive/expensive Nvidia GPUs for running inference on larger models. The software is only going to improve, given the growing enthusiast cohort and Apple's push towards providing a solid ML development pipeline on MacOS.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: