Is this news? I've got a nearly year old app that supports over 2 dozen local LLMs with support for using them with Siri and Shortcuts. I added support for Llama 3 8B the day after it came out and also Eric Hartford's new Llama 3 8B based Dolphin model. All models in it are quantized with OmniQuant. On iOS, 7B and 8B ones are 3-bit quantized and smaller models are 4-bit quantized. On the macOS version all models are 4-bit OmniQuant quantized. 3-bit Omniquant quantization is quite comparable in perplexity to 4-bit RTN quantization that all the llama.cpp based apps use.
Nice. What is battery life like under heavy use? I was reading a thread on the llama.cpp repo earlier where they were discussing whether it was possible (or attractive) to add neural engine support in some form.
With bigger 7B and 8B models, the battery life goes from a over a day to a few hours on my iPhone 15 Pro.
The 8B model nominally works on 6GB phones but it's quite slow on them. OTOH, it's very usable on iPhone 15 Pro/ Pro Max devices and even better on M1/M2 iPads.
Every framework: llama.cpp, MLX, mlc-llm (which I use) all only use the GPU. Using the ANE and perhaps the undocumented AMX coprocessor for efficient decoder only transformer inference is still an open problem. I've made early some progress on quantised inference using ANE, but there 're still a lot of issues to be solved before it is even demo ready, let alone a shipping product.
So the trend with Apple has been that the SoC from the current generation Pro and Pro Max devices becomes the SoC for the next generation of baseline devices. For instance the iPhone 14 Pro Max and iPhone 15 have the same SoC (A16 Bionic). And this trend holds all the way back to iPhone 12.
It's almost certain that the iPhone 16 will ship with 8GB of RAM. What needs to be seen is whether iPhone 16 Pro and Pro Maxes will ship with 16GB of RAM (Like with high end M1/M2 iPad Pros with >= 1TB SSD).
It's one of the most infuriating things about Apple.
The ram tax so absurd it's bordering on criminal, but it also just seems stupid, because if they hadn't put 8gb's of ram in the new smallest macbook air m2 their whole lineup would be more than capable at running local quality LLM's, or double their gaming devices because of their awesome chipset giving them 16gb's of vram essentially, but no, not now when 25% have low ram, ie. no new OS LLM updates.
Also we can't have gaming because half their new sold devices have shit ram, so they also kind of already ditched their "gaming" plan they just got started on a year ago - all because they wan't to push products with ram levels from 10 years ago - bizarre!
They must be betting on local AI as a "pro" feature only.
8GB for a premium device in 2024 is a hard ask, completely agree. But I hold absolutely zero hard feelings toward Apple for not catering to gamers as a demographic
Most importantly, though, we are talking about iPhones here. I can’t say I’ve ever thought to myself “gosh, I wish my phone had more RAM!” in…over a decade?
> But I hold absolutely zero hard feelings toward Apple for not catering to gamers as a demographic
Honestly I'm glad they don't. The PC is the last open platform out there and the last thing I'd want to see is Apple encroaching on it with their walled gardens and carbonite-encased computers.
Last time I cared how much RAM any phone had, iOS or Android, I was working at Augmentra on the ViewRanger app, and we were still supporting older devices with only 256 MB.
That was… *checks CV*… I left in April 2015.
I think RAM is like roads: usage expands to fill available infrastructure/storage.
That an iPhone today has as much RAM as the still-functioning Mid-2013 MacBook Air sitting in a drawer behind me is surprising when compared to the 250-fold growth from my Commodore 64 to my (default) Performa 5200… but it doesn't seem to have actually harmed anything I care about.
I was basically always slowed down by RAM on Android - prob bc I switch between lots of very badly coded apps... so even on desktop I've grown to see RAM as "insurance against badly written code" as in "I'll still be able to run that memory leaky crapware and get what I need done" or in "I'll just spin up a VM for that crap that only runs on that other OS"...
Swimming in badly written SPAs and cordova/whatever hybrid apps is seriously helped by eg 12GB of RAM on a mobile :)
Zero chance the marketing department will let them give up the extra $400 or whatever they get to charge for the bare minimum storage and RAM upgrades on all their devices.
I think it's silly to think the marketing department gets to control the pricing, but it is definitely very true that the "starting at <great price>" is very powerful for them. Even beyond Apple, it warps and distorts the entire laptop field pricing because people who don't understand how inadequate the entry level model is will compare that price to an entry-level model of Lenovo, or Dell, etc and make conclusions. Even on HN I've seen people use the "starting at" price of macs as a way of "proving" that "the Apple tax isn't much."
So yes, there is tremendous marketing value from that low starting price, although I think it's nearing the end of it's usefulness now that even fan sites are starting to call out the inadequacy.
I don't think that they are inadequate, these devices are perfect for most of my family. They do some calls, messages, a couple of pictures here and there, basic word processing and web browsing but not much more on these devices.
I had a Macbook with 8GB RAM and 256GB disk as my daily driver for work until last year running Docker and my fat IDE without too many issues. It's a similar story with my phone - I bought the bigger storage version because I thought I'd need it but after 3 years of using it I'm still not close to even using 128GB.
On the most recent iPhone pro I have a query running (about ~15 minutes so far) and the results are really good, just really slow, but I imagine the performance is worse on an older device
My current and previous MacBooks have had 16GB and I've been fine with it, but given local models I think I'm going to have to go to whatever will be the maximum RAM available for the next one. It runs 13b models quite well with Ollama, but I tried `mixtral-8x7b` and saw 0.25 tokens/second speeds; I suppose I should be amazed that it ran at all.
Similarly, I am for the first time going to care about how much RAM is in my next iPhone. My iPhone 13's 4GB is, as in your case, inadequate.
I recently upgraded from my M1 Air specifically because I had purchased it with 8gb -- silly me. Now I have 24gb, and if the Air line had more available I would have sprung for 32, or even 64gb. But I'm not paying for a faster processor just to get more memory :-/
I got an 8GB M1 from work, and I've been frankly astonished with what even this machine can do. Yes, it'll run the 4bit llama3 quants - not especially fast, mind, but not unusably slow either. The problem is that you can't do a huge amount else.
On Android you can simply run vanilla llama.cpp inside a terminal, or indeed any stack that you would run on a Linux desktop that doesn't involve a native GUI.
Yep, termux is a good way to do this. Llama.cpp has Android example as well, I forked it here GitHub.com/iakashpaul/portal you can try it with any supported GGUF/Q4+Q8 models
There's an app called Private AI that will let you run models locally on Android. It has a few smaller models available for free to try it out, but the larger models like Llama 3 (or the option to use your own downloaded models) require a $10 unlock purchase.
https://privatellm.app/
https://apps.apple.com/app/private-llm-local-ai-chatbot/id64...