Hacker News new | past | comments | ask | show | jobs | submit login
Llama3 running locally on iPhone 15 Pro (imgur.com)
77 points by yaszko 7 months ago | hide | past | favorite | 54 comments



Is this news? I've got a nearly year old app that supports over 2 dozen local LLMs with support for using them with Siri and Shortcuts. I added support for Llama 3 8B the day after it came out and also Eric Hartford's new Llama 3 8B based Dolphin model. All models in it are quantized with OmniQuant. On iOS, 7B and 8B ones are 3-bit quantized and smaller models are 4-bit quantized. On the macOS version all models are 4-bit OmniQuant quantized. 3-bit Omniquant quantization is quite comparable in perplexity to 4-bit RTN quantization that all the llama.cpp based apps use.

https://privatellm.app/

https://apps.apple.com/app/private-llm-local-ai-chatbot/id64...


Nice. What is battery life like under heavy use? I was reading a thread on the llama.cpp repo earlier where they were discussing whether it was possible (or attractive) to add neural engine support in some form.


With bigger 7B and 8B models, the battery life goes from a over a day to a few hours on my iPhone 15 Pro.

The 8B model nominally works on 6GB phones but it's quite slow on them. OTOH, it's very usable on iPhone 15 Pro/ Pro Max devices and even better on M1/M2 iPads.

Every framework: llama.cpp, MLX, mlc-llm (which I use) all only use the GPU. Using the ANE and perhaps the undocumented AMX coprocessor for efficient decoder only transformer inference is still an open problem. I've made early some progress on quantised inference using ANE, but there 're still a lot of issues to be solved before it is even demo ready, let alone a shipping product.


Super interesting, thank you!


I wonder if Apple will bump up the amount of RAM in iPhones due to AI. It seems like most LLMs require a large amount of memory.

They've been stingy on increasing RAM compared to Android phones.


So the trend with Apple has been that the SoC from the current generation Pro and Pro Max devices becomes the SoC for the next generation of baseline devices. For instance the iPhone 14 Pro Max and iPhone 15 have the same SoC (A16 Bionic). And this trend holds all the way back to iPhone 12.

It's almost certain that the iPhone 16 will ship with 8GB of RAM. What needs to be seen is whether iPhone 16 Pro and Pro Maxes will ship with 16GB of RAM (Like with high end M1/M2 iPad Pros with >= 1TB SSD).


So I plotted iPhones RAM and I fail to see a trend that could lead to doubling current RAM to 16GB.

12 Pro Max had 6GB RAM

15 Pro Max has 8GB RAM

16 most likely will have between 8GB and 12GB RAM.

https://i.imgur.com/N7OPjMK.png


It's one of the most infuriating things about Apple.

The ram tax so absurd it's bordering on criminal, but it also just seems stupid, because if they hadn't put 8gb's of ram in the new smallest macbook air m2 their whole lineup would be more than capable at running local quality LLM's, or double their gaming devices because of their awesome chipset giving them 16gb's of vram essentially, but no, not now when 25% have low ram, ie. no new OS LLM updates.

Also we can't have gaming because half their new sold devices have shit ram, so they also kind of already ditched their "gaming" plan they just got started on a year ago - all because they wan't to push products with ram levels from 10 years ago - bizarre!

They must be betting on local AI as a "pro" feature only.


8GB ram models don't exist to be used they exist to be e-waste that gets you to the checkout page where you click 16GB instead.


8GB for a premium device in 2024 is a hard ask, completely agree. But I hold absolutely zero hard feelings toward Apple for not catering to gamers as a demographic

Most importantly, though, we are talking about iPhones here. I can’t say I’ve ever thought to myself “gosh, I wish my phone had more RAM!” in…over a decade?


> But I hold absolutely zero hard feelings toward Apple for not catering to gamers as a demographic

Honestly I'm glad they don't. The PC is the last open platform out there and the last thing I'd want to see is Apple encroaching on it with their walled gardens and carbonite-encased computers.


...so, you haven't used Android in over a decade?


Last time I cared how much RAM any phone had, iOS or Android, I was working at Augmentra on the ViewRanger app, and we were still supporting older devices with only 256 MB.

That was… *checks CV*… I left in April 2015.

I think RAM is like roads: usage expands to fill available infrastructure/storage.

That an iPhone today has as much RAM as the still-functioning Mid-2013 MacBook Air sitting in a drawer behind me is surprising when compared to the 250-fold growth from my Commodore 64 to my (default) Performa 5200… but it doesn't seem to have actually harmed anything I care about.


I was basically always slowed down by RAM on Android - prob bc I switch between lots of very badly coded apps... so even on desktop I've grown to see RAM as "insurance against badly written code" as in "I'll still be able to run that memory leaky crapware and get what I need done" or in "I'll just spin up a VM for that crap that only runs on that other OS"...

Swimming in badly written SPAs and cordova/whatever hybrid apps is seriously helped by eg 12GB of RAM on a mobile :)


> > we are talking about iPhones here


I can't wait until Groq or someone else release tiny mobile inference engines specifically for phones and the like.


There's already tiny LLMs for this. They're bad. Because it's not enough information to be coherent.


> They've been stingy on increasing RAM

... in any of their products.

FTFY


Zero chance the marketing department will let them give up the extra $400 or whatever they get to charge for the bare minimum storage and RAM upgrades on all their devices.


I think it's silly to think the marketing department gets to control the pricing, but it is definitely very true that the "starting at <great price>" is very powerful for them. Even beyond Apple, it warps and distorts the entire laptop field pricing because people who don't understand how inadequate the entry level model is will compare that price to an entry-level model of Lenovo, or Dell, etc and make conclusions. Even on HN I've seen people use the "starting at" price of macs as a way of "proving" that "the Apple tax isn't much."

So yes, there is tremendous marketing value from that low starting price, although I think it's nearing the end of it's usefulness now that even fan sites are starting to call out the inadequacy.


I don't think that they are inadequate, these devices are perfect for most of my family. They do some calls, messages, a couple of pictures here and there, basic word processing and web browsing but not much more on these devices.

I had a Macbook with 8GB RAM and 256GB disk as my daily driver for work until last year running Docker and my fat IDE without too many issues. It's a similar story with my phone - I bought the bigger storage version because I thought I'd need it but after 3 years of using it I'm still not close to even using 128GB.


Interesting. what is the browser usage like for most of your family? i.e. how many tabs do they tend to keep open at a time?


This is quite impressive to be honest.

The chat is answering at a speed of one word per few/several seconds. But still, this a nice feat.

Example recording for the curious: https://www.youtube.com/watch?v=nZEvUj-QTrI


I'm so spoiled by the quality of modern closed-source models, I laughed out loud when the beginning of its answer just said [duplicate].


On my S24 Ultra, I am seeing it generate several words a second.


Related with GitHub link

"Next level: QLoRA fine-tuning 4-bit Llama 3 8B on iPhone 15 pro.

Incoming (Q)LoRA MLX Swift example by David Koski: https://github.com/ml-explore/mlx-swift-examples/pull/46 works with lot's of models (Mistral, Gemma, Phi-2, etc)"

https://twitter.com/awnihannun/status/1782436898285527229



Wonder why it has a 2 star rating


On the most recent iPhone pro I have a query running (about ~15 minutes so far) and the results are really good, just really slow, but I imagine the performance is worse on an older device


You can run it on Android too. Not very fast though.


Link?




I wonder why not to publish it on Google Play?



For 3 seconds I was hoping it might support 13 mini, but 4GB of RAM is not enough :(


My current and previous MacBooks have had 16GB and I've been fine with it, but given local models I think I'm going to have to go to whatever will be the maximum RAM available for the next one. It runs 13b models quite well with Ollama, but I tried `mixtral-8x7b` and saw 0.25 tokens/second speeds; I suppose I should be amazed that it ran at all.

Similarly, I am for the first time going to care about how much RAM is in my next iPhone. My iPhone 13's 4GB is, as in your case, inadequate.


I recently upgraded from my M1 Air specifically because I had purchased it with 8gb -- silly me. Now I have 24gb, and if the Air line had more available I would have sprung for 32, or even 64gb. But I'm not paying for a faster processor just to get more memory :-/


I got an 8GB M1 from work, and I've been frankly astonished with what even this machine can do. Yes, it'll run the 4bit llama3 quants - not especially fast, mind, but not unusably slow either. The problem is that you can't do a huge amount else.


No luck on 14 pro max either



Seems like it is the 3bit quantized version? (Judging by the file size)

And does anyone know how many tokens per second it can run?


Which app is this?

Does anything similar exist for Android?


On Android you can simply run vanilla llama.cpp inside a terminal, or indeed any stack that you would run on a Linux desktop that doesn't involve a native GUI.


Yep, termux is a good way to do this. Llama.cpp has Android example as well, I forked it here GitHub.com/iakashpaul/portal you can try it with any supported GGUF/Q4+Q8 models


There's an app called Private AI that will let you run models locally on Android. It has a few smaller models available for free to try it out, but the larger models like Llama 3 (or the option to use your own downloaded models) require a $10 unlock purchase.


You can either modify the Android example inside llama.cpp or my fork of it at GitHub.com/iakashpaul/portal

Increase the Ctx to more than 100 & link to any Q4 GGUF of 7B


It's fascinating. I hope to see more powerful large models on various mobile devices in the future.


I’m sure my iPhone 14 Pro with 6GB RAM can handle the 4GB weights no?

Says device is unsupported :(


This should be attempted with phi-3 when the weights are released tomorrow.


Anybody else running into issues downloading models?


new apple pro ai model with 32gb ram !!


what is with the BS answer lol





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: