Hacker News new | past | comments | ask | show | jobs | submit login
MobileLLM: Optimizing Sub-Billion Parameter Language Models for On-Device Use (github.com/facebookresearch)
260 points by tosh 43 days ago | hide | past | favorite | 55 comments



> MobileLLM-125M/350M attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M SoTA models on zero-shot commonsense reasoning tasks

Small models, slightly improved, probably still not good enough for the same use as online models. Nothing wrong with incremental progress, however.

1.5B parameter model does seem to be a pretty decent step up, even beating larger models by a wide margin. I'm not sure why they didn't go larger -- having a more efficient model that fits on hardware the size of the RPi could be a gamechanger (IIRC TinyLlama 7B does run, barely).


>> Small models, slightly improved, probably still not good enough for the same use as online models. Nothing wrong with incremental progress, however.

An even smaller language model should still be useful as part of a speech-to-text system. These should benefit from using the language model to narrow down what word is spoken in the face of ambiguity or noise.


ASR systems already use language models during decoding, though mostly not large decoder-only LLMs. However, incorporating LLMs into ASR is currently at the center of a lot of research, e.g. using a speech encoder like wav2vec 2.0 or the whisper encoder with a Qformer etc. and a LoRA adapter on an LLM trained for ASR.


Really interested in this! Do you know of some good reading in this area?


But imagine if these models were baked into your Instagram app and then used for ad targeting using your own compute. Then Facebook gets to look at tons of other data and for less cost (and much less litigation risk) to them.

In this application it’s unfair to compare tiny models to cloud models. Moreover any incremental precision boosts to tiny models would be notable (and directly translate to revenue).


> I'm not sure why they didn't go larger -- having a more efficient model that fits on hardware the size of the RPi could be a gamechanger (IIRC TinyLlama 7B does run, barely).

I'm not sure that RPi is the right target for the next step of local LLMs, and I think that it's worth considering web-deployment on engines like WebLLM [1].

A 7B model may "run fine" on a Raspberry Pi, but I've (personally) found 7B models to be a bit larger than I want to download / run for web-based interfaces.

However, a solid 125M model is the sort of thing that I can run on a webpage, and the time it takes to download to the local user's browser (combined with my bandwidth costs) aren't exorbitant.

[1] https://github.com/mlc-ai/web-llm


Llama-3-8b runs fine on raspberry pi


how fast is that for you?


Does it have to stay on mobile devices? Bit of niche but if its not a resource hog it could be handy for giving NPC's in games more interesting dialogue without having use

Even better if it could be tuned in someway to allow dialogue to influence NPC behavior or actions.


Would it be interesting dialogue? You could generate more dialogue, but would it have anything underpinning it of interest to the player? i.e. you could suddenly have townspeople that would talk about local scenery or their relationships with other NPCs, but none of that stuff they describe would actually exist in the game. I would personally be weirded out if NPCs started making stuff up.

I can imagine training some sort of LLM on your game data such that NPCs are able to actually describe the game world, but I can't imagine what kind of scale you'd need to operate at for that to be cheaper than just paying someone to write the dialogue. Maybe at Ubisoft's scale where your team sizes are in the thousands (AFAIK, they have been investigating using AI for writing, but it's mostly for things like combat barks which are very repetitive and basically noise.)


>Would it be interesting dialogue?

It would definitely depend a lot on the implementation. I think it could work great for some indie dev's. Not all of course, devs that like writing understandably won't like it.


It would be fascinating if NPCs had more backstory to them and more complex behaviors. Although I would imagine it would be near impossible to test since anything could influence their behavior.


I'm definitely interested in exploring this sort of thing. How much can we do with creating interesting characters and interesting circumstances?

Makes me think of the way that characters are set up in AI Alibis -- each with their own secrets, but also with clues about other NPC's secrets. That feels like clever design, and it's the first use-case of using LLMs for NPC dialogue that feels interesting to me: https://news.ycombinator.com/item?id=40921990


yeah definitely testing would be nightmare. especially if conversations could influence the wider game.

You'd have someone on youtube cheesing games by running scamming npcs.


What apps can one currently use to run them on say an iPhone? Only aware of the MLC one which has literally 3 old models only


The Android apk for MLC is updated frequently with recent models built-in. And a Samsung S24+ can comfortably run 7-8B models at reasonable speeds (10ish tokens/sec).

https://llm.mlc.ai/docs/deploy/android.html


I have an (mlc-llm based) app on the App Store that supports over 2 dozen models, including some recent ones.




On my iphone there doesn’t seem to be an option to download more.

Vaguely recall there being a button initially but don’t see it anymore


I wonder how much you can push the "deeper and thinner" part. At some point your entire FFN fits into your L2 cache, you're bound to get some performance jumps.


Other research from Meta FAIR actually suggests that you should prune deeper layers if you want to improve performance while maintaining accuracy [1]. So there must be a cutoff point for smaller networks where this approach still works, otherwise the results are contradictory. Or we could drastically improve these new models even further.

[1] https://arxiv.org/html/2403.17887v1


That reminds me of the findings of Google’s paper on EfficientT5 (https://arxiv.org/abs/2109.10686). They refer to it as “DeepNarrow”.


Am I missing something but can't something like distillation help here ?


The paper says they tried that: https://arxiv.org/abs/2402.14905

Deep link to the relevant snippet in html version: https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5

"So far, we trained compact models from scratch using next tokens as hard labels. We explored Knowledge Distillation (KD)... Unfortunately KD increases training time (slowdown of 2.6−3.2×) and exhibits comparable or inferior accuracy to label-based training (details in appendix)."


Hey HN. I actually have a current need for on-device wake-word-like STT. Which model(s) have the lowest WER and can run on an RPi 4B? I've been looking at openWakeWord. It's for an DIY inventory system.


It seems like the smaller models get the largest size decrease by embedding share/weight tying between the linear head and token embeddings. Is there any research going into how to further reduce size from there?


If you mean that LM-head is just inverted embedding matrix then this was already done in GPT-2.

Unfortunately, the only thing I found out about this is that bigger models benefit from separate layer. But this was only mentioned somewhere in discord, so no paper to read and my personal hunch is that it should work for bigger models too. After all, GPT-3 was just scaled GPT-2.

From my personal experiments, models learn better if you give them harder task. And tied weights could be one of such things. Multi-token prediction could be another and bitnet could be also considered such... (and dropout too)


How about instead of Gen AI on the desktop, just AI on the desktop. Could organize all my files, emails, and notes and let me search for information from my own data.


Btw, for anyone interested I have AI news summaries and ideas for startups at my website here: https://asiaviewnews.com/gigabots/Threads?p=20007


Nice, could one use this to train models for Windows PCs also? I don't have a lot of ram.


Training models is not OS dependend. RAM is dependend on the size and i would argue this should be a lot easier to finetune with less GPU Ram.

Nonetheless the endgoal will probably be downloading a model like this or paying for finetuning than downloading and using it through an optimized Neuralchip.

Its currently more a question of when this will happen. The newest Windows cert already requires some neuralchip and even my google pixel 8 pro can host small models (i know the pixel is not a cheap phone, but the coprocessor should still be much more affordable than a big GPU)


While this is interesting, I wonder what the use case is, other than better autocomplete?


You could possibly fine tune it for narrow domain tasks like they did with tiny-agent https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/

I like the approach that Apple seems to be taking with fine tuned small models that handle routine tasks and then defer to larger off device models for things they can’t confidently do. I imagine you could construct a training set that contains examples that should produce low confidence answers where you could add an output that is essentially a “call for help” option so you could train it to choose that. Smaller models also means you could have more running in parallel and use another to route requests to the appropriate expert.


Reading emails, replying to emails, scheduling tasks, using apis for services.

Basically everything which doesn't need knowledge but actions.

"Tell my wife i'm late" and it will use some configured magic to talk to service xy and just does it.

Siri is very good in doing homeautomatistaion without the internet, the old google agent and alexa were absolutly not and i don't think they were ever available offline.

This basically gives you a local (local-first!) good working assistent


Would be very nice to have my schedule automatically managed by Siri. Already has a few nice things but I genuinely have trust issues, especially with AI.


You can get very far with the Shortcuts app by the way. Some examples: using your current location to estimate when you should leave to get to your next meeting on your calendar, letting those included in the calendar event know you’re running late. Highly highly recommend it, the learning curve isn’t much, a bunch of drag and drop!


Local agent like siri that can do simple tasks, and route more complex requests.


It can be fine tuned for device related actions. In other words, with all the capabilities of your device applications or services, the small model can virtually have the same capabilities. It can always dispatch a user request in way of “natural language” to those applications, and orchestrate the applications. It can dispatch user requests beyond the device capabilities to a cloud model. This is powerful since it changes how you interact with your devices.


I tested the Google AI on my phone, I had the browser open and asked it to read the page to me and it responded that it does not have access to the internet.

So I would like an AI assistant that:

1 can understand english and my native language

2 that is aware that runs on Android(or KDE/Linux) and can understand commands like "open the Android Settings , Application section " or "read the page that is opened in the browser" or "read the text in the popup that is now opened". Basically to be integrated with the OS via public and open APIs. Big AI companies could compete on selling us better assistants especially for multi lingual people.

3 the model should be small , it should not know geography, history, music bands etc, for tasks where the user asks question there should be an option for the model to forward the question to a search engine or even an online LLM.


It could power simple agents like Siri under the hood. Helping with natural language understanding, intent classification, retrieval, and other agent tasks.


Like the Rabbit R1 or Humane AI Pin



user cases are that of LLMs, from a mobile UI (so every AI use case there is), when you need privacy from big tech's AI APIs.

I'm just so amazed by statements like "LLMs can ONLY be used for autocomplete", like am I supposed to be impressed by the smirkiness?


The question was more about the capability and knowledge in a sub-1B LLM: at that size what is it capable to do beyond excellent autocompletion.


Probably hacking foreign intelligence codes.


Do Apple Watches have the hardware capability to run inference on a small model? Do I need a developer account to develop on one?


When Gemma 2 2b releases it would be interesting to compare its scaling with this


Interesting research, but Meta do not have any device worth talking about (at least at scale,) unless they want to ship that as part of their apps.


Dismissiveness like this tends to radiate ignorance, not insight.

Quests have shipped roughly ~1/2 PS5 sales. Certainly a scale only a handful of technologically advanced product lines outside of phones ever reach.

Incidentally, the enabling technology for the Quest? On-device ML that grew out of - you guessed it - developing on-device inference for their apps.


125M parameters feels very feasible to ship as part of apps -- even web-based apps.


They have Oculus.


Why no mmlu or gsm8k?


Anyone is aware of custom mobile llms?

Optimizing and loading in your own voice, selecting your primary language and adding a little bit of personal knowledge like nicknames, location and stuff?

My pixel 8 apparently can use / load local models but don't have the time right now to follow that rabbit hole


Tensor chips are not open enough for an optimized mobile LLM to be ran on them.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: