It's coming in October with the new Apple chip

sigmoid10 · 2024-02-28T12:58:27

I'd be very surprised if Apple can put something on the level of GPT4 on a handheld. Remember, GPT4 is estimated to be around 1.7 trillion parameters. That's 3.4TB at 16 bit and it would still be ~340GB at 1.58bits. The best we can hope for is a low-ish level few billion parameter model. Which would still be cool on a phone, but as of today these models are nowhere near GPT4.

ynniv · 2024-02-28T13:30:07

You don't need "GPT4" though. Mixtral 8x7B is robust and can be run in 36 Gb, 24 Gb if you're willing to compromise. A 1.5 bit quantization should bring it down to 16. That's still a lot compared to the iPhone 15's 6, but it's close enough to imagine it happening soon. With some kind of streaming-from-flash architecture you might be in the realm already.

creshal · 2024-02-28T15:24:57

> With some kind of streaming-from-flash architecture you might be in the realm already.

I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.

liuliu · 2024-02-28T18:40:24

There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.

For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.

cjbprime · 2024-02-28T21:15:47

You need all of the model in RAM to perform the matmult that gets you the next token from it. There's no shortcut.

imtringued · 2024-02-28T15:45:07

I'm not sure what use that is, other than to maintain the KV cache across requests.

jairuhme · 2024-02-28T13:15:24

They won't have something at that size because as you pointed out, it is still huge. But depending on how they are used, smaller parameter models may be better for specific on-phone tasks that start to make the size of the model not a problem. GPT4 is so large because it is very general purpose with the goal seeming to be to answer anything. You could have a smaller model focused solely on Siri or something that wouldn't require the parameter size of GPT4

sigmoid10 · 2024-02-28T21:01:43

The thing a about GPT4 that matters so much is not just raw knowledge retention, but complex, abstract reasoning and even knowing what it doesn't know. We haven't seen that yet in smaller models and it's unclear if it is even possible. The best we could hope for right now is a better natural language interface than Siri for calling OS functions.