It took me a while to understand that paper, because it builds on the techniques of the Deja Vu paper for leveraging sparsity which are already pretty complex:
- First, the Deja Vu paper observes that models with low weight sparsity have what they call high "contextual sparsity". Basically, the matrix multiplications will produce vectors with lots of zeros in them, but where the zeros are depends on the input.
- The paper notes that you can use that sparsity to skip loading some rows of your matrices.
- However, to get good performance benefits, you can to predict in advance which rows you're going to skip. You can do that with a low-rank matrix.
The Apple paper then suggests these findings can not just improve your performance loading from RAM, but even allow you to load from flash memory without sacrificing bandwidth:
- The paper notes that attention matrices are rather light; the FFNs are the ones you want to load sparsely.
- The paper notes that you can get much better sparsity by predicting the output of the ReLU layer rather than the input of the FFN. Basically saying "if you can predict that after matmul this slot of the vector will have a negative value before ReLU, you can skip loading the matrix column and output zero".
- The paper suggests that you don't need to load most rows of the FFN at all; you can just keep a cache of recently used FFN rows for each FFN, and update it from flash memory on-demand.
There's a bunch more about chunk loading and the correlation between projection layers, but the above is where I think the main insights are.
(FFN = Feed Forward Network; in the context of transformers they're the biggest blocks.)
I wonder how much of the model you can avoid loading before you start to see a real performance difference.
Let’s say you want to maintain 90% of everything-in-RAM performance.
Can you get away with only using half the memory? Do you need 90% of the memory? Maybe 95%?
Basically how fast do you lose performance compared to the maximum by cutting RAM. The charts are comparing their algorithm vs a basic one for the less RAM case, which is a different (but good!) question.
If you can get good performance by NOT loading an entire eight gig model into memory of a cell phone that’s obviously a very useful thing.
Apple was running a model double the size of the available memory. Not sure if that was a sweet spot they found or if you could sacrifice response time to run even bigger models.
The paper is worth a read in full as what they are doing is pretty cool:
"Then, we introduce two complementary techniques to minimize data transfer and maximize flash memory throughput:
• Windowing: We load parameters for only the past few tokens, reusing activations from recently computed tokens. This sliding window approach reduces the number of IO requests to load weights.
• Row-column bundling: We store a concatenated row and column of the up-projection and down-projection layers to read bigger contiguous chunks from flash memory. This increases throughput by reading larger chunks."
Half the memory was just an example not a sweet spot. With smaller window sizes you can use less memory, but it will come with the cost of loading more from flash.
p.s: The window size in the paper is showing how many token's feed forward layer is in the memory.
I was thinking a couple of days ago about this concept of windowing for LLMs, but I lack the technical skills to implement it. Now Apple just published a paper on it. This is what I call synchronicity
They are targeting limited gpu memory and limit cpu to gpu memory transfer. I don't know how it could be useful on Macs because MacBooks have a unified memory and you don't need to do that transfer necessarily.
I'm just thinking out loud. Nothing in this post is authoritative.
Theoretically, the time consumed to inference a single token with part of the model stored in flash should be equal to the time to inference that token if the whole model was in RAM, plus the time required to load the part of the model stored in flash memory.
I assume we do not need to write back to flash, but I'm not an LLM expert so I could be wrong.
I assume we have many (more than 10) layers so we can leave a fairly small amount of our RAM available to load one layer after another. Most nontrivial LLMs have many dozens of layers, so this seems plausible.
If we are not bottlenecking on our RAM during inferencing then we might be able to load the next layer from flash into RAM through a DMA transfer while inferencing our current layer. I don't think that would work on single-processor systems due to us always bottlenecking on RAM.
Maybe a dual-processor system could load one layer into RAM on one processor while inferencing on the previous layer on the other processor, and thus run really big LLMs in a small amount of RAM?
I'm sitting next to a pile of parts to build a new LLM AI machine. (z840, dual processor) and I look forward to playing with this stuff.
"Mixture of Experts (Yiet al., 2023) have a sparse structure in their feed forward layer. This property can be used to combine with our method for enabling larger MoEs on device."
Assuming this implementation would allow for running Mixtral 8x7b on a 16Gb M1, I'm happy.
I would think so, I have it quantized at 8 bit (q8) and that ticks in at ~47GB.
Q4 should be well below the 32GB (2x 16GB).
I am wondering if the same implementation could be done for RAM to CPU, too. Those are said to be data transfer limited, so minimizing the RAM to CPU cache transfer should help there, too?
Q4 comes out to be ~26GB but Apple doesn't let you load it on a 32GB Mac machine because they put a limit on the max usable unified memory at ~21GB (`device. recommendedMaxWorkingSetSize`) [1]. So for Q4 Mixtral MoE you'd need a 64GB Mac machine unfortunately.
There’s a brand new hybrid quantization for Mixtral out that uses 4b for shared neurons and 2b for experts, which does not bleed much perplexity, but fits it into a 32G machine. Haven’t had it in hand yet and no link here on mobile, but can’t wait to try.
It's notable the Apple devices are very low-RAM compared to similar devices from competitors.
Part of that is that Apples software team uses more efficient languages like (eg. Objective-C vs Java). Part of that is that applications on iOS don't have to target a huge variety of screen resolutions (and therefore are frequently loading then downscaling high res textures). Part of that is that RAM doesn't get much cheaper if you buy at Apple-scale - so a RAM bump represents a bigger hit to margins than adding other features.
But all of that comes back to bite when using LLM's, which inherently gobble RAM. And any memory saving techniques used will simply allow a competitor with more RAM to squeeze in an even bigger better smarter model.
Add to that, that you can't upgrade RAM in most Desktop Macs anymore.
I want to buy a Mac soon, and I'm really struggling to decide how much RAM I should order. Unfortunately, my budget is limited. If it wasn't, I would probably go for at least 32GB. I'm still hoping Apple might change their RAM pricing, but probably in vain.
> I want to buy a Mac soon, and I'm really struggling to decide how much RAM I should order.
I'd recommend getting at least 32GB if you're on the fence. Not being able to upgrade it is a bummer, and your future self will thank you for getting the most you possibly can.
For my most recent upgrade I went for 64GB (previously 32GB) and I'm really glad I did, especially since llama.cpp became a thing shortly after getting it.
Also in the “glad I got 64gb” camp - even though it seemed ridiculous when I bought it, technology has advanced so quickly that now it’s actually very useful.
Now I wish I’d bought 4tb rather than 2tb hard drive lol but that’s just me being lazy - that upgrade definitely felt like a step too far.
64GB of memory will struggle on 65-70B (or bigger) models, and you'll be limited to running 70B on Q3 or Q4 if you want to use it somewhat comfortably.
lmkd (low memory killer daemon) works fairly differently off of a different set of signals and different policy. But yes, conceptually they try to achieve the same goal.
I also do not know if Android combines system libraries into one big file for the savings, something Apple devices do.
The only thing that keeps me on a Mac is familiarity, and air macs are silent. I am open to any suggestions for Linux laptops that are quiet or almost silent, most have fans that rev up, I'll gladly sacrifice some CPU for quiet or even a quiet mode (easy switch on/off). Nothing I've seen matches the silence. I'm more than happy to hear anything that proves me wrong. I would be glad to hear about something like that, obviously it has to have plus like either cheaper/replaceable ram. Furthermore, I mostly use my Mac air as a remote terminal to web based services and my Linux server that I use for compiling bigger projects and home/self-hosting.
Not sure if this is the right take. Apple is betting that in the long term that flash memory will be equivalent to RAM with the right CPU / GPU architectures. The timeline is pressed up, certainly, but I don't think their thesis is wrong.
I've a limited understanding of the topic, but would this allow to run an LLM in a mobile phone in offline mode? If that's feasible, it'd pave the way to lots of interesting applications, such as AI-assisted content moderation without having to phone back confidential data.
Yes, this may (significantly) improve that. Even without that, you can run LLMs already on mobile phones, the question is just how big of a model and how strongly quantized, and if the few models that remain produce good enough results.
I appreciate all these recent articles referring to it as an "LLM" rather than "AI". That way you know it is specifically about the technology instead of marketing hype.
How is this different from flash attention? I think using similar terms without explaining the difference in the abstract is confusing...
edit: Seems like this is an extension along two different mechanisms within the flash framework. Title of paper could be better, but it is within the first few pages.
I was hoping to find some sort of "how this feature will be exposed to users" section in the conclusion, but maybe that discussion is out of scope.
Does this kind of feature then bubble up into CoreML as API calls and settings, where you need to set, for example, a _use_flash_ flag? Or does does this just become a runtime optimization opaque to the user? Curious if anyone knows of good talks/presentations where Apple discusses their CoreML, Metal, etc development roadmaps.
It looks like most of the team comes from XNOR.ai which Apple acquired in 2020[0]. The company was based in Seattle and it looks like the founders have Iranian roots.
I understand that it's a different approach, but I would still have expected this paper to at least mention FlashAttention [1] since they both leverage flash memory.
I'm pretty sure FlashAttention doesn't deal with flash memory at all.
From what I understand, FlashAttention is about using access patterns that better leverage local memory, and especially SRAM. Eg, it about keeping data in CPU L1 cache, or in whatever the GPU equivalent is.
(In other words: FlashAttention is concerned about the part that's faster than DRAM, this is about better offloading to the part that's slower than DRAM)
It means that 97% of the outputs from the layer are zero; only 3% are active _at a time_. You can’t get rid of the other 97% entirely because the active 3% isn’t static. I think the paper is saying that they can reasonably accurately predict the active 3% at least well enough to make it run faster without losing too much accuracy.
Why bring "model parameters ... on demand to DRAM"? Maybe it is better to move LLM processing right unto flash controller chip... (after adding bfloat16 and matrix multiplication support into controller circuitry)
Probably because changing the software to use a different read pattern is doable in a few weeks/months on your existing systems, and changing anything in the flash controller is a wicked project probably only available to hardware manufacturers, and which will take months to years given the immensely slower hardware iteration cycles (even if it's "just" firmware changes).
One of the initial ideas was of course doing computation inside flash, but we didn't try to go that path for two reasons:
1. It's not as easy to change the controller, even if you do it was not obvious for me if we need for software updates at system level. In current way it is a standalone project.
2. I guess for a LLM scale flash controller chip might not be strong enough for computation. Additional hardware inside flash might be required for that.
- First, the Deja Vu paper observes that models with low weight sparsity have what they call high "contextual sparsity". Basically, the matrix multiplications will produce vectors with lots of zeros in them, but where the zeros are depends on the input.
- The paper notes that you can use that sparsity to skip loading some rows of your matrices.
- However, to get good performance benefits, you can to predict in advance which rows you're going to skip. You can do that with a low-rank matrix.
The Apple paper then suggests these findings can not just improve your performance loading from RAM, but even allow you to load from flash memory without sacrificing bandwidth:
- The paper notes that attention matrices are rather light; the FFNs are the ones you want to load sparsely.
- The paper notes that you can get much better sparsity by predicting the output of the ReLU layer rather than the input of the FFN. Basically saying "if you can predict that after matmul this slot of the vector will have a negative value before ReLU, you can skip loading the matrix column and output zero".
- The paper suggests that you don't need to load most rows of the FFN at all; you can just keep a cache of recently used FFN rows for each FFN, and update it from flash memory on-demand.
There's a bunch more about chunk loading and the correlation between projection layers, but the above is where I think the main insights are.
(FFN = Feed Forward Network; in the context of transformers they're the biggest blocks.)