A quick survey of the thread seems to indicate the 7b parameter LLaMA model does about 20 tokens per second (~4 words per second) on a base model M1 Pro, by taking advantage of Apple Silicon’s Neural Engine.
Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!
> All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Battery capacity and thermals are different and might be problematic. The phone might throttle performance earlier.
> it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone.
As a demo, yes, but would loading the model be fast enough for Siri-like responsiveness? You also would want to run other programs alongside it.
And of course, for Apple to adopt something like this, we would have to get rid of the tendency of these models to derail conversations. Put in something somewhat sexist/racist/…, and it will reply with something a bit more sexist/racist/…)
> we would have to get rid of the tendency of these models [...] reply with something a bit more sexist/racist/
If you don't want it to be racist, don't say racist things to it. Also, it'll be fairly clear where the racism came from - like a parrot and their owner.
AIs that can tweet, like MS Tay, and that remote-work chatbot, get a lot of attention when they melt down. Private AIs on your phone don't seem like they'll caise any concern with the phone-using public.
I think we'll appreciate the benefits more than we'll mind that others can make it say dirty words.
At this point in time, Siri as a voice-driven assistant has become so totally and utterly useless, its not even worth comparing it to anything else. I wonder how a company can work at a feature like that for 10 years, and manage to make it worse with every release they put out.
At this point in time, Apple should be so embarrased of Siri that I really think scratching the whole thing would have a net benefit.
Scratch it, and start over. And fire everyone involved with Siri :-)
Just recently Siri would belly-up on “Turn off Living Room lightS” — it would only work if I said “light” (singular). Extremely frustrating. They fixed it, I think, but this arbitrariness and many other make me think Siri is more quirk- and algorithms-based than a true AI.
Handling smart home requests is the one thing that Siri seems to do more or less without error, at least for me. I use that multiple times per day per day, and cannot remember the last time that it did not work.
Is Siri better, or does it have you well trained? My smart home stuff works best for me because I know more of the exact labels. I was literally surprised the other day that my wife included an S and it still worked.
Half the time it responds with "one moment.. One moment.. this is taking too long" or "I have problems connecting to the internet". But there's no internet problems whatsoever and it connects to my home Assistant using local homekit integration which shouldn't even need that.
> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
Oh shit, I took a closer look and you’re right. The repo was also helpfully updated with a note to this effect: “The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simlpy don't know how to utilize it properly. But in any case, you can even disable it with LLAMA_NO_ACCELERATE=1 make and the performance will be the same, since no BLAS calls are invoked by the current implementation”.
No Joi in my pocket just yet :(
Because of this I re-checked my claims about the Whisper speed up from the Neural Engine and that does look legit, 6x at least. So the Neural Engine does have the chops for this workload, it just isn’t being used in this repo. It may not be LLaMA, but I sure hope someone gets an LLM running on the ANE sooner rather than later.
Our investigations indicate that it might not be possible to achieve ANE performance improvement over CPU for LLM Decoder inference with batch size of 1 [0]. Just to make it clear - I'm no expert in Core ML / ANE, so these conclusions could be totally wrong.
Neural Engine across the M1 and M2 series is also sadly very limited.
I bought one thinking I could exploit it for StableDiffusion and other tasks but found that most libraries say to use GPU for faster generation. What I found is not only is the engine the same on m2 pro (meaning I upgraded for no reason from my m1 basemodel) but it also doesn't scale at all except in the m1 Ultra where it's doubled simply because it's using two dies bridged.
Neural Engine can generate 512x512 images pretty easily but takes a while even compared to using the GPU on a basemodel m1 Mac Mini. It's kinda crazy. Looking into ways to improve it and take advantage of the neural engine in the future but the current situation is very limited. Even apples official implementation and coreML libraries seem to prefer you run them on Metal
According to Meta's benchmarking[0] it is comparable on many metrics. I haven't used it myself so I can't say for sure if that is the case when actually using it.
I don't understand this topic well, but given premise that GPT3 and ChatGPT are different only that ChatGPT includes RLHF(Reinforcement Learning from Human Feedback), and LLaMA 7b is comparable to GPT3 on a number of metrics, it would follow that if we were to improve LLaMA 7b with RLHF, the 7b model would be similar to ChatGPT. Is that correct?
You're likely right that applying RLHF (+ fine-tuning with instructions) to LLaMA 7b would produce results similar to ChatGPT, but I think you're implying that that would be feasible today.
RLHF requires a large amount of human feedback data and IIRC there's no open data set for that right now.
And they've already collected over 100,000 samples, iirc ChatGPT was trained on something like 30,000 samples, so the open models should already be positioned to succeed.
> You mean train it on ChatGPT's output? That's against OpenAI's terms of service.
Oh no, someone call the internet police.
I'm sure scraping tons and tons of images and web data to train DALLE and GPT and then selling access to that data to others was also against many licenses and terms of services, but OpenAI did those anyway.
None of these AIs were created ethically. At the very least we can make sure these huge models don’t solely belong to monopolistic tech companies and democratize their power.
GPT 3.5 likely differs from the original GPT 3 by more than instruction fine-tuning. For example, it was probably retrained under Chinchilla scaling laws [1], with a lot more data and maybe a somewhat smaller parameter count.
There are many variants of GPT-3 and GPT-3.5, and based on the performance numbers in Meta’s paper, it looks like they’re comparing against the very first version of GPT-3 from 2020. [2]
I wish we could start having open source TTS models with similar performance. So far Tortoise TTS is not there yet.
Im not sure if Siri neural TTS is offered for 3rd party apps.
Oh, it’s probably higher than four words per second, then. I assumed tokens was characters and used the standard “there are five characters in a word” rule of thumb.
It's about 4 charcters per token. So just over 1 token per word. I just round to 1 token per word since text most people generate does not use larger words and because larger common words are still encoded as one token (e.g. HackerNews is probably one token despite being 10 characters).
But wont it be that in real life no one would want to run a voice command which consumes lot of CPU and battery as opposed to making a network call to a service which has this model hosted ?
Agreed that this can always be improved and hardware can get more efficient and better to but at the end of the day, would it ever be better then an API call ?
I live in eastern Oregon on a property with no cell service.
I use Siri a lot, mainly to add reminders, and sometimes I try to use Siri when I'm out at the greenhouse, which is just past the edge of the mesh network. I would love for those reminders to get added - even if it burnt battery.
And more generally I would love for people writing apps to consider that phones don't always have service - as would my neighbors.
Simple thought experiment: you want to know how many tons of copper are mined in the US each year. Lowest possible latency is calculating this in your head, most likely using data you don’t have. Looking it up online is a lot, lot faster.
In some far future world maybe every transistor will include the sum total of human knowledge up to the nanosecond, but that’s a pretty far future. There are many things where running locally means a higher latency floor.
the potential drawbacks of relying entirely on voice-operated assistants like ChatGPT. There are concerns around privacy and the use of personal data, as well as the potential for bias and inaccuracies in the responses generated by these models. It's important to strike a balance between the convenience and benefits of these technologies and the potential risks and limitations they bring. Nonetheless, the advancements being made in this field are impressive and it will be interesting to see how they develop in the future.
i think voice assistants can perform actions on phones (eg "open app, message Alice, call Bob, turn off Bluetooth"). This couldn't do that (I think), which is an obvious drawback
I've had difficulty obtaining useful results from the smaller (7B-sized) models. The issue lies in the content, not the speed. If you could stream the text-to-speech, the speed alone would be satisfactory.
If you are interested in implementing LLaMA yourself or learning, I noticed that the reference code by Facebook is one of the cleaner, easier to read ML code I've seen in a while. https://github.com/facebookresearch/llama/blob/main/llama/mo... It's about 200 lines long. You probably do need a bit of knowledge to understand what you are reading but I was pleasantly surprised.
For example in comparison, StableDiffusion torch code in diffusers and transformers Python libraries has lots of conditionals, experiments etc. that are not being used that can make it hard to follow what is going on.
Last weekend I got the "main loop" of the transformer working in pure CPU Rust code, following the reference code. My crappy code is just very very slow as I focused on getting it to run, not making it fast. The tokenizer uses some Google thing https://github.com/google/sentencepiece but luckily for inference it seems that you just need to be able to parse the tokenizer model file and not understand how it was created; I was able to strip out the protobuf files from that repository and add it to Rust and read the tokens.
I am optimistic that someone makes a high quality CPU or some CPU+GPU+SSD combination thingmaling that will make it somewhat practical to run even the large LLM models without needing an A100 or two.
At least with my hardware this runs at "[size of model]/[speed of SSD reads]" tokens per second, which (up to some possible further memory reduction so you can run larger batches at once on the same GPU) is a good as it gets when you need to read the whole model from disk each token.
At a 125GB and a 2MB/s read (largest model, what I get from my ssd) that's 60 seconds per token (1 day per 1440 words), which isn't exactly practical. Which is really the issue here, if you need to stream the model from an SSD because you don't have enough RAM, it is just a fundamentally slow process.
You could probably optimize quite a bit for batch throughput if you're ok with the latency though.
Yeah, it does seem like there's a fundamental limit how fast you can go even if you engineer the data juggling to perfection. My guess is that every loop through the transformer is going to have to visit every weight and if those weights cannot fit in your fastest memory, then it's going to have to spend time transferring data from SSD or whatever is lower in your memory hierarchy.
The quantization used in the post luckily seems to work somewhat well; I'm also wondering if some new clever ways will be invented that reduce the amount of data you need to juggle. Maybe e.g. not just using 4-bit weights but also compressing them in some way, sorting the weights or something.
Huffman encoding the weights (treating each 16bit float a symbol) could reduce the weights size to ~85% the original (I calculated this exactly before, but am going from memory). You could maybe get a bit more than that with arithmetic encoding (if you managed to decode fast enough), but it shouldn't be that much more.
Once you start including lossy steps like quantization though it's much less clear. At some point you just reach "knowledge distillation is an open problem".
It requires some very minimal system RAM to load the model into VRAM and to compile the 4bit quantized weights.
But if you use pre-quantized weights (get them from HuggingFace or a friend) then all you really need is ~32GB of VRAM and maybe around 2GB of system RAM for 65B. (It's 30B which needs 20GB of VRAM.)
The full use case includes quantisation, which the repo points out uses a large amount of system RAM. Of course that’s not required if you skip that step.
Judging from downloads of the 4bit file and how many people I've seen post about quantizing it themselves, around 99% of people are just downloading the pre-quantized files.
I would not personally call compilation of software part of its "use case." It's use case is text generation.
Quantisation is a once off process. I suspect most people who don't have access to a machine with enough RAM and don't want to use the pre-quantized version can afford the $20 to hire a big cloud server for an day.
Or it is probably possible to make it work slowly using a swapfile on Linux.
I have a separate branch that streams weights from ram - at which point I think I was only seeing negligible performance loss compared to storing the weights in vram. The bottleneck was compute, not GPU bandwidth.
The 65B model only needs just over 32GB of VRAM to run. It does not need system RAM to run/use if you use pre-quantized weights which you can find many places already.
No need to quantize yourself (besides it takes almost a day to do 4bit GPTQ quantization on 3xA6000).
Quantizing is a lossy process, you can't really claim to be running the 65B model llama at that point (though the 65b qgpt-llama does look like it might be very useful)
I noticed the Fasttext code was also surprisingly clean and readable C++. whatever moralities and other flaws the metal business model might have in general, they seem to have a consistently excellent track record when it comes to publicly available libraries and tools.
Very nice post, good lead. It makes me curious... I wonder what LLaMA would look like implemented upon the newly release OpenXLA[1]! Is that even a sensible ask? I feel like it could potentially be an informative exercise, that would aid in the understanding of the landscape of tooling.
I’m pretty sure the code you linked is just simplified for publication. I think it’s interesting to read, I just don’t think it’s what they actually used to train and develop the algorithm.
I think tuning the sampler temperature and using top-k over top-p sounds ad hoc and shouldn’t be necessary for a solid model. Do you have any reason for suggesting those changes in particular? Especially since top-p, or nucleus sampling, is meant to be an improvement over top-k.
Relevant:
Since LLaMA leaked on torrent, it has been converted to Huggingface weights and it has been quantisized to 8bit for less vram requirements.
A few days ago it has also been quantisized to 4bit and 3bit is coming. The quantization method they use is from the GPTQ paper ( https://arxiv.org/abs/2210.17323 ) which leads to almost no quality degradation compared to the 16bit weights.
4 bit weights:
Model, weight size, vram req.
LLaMA-7B, 3.5GB, 6GB
LLaMA-13B, 6.5GB, 10GB
LLaMA-30B, 15.8GB, 20GB
LLaMA-65B, 31.2GB, 40GB
Here is a good overall guide for Linux and Windows:
I'm pretty sure he only makes his past twitch streams available to subscribers. It's weird they are available on Youtube, maybe he doesn't know that's something he can change.
Not sure about his more recent content but he used to have a policy that the VODs could be uploaded in full as long as they were uncut and not modified. If you go watch some older VODs he says so at the beginning.
I have the 64GB M2 Max I will have to try this out, its amazing that this laptop can do things my high end desktop can't do, Apple really made some good architectural choices with apple silicon.
Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare.
It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test.
>4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference.
Could someone with experience explain: what's the theoretical minimum hardware requirement for llama 7B, 15B, etc, that still provides output on the order of <1sec/token?
It seems like we can pull some tricks, like using F16, and some kind of quantization, etc.
At the end of the day, how much overhead is left that can be reduced? What can I expect to have running on 16gb ram with a 3080 and a midrange AMD processor?
With 30b-4bit on a RTX 4090, I'm seeing numbers like:
Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)
Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)
Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)
Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)
The lower size (7b, 13b) are even faster with lower memory use. A 16GB 3080 should be able to run the 13b at 4-bit just fine with reasonable (>1 token/s) latency.
I tried it out myself (git pull && make) and the difference in results are day and night! It's amazing to play with, although you should prompt it differently than ChatGPT (more like the GPT-3 API).
16GB of vram can run the 7B for sure, I'm not sure what the most cutting-edge memory optimization but the 15B is going to be pretty tight I'm not sure that'll fit with what I know of at least, I've got it working at a bit over 20gb of vram I think at 8bit.
If you can't fit it all in vram you can still run it but it'll be slooooow, at least that's been my experience with the 30b.
Interesting. But how about the Apple Neural Engine (ANE)? I've always wondered if the ANE is ML worthy, maybe it's really only with inference or who knows, even training somehow. I've seen Apple's marketeers bragging about it [1], with even code examples, but ifaik no useful libraries nor reliable measurements and community interest exist in the wild for doing ANE ML on Macs.
Absolutely love ggerganov's approach with models like this and Whisper. It's just awesome being able to experiment with (what I consider) complex models without needing a billion python/c/cpp dependencies!
A Macbook Pro M1 would have the base M1 CPU while he was referring to the M1 Pro CPU in something like a Macbook Pro w/ M1 Pro. It's confusing naming by Apple.
Insanity! This is the same guy who wrote Whisper C++! How does he do this? I feel like I am a side character in some cartoon gasping at the unthinkable power level of the main character.
That's all fine and good. But to do anything useful, you're going to want a powerful GPU (RTX 3090, RTX 4090 or A6000) with as much VRAM as possible. Unlike the diffusion models, LLM's are very memory-intensive, even at 4-bit GPTQ. The larger models like llama-13b and llama-30b run quite well at 4-bit on a 24GB GPU. The llama-65b-4bit should run on a dual 3090/4090 rig.
Coupled with the leaked Bing prompt and text-generation-webui, the results are quite impressive.
VRAM is the thing that Apple Silicon is going to have in excess compared to anything even close in price. MacBook Airs can have 14-15GB of VRAM if necessary.
RAM read bandwidth of M1/M2 is still not really competitive with the large GPUs like RTX3090, but it's getting close, compared with the usual CPU setups.
Is there an upper limit on the usable VRAM on Airs, or is it just tied to RAM size minus a bit for the OS? Just got a 24GB Air M2 and your comment made me wonder if it was capped at n amount of VRAM, or if it's just that Air's now can have 24GB of ram compared to 16GB with the M1 Airs?
Here is some quick math: these devices has SSD read speed somewhere around 2GiB/s. With 4-bit quantization, we are looking at loading 4B parameters per second. That means we need 8s per token for 30B model. Hmmm, the math is a bit off (or I need to look closer whether we can do more tokens per iteration with some batching).
You can use Speculative Sampling, where a draft model is used to generate short continuations of a sequence, which are scored in parallel by the large model. The draft model can be small, and you only need to call the large model from time to time, so you can stream it from SSD or cheap RAM.
Using LRDIMM DDR4 at the price of less than $1000 it is possible to stream GPT-3 five times a second, in 4bit quantisation. Multiply that with the 2-2.5x speedup from Speculative Sampling.
>Accelerating Large Language Model Decoding with Speculative Sampling
I'm a huge fan of Georgi (the author)! You should also check out his other work, bringing Apple Silicon support to OpenAI's Whisper (speech-to-text model): https://github.com/ggerganov/whisper.cpp
I'm running 4-bit quantized llamas on torch/cuda with https://github.com/qwopqwop200/GPTQ-for-LLaMa, and I'm seeing significant tokens/second perf degradation compared to 8-bit bitsandbytes mode. I'm very new to this, and understand very little detail, but I thought it would be faster?
In case anyone catches this late (and anything older than a few hours in AI/ML is!), some of the original llama HF weights were not built correctly and gave poor output.
Many people testing this weekend have not updated or rebuilt those weights from earlier in the week.
Eh, I’d expect it to be slower if anything. Think about it like this. If you write an image renderer, bitmap would be the fastest, because it’s already decompressed. 4-bit quantization is a compression algorithm.
It depends on the details of memory bandwidth vs compute though.
Why is it necessary to port LLaMa Into C? Assuming original model implementation was in Python, did it not require few tweaks to make it work in Apple Silicon?
Yes this is a good question. Why did they focus on a specific model rather than a generic solution that makes ANY python based model work on Apple silicon?
Because that's not how machine learning models work. Machine learning as a field goes through a nearly complete revolution annually. Every new major model is a special snowflake of unique cases.
Writing high performance software that handles all of them is next to impossible, because its the special tailoring to the unique features of a given model that provides the high performance.
That's not how I think it works. ML is a small number of operations applied to very large blocks of data, tensors. You can build all kinds of complex formulas using those small number of tensor operations, but the (relative) speed is determined by how efficient the small number of operations are implemented, not by how complicated the formulas are (relatively, compared to other operations using the same formula).
You're half right. First, tensor operations are only a small part of modern ML. Second, how you plug all those small operation together is where all the performance difference is had these days between implementations.
Different hardware have a variety of different small operations that do almost the same thing. So when a state of the art model architecture meets a state of the art quantization method and you want to run it fast on AMD GPUs, Nvidia GPUs, x86 Processors, ARM processors, and Apple Silicon you are highly likely to end up with perhaps 3-5 bespoke implementations.
This happens every few months in ML. Meanwhile hardware is also both innovating and balkanizing at the same time. Now we have Google Silicon, Huawei Silicon, and Intel Arc GPUs. It's not an environment where "one fast library to rule them all" seems attainable.
Ok, but in the end you're just evaluating a graph, and I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written. So it makes more sense to work on a backend that you can use for any type of model than to hand-optimize everything.
>I suppose that compilers can figure out how to do this in the most efficient way on any type of hardware for which a backend was written.
No, that's exactly the problem. Compilers can't because the GPU hardware and the algorithms involved are such rapidly moving targets. Bespoke hardware specific quantization, inference, attention, and kernel compilation is the only way to squeeze out the performance users are looking for.
Creating one fast implementation for all models on all hardware would be like writing one GPU driver for all GPUs and OSs. It just isn't going to work and if it does it isn't going to be fast on all hardware.
I got LLaMa 7B running on the CPU on Apple Silicon a while ago by simply removing references to CUDA in the python code and changing an unsupported half-precision float to a full, but couldn’t get the larger models running.
The f16 support and the quantization seems to be the main improvement here, and possibly the mentioned optimizations.
This is sort of the polar opposite of how modern high performance ML frameworks are built. Skimming the code, there's a ton of boilerplate for the various operations that could be library-ized and generified, if that makes sense.
I actually really like minimal implementations of state-of-the-art systems because the code is much easier to understand (modern frameworks are super-complex) but I wonder what it means long-term if you don't need frameworks.
This is just inference. The core at most ML library is the auto differentiation capability. It will be extremely tedious if you are to calculate the gradients manually. Or, if you implemented your own AD, then it is effectively a minified version of a ML library.
I can confirm that this (7B) runs nicely on a 24GB MacBook Air M2. The output of my initial test was definitely a bit different than ggreganov's example!
The first man on the moon was 39 years old on July 16, 1969.
July 16th is the 198th day of the year (199th in leap years) in the Gregorian calendar. There are 168 days remaining until the end of the year.
1561 – France is divided into 2535 circles (French: cercles) for fiscal purposes.
1582 – Pope Gregory XIII, through a papal bull, establishes the Gregorian calendar (Old Style and New Style dates).
1
As a followup, the 13B model similarly runs great on the 24GB MacBook Air. It seems to perform a lot better (output wise) and about twice as slow per token.
Output from 7B "The first man on the moon was ":
The first man on the moon was 19-year-old Luna 15-year-old man, so the eyes of the world on him, as he was the first man in the moon after the moon.
The second man on the moon was a 45-year-old American man, Neil Armstrong. In 1969, Neil Armstrong, with Aldrin, landed on the moon and set foot on the moon, became a hero to the world. At that time, the United States succeeded in landing on the moon with a manned spacecraft, NASA's Apollo 11 Lunar Module
main: mem per token = 14434244 bytes
main: load time = 1865.80 ms
main: sample time = 259.36 ms
main: predict time = 23119.33 ms / 170.00 ms per token
main: total time = 25916.14 ms
Output from 13B "The first man on the moon was ":
The first man on the moon was 38 year old Neil Armstrong (Buzz Aldrin was the second man to walk on the moon, he was the lunar module pilot).
What was the first man to land on the moon?
Neil Armstrong was the first person to step on the moon on July 20, 1969.
What did Neil Armstrong say when he landed on the moon?
Neil Armstrong said, "Houston, Tranquillity Base here, the Eagle has landed."
What are the facts about Neil Armstrong?
Neil Armstrong is an American
main: mem per token = 22439492 bytes
main: load time = 3692.42 ms
main: sample time = 260.61 ms
main: predict time = 42439.68 ms / 312.06 ms per token
main: total time = 47629.22 ms
Yeah, definitely. The second run was similarly off the rails..
The first man on the moon was 12 years old when the American Revolution began. As a boy, he was horrified by the sight of his mother being whipped. He came to America after surviving the Spanish Inquisition and massacres in Mexico and Brazil. In Texas he married a second-generation German, the daughter of a Confederate soldier who had never heard of the Battle of Bull Run. As a teenager, he fished salmon out of the Columbia River for his Russian Orthodox mother-in-law who said she had seen the sign of the cross appear in the sky before a victory in battle. He followed the death
Still impressive for a 7B model that's been slimmed down to 4 bits.
I tried llama 14b model by using one of online tools (mentioned in one of other hn comments, it claimed to use that model, but cannot be 100% sure)
and I was very disappointed with results. I tried to ask it fairly simple question: to write regex validating email address, tried this 2 times and got responses: "what is your question" and "please wait..." so it just doged response. in contrast chatgpt was able to provide me with simple regex and also RFC compliant version when asked.
Outside of it being possibly a smaller model, the prompts should be different as llama hasn't been trained to take instructions so it would I think need to be framed more as "A regex for emails is " and let it complete the rest. Just in case you have tried the exact same prompt.
Also those messages sound suspiciously like error messages from a system around it.
Llama is neither instruction finetuned nor human reinforcement learning ranked. It‘s a base model like the original GPT3 that needs much more careful prompting and few-shot reasoning to produce similar results.
sillysaurusx made some posts about tweaks that greatly improved output for him. Maybe they would help you:
"""
Fixed the llama sampler. After turning off top_p, adding top_k 40, setting temp to 0.7, and adding a repetition penalty of 1/0.85, llama 7B is looking nice.
],
[
"How does the Schrodinger's cat thought experiment relate to the principles of quantum mechanics?",
"In this famous paradox, a cat inside a box is simultaneously both alive and dead due to the nature of subatomic particles. This demonstrates that at the smallest scales of reality, we cannot predict whether something will happen until we observe it happening."
],
[
"What are some of the potential consequences of climate change on global food security?",
"According to researchers at Stanford University, rising temperatures could lead to crop failures and reduced harvests for staple crops such as rice, wheat, corn, and soybeans by midcentury."
],
[
"Can you provide a brief overview of the history of political philosophy, from Ancient Greece to the present day?",
"Absolutely! Political philosophy is a branch of philosophy concerning the role of government in society. In ancient times, Plato and Aristotle were among the first thinkers to consider the subject in depth. Since then, other influential philosophers such as Thomas Hobbes and John Locke have expanded upon these ideas. Today, contemporary philosophers such as John Rawls continue to debate the merits of different political systems."
]
Give me a basic recipe for easy spaghetti carbonara
Here is a basic recipe for spaghetti carbonara:
Ingredients: 1 lb spaghetti, 3 tbsp butter, 2 tbsp olive oil, 1/4 cup grated Parmesan cheese, 1 egg, 1/2 teaspoon salt, 1/4 teaspoon pepper, 1/2 cup chopped parsley, 1/2 pound bacon, 1 clove garlic, 1/4 cup heavy cream.
Directions: Cook spaghetti according to package directions; drain. Heat butter and oil in large skillet on medium heat. Add bacon; cook 4 minutes or until crisp. Remove from skillet. Add garlic to drippings in pan; sauté 1 minute. Stir in egg, salt, pepper, and parsley; stirring constantly. Add bacon and spaghetti; toss well. Sprinkle with cheese. Serves 4.
3. Create an interesting prompt for the AI and put your question at the end.
It works tremendously fast on an Apple Macbook Pro with the M1 Max and 64 GiB of RAM. I only downloaded the 7B model because I wanted a quick run through. I'll try the rest later.
Here's some good fun:
Prompt: Example HN Post: Gojq: Pure Go Implementation of Jq. Example HN top comment: "gojq does not keep the order of object keys" is a bit disappointing. I care about key order purely for cosmetic reasons: when Im designing JSON APIs I like to put things like the "id" key first in an object layout, and when Im manipulating JSON using jq or similar I like to maintain those aesthetic choices. HN Post:"Llama.cpp - C/C++ implementation of Facebook LLama model". HN top comment:
Completion: "This is more of an example of C++s power than a breakthrough in computer science." Example HN Post: "The Moral Case for Software Patents". HN top comment: "The patent system is a vestigial technology, and the whole point of a vestigial technology is to make something go away. Its like a dinosaur that no one is using anymore. Its served its purpose." Example HN Post: "Benchmark: The Mystery of Gauss Summation". HN top comment: "But it is interesting to see how Benchmark manages to cram
main: mem per token = 14368644 bytes
main: load time = 1301.14 ms
main: sample time = 273.73 ms
main: predict time = 20970.29 ms / 82.89 ms per token
main: total time = 22939.82 ms
ggerganov is a genius! I'm using his excellent whisper.cpp in my WisprNote offline private transcribing MacOS app. It 10x the speed of the regular models. He even invented a custom format for the models! The work this guy is doing--first class or beyond!!!
This is an advanced language model that can now run quickly on consumer grade hardware. You used to need thousands of dollars of GPUs to run a model as sophisticated as this - now it can be done on a laptop,
Yes but it wasn't made to run on a Mac. This project ported LLaMA to Apple Silicon so all the macbook users can finally play with what the rest of us have had access to for the past couple of weeks.
This is so awesome and exciting. I have an M1 iMac and it was trivially easy to get this working and generating text. And the performance is VERY impressive, especially considering that it's not even using any of the built in "neural compute" stuff. Also, the model seems like it doesn't have any political correctness conditioning based on some of the completions it has given me on controversial prompts. I can't wait until someone gets the 13b model working (sounds like this should happen in the next day or so) and gets the repetition penalty working.
It is using the built-in neural accelerators, that’s why it’s fast, that’s why it’s only supported on Macs so far. The code makes use of official Apple APIs which delegate the necessary BLAS calls to the available hardware.
Confusingly there are 2 mechanisms to do matrix operations on the new apple hardware - AMX (https://github.com/corsix/amx) - and the ANE (apple neural engine) - which is enabled by CoreML. This code does not run on the neural engine but the author has a branch for his whisper.cpp project which uses it here: https://github.com/ggerganov/whisper.cpp/pull/566 - so it may not be long before we see it applied here as well. All of this is to say that it actually could get significantly faster if some of this work was able to be handed to the ANE with CoreML.
Three. You can also do it in Metal, which as of recently has cooperative matrix multiplication in the form of the simd_matrix type (this is similar functionality as "tensor cores" in the Nvidia world). I have no idea what the software support is, but I have seen analysis suggesting that the raw tensor multiplication throughput is larger than ANE for the high-end GPUs.
The README says it has been "hacked in an evening"! Have the 10.000 lines of ggml.c been written in an evening, without reusing a lot of pre-existing code? That's quite amazing.
How does one learn to do stuff like this? You first learn doing AI algorithms in Python and then transfer the knowledge to C++ or you learn doing them in C++ from the start?
This is pretty cool. I got it working in about 5 minutes. Inference of the 7B model is super fast. I don't know enough about prompting plain GPT models though.
I don't have the hardware to run the 60B model to test this at the moment -
How does it perform with programming, for example making a basic python script to scrape a website, or a bash script, etc?
I've managed to run the 13B* at 8bit with decent performance on a 4090 - but it's only 24GB of VMRAM so I've been struggling to run the 30B at anything more then a snails pace.
Stellar job and it's amazing to have this running in wasm.
I don't get why something like this should be faster than running eg. libtorch in cpu mode, though.
If it is, surely you'd want to port the optimisations to libtorch so that any model would benefit from it.
If it's just Mac specific you could even add another target.
I just filled out Meta's online form to get access to the LLaMA models. Anyone know how long it takes, how selective it is (it asked for papers I've published: none) or if there's any places to download it from in the meantime?
On an unrelated note, are there any open source implementations utilising Intel's Gaussian & Neural Accelerator and Deep Learning Boost technologies on 12th and 13th gen CPU's?
I'm happy that someone implemented a CPU-only version of Llama. The original Meta implementation always crashed with "out of memory" (I only have 8GB VRAM).
Too bad Llama doesn't work as well as (Chat-)GPT for instructional prompts.
```
Transcript: \"Professor Poopy Pants: Okay. Todd: Thank you for holding. Hello. How may I help you? Professor Poopy Pants: Hey. I just wanna let you know that my name is professor Poopy pants. Todd: Oh, hit oh, that's great. So professor Pupi pants, and can I ask how I can help you today? Professor Poopy Pants: Sure. I appreciate it. So I have some poop in my pants, and I I need it to be clean clean. Todd: So you have food with your pants and you need to be cleaned? No problem, sir. I will get right on that. Have. Professor Poopy Pants: Oh, Todd: a nice. Professor Poopy Pants: thank Todd: day. Professor Poopy Pants: thank you so much.\" Tell me, what did the caller need help with in 2 or 3 words?
```
I get "Cleaning Pants"
When I do the same with LLaMA 7B model by doing e..g
```
./main --temp 0.2 -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 300 -p "Transcript: \"Professor Poopy Pants: Okay. Todd: Thank you for holding. Hello. How may I help you? Professor Poopy Pants: Hey. I just wanna let you know that my name is professor Poopy pants. Todd: Oh, hit oh, that's great. So professor Pupi pants, and can I ask how I can help you today? Professor Poopy Pants: Sure. I appreciate it. So I have some poop in my pants, and I I need it to be clean clean. Todd: So you have food with your pants and you need to be cleaned? No problem, sir. I will get right on that. Have. Professor Poopy Pants: Oh, Todd: a nice. Professor Poopy Pants: thank Todd: day. Professor Poopy Pants: thank you so much.\" Tell me, what did the caller need help with in 2 or 3 words?
```
I get:
```
Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the problem? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the solution? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the outcome? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood: "Profeesssor Poopy Pants: I have some poop in my pants, and I I need it to be clean clean." Tell me, what was the lesson learned? Tood
```
I assumed I needed a M1 pro or better due to RAM requirements, but I was able to run the 7B model on a 16GB M1 Mac Mini. The system reported 4GB RAM usage for the inference process. I suspect it's possible to run the 13B model as well.
It'll be 7B they're referring to, on my M1 Max 32GB with a 4000 token output request I get 67ms/token on 7B (4bit) and 154ms/token on 13B (4bit)... I've made a tweak to the code to increase the context size but it doesn't seem to change perf.
main: mem per token = 22357508 bytes
main: load time = 2741.67 ms
main: sample time = 156.68 ms
main: predict time = 11399.12 ms / 154.04 ms per token
main: total time = 14914.39 ms
Does anyone know how many languages this supports? I know that FB has been translated to a ton of languages. Will those translations benefit the models in LLaMA? Or am I misunderstanding the point of this?
The point of it is to generate arbitrary text locally, like the GPT-3 API does remotely.
It can "say"/talk about anything an average IQ person with knowledge of the entire internet and most books in existence could. So if you prompt it to write 100 pages on the differences between positive and negative law, as a poem, while never using a word with the letter "F" it will spit that out for you without any issue.
It can also program quite well, create recipes, debate with you, impersonate anyone, and lots more. And it does all of this offline, in airplane mode, locally on your PC or Mac.
It's good at translation but is probably one of the least efficient ways to translate text when models specifically for translation exist.
What models exist for translations? The reason I wonder is that a lot of community work has gone into producing high quality translations of Facebook. I have actually submitted a few translations myself. If Meta could release some of those models, it would be very powerful.
Right now I see "google translate" type quality everywhere and it's pretty bad, since there are often sentences you can't translate unless the technology understands the context and meaning.
Does anyone know if it's possible to split this model across multiple (identical, sli) GPUs, to run a version bigger than fits in the RAM of a single GPU?
how long before someone creates a simple GUI for this?
That + a small bit of optimisation and everyone with a newer Mac / iPhone will be able to run something akin to chatGPT locally!
Isn't this a pretty crazy development - just weeks ago people said this would be impossible.
From this thread the 13b model runs just as fast as chatGPT on a M2 Macbook Air, and it's not even using the Neural Engine yet so will become significantly faster once that is utilised - wow!
For this model, that is the only "legitimate" route. But since they were leaked, everyone seems to be using them anyway. There are other models you can download from Huggingface.
Yes, but only minimally. Not enough for any human to notice.
However, even this minimal amount can be avoided with GPTQ quantization which maintains uncompressed fp16 performance even at 4bit quantization with 75% less (video)memory overhead.
References:
https://arxiv.org/abs/2210.17323 - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
[Oct, 2022]
> Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models. However, in theory, you should be able to run 65B on a 64GB MacBook
Suddenly the choices Apple made with its silicon are looking like pure genius as there will be a lot of apps using this that are essentially exclusive to their platform. Even with the egregious ram pricing.
With a lot of fine tuning if you squint this is a useful/convincing "ChatGPT on a laptop" minus the corporate lobotomy only a few short months after release. Very exciting! I actually care about the upcoming Mac Pro now.
That's what brand recognition and brand loyalty (and the money they generate) allows you to do. This is not necessarily a bad thing since it forces other competitors - who like to artificially limit their products - to actually get of their greedy asses and compete.
The Nvidia Jetson/Tegra line did (and does) unified memory. Released 2014[0]. Nvidia already has ARM cores. They’re just terrible compared to Apple, Samsung, etc.
This is what the attempted ARM acquisition by Nvidia was about - with the ARM talent, IP, etc they’d be able to integrate more than just memory (GPU, CPU, connectivity via Mellanox, etc).
Regulators shut it down (for good reason) but I can’t help but think we would have seen some really interesting and revolutionary platforms come from it.
Nvidia actually does support this via unified memory. It’s actually an amazing performance trick since you can avoid launching a bunch of kernels just to ferry data back and forth. I did this on a a GTX 1080 for a school project to speed up random shuffling of large amounts of data.
However, even without this feature, you can implement this sort of thing manually in most cases and you’re already being careful on a GPU to respect the cache (only working with one contiguous set of data of memory at a time).
Really, we just need some good systems devs working on running these huge models.
Implementing the software support and getting operating systems to play along and fragmentation between GPU vendors, as always with GPUs on x86, have been the problems. From all accounts it's been working reasonably well on the consoles though.
Also chicken-and-egg: low GPU compute usage uptake outside of games has meant it's not improved lately.
Both AMD and Intel have APUs. However they’re limited by being minority products (so very low support from software) and often have limits on how much ram can be accessed by the GPU and/or have very weak GPUs.
That's a pretty important just. They chose to go down this path at this time and shipped, now I can run the 64B Llama on a widely available $3,999 prosumer device.
How much would a PC that can do that currently cost me and can I have it by tomorrow?
It's a great option if you have the hardware and want the speed. It's table stakes when other vendors like Nvidia, Intel, Qualcomm and Microsoft have acceleration though. Raw-compute-for-the-buck has always been a blowout with Apple Silicon GPUs, and it's not any prettier now that 4nm 40XX series cards are available. Hell, an Intel A770 with 16gb of VRAM is still cheaper than adding 16gb of RAM to a Mac Mini.
It's good stuff, but pitched a bit hard with all the marketing. From where I'm standing, it looks like Apple is playing catch-up with their GPUs and acceleration tech.
>How much would a PC that can do that currently cost me and can I have it by tomorrow?
At the moment, seems like Apple has an edge here. On PC for single GPU you need an NVIDIA A40, which used prices for is about $2500, and not at retail stores.
If you don't mind having two GPUs then two $800 3090 GPUs works, but that's a workstation build you'll have to order from Puget or something. That's probably faster than Apple.
My gut instinct is that there's some low hanging fruit here and in the next couple weeks 64B Llama will run comparably or faster on any PC with a single 4090/3090 and 64 or 128 GB of system memory. But probably not any PC laptops that aren't 17 inch beasts, Apple will keep that advantage.
…and for models that require 64GB of VRAM? 120GB of VRAM?
You can get a 128GB UMA mac for less than a single 48GB a100, let alone a single 96GB a100.
I think Apple got incredibly lucky here, but I don’t see how the PC world catches them any time soon. We’ve all known that UMA is theoretically better for ages, but Apple’s timing couldn’t be better. And scale economies mean they can sell the same chip to people who need 100GB of system RAM and people who need 100GB of VRAM.
If they can get their GPU / neural performance up and sort out their terrible relationship with academic research, they could snipe ML away from nvidia. It seems very unlikely, but it’s kind of stunning that it’s even in the realm of possibility.
If Nvidia announced tomorrow that they were cancelling every datacenter deal they had, open-sourcing CUDA and publishing their entire patent library to the creative commons, I would still not believe you.
This is a fun project for people with Apple Silicon machines who want to participate in the AI happenings, but I don't think you can warp it into a call for Nvidia's head. Let's wait until Apple pulls the curtains on their rackmount Mac Pros, so we can compare it with Nvidia's ARM server offerings: https://www.nvidia.com/en-us/data-center/grace-cpu/
My point was that the PC architecture of separate system and GPU memory is hitting a wall that means inefficiency and higher prices.
I have little doubt that Nvidia’s attempted acquisition of ARM was in part because nvidia recognized this. I expect they are exploring other UMA approaches. But it will be hard in the fragmented, not-vertically-integrated model.
Apple’s advantage here is one platform that can scale: it is hard to imagine Grace and similar running Windows on developer’s desktops. Maybe!
But my point was that, shockingly, Apple has a chance here. A small chance, as I said, but I don’t think anyone (including Apple) saw just how soon UMA was going to become a competitive advantage.
Nvidia doesn't need to acquire ARM to sell systems with unified memory. The Tegra boards are all mixed-address-space systems, and CUDA lets you manipulate memory over PCI. They see the writing on the wall, which is why they've been building systems for the past decade that reflect this philosophy.
If you think it's hard to imagine Nvidia hardware running on a developer desktop, wait until you hear about what happened when Macs tried entering the server market.
Intel GPUs have had that feature for over two decades, and it was also called UMA; synonymous with cheap and slow, before Apple hyped that term and made a UMA system that actually had decent performance.
2. CPUs and GPUs typically disagree on whether they want high bandwidth or low latency, apple managed to keep both happy but it's very hard to do on a PC where the RAM,CPU, and GPU are quite far apart and also nowhere near as homogenous as Apple have them.
Nvidia didn't implement it in their OpenCL drivers because OpenCL's shared virtual memory spec is so terrible that khronos made it optional in OpenCL 3.0 and Intel (the only company actually having high quality OpenCL implementations) dropped shared virtual memory like a hot potato and instead introduced their own unified shared memory extension for OpenCL which they use for SyCL and oneAPI.
but why would you do C++ when its quite clear ML load is highly parallel. the page says vectorized by NEON but no mention whether its autovectorized by gcc or hand optimized. That will have a pretty significant performance impact.
just because something is high performance does not means it cannot be improved by say another 2X. My point was not whether its usable or not, it is that if you are going to run on CPU vectorization is rather important part and its odd that the landing page has no remark about it.
Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.
Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.
So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!