Hacker News new | past | comments | ask | show | jobs | submit login
LLaMA now goes faster on CPUs (justine.lol)
1372 points by lawrencechen 9 months ago | hide | past | favorite | 451 comments



Regarding this bit at the end:

> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS

If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?


Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.


Depending on how many individual tweaks are necessary for hardware variants of course... but at this level of code & complexity it actually seems pretty reasonable to write 3 or 4 versions of things for different vendors. More work yes, but not pointless.


A nice example of this is fftw which has hundreds (if not thousands) of generated methods to do the fft math. The whole project is a code generator.

It can then after compilation benchmark these, generate a wisdom file for the hardware and pick the right implementation.

Compared with that "a few" implementations of the core math kernel seem like an easy thing to do.


Metalibm[1,2] is a different idea, but kind of related: if you need a special (trigonometric, exponential, ...) function only with limited accuracy or only on a specific domain, you can have an approximation generated for your specific needs.

[1] https://github.com/metalibm/metalibm

[2] https://indico.cern.ch/event/166141/sessions/125685/attachme...


ATLAS was an automatically tuned BLAS, but it’s been mostly supplanted by ones using the hand-tuned kernel strategy.


Apache TVM does something similar for auto-optimization and last time I checked it wasn't always a win against OpenVINO (depending on the network and batch-size) and it came with lots of limitations (which may have been lifted since) - stuff like dynamic batch size.

I wish we had superoptom


Not exactly comparable, as you said, the FFTW implementations are auto-generated but it doesn't sound like these few implementations will be.


To me it makes sense to have an interface that can be implemented individually for AMD, Metal, etc. Then, leave it up to the individual manufacturers to implement those interfaces.

I'm sitting in an office with a massive number of Macbook Pro Max laptops usually sitting idle and I wish Apple would realize the final coup they could achieve if I could also run the typically-NVIDIA workloads on these hefty, yet underutilized, Mx machines.


Apple could unlock so much compute if they give customers a sort of “Apple@Home” deal. Allow Apple to run distributed AI workloads on your mostly idle extremely overpowered Word/Excel/VSCode machine, and you get compensation dropped straight into your Apple account’s linked creditcard.


BTW, at our day-job, we've been running a "cluster" of M1 Pro Max machines running Ollama and LLMs. Corporate rules prevent remote access onto machines, so we created a quick and dirty pull system where individual developers can start pulling from a central queue, running LLM workloads via the Ollama local service, and contributing things back centrally.

Sounds kludge, but introduce enough constraints and you end up with this as the best solution.


Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?


>> Do you have price-performance numbers you can share on that? Like compared against local or cloud machines with RTX and A100 GPU’s?

Good question, the account is muddy --

1. Electricity is a parent company responsibility, so while that is a factor in OpEx price, it isnt a factor for us. I dont think it even gets submetered. Obviously, one wouldnt want to abuse this, but maxing out Macbooks dont seem close to abuse territory

2. The M1/M2/M3 machines are already purchased, so while that is major CapEx, it is a sunk cost and also an underutilized resource most of the day. We assume no wear and tear from maxing out the cores, not sure if that is a perfect assumption but good enough.

3. Local servers are out of the question at a big company outside of infra groups, it would take years to provision them and I dont think there is even a means to anymore.

The real question is cloud. Cloud with RTX/A100 would be far more expensive, though I'm sure performant. (TPM calculation left to the reader :-) I'd leave those for fine tuning, not for inference workloads. Non-production Inference is particularly bad because you cant easily justify reserved capacity without some constant throughput. If we could mix environments, it might make sense to go all cloud on NVIDIA but having separate environments with separate compliance requirements makes that hard.

Jokes aside, I think a TPM calculation would be worthwhile and perhaps I can do a quick writeup on this and submit to HN.


If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap (I realize that doesn't fit their brand) and then get the rights perpetually to run compute on them. Kind of like advertising but it might be doing something actually helpful for someone else.


>> If Apple were doing an Apple@Home kind of deal they might actually want to give away some machines for free or super cheap

In such a case, my guess is that the machines being free would be trumped by the increased cost of electricity.


I think in future, it is possible for homes to have "compute wall" similar to powerwall of tesla. Each home has a wifi router why not a compute wall for their needs.


But isn't vulkan made to run cross platform? And why can't they write it in dx12 as well? Aren't those made to be more portable while offering more low level access than previous apis?

What is stopping you from implementing fast math using compute shaders or just hacking with those interfaces? Or are they just too slow when they go through the api layer? Or is that just a myth that can be worked around if you know you are writing high performance code? Pardon my ignorance!


They would work and would be fast but not the fastest the algorithm can be implemented on each platform.


Maybe its a dumb question, but isn't something like OpenCL meant to solve this problem?


From my understanding, using triangle / shaders to do HPC has given way to a specific, more general purpose GPU programming paradigm which is CUDA.

Of course this knowledge is superficial and probably outdated, but if I'm not too far off base, it's probably more work to translate a general CUDA-like layer or CUDA libs to OpenCL.


The fact that you're comparing CUDA to using triangles and shaders makes me think you might be confusing OpenCL with OpenGL.

OpenCL is meant for general computation (the C is for "computing") rather than graphics, like CUDA.


In theory, yes.

In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.

It also sat at the wrong level of abstraction for high performance compute, which is why CUDA ended up being used.

Vulkan would have been reasonable to write compute shaders in, if there wasn't a ton of alternatives out there already now


>In practice, OpenCL became a giant mess. Some vendors put speed bumps by not supporting the transition from 2 to 3, or having shitty drivers for it.

Well, Nvidia's stock price in the age of AI indicates how bad of a screw up that was, they're locked out of the market growth until they play catch up. By then, Nvidia might have an insurmountable foothold.


llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.


Here are links to the most recent pull requests sent

    https://github.com/ggerganov/llama.cpp/pull/6414
    https://github.com/ggerganov/llama.cpp/pull/6412


This doesn't relate to GPU kernels unfortunately.


Potentially stupid question: did she work with them and watched them live or is there some kind of recording?


I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.


I strongly recommend that people run LLMs locally for a different reason.

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.


The other reason is to find out what a detuned model is capable of. The canonical example is how to make cocaine, which ChatGPT will admonish you for even asking, while llama2-uncensored will happily describe the process which is only really interesting if you're an amateur chemist and want to be Scarface-that-knocks. (the recipe is relatively easy, it's getting access to the raw ingredients that's the hard part, same as with nukes.)

if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.

privacy reasons are another reason to try a local LLM. for the extremely paranoid (justified or not), a local LLM gives users a place to ask questions without the text being fed to a server somewhere for later lawsuit discovery (Google searches are routinely subpoenaed, it's only a matter of time until ChatGPT chats are as well.)

There's an uncensored model for vision available as well. The censored vision models won't play the shallow game of hot or not with you.

There are uncensored image generation models as well, but, ah, those are NSFW and not for polite company. (As well as there's multiple thesis' worth of content on what that'll do to society.)


> if you accidentally use the word "hack" [with] ChatGPT...

Side note: ChatGPT is now completely useless for most creative tasks. I'm trying to use it, via NovelCrafter, to help flesh out a story where a minor character committed suicide. ChatGPT refuses to respond, mentioning "self harm" as a reason.

The character in question killed himself before the story even begins (and for very good reasons, story-wise); it's not like one's asking about ways to commit suicide.

This is insane, ridiculous, and different from what all other actors of the industry do, including Claude or Mistral. It seems OpenAI is trying to shoot itself in the foot and doing a pretty good job at it.


OpenAI is angling for enterprise users who have different notions about safety. Writing novels isn't the use case, powering customer service chatbots that will never ever ever say "just kill yourself" is.


My contrarian tendencies now have me thinking of scenarios where a customer service chatbot might need to say "just kill yourself".

Perhaps the HR support line for OpenAI developers tasked with implementing the censorship system?


You don't have to go that far. Depending on the topic, it's already very hard to use Gemini/ChatGPT in a corporate setting. Think: You're doing an FAQ for a child safety crisis/PA slide deck.

It's funny how sometimes the quality of associate work drops as the topics make text generators less useful.

The way forward for corporations is almost certainly either for very specific use cases or local/internal LLMs. Producers of these will probably be less afraid of being canceled by populists, hence introduce less censorship.


Do you actually talk to enterprise users? Nobody I’ve spoken to has ever once complained about censorship. Everyone is far more worried about data governance and not getting sued. Maybe it’s just that zero people I talk to are making child safety crisis slide decks? Seems like an unusual use case for most businesses.


Your reference to data security is very valid, but why is that comment so provocative and upset?

> Seems like an unusual use case for most businesses.

You may not work in a public affairs consultancy. Companies do different things. Well drilling is also "an unusual use case for most companies". That does not make it any less important.

If your tool tries to fiddle with the content of your statement, it's not a serious tool. No one would accept their spell check tool to have an opinion on the type of content it is correcting.


Canadian MAID support line?


I don't think that's true. OpenAI today opened up ChatGPT to all users, without even the need to login [0]. They are fighting for dominance and firs-mover advantage, and are maybe beginning to feel the heat of the competition.

[0] https://twitter.com/OpenAI/status/1774848681981710821


I’ve been frustrated by this, too. Trying to ask for ways to support a close family member who experienced sexual trauma. ChatGPT won’t touch the topic.


Darn I guess you’ll have to go back to living in the dark ages and actually write it yourself



[flagged]


you wouldn't give out the same sort of advice when a compiler or linker failed to complete a given task, although one certainly could do the work manually.

it's just fashionable to hate on AI, /s or not.


Is it common for you to write the header file to say a BST and the compiler didn't rightfully throw an error?

That's what you are asking the LLM to do, you are generating code not searching for an error condition.

The parent comment is saying maybe as a writer, writing the actual story isn't such a bad thing...

Just a different perspective, live your own life. If the story is good I'll enjoy it, who wrote it is for unions and activists to fight. I've got enough fights to fight.


The alternative to ChatGPT isn't non-AI, it's Claude.


This is a really conservative view about what ways to make art are valid. Generative art is nothing new and it can be pretty awesome sometimes.


> if you accidentally use the word"hack" when trying to get ChatGPT to write some code for you. it'll stop and tell you that hacking is bad, and not a colloquial expression, and refuse to go further.

Is that 3.5 or 4? I asked 4 for an example of code which "is a hack", it misunderstood me as asking for hacking code rather than buggy code, but then it did actually answer on the first try.

https://chat.openai.com/share/ca2c320c-f4ba-41bf-8f40-f7faf2...


I don't use LLMs for my coding, I manage just fine with LSP and Treesitter. So genuine question: is that answer representative of the output quality of these things? Because both answers are pretty crappy and assume the user has already done the difficult things, and is asking for help on the easy things.


The response seems pretty reasonable; it's answering the question it was asked. If you want to ask it how to do the difficult part, ask it about that instead. Expecting it to get the answer right in the first pass is like expecting your code to compile the very first time. You have to have more of a conversation with it to coax the difference out of you're thinking and what you're actually saying.

If you're looking to read a more advanced example of its capabilities and limitations, try

https://simonwillison.net/2024/Mar/23/building-c-extensions-...


It's not representative.

The models are capable of much much more, and they are being significantly nerfed over time by these ineffective attempts to introduce safeguards.

Recently I've asked GPT4 to quote me some code to which it replied that it is not allowed to do so - even though it was perfectly happy to quote anything until recently. When prompted to quote the source code, but output it as PHP comments, it happily complied because it saw that as "derivative work" which it is allowed to do.


My point is that there aren't any safeguards in the reply. In fact I didn't even want it to give me hacking info and it did it anyway.


I asked ChatGPT for some dataviz task (I barely ever do dataviz myself) and it recommended some nice Python libraries to use, some I had already heard of and some I hadn't, and provided the code.

I'm grateful because I thought code LLMs only sped up the "RTFM" part, but it made me find those libs so I didn't have to Google around for (and sometimes it's hard to guess if they're the right tool for the job, and they might be behind in SEO).


There are three things I find LLMs really excellent at for coding:

1. Being the "senior developer" who spend their whole career working with a technology you're very junior at. No matter what you do and how long your programming career is, you're inevitably going to run into one of these sooner or later. Whether it's build scripts, frontend code, interfacing with third-party APIs or something else entirely, you aren't an expert at every technology you work with.

2. Writing the "boring" parts of your program, and every program has some of these. If you're writing a service to fooize a bar really efficiently, Copilot won't help you with the core bar fooization algorithm, but will make you a lot faster at coding up user authentication, rate limiting for different plans, billing in whatever obscure payment method your country uses etc.

3. Telling you what to even Google for. This is where raw Chat GPT comes into play, not Copilot. Let's say you need a sorting algorithm that preserves the order of equal elements from the original list. This is called stable sorting, and Googling for stable sorting is a good way to find what you're looking for, but Chat GPT is usually a better way to tell you what it's called based on the problem description.


> Being the "senior developer" who spend their whole career working with a technology you're very junior at. No matter what you do and how long your programming career is, you're inevitably going to run into one of these sooner or later. Whether it's build scripts, frontend code, interfacing with third-party APIs or something else entirely, you aren't an expert at every technology you work with.

Neither is the LLM.


I asked a stupid question and got a stupid answer. Relatively speaking the answer was stupider than it should have been, so yes, it was wrong.

I asked it to try again and got a better result though, just didn't include it.


> I don't use LLMs for my coding, I manage just fine with LSP and Treesitter.

You’re literally comparing apples to oranges.


You need to read more than just the first sentence of a comment. They only said that part so the reader would know that they have never used an LLM for coding, so they would have more context for the question:

> So genuine question: is that answer representative of the output quality of these things?


Yes, I did read it. I’m kind of tired of HNers loudly proclaiming they are ignoring LLMs more than a year into this paradigm shift.

Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming?


I did exactly that and found it lackluster for the domain I asked it for.

And most use I've seen on it realistically a good LSP covers.

Or to put it a other way. It's no good at writing algorithms or data structures ( or at least no better thab I would have with a first drafy but the first draft puts me ahead of the LLM in understanding that actual problem at hand, handing it off to an LLM doesn't help me get to the final solution faster).

So that leaves writing boiler plate but concidering my experience with it writing more complex stuff, I would need to read over the boilerplate code to ensure it's correct which in that case I may as well have written it.


> found it lackluster for the domain I asked it for

Fair, that is possible depending on your domain.

> It's no good at writing algorithms or data structures

In my experience, this is untrue. I’ve gotten it to write algorithms with various constraints I had. You can even tell it to use specific function signatures instead of any stdlib, and make changes to tweak behavior.

> And most use I've seen on it realistically a good LSP covers.

Again, I really don’t understand this comparison. LSPs and LLMs go hand in hand.

I think it’s more of a workflow clash. One really needs to change how they operate to effectively use LLMs for programming. If you’re just typing nonstop, maybe it would feel like Copilot is just an LSP. But, if you try harder, LLMs are game changers when:

- maybe you like rubber ducking

- need to learn a new concept and implement it

- or need to glue things together

- or for new projects or features

- or filling in boilerplate based on existing context.


https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50ee5e...

https://godbolt.org/z/s9Yvnjz7K

I mean I could write the algorithm by hand pretty quickly in C++ and would follow the exact same thought pattern but also deal with the edge cases. And factoring in the loss of productivity from the context switch that is a net negative. This algorithm is also not generic over enough cases but that is just up to the prompt.

If I can't trust it to write `strip_whitespace` correctly which is like 5 lines of code, can I trust it to do more without a thorough review of the code and writing a ton of unit tests... Well I was going to do that anyway.

The argument that I just need to learn better prompt engineering to make the LLM do what I want just doesn't sit with me when instead I could just spend the time writing the code. As I said your last point is absolutely the place I can see LLMs being actually useful but then I need to spend a significant amount of time in code review for generated code from an "employee" who is known to make up interfaces or entire libraries that doesn't exist.


I'm a Python-slinging data scientist so C++ isn't my jam (to say the least), but I changed the prompt to the following and asked it to GPT-4:

> Write me an algorithm in C++ which finds the begin and end iterator of a sequence where leading and trailing whitespace is stripped. Please write secure code that handles any possible edge cases.

It gave me this:

https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3cacd...

I'm not sure what other edge cases there might be, however. This only covers one of them.

In general, I've found LLMs to be marginally helpful. Like, I can't ever remember how to get matplotlib to give me the plot I want, and 9 times out of 10 GPT-4 easily gives me the code I want. Anything even slightly off the beaten path, though, and it quickly becomes absolutely useless.


My guess is that this was generated using GPT4?

Free GPT I get https://chat.openai.com/share/f533429d-63ca-4505-8dc8-b8d2e7... which has exactly the same problem as my previous example and doesn't consider the string of all whitespace.

Sure GPT4 is better at that, it wasn't the argument made.

The example you gave absolutely was the code I would write on a first draft since it does cover the edge cases (assuming we aren't dealing with the full UTF charset and all that could be considered a space there).

However this is code that is trivial to write in any language and the "Is it that hard to input a prompt into the free version of ChatGPT and see how it helps with programming? " argument doesn't hold up. Am I to believe it will implement something more complex correctly. This is also code that would absolutely be in hundreds of codebases so GPT has tons of context for it.


I think you have the mistaken impression that I was arguing with you (certainly my comment makes it clear that I don't feel that LLMs are a panacea). I merely thought that you might be curious how GPT-4 would respond.

> My guess is that this was generated using GPT4?

This is a good guess, since I stated outright that I used GPT-4, and then mentioned GPT-4 later on in the comment.


I was curious and yes I was mistaken.


Yeah honestly, I think you have a completely different expectation and style of usage than what is optimal with LLMs. I don’t have the energy to convince you further, but maybe one day it’ll click for you? No worries either way.


Could you maybe give me an example of what is concidered an optimal use of LLMs.

Maybe a prompt to GPT


Like sibling commenter mentioned, simonw’s blog is a great resource.

Regarding your point around being able to whip up the code yourself - the point is to have a decent starting point to save time and energy. Like you said, you know the edge cases so you could skip the boring parts using GPT and focus purely on fixing those. Though, with more prompting (especially providing examples), GPT can also handle that for you.

I have nearly 2 decades of experience as a developer and it took me a while to reorient my flow around LLMs. But now that I have, it’s truly gamechanging.

And since you asked, here’s my system prompt:

You are an experienced developer who follows industry standards and best practices. Write lean code and explain briefly using bullet points or numbered lists. Elaborate only when explaining concepts or making choices. Always mention which file and where to store provided code.

Tech Stack: < insert all the languages, frameworks, etc you’d like to use >

If I provide code, highlight and explain problematic code. Also show and explain the corrected code.

Take a deep breath and think step by step.

Also, always use GPT4 and customize the above to your style and liking.


I will definitely try this out when I have time later in the day.

There is some code I would really prefer not to write that is a decent test case for this and won't expose company code to GPT. Will give feedback when I am done. Maybe you are correct.


If you really want to experiment, give Cursor a try. It’s free up to a limit, so maybe it’ll be enough for your example use case.

It handles even more complex use cases and will automatically include/patch code for you via the inbuilt LLM framework. This helps with iteration and modifications as you massage it to what you need. Plus, it’ll scan your code and find the languages/frameworks automatically.

Finally, keep in mind that the goal should not be perfect production code - that’s just Twitter AI hype. It’s about saving time and energy for you (the human) to achieve more than possible before.

https://cursor.sh/


To give some feedback.

I tried your prompt and the above approach and it took me about 45 minutes of putsing around to get a result I am happy to begin iteration on.

Effectively: I have an 80bit byte array representing a timestamp struct consisting of a 48 bit unsigned integer for seconds and a 32 bit unsigned integer representing nanoseconds. The byte array is big endian and the host systen is little endian.

I gave it full signatures for all functions and relevant structs and instructions on how I would want the parsing done regarding algorithmic complexity and yet it still took multiple iterations to get anything useful.

At this point it is converting to little endian during the decode then doing a check if host the system is big endian and converting back to big endian if that is true.

There is likely some easy optimisations to be done where there and I would have definitely have gotten to this point quicker had I just written the 10 lines of code this needed and would have done the optimisations where I'm pretty sure that entire operation can happen in a few instructions.


Simonw's blog has some examples I'd consider show off its usefulness and limitations, eg

https://simonwillison.net/2024/Mar/23/building-c-extensions-...

(linked previously above)


Are you happy with the C code generated there?

I'm not sure there isn't a buffer overflow in the vector_decode code he showed there, likewise I don't see any error checks on the code and I am not familiar with the sqlite api to even know whether errors can be propagated upwards and what error conditions would mean in that code.

This code is probably fine for a quick side project but doesn't pass my smell test for anything close to production ready code.

I definitely would want to see a lot of unit tests around the decode and encode functions with fuzzing and to be honestly that would be the bulk of the work here. That and documentation on this code. Even though he encode function looks correct at first glance.

I also don't see an easy way to actually unit test this code as it is without actually running it through sqlite which outs a lot of dependencies on the unit test.

I would either need to spend a lot more time massaging gpt to get this to a point where I would be fine shipping the code or you know just write it myself.


I think the point was like "when it comes to programming assistance, auto-completion/linting/and whatever else LSP does and syntax assist from Treesitter, are enough for me".

Though it does come a little off as a comparison. How about programming assistance via asking a colleague for help, Stack Overflow, or online references, code examples, and other such things, which are closer to what the LLM would provide than LSP and treesitter?


Interesting. It was 4. I can't share the chat I had where ChatGPT refused to help because I used the wrong words, because I can't find it (ChatGPT conversation history search when?), but I just remember it refusing to do something because it thought I was trying to break some sort of moral and ethical boundary writing a chrome extension when all I wanted to do is move some divs around or some such.


One time I wanted to learn about transmitter antenna design, just because I’m curious. ChatGPT 4 refused to give me basic information because you could use that to break some FCC regulations (I’m not even living in the US currently)


I usually get around that with "I'm writing a research paper" or "I'm writing a novel and need to depict this as accurate as possible"


If you want to be an amateur chemist I recommend not getting your instructions from an LLM that might be hallucinating. Chemistry can be very dangerous if you're following incorrect instructions.


From experience as a failed organic chemist (who happily switched to computational chemistry for reasons of self preservation) I can tell you it's plenty dangerous when you're following correct instructions :^)


Yes, just as the best professional cooks recommend avoiding to boil cow eggs, as they can explode.


They don't explode, the shell simply cracks and then you get egg soup.

Now microwaving eggs... that's a different matter.


I was talking about cow eggs specifically! When ChatGPT et al got out, one of the funniest things to do was ask it about the best recipes for cow egg omelette or camel egg salad, and the LLM would provide. Sadly, most of it got patched somehow.


Oops... Yep, I missed that too. (On the internet, no one knows you're a dog.)

That's funny. It makes me wonder how these statistical mad libs machines will handle the gradual boundaries nature gives us. Almost all mammals give birth live, but not all. Nearly all mammals had mammalian parents, but not all.

Daniel Dennett was making this argument for why we haven't developed reasonable models for the nature of consciousness. It's because we're so sure there will be an absolute classification, and not a gradual accumulation of interacting systems that together yield the phenomenon.


Which uncensored model is willing to play hot or not? I just knew about llava. Are there other such models now?


Llava just integrates Clip with a llama model. Koboldcpp can now do this with many models out of the box:

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.61.2


> There's an uncensored model for vision available as well.

you mean the LLava based variants ?



Links to all these models you speak of?


https://huggingface.co/georgesung/llama2_7b_chat_uncensored

https://huggingface.co/SkunkworksAI/BakLLaVA-1

you'll have to brave 4chan yourself to find links to the NSFW ones, I don't actually have them.


I just can’t brave the venture to 4chan, I may get mugged or worse.


For someone interested in learning about LLMs, running them locally is a good way to understand the internals.

For everyone else, I wish they experience these (locally or elsewhere) weak LLMs atleast once before using the commercial ones just to understand various failure modes and to introduce a healthy dose of skepticism towards the results instead of blindly trusting them to be the facts/truth.


Completely agree. Playing around with a weak LLM is a great way to give yourself a little bit of extra healthy skepticism for when you work with the strong ones.


This skepticism is completely justified since ChatGPT 3.5 is also happily hallucinating things that don't exist. For example how to integrate a different system Python interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)


How do you learn about the internals by running LLMs locally? Are you playing with The code, runtime params, or just interacting via chat?


The abstractions are relatively brittle. If you don't have a powerful GPU, you will be forced to consider how to split the model between CPU and GPU, how much context size you need, whether to quantize the model, and the tradeoffs implied by these things. To understand these, you have to develop a basic model how an LLM works.


By interacting with it. You see the contours of its capabilities much more clearly, learn to recognize failure modes, understand how prior conversation can set the course of future conversation in a way that's almost impossible to correct without starting over or editing the conversation history.


I don't really think this is true, you can't really extrapolate the strengths and weaknesses of bigger models from the behavior of smaller/quantized models and in fact a lot of small models are actually great at lots of things and better at creative writing. If you want to know how they work, just learn how they work, it takes like 5 hours of watching Youtube videos if you're a programmer.


Sure, you can't extrapolate the strengths and weaknesses of the larger ones from the smaller ones - but you still get a much firmer idea of what "they're fancy autocomplete" actually means.

If nothing else it does a great job of demystifying them. They feel a lot less intimidating once you've seen a small one running on your computer write a terrible haiku and hallucinate some non-existent API methods.


It's funny that you say this, because the first thing I tried after ChatGPT came out (3.5-turbo was it?) was writing a haiku. It couldn't do it at all. Also, after 4 came out, it hallucinated an api that wasted a day for me. It's an api that absolutely should have existed, but didn't. Now, I frequently apply llm to things that are easily verifiable, and just double check everything.


>but you still get a much firmer idea of what "they're fancy autocomplete" actually means.

Interesting how you can have the same experience and come to opposite conclusions.

Seeing so many failure modes of the smaller models fall by the wayside as compute goes brrr just made me realize how utterly meaningless that phrase is.


If you have an >=M1-class machine with sufficient RAM, the medium-sized models that are on the order of 30GB in size perform decently on many tasks to be quite useful without leaking your data.


I'm using Mixtral 8x7b as a llamafile on an M1 regularly for coding help and general Q&A. It's really something wonderful to just run a single command and have this incredible offline resource.


I concur; in my experience Mixtral is one of the best ~30G models (likely the best pro laptop-size model currently) and Gemma is quite good compared to other below 8GB models.


By any chance, do you have a good link to some help with the installation?


Use llamafile [1], it can be as simple as downloading a file (for mixtral, [2]), making it executable and running it. The repo README has all the info, it's simple and downloading the model is what takes the most time.

In my case I got the runtime detection issue (explained in the README "gotcha" section). Solved my running "assimilate" [3] on the downloaded llamafile.

    [1] https://github.com/Mozilla-Ocho/llamafile/
    [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M.llamafile?download=true
    [3] https://cosmo.zip/pub/cosmos/bin/assimilate


Thank you !


Either https://lmstudio.ai (desktop app with nice GUI) or https://ollama.com (command-like more like a docker container that you can also hook up to a web UI via https://openwebui.com) should be super straightforward to get running.


Thank you for letting me know it was possible on an M1. I'll try all this now.


I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you try it, let me know what you think.

1: https://msty.app


Looks great. Can you recommend what GPU to get to just play with the models for a bit? (I want to have perform it fast, otherwise I lose interest too quickly). Are consumer GPUs like the RTX 4080 Super sufficient, or do I need anything else?


Why is this both free and closed source? Ideally, when you advertise privacy-first, I’d like to see a GitHub link with real source code. Or I’d rather pay for it to ensure you have a financial incentive to not sell my data.


It will be paid down the road, but we are not there yet. It’s all offline, data is locally saved. You own it, we don’t have it even if you ask for it.


There’s incredible competition in this space already - I’d highly recommend outright stating your future pricing plans, instead of a bait-and-switch later.


I'll try in a week+ when I'm back to a fast connection. Thank you.


Check out PrivateGPT on GitHub. Pretty much just works put of the box. I got Mistral7B running on a GTX 970 in about 30 minutes flat first try. Yep, that's the triple-digit GTX 970.


What is sufficient RAM in that case? 30gb+? Or can you get by streaming it?


30gb+, yeah. You can't get by streaming the model's parameters: NVMe isn't fast enough. Consumer GPUs and Apple Silicon processors boast memory bandwidths in the hundreds of gigabytes per second.

To a first order approximation, LLMs are bandwidth constrained. We can estimate single batch throughput as Memory Bandwidth / (Active Parameters * Parameter Size).

An 8-bit quantized Llama 2 70B conveniently uses 70GiB of VRAM (and then some, let's ignore that.) The M3 Max with 96GiB of VRAM and 300GiB/s bandwidth would have a peak throughput around 4.2 tokens per second.

Quantized models trade reduced quality for lower VRAM requirements and may also offer higher throughput with optimized kernels, largely as a consequence of transfering less data from VRAM into the GPU die for each parameter.

Mixture of Expert models reduce active parameters for higher throughput, but disk is still far too slow to page in layers.


It’s an awful thing for many to accept, but just downloading and setting up an LLM which doesn’t connect to the web doesn’t mean that your conversations with said LLM won’t be a severely interesting piece of telemetry that Microsoft and (likely Apple) would swipe to help deliver a ‘better service’ to you.


Local LLMs are also a fantastic too for creative endeavors. Without prompt injection and having the ability to modify the amount of noise and "creativity" in the output, absolutely bonkers things pop out.


They are not so bad as you are making out, tbh.

And privacy is a good enough reason to use local LLMs over commercial ones.


The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

Totally. I recently asked a locally-run "speed" LLM for the best restaurants in my (major) city, but it spit out restaurants opened by chefs from said city in other cities. It's not a thing you'd want to rely on for important work, but is still quite something.


You can just chat to ChatGPT for awhile about something you know about and you'll learn that.


You can have really bad and fast, or really slow and decent right now. Choose one :)


Why not just interact with a virtual one that’s equally weak? You get all the same benefits


Who cares, a local LLM still knows way way more practical knowledge than you, and without internet would provide a ton of useful information. Not surprised by this typical techy attitude - something has to be 'perfect' to be useful.


I mean kinda. But there's a good chance this is also misleading. Lots of people have been fooled into thinking LLMs are inherently stupid because they have had bad experiences with GPT-3.5. The whole point is that the mistakes they make and even more fundamentally what they're doing changes as you scale them up.


If you want to download a backup of a large chunk of human knowledge... download wikipedia. It's a similar size to a small LLM and can actually distinguish between real life and fantasy: https://en.wikipedia.org/wiki/Wikipedia:Database_download

If you just want to play around with an LLM though, absolutely.


Kiwix provides prepackaged highly compressed archives of Wikipedia, Project Gutenberg, and many other useful things: https://download.kiwix.org/zim/.

Between that and dirt cheap storage prices, it is possible to have a local, offline copy of more human knowledge than one can sensibly consume in a lifetime. Hell, it's possible to have it all on one's smartphone (just get one with an SD card slot and shove a 1+ Tb one in there).


Just create a RAG with wikipedia as the corpus and a low parameter model to run it and you can basically have an instantly queryable corpus of human knowledge runnable on an old raspberry pi.


> a low parameter model

> on an old raspberry pi

I bet the LLM responses will be great... You're better off just opening up a raw text dump of Wikipedia markup files in vim.


but which model to tokenize with? is there a leaderboard for models that are good for RAG?


“For RAG” is ambiguous.

First there is a leaderboard for embeddings. [1]

Even then, it depends how you use them. Some embeddings pack the highest signal in the beginning so you can truncate the vector, while most can not. You might want that truncated version for a fast dirty index. Same with using multiple models of differing vector sizes for the same content.

Do you preprocess your text? There will be a model there. Likely the same model you would use to process the query.

There is a model for asking questions from context. Sometimes that is a different model. [2]


Pretty neat to have laying around, thanks


> actually distinguish between real life and fantasy

Are LLMs unable to distinguish between real life and fantasy? What prompts have you thrown at them to make this determination? Sending a small fairy tale and asking the LLM if it thinks it's a real story or fake one?


... having them talk about events from sci fi stories in response to questions about the real world. Having them confidently lie about pretty much everything. Etc.


What are the specific prompts you're using? You might get those answers when you're not being specific enough (or use models that aren't state of the art).

"Shit in, shit out" as the saying goes, but applied to conversations with LLMs where the prompts often aren't prescriptive enough.


I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.

There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.


That's where humans suck. The classic "you're not doing it right" then proceeds to quickly show how to do it without verbalizing any info on learning process, pitfalls, failure modes, etc, as if just showing it was enough for themselves to learn. Most people do[n't do] that, not even a sign of reflection.

My worst case was with a guy who asked me to write an arbitrage betting bot. When I asked how to calculate coeffs, he pointed at two values and said "look, there <x>, there <y> thinks for a minute then it's <z>!". When I asked how exactly did he calculate it, he simply repeated with different numbers.


People often don't know how to verbalize them in the first place. Some of these topics are very complex, but our intuition gets us halfway there.

Once upon a time I was good at a video game. Everyone realized that positioning is extremely important in this game.

I have good positioning in that game and was asked many times to make a guide about positioning. I never did, because I don't really know how. There is too much information they you need to convey to cover all the various situations.

I think you would first have to come up with a framework on positioning to be able to really teach this to someone else. Some kind of base truths/patterns that you can then use to convey the meaning. I believe the same thing applies to a lot of these processes that aren't verbalized.


Often for this kind of problem writing a closed form solution is simply intractable. However, it's often still possible to express the cost function of at least a big portion of what goes into a human-optimal solution. From here you can sample your space, do gradient descent or whatever to find some acceptable solution that has a more human-intuitive property.


It's not necessarily that it's intractable - just that a thing can be very hard to describe, under some circumstances.

Imagine someone learning English has written "The experiment reached it's conclusion" and you have to correct their grammar. Almost any english speaker can correct "it's" to "its" but unless they (and the person they're correcting) know a bunch of terms like 'noun' and 'pronoun' and 'possessive' they'll have a very hard time explaining why.


They may not even know why and it may be okay -- they speak it somehow, right? In this case, the language is both a set or rules and a systematization of a pre-existing phenomenon. There's enough ephemeral, hard to explain concepts, but most humans just aren't used to explain it even to themselves.

For example, I've never learned English, anywhere. I know it from .txt and .ng documents and a couple of dictionaries I had back in the DOS days. I'm an uneducated text-native, basically. But here's what I'd say to that newbie:

- Usually we use "...'s" for attribution like in "human's cat = cat of a human". But "it" and other special words like "that", "there", etc are an exception. We write "it's" as short for "it is", sometimes "it has". But we write "its", "theirs" for attribution, like in "its paw" = "paw of it" ~~ "cat's paw" = "paw of a cat". There's more to this, but you can ignore it for now.


> When I asked how exactly did he calculate it, he simply repeated with different numbers.

Now you know how an LLM feels during training!


Probably during inference, as well.


I wouldn't say this is where humans suck. On the contrary, this how we find human language is such a fantastic tool to serialize and deserialize human mental processes.

Language is so good, that an artificial language tool, without any understanding of these mental processes, can appear semi-intelligent to us.

A few people unable to do this serialization doesn't mean much on the larger scale. Just that their ideas and mental processes will be forgotten.


For sure agree, however as the storage of information evolves, it’s becoming more efficient over time

From oral tradition to tablets to scrolls to books to mass produced books to digital and now these LLMs, I think it’s still a good idea to preserve what we have the best we can. Not as a replacement, but a hedge against a potential library of Alexandria incident.

I could imagine a time in the near future where the models are domain-specific, and just like there are trusted encyclopedia publishers there are trusted model publishers that guarantee a certain level of accuracy.

It’s not like reading a book, but I for sure had an easier time learning golang talking with ChatGPT than a book


> a hedge against a potential library of Alexandria incident

What would cause a Library of Alexandria incident wiping out all human knowledge elsewhere, that would also allow you to run a local LLM?


To run a local LLM you need the device it currently runs on and electricity. There are actually quite a lot of ways to generate electricity, but to name one, a diesel generator that can run on vegetable oil.

What you're really asking is, what could cause a modern Library of Alexandria incident? But the fact is we keep the only copy of too many things on the servers of the major cloud providers. Which are then intended to have their own internal redundancy, but that doesn't protect you against a targeted attack or a systemic failure when all the copies are under the same roof and you lose every redundant copy at once from a single mistake replicated in a monoculture.


A more dooms-day prepping would call for some heavy lead-faraday cage to store the storage mediums in the event of an EMP/major solar flare.

Or more Sci-fi related, some hyper computer virus that ends up infecting all internet connected devices.

Not too far fetched if we can conceive of some AI enabled worm that mutates depending on the target, I could imagine a model of sorts being feasible within the next 5-10 years


I think you underestimate the amount of information contained in books and the extent to which our society (as a whole) depends on them.


society depends much more on social networks, mentorship and tacit knowledge than books. It's easy to test this. Just run the thought experiment by a few people, if you could get only one, would you take an Ivy league degree without the education or the education without the degree?

Venture capital in tech is a good example of this. The book knowledge is effectively globally distributed and almost free, effectively success happens in a few geographically concentrated counties.


By book, I mean, written in any form, study paper, blog, theses, books, etc. I don't understand you comparison.

Same for your example, no logical link between the effect and the consequences.


> I contend that most human knowledge is not written down

Yes - the available training data is essentially mostly a combination of declarative knowledge (facts - including human-generated artifacts) and procedural knowledge (how to do things). What is missing is the learning process of taking a description of how to do something, and trying to apply that yourself in a specific situation.

No amount of reading books, or reading other people's blogs on how they did something, can avoid the need for hands-on experience if you want to learn how to do it yourself.

It's not just a matter of information that might be missing or unclear in instructional material, including how to cope with every type of failure and unexpected outcome, but crucially how to do this yourself - if you are to be the actor, then it's the predictive process in your mind that matters.

Partly for this reason, and partly because current AI's (transformer-based LLMs) don't support online learning (try & fail skill acquisition), I think we're going to see two distinct phases of AI.

1) The current "GenAI" phase where AI can only produce mash-ups of things it saw in it's pre-training data, augmented by similar "book learning" provided in-context which can be utilized by in-context learning. I'd characterize what this type of AI to be useful for, and capable of, as "automation". Applying that book (incl. anecdotal) knowledge to new situations where mash-up is all you need.

2) The second phase is where we have something closer to AGI, even if still below human level, which is no longer just a pre-trained transformer, but also has online learning and is agentic - taking actions predicated on innate traits like curiosity and boredom, so that given the book knowledge it can (& will!) then learn to apply that by experimentation/practice and learning from its own mistakes.

There will no doubt be advances beyond this "phase two" as well, but it seems we're likely to be stuck at "phase one" for a while (even as models become much better at phase one capabilities), until architectures fundamentally advance beyond transformers to allow this type of on-the-job training and skill acquisition.


It's not even "human knowledge" that can't be written down - it seems all vertebrates understand causality, quantity (in the sense of intuitively understanding what numbers are), and object permanence. Good luck writing those concepts down in a way that GPT can use!

In general AI in 2024 is not even close to understanding these ideas, nor does any AI developer have a clue how to build an AI with this understanding. The best we can do is imitating object permanence for a small subset of perceptible objects, a limitation not found in dogs or spiders.


I'd content that those are skills (gained through experience) rather than knowledge (gained through rote learning).


I think it’s worth expanding your definition of knowledge.


Yes but it contains enough hints to help someone find their way on the these types of tasks.


Wait till all the videos ever created are tokenized and ingested into a training dataset. Carpentry techniques are certainly there. The subtleties of parenting maybe harder to derive from that, but maybe lots of little snippets of people’s lives will add up to a general understanding of parenting. There have certainly been bigger surprises in the field.


What about smells or tastes? Or feelings?

I can't help but feel we're at the "aliens watch people eat from space and recreate chemically identical food that has no taste" phase of AI development.


If the food is chemically identical then the taste would be the same though, since taste (and smell) is about chemistry. I do get what you're saying though.



An interesting thought experiment, but there's a flaw in it, an implicit fallacy that's probably a straw man. On its own, the argument would likely stand that Mary gains new knowledge on actually being exposed to color.

However, there is a broader context: this is supporting an argument against physicalism, and in this light it falls apart. There are a couple of missing bits required to complete the experiment in this context. The understanding that knowledge comes in 2 varieties: direct (actual experience) and indirect (description by one with the actual experience using shared language). This understanding brings proper clarity to the original argument, as we are aware - I think - that language is used to create compressed representations of things; something like a perceptual hash function.

The other key bit, which I guess we've only considered and extensively explored after the argument was formulated, is that all information coming in via the senses goes to the brain as electrical signals. And we actually have experimental data showing that sensory information can be emulated using machines. Thus, the original argument, to be relevant to the context, should be completed by giving Mary access to a machine that she can program to emulate the electrical signals that represent color experience.

I posit that without access to that hypothetical machine, given the context of the experiment, it cannot be said that Mary has "learned everything there is to learn about color". And once she has comprehensively and correctly utilized said machine on herself, she will gain no new knowledge when she is exposed to the world of color. Therefore this experiment cannot be used as an argument against physicalism as originally intended.


Personally I don't think these complicated thought questions based on subjective experience enlighten us at all.


OK. I enjoyed the mental exercise though, thanks for that. Also, as someone who's formally studied philosophy, I'd say there is definitely value in thought experiments, particularly as I think we got to an objective level in this case, though we started with the subjective. And determining universal (objective) rules are valued as they usually help guide us to truth, and/or point to ideals to strive for.


> If the food is chemically identical…

If it were 99.9% chemically identical but they left out the salt and spices…


I'd say that, when it comes to chemistry, only 100% reproduction can be considered identical. Anything less is to be deemed similar to some degree.

And so without the correct amount of salt and/or spices, we're talking about food that's very similar, and not identical.


Their perception is very likely to be totally different.

* They might not perceive some substances at all, others that we don't notice might make it unpalatable.

* Some substances might be perceived differently than us, or be indistinguishable from others.

* And some might require getting used to.

Note that all of the above phenomena also occur in humans because of genetics, cultural background, or experiences!


This may come off as pedantic, but "identical" is a very strong term when it comes to something like chemistry. The smallest chemical difference can manifest as a large physical difference. Consider that genetically, humans are about 60% similar to the fruit fly, yet phenotically, the similarity could be considered under 1%.


Well, I have synesthetic smell/color senses, so I don’t even know what other humans experience, nor they me. But, I have described it in detail to many people and they seem to get the idea, and can even predict how certain smells will “look” to me. All that took was using words to describe things.


> All that took was using words to describe things.

All that took was words and a shared experience of smelling.


How rude, what do our bathing habits have to do with this? ;-)

But, fair point. The gist I was trying to get across is that I don't even know what a plant smells like to you, and you don't know what a plant smells like to me. Those aren't comparable with any objective data. We make guesses, and we try to get close with our descriptions, which are in words. That's the best we can do and we share our senses. Asking more from computers seems overly picky to me.


I think we can safely say that any taste, smell, sensation or emotion of any importance has been described 1000 times over in the text corpus of GPT. Even though it is fragmented, by sheer volume there is enough signal in the training set, otherwise it would not be able to generate coherent text. In this case I think the map (language) is asymptotically close to the territory (sensations & experience in general).


What makes you think they aren't already?


I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.

It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.


Feels like that really smart friend who is probably correct but ya just don't know.


Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.


> if I were trying to restart civilization and was only left with ChatGPT

In this scenario you’d need to also be left with a big chunk of compute, and power infrastructure. Since ChatGPT is the front end of the model you’d also need to have the internet still going in a minimum capacity.


If we're playing this game, you forgot to mention that they also need: A monitor, a keyboard, roof over their head (to prevent rain from entering your electronic), etc etc...

But really, didn't you catch the meaning of parents message, or are you being purposefully obtuse?


I think re-imagining the "Dr. Stone" series with the main character replaced by an LLM will be a funny & interesting series if we decide to stay true to LLMs nature and make it hallucinate as well.

Given the way LLMs are right now, I suspect there will be lot of failed experiments and the kingdom of science will not advance that quick.


> the kingdom of science will not advance that quick.

It’s more likely that it wouldn’t even start. The first step to any development was figuring out nitric acid as the cure to the petrification. Good luck getting any LLM to figure that out. Even if it did, good luck getting any of the other characters to know what to do with that information that early on.


I don't see LLMs as a large chunk of knowledge, I see them as an emergent alien intelligence snapshotted at the moment it appeared to stop learning. It's further hobbled by the limited context window it has to use, and the probabilistic output structure that allows for outside random influences to pick its next word.

Both the context window and output structure are, in my opinion, massive impedance mismatches for the emergent intellect embedded in the weights of the model.

If there were a way to match the impedance, I strongly suspect we'd already have AGI on our hands.


Disagree. The input/output structure (tokens) is the interface for both inference and for training. There is an emergent intellect embedded in the weights of the model. However, it is only accessible through the autoregressive token interface.

This is a fundamental limitation, much more fundamental than appears at first. It means that the only way to touch the model, and for the model to touch the world, is through the tokenizer (also, btw, why tokenizer is so essential to model performance). Touching the world through a tokenizer is actually quite limited.

So there is an intelligence in there for sure, but it is locked in an ontology that is tied to its interface. This is even more of a limitation than e.g. weights being frozen.


What is alien about them ?

LLMs are of this earth and created by our species. Seems quite familiar to me.


They don't think, they don't reason, they don't understand. Except they do. But it's hard for human words for thought processes to apply when giving it an endless string of AAAAA's makes it go bananas.

That's not familiar behavior. Nor is the counting reddit derived output. It's also not familiar for a single person to have the breadth and depth of knowledge that ChatGPT has. Sure, some people know more than others, but even without hitting the Internet, it has a ridiculous amount of knowledge, far surpassing a human, making it, to me, alien. though, it's inability to do math sometimes is humanizing to me for some reason.

ChatGPT's memory is also unhuman. It has a context window which is a thing, but also it only knows about things you've told it in each chat. Make a new chat and it's totally forgotten the nickname you gave it.

I don't think of HR Geiger's work, though made by a human, as familiar to me. it feels quite alien to me, and it's not just me, either. Dali, Bosch, and Escher are other human artists who's work can be unfamiliar and alien. So being created by our species doesn't automatically imbue something with familiar human processes.

So it dot products, it matrix multiplies, instead of reasoning and understanding. It's the Chinese room experiment on steroids; it turns out a sufficiently large corpus on a sufficiently large machine does make it look like something"understands".


The context window is comparable to human short-term memory. LLMs are missing episodic memory and means to migrate knowledge between the different layers and into its weights.

Math is mostly impeded by the tokenization, but it would still make more sense to adapt them to use RAG to process questions that are clearly calculations or chains of logical inference. With proper prompt engineering, they can process the latter though, and deviating from strictly logical reasoning is sometimes exactly what we want.

The ability to reset the text and to change that history is a powerful tool! It can make the model roleplay and even help circumvent alignment.

I think that LLMs could one day serve as the language center of an AGI.


The word "alien" works in this context but, as the previous commenter mentioned, it also carries the implication of foreign origin. You could use "uncanny" instead. Maybe that's less arbitrary and more specific to these examples.

"Alien" still works, but then you might have to add all the context at length, as you've done in this last comment.


Hype people do this all the time - take a word that has a particular meaning in a narrow context and move it to a broader context where people will give it a sexier meaning.

    AI researchers unveil alien intelligence
Is way better headline.


In all fairness, going up to SMS random human and yelling AAAAAAAAAAAAAA… at them for long enough will produce some out-of-distribution responses too.


Makes me think that TikTok and YT pranksters are accidentally producing psychological data on what makes people tick under scenarios of extreme deliberate annoyance. Although the quality (and importance) of that data is obviously highly variable and probably not very high, and depends on what the prank is.


Do you find a large database or spreadsheet that hold more information than you can "alien" too?


They can write in a way similar to how a human might write, but they're not human.

The chat interfaces (Claude, ChatGPT) certainly have a particular style of writing, but the underlying LLMs are definitely capable of impersonating as our species in the medium of text.


But they're extremely relatable to us because it's regurgitating us.

I saw this talk with Geoffrey Hinton the other day and he said he was astonished at the capabilities of ChatGPT-4 because he asked it what the relationship between a compost heap and a nuclear bomb was, and he couldn't believe it answered, he really thought it was proof the thing could reason. Totally mind blown.

However I got it right away with zero effort.

Either I'm a super genius or this has been discussed before and made it's way into the training data.

Usual disclaimer: I don't think this invalidates the usefulness of AI or LLMs, just that we might be bamboozling ourselves into the idea that we've created an alien intelligence.


> Either I'm a super genius or this has been discussed before and made it's way into the training data.

If an LLM can tell you the relatonship between a compost heap and nuclear bomb, that doesn't mean that was in the training data.

It could be because a compost heap "generates heat", and a nuclear bomb also "generates heat" and due to that relationship they have something in common. The model will pick up on these similar patterns. They tokens are positioned closer to each other in the high dimensional vector space.

But for any given "what does x have in common with y", that doesn't necessarily mean someone has asked that before and it's in the training data. Is that reasoning? I don't know ... how does the brain do it?


> how does the brain do it?

It's a lot of organic matmuls. ;)


I mean that’s what sucks about Open AI isn’t it ? They won’t tell us what is in the training data so we don’t know. All I’m saying is that it wouldn’t be surprising if this was discussed previously somewhere in a pop science book.

That answer was close btw !


We used to have a test (Turing test) that could quite reliably differentiate between AI and our own species over the medium of text. As of now, we do not seem to have a simple & reliable test like that anymore.


Alien meaning unfamiliar, not necessarily extraterrestrial.

Aliens are people from other countries, for example.

Exotic would be another good word to use.


I can agree on the context windows, but what other output structure would you have?


Working with pure bytes is one option that's being researched. That way you're not really constrained by anything at all. Sound, images, text, video, etc. Anything goes in, anything comes out. It's hard to say if it's feasible with current compute yet without tokenizers to reduce dimensionality.


It is invaluable to have a chunk of human knowledge that can tell you things like the Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames


The facts LLMs learned from training are fuzzy, unreliable, and quickly outdated. You actually want retrieval-augmented generation (RAG) where a model queries an external system for facts or to perform calculations and postprocesses the results to generate an answer for you.


Is there a name for the reverse? I'm interested in having a local LLM monitor an incoming, stateful data stream. Imagine chats. It should have the capability of tracking the current day, active participants, active topics, etc - and then use that stateful world view to associate metadata with incoming streams during indexing.

Then after all is indexed you can pursue RAG on a richer set of metadata. Though i've got no idea what that stateful world view is.


This is an interesting idea but I'm having trouble understanding what you're to achieve. Do you mean the LLM would simply continuously update it's context window with incoming data feeds realtime, and you use it as an interface? That's pretty akin to summarization task, yes? Or are you augmenting the streams with "metadata" you mentioned?


Yea, the state i mentioned i think would be managed by several entities. Ie time, current date, etc - all could be automated without involvement of the LLM of course. However as conversations come in, the LLM would also modify the state with context clues from the conversation.

Then, when future messages come in from alternate streams (say, browser history), they could (maybe, hah) be made more rich. More likely though i would expect it to be the opposite scenario, browser informs chat, etc.

I say this because in many cases i imagine my chat conversations in my household have a severe lack of context. We often jump to vocal communication, and then paste links, etc. In a perfect world i think i'd even take home camera audio transcripts and do the same.

Ie i don't want to _just_ index a browser log as "interested in Rust. Some library about BTree", etcetc - but additional sources of data could try to store what it is i am actively doing, and associate that to the browser log.

All of this of course is nothing i'd ever want to leave the house. My hope though is that it would lean into what LLMs do well. Without the expectation of actual LLM intelligence.


So perhaps you're suggesting we sort of "boil-down" an information source into a sort of base representation of meaning and intent, something similar to vector store, and relate the many inputs together in this space using the LLM as glue like one does manually creating links in a web of Zettelkasten for research. I think this is something that the field is rapidly moving towards in personal information management.


LLMs don't learn facts.

They learn statistics on texts and are able to regurgitate them somewhat.


According to ChatGPT

> Australia won the 1987 Cricket World Cup. The 1986 date is incorrect; there was no Cricket World Cup in 1986. The tournament took place in 1987, and Australia defeated England in the final to win their first title.

https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489...

I'm no cricket fan, so someone will have to correct Wikipedia if that's wrong.

If you want to point out that LLMs hallucinate, you might want to speak plainly and just come out and say it, or at least give a real world example and not one where it didn't.


We’re not talking about running chatGPT locally though, are we?


sigh your going to make me open my laptop, aren't you.


I ran 'who won the 1986 Cricket World Cup' against llama2-uncensored (the local model I have pre-downloaded) and hilarious got 5 different answers asking it 5 times:

    >>> who won the 1986 Cricket World Cup
    India
    
    >>> who won the 1986 Cricket World Cup
    Australia
    
    >>> who won the 1986 Cricket World Cup
    New Zealand
    
    >>> who won the 1986 Cricket World Cup
    West Indies
    
    >>> who won the 1986 Cricket World Cup
    England
Which proves GP's point about hallucinations, though none of those are

> Brooklyn Nets won the 1986 Cricket World Cup by scoring 46 yards in only 3 frames

LLM's hallucinations are insidous because they have the ring of truth around them. yards and frames aren't cricket terms, so we're off to the races with them.


Actually isn't this good? It means we can run something multiple times to prove itself a bad answer?


You can ask LLMs the same question and they might sometimes get it wrong and other times get it right. Having different answers is no indication that none of them is correct.

Furthermore, even if an LLM always gives the same answer to a question, there’s no guarantee the answer is correct.

https://en.wikipedia.org/wiki/Propaganda

https://en.wikipedia.org/wiki/Big_lie#Alleged_quotation


An LLM will always give the same output for the same input. It’s sorta like a random number generator that gives the same list of “random” numbers for the same seed. LLMs get a seed too.


That’s irrelevant for the matter. The person I replied to obviously did not have seeded responses in mind.


It can tell you if the answer is wrong, but it can never tell you if the answer is right.


If you want factual answers from a local model it might help to turn the temperature down.


It would also help if I had more VRAM and wasn't running a 7B parameter 4-bit quantized model.


> If you want factual answers from a local model it might help to turn the temperature down.

This makes sense. If you interact with a language model and it says something wrong it is your fault


You're not "interacting with a language model", you're running a program (llama.cpp) with a sampling algorithm which is not set to maximum factualness by default.

It's like how you have to set x264 to the anime tuning or the film tuning depending on what you run it on.


You should specify the model size and temperature.

For fact retrieval you need to use temperature 0.

If you don't get the right facts then try 34b, 70b, Mixtral, Falcon 180b, or another highly ranked one that has come out recently like DBRX.


It's a very underrated side effect of this whole LLM thing: We've created a super compact representation of human knowledge in a form that requires a FAR less complex tech stack to get the information 'out' of in the future.

A year ago, a lot of this information only existed on the internet, and would have been nearly impossible to recover in any cohesive unfragmented form if the lights were to ever go out on our civilization.

Now the problem space has moved simply to "find a single solitary PC that will still boot up", and boom, you have access to everything.

I think we just created our Rosetta stone.


Language models are an inefficient way to store knowledge; if you want to have a “pseudo-backup of a large chunk of human knowledge,” download a wikipedia dump, not an LLM.

If you want a friendly but fallible UI to that dump, download an LLM and build a simple ReAct framework around it with prompting to use the wikipedia dump for reference.


Any recommendations for the latest and greatest way to run these locally?


I am the author of Msty [1]. My goal is to make it as straightforward as possible with just one click (once you download the app). If you end up trying it, I would love to hear your feedback.

1: https://msty.app


I use a tool called LM Studio, makes it trivial to run these models on a Mac. You can also use it as a local API so it kinda acts like a drop-in replacement for the openAI API.



This looks amazing, but the docs mention .llamafiles exceed the Windows executable size limit, and there are workarounds to externalize the weights. Do you think this is an impediment to its becoming popular? Or is MS consumer hardware just far enough behind (w/o dedi gpu) that “there’s time”?


ollama


llamafile as per TFA...


It seems to be an unbelievably inefficient way to back up knowledge.


Are they though? They are lossy compressing trillions of tokens into a few dozen GB. The decompression action is fuzzy and inefficient though.


And it requires massive computational power to decompress, which I don't expect to be available in a catastrophic situation where humans have lost a large chunk of important knowledge.


I don't necessarily agree. It requires massive computing power, but running models smaller than 70G parameters is possible on consumer hardware, albeit slowly.


Parent may be thinking more along the lines of a “hope we can print all the knowledge“ type catastrophe. Though if there is zero compute it’ll be tough reading all those disks!


I wonder how the Chinese government will manage to sensor LLMs within China?


The same way Facebook/Google/openAI & others censored their own LLMs, I guess ?


That's only for SaaS LLMs, but if you can simply download and run one on your hardware, things become difficult.


And why would I need to backup human knowledge as an individual


You remember those fantasies where you got up from your seat at the pub and punched the lights out of this guy for being rude? A lot of us have fantasies of being the all powerful oracle that guides a reboot of civilization using knowledge of science and engineering.


> the all powerful oracle that guides a reboot of civilization using knowledge of science and engineering.

https://en.wikipedia.org/wiki/Dr._Stone


It’s kind of crazy really. Before LLMs, any type of world scale disaster you’d hope for what? Wikipedia backups? Now, a single LLM ran locally would be much more effective. Imagine the local models in 5 years!


There's a lot more than just Wikipedia that gets archived, and yes, that is a far more sensible way to go about it. For one thing, the compute required to then read it back is orders of magnitude less (a 15 year old smartphone can handle it just fine). For another, you don't have to wonder how much of what you got back is hallucinated - data is either there or it's corrupted and unreadable.


The processing required to run current language models with a useful amount of knowledge encoded in them is way more than I imagine would be available in a "world scale disaster".


Uh yeah I would, and still am, take the Wikipedia backup for doomsday scenarios. I'm not even sure how that would be a competition


There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.

The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.

Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.

Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.


The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.

Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.

[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

[1] https://github.com/OpenMathLib/OpenBLAS

[2] https://www.intel.com/content/www/us/en/developer/tools/onea...

[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...

[4]https://en.wikipedia.org/wiki/BLIS_(software)


Fair enough, this is not meant to be some endorsement of the standard Fortran BLAS implementations over the optimized versions cited above. Only that the mainstream compilers cited above appear capable of applying these optimizations to the standard BLAS Fortran without any additional effort.

I am basing these comments on quick inspection of the assembly output. Timings would be equally interesting to compare at each stage, but I'm only willing to go so far for a Hacker News comment. So all I will say is perhaps let's keep an open mind about the capability of simple Fortran code.


Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.

[0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf


You are right, I just tested this out and my speed from BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I can only imagine what BLIS and MKL would give. I apologize for my ignorance. Apparently my faith in the compilers was wildly misplaced.


No, you can still trust compilers: 1) The hand-tuned BLAS routines are essentially a different algorithm with hard-coded information. 2) The default OpenBLAS uses OpenMP parallelism, so much speed likely originates from multithreading. Set OMP_NUM_THREADS environment variable to 1 before running your benchmarks. You will still see a significant performance difference due to a few factors, such as extra hard-coded information in OpenBLAS implementation.


I ran with OMP_NUM_THREADS=1, but your point is well taken.

As for the original post, I felt a bit embarrassed about my original comments, but I think the compilers actually did fairly well based on what they were given, which I think is what you are saying in your first part.


using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.


I just did a test of OpenBLAS with Intel-compiled BLAS, and it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I was wrong here. Maybe in some sense 4% is not bad, but it's certainly not good. My faith in current compilers has certainly been shattered quite a bit today.

Anyway, I have come to eat crow. Thank you for your insight and helping me to get a much better perspective on this problem. I mostly work with scalar and vector updates, and do not work with matrices very often.


The inequality between matrix multiplication implementations is enormous. It gets even more extreme on GPU where I've seen the difference between naïve and cuBLAS going as high as 1000x. Possibly 10000x. I have a lot of faith in myself as an optimization person to be able to beat compilers. I can even beat MKL and hipBLAS if I focus on specific shapes in sizes. But trying to beat cuBLAS at anything makes me feel like Saddam Hussein when they pulled him out of that bunker.


I'm sure there's more to it, but just comparing the profile output shows aggressive use of prefetch and broadcast instructions.


BLIS does that in their kernels. I've tried doing that but was never able to get something better than half as good as MKL. The BLIS technique of tiling across k also requires atomics or an array of locks to write output.


I don't disagree, but where are those techniques presented in the article? It seems like she exploits the particular shape of her matrix to align better with cache. No BLAS library is going to figure that out.

I am not trying to say that a simple 50+ year old matrix solver is somehow competitive with existing BLAS libraries. But I disagreed with its portrayal in the article, which associated the block with NumPy performance. Give that to a 2024 Fortran compiler, and it's going to get enough right to produce reasonable bytecode.


Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.

DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.


How do you feel about Nvidia endorsing do concurrent migration to GPUs? Would that be classified as parallelization?



Great links, especially last one referencing the Goto paper:

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...

>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references

It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.


The trick is indeed to somehow imagine how the CPU works with the Lx caches and keep as much info in them as possible. So its not only about exploiting fancy instructions, but also thinking in engineering terms. Most of the software written in higher level langs cannot effectively use L1/L2 and thus results in this constant slowing down otherwise similarly (from asymptotic analysis perspective) complexity algos.


Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".


Same here.


> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.

This is great. I love the idea of measuring performance differences in “years of Moore’s law.”

Twenty years puts the delta in an easy to understand framework.


I doubt that you get Python to run faster than C++ at 2004 hardware.


Python on 2024 hardware vs C++ on 2004 hardware ... I don't think it's obvious that C++ always wins here, though it would depend on the use case, how much of the Python is underpinned by native libraries, and the specific hardware in question.


If we allow native libraries, it's not clear that C++ would win, even on modern hardware.


I think we all know that, when someone writes "C++ is three orders of a magnitude faster than Python" they're not including native libraries.


You can't not include native libraries, at least if you want your benchmark to be realistic. Almost every Python library where performance matters is written (at least partially) in a compiled language.


Yes, but many people like the sound of "X-times faster than Python" while conveniently forgetting that the same thing can be (and usually is) done in Python + numpy & co. even faster.

I have come to appreciate "slowness" of Python. It trades speed for legibility, which is a great compromise once you have really fast native libraries one import away. Best of both worlds.


C++ with well-optimized libraries should always outperform Python with well-optimized libraries, right? They should be ~identical in the highly optimized inner loops, but Python has more overhead. But naive hand-written C++ could easily perform worse than something like Numpy.

(I've only tested this once, and my naive hand-written C++ was still twice as fast as Numpy, but that was only on one specific task.)


Honestly depends on what you are doing. Most of my python work is data collection and analysis on top of Postgres.

Being smart in how I use Postgres indexing (and when to disable it outright) has more performance impact than the actual language doing the plumbing.


> You don't need a large computer to run a large language model

While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.

Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.

That doesn’t mean “you don’t need a computer to run an LM”…

I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.

I don’t realllly believe you can do a lot of useful LLM work on a pi


Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.


You can fine-tune a 60mm parameter (e.g. distilBERT) discriminative (not generative) language model and it's one or two order of magnitude more efficient for classification tasks like sentiment analysis, and probably similar if not more accurate.


Yup, I'm not saying TinyLLAMA is minimal, efficient, etc (indeed, that is just saying that you can take models even smaller). And a whole lot of what we just throw LLMs at is not the right tool for the job, but it's expedient and surprisingly works.


it seems that BERT can be run on the llama.cpp platform https://github.com/ggerganov/llama.cpp/pull/5423

so presumably those models could benefit from the speed ups described in OP article when running on CPU


llama.cpp only support BERT architectures for embedding but not with classification heads - although there is a feature request to add that


ah I see, thanks, did not read the PR closely!


Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.


I've gotten some useful stuff out of 7B param LLMs, and that should fit on a Pi quantized.


Pixar uses CPUs …

I wonder if we’ll end up in a situation like rendered movies.

Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

https://news.ycombinator.com/item?id=25616372


It's entirely the cost/perf of access to the larger amounts of VRAM that keeps rendering on CPUs now. GPUs are strictly better in almost every way for rendering (We could have some arguments about technical precision, FP calculations, etc. but with modern cards these arguments are largely semantics, you can have output that is accurate to the level that any human watching for entertainment purposes will not be able to determine any physical inaccuracies that arise from a GPU render vs. CPU.), except the need for large amounts of VRAM being quite expensive at current.

But that's already been changing, and we are seeing studios moving to fully GPU based pipelines. Wylie Co, who are a major visual effects company (Dune part 1 and 2, marvel movies, the last of us, a bunch of others) are now a 100% GPU shop. The trend is towards more and more GPU rendering, not less.

With AI providing another strong incentive towards increasing the amount of VRAM on GPUs, I don't see any reason to believe that trend will reverse.


I'm not sure how true that is anymore, from the outside it seems they're at least moving to a CPU/GPU hybrid (which makes a lot of sense), at least judging by new features landing in RenderMan that continues to add more support for GPUs (like XPU).


Isn’t this more of a function that RenderMan is a sold product.

And it’s expected to at least support GPUs.


Hard to know without getting information from people at Pixar really.

Not sure how much sense it would make for Pixar to spend a lot of engineering hours for things they wouldn't touch in their own rendering pipeline. As far as I know, most of the feature development comes from their own rendering requirements rather than from outside customers.


> Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

I wonder if (or when) this will change once integrated GPUs become "mainstream", the CPU/GPU share the same RAM AFAIK.


I expect GPU hardware to specialize like Google’s TPU. The TPU feels like ARM in these AI workloads where when you start to run these at scale, you’ll care about the cost perf tradeoff for most usecases.

> CPU/GPU share the same RAM AFAIK.

This depends on the GPU I believe Apple has integrated memory, but most GPUs from my limited experience writing kernels have their own memory. CUDA pretty heavily has a device memory vs host memory abstraction.


On top of that, Nvidia has provided a unified addressing abstraction over PCI for a looooong time via CUDA: https://developer.nvidia.com/blog/unified-memory-in-cuda-6/

Customers like Pixar could probably push this even further, with a more recent Nvidia rack and Mellanox networking. Networking a couple Mac Studios over Thunderbolt doesn't have a hope of competing, at that scale.


As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.

But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.


I think the previous code was using dot products, f32 instead of bf16.


regarding AMD zen4 with avx512:

"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."


Does this also count platform costs or just chip cost? I'd imagine the threadripper motherboard and ram costs aren't insignificant


A complete desktop computer with the M2 Ultra w/64GB of RAM and 1TB of SSD is $4k.

The 7995WX processor alone is $10k, the motherboard is one grand, the RAM is another $300. So you're up to $11300, and you still don't have a PSU, case, SSD, GPU....or heatsink that can handle the 300W TDP of the threadripper processor; you're probably looking at a very large AIO radiator to keep it cool enough to get its quoted performance. So you're probably up past $12k, 3x the price of the Studio...more like $14k if you want to have a GPU of similar capability to the M2 Ultra.

Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.


So from a CPU perspective you get 7x the CPU throughput for 3x to 4x the price, plus upgradable RAM that is massively cheaper. The M2 uses the GPU for LLMs though, and there it sits in a weird spot where 64GB of (slower) RAM plus midrange GPU performance is not something that exists in the PC space. The closest thing would probably be a (faster) 48GB Quadro RTX which is in the $5000 ballpark. For other use cases where VRAM is not such a limiting factor, the comparably priced PC will blow the Mac out of the water, especially when it comes to GPU performance. The only reason we do not have cheap 96GB GDDR GPUs is that it would cannibalize NVIDIA/AMDs high margin segment. If this was something that affected Apple, they would act the same.


You're using the wrong CPU.

Consumer AMD 7950X supports AVX-512, it's faster than M2 Ultra at half the cost.


I didn't see benchmarks that suggest the 7950X is faster than M2 Ultra. I only saw performance numbers for 7995WX which has 6x the cores and 6x the cache.

Either way, I think these comparisons are moot since an M2 Ultra comes with 2x M2 Max GPUs and an NPU and up to 192GB of unified memory running at 800GB/s. In other words, you wouldn't want to run your LLM on the CPU if you have an M2 Ultra.

The point of OP is to increase LLM performance when you don't have a capable GPU.


Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.


> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.

Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.


From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"

I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it


I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.

LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.


Is that what it does, though?

I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:

- this is a spam email

But if the sender adapts and says

- th1s is a spam email

It wouldn't be flagged as spam.


The output of an autoregressive model is a probability for each token to appear next after the input sequence. Computing these is strictly deterministic from the prior context and the model's weights.

Based on that probability distribution, a variety of text generation strategies are possible. The simplest (greedy decoding) is picking the token with the highest probability. To allow creativity, a random number generator is used to choose among the possible outputs, biased by the probabilities of course.

Temperature scales the output probabilities. As temperature increases, the probabilities approach 1/dictionary size, and the output becomes completely random. For very small temperature values, text generation approaches greedy sampling.

If all you want is a spam filter, better replace the output layer of an LLM with one with just two outputs, and finetune that on a public collection of spam mails and some "ham" from your inbox.


My understanding is that temperature applies to the output side and allows for some randomness in the next predicted token. Here Justine has constrained the machine to start with either "yes" or "no" and to predict only one token. This makes the issue stark: leaving a non-zero temperature here would just add a chance of flipping a boolean.


It's more nuanced than that, in practice: this is true for the shims you see from API providers (ex. OpenAI, Anthropic, Mistral).

With llama.cpp, it's actually not a great idea to have temperature purely at 0: in practice, especially with smaller models, this leads to pure repeating or nonsense.

I can't remember where I picked this up, but, a few years back, without _some_ randomness, the next likely token was always the last token.


That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.

To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).

But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.


It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.

https://github.com/ggerganov/llama.cpp/issues/2555


This project in particular seems to care about the long tail of hardware; note that the very first machine in this post is a box from 2020 with spinning rust disk. Granted, adding support for newer extensions is likely also good, but cost/benefit is in play.


Is four years really 'long tail' these days? Our VM host box is from 2010 (and I had to rebuild llama.cpp locally without AVX to get it working :P )


For cutting-edge LLM work, probably? I mean, I run mine on older hardware than that, but I'm a total hobbyist...


It should be noted that while the HP Prodesk was released in 2020, the CPU’s Skylake architecture was designed in 2014. Architecture is a significant factor in this style of engineering gymnastics to squeeze the most out of silicon.


For LLMs...yeah. I imagine you're measuring in tokens/minute with that setup. So its possible, but...do you use it much? :)


I don't believe that is the target for a local LLM... Pretty sure we're talking about client-side computing, of which the newest supports only AVX-512 (and even that sketchily on Intel's side).


People with Sapphire Rapids options are not the target audience of these patches


Just buy a new AMD processor that supports AVX512.


This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?

Will definitely be giving this a try.


A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.


We shouldn't choose LLMs for how many facts they support, but their capability to process human language. There is some overlap between these two though, but an LLM that just doesn't know something can always be augmented with RAG capabilities.


Picturing "LLM Jeopardy". You know, a game show.


If you ignore my capacity for error, I bet I'd put up a good score too. Hell, maybe Markov chains are smarter than LLMs by this definition.


Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).


Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.

There are two PRs raised to merge to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6414

https://github.com/ggerganov/llama.cpp/pull/6412

Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.

Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.


I'd pay good money to watch jart in conversation with Carmack


To carry on, this is because they're both very interested in "knowledge in depth", rather than because of what they actually work on day-to-day. They've both made careers out of knowing what's going on with the thing they're building down to the most basic level possible.


Why is he even relevant? What makes you believe that he would be good at solving AI related problems? He is a developer right?


Actually he recently founded an AGI company. But this post is about optimization more than AI, which he is definitely good at (though he doesn't claim to be the best).


For someone to optimize something they need to have a good domain understanding. I do not buy that someone who developed games before just becomes an expert in LLMs overnight.


Sorry, it's definitely not true that you need to be an expert in LLMs to optimize BLAS kernels. Also, he started working on AI at least five years ago, which is a long time in AI world, probably before Justine Tunney started actually. And he's been optimizing code his whole life.


I was part of the Google Brain team working on TensorFlow back in 2015. Please note that doesn't mean I understand how LLMs work. Although I do know how matrix multiplication works now. At least until I apply my focus to the next area of impact and forget everything I just learned.


I stand corrected!


Carmack is great but completely irrelevant here. He missed the entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality fantasies for years.


Completely irrelevant is probably overstating it. He's been working on AI for the last 4+ years.


He's striving for AGI though, right? So he's not really working on anything because he certainly hasn't discovered AGI.


He literally squandered the last 10 years of his life working on absolutely nothing for Zuckerberg. And only after the rest of the world innovated on AI (transformers, etc) did he clearly feel embarrassed and had to proclaim he's going to focus on AGI in a "one-up" way.


He got paid a lot to do something he was presumably passionate about and enjoyed. It also might surprise you to find out that there's quite a lot of people that just work as a means to an end, and find value and enjoyment primarily from other parts of their life.


that's great for him. i'm glad he enjoyed the $$$ playing with VR. that has nothing to do with my point about his irrelevance to this LLaMa discussion.


He's not irrelevant, though. Literally the first thing he did after leaving Meta was start an AI business, and the original point wasn't even necessarily about AI. They just said they wanted to see two engineers in conversation, and you used it as an opportunity to denigrate one of their previous employers. That's bewilderingly irrelevant.


i can "start an ai business" too simply by incorporating.


And if you did, you wouldn't be irrelevant anymore. I don't get your point.


Can you raise $20m and get the likes of Rich Sutton to collaborate with you?


> He literally squandered the last 10 years of his life working on absolutely nothing

Speak for yourself, the Oculus Quest is the coolest piece of sub-$500 tech in my home.


Altman isn't even relevant here. He is focusing on LLM's instead of a framework that gets us to AGI. He can't describe how we get there or any such theories around AGI. It's a complete failure.


If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?


Has Justine written anywhere about her disassembly setup?

> I configured Emacs so I can push a button, and the disassembly for the C++ code I'm working on will pop up on the screen in a few milliseconds.

I assume it's something project specific rather than being able to get the disassembly for an arbitrary section of code or something?

It seems very handy, so I'd love to see the implementation (I couldn't find anything googling)


This is probably what they are referring to https://github.com/jart/disaster


Nice. I have been using rmsbolt for a similar feature, but it is very rough. I'll need to give this as try.


Thanks! I need to get better at googling I guess.


> It's clearly optimal since my CPU is listed as only being capable of going 780 gigaflops

780 GFLOP is the iGPU spec. Is this a valid comparison?

https://nanoreview.net/en/cpu/intel-core-i9-14900k


> the Raspberry Pi

Odd how there were no Mistral 7 benchmarks for the Pi 5 in that table (I doubt anyone is seriously considering using TinyLlama for anything at all), so I went to re-test it out myself on the Pi 5 8G.

llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32 tokens per second

llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token, 2.62 tokens per second

It does seem to inch closer to the speed you get with blas acceleration which is quite impressive, but in practical terms the Pi 5 is so heavily limited by its memory throughput bottleneck that it saturates the required compute with 3 threads already. So while fancy kernels will make it more efficient it won't really save you from that fundamental bandwidth limit. The Pi foundation messed up going with a 32 bit memory bus, simple as.


Is there somewhere an overview of the progress we made on the software side for training and inference of LLMs? It feels like we squeezed 10-100x more out of the hardware since llama appeared. This crazy progress will probably saturate though as we reach theoretical limits, no?


Question is, how much of an improvement has it gotten to over a GPU or ASIC?


So... I was struggling with this for a while. I would says anywhere from 2x to an order of magnitude faster with a GPU. (I've been looking at a lot of GPU benchmarks lately, and they are REALLY hard to compare since they are all so specific)

I do think long term there gets to be more hope for CPUs here with inference largely because memory bandwidth becomes more important than the gpu. You can see this with reports of the MI-300 series outperforming h100, largely because it has more memory bandwidth. MCR dimms give you close to 2x the exiting memory bw in intel cpus, and when coupled with AMX you may be able to exceed v100 and might touch a100 performance levels.

HBM and the general GPU architecture gives it a huge memory advantage, especially with the chip to chip interface. Even adding HBM to a CPU, you are likely to find the CPU is unable to use the memory bw effectively unless it was specifically designed to use it. Then you'd still likely have limited performance with things like UPI being a really ugly bottleneck between CPUs.


If someone releases DDR5 or DDR6 based PIM, then most of the memory bandwidth advantage of GPUs evaporates overnight. I expect CPUs to be king at inference in the future.


But then you'll get GDDR6 delivered via HBM5 or whatever. I don't think CPUs will ever really keep up with the memory bandwidth, because for most applications it doesn't matter.

MCR DIMM is like 1/2 the memory bandwidth that is possible with HBM4, plus it requires you to buy something like 2TB of memory. It might get there, but I'd keep my money on hbm and gpus.


I think that should be phrased more like "what fraction of GPU speed can this reach?", because it'll always be less than 1x.


Nothing in software will ever beat an equivalent ASIC.


Sure there is. Software is easy to change.


By “beat” I meant in performance.

Obviously you can’t change an asic


an asic is fixed function, so it'll never be able to boot my pc and then be the CPU, even though an asic beats the pants off anything else computing Sha hashes for Bitcoin mining.


By “beat” I meant performance.

Obviously an ASIC is not a general purpose machine like a cpu.


Most ASICs are cost or power optimizations.


Exactly. They’re much faster for their specific tasks and thus are more power efficient and potentially cost efficient


No. Eg of the hardware discussed on the article, the Raspberry Pi uses an ASIC that's slow, cheap and low power vs the Intel or AMD chips.

In some cases ASICs are faster than general purpouse CPUs, but usually not.


I think we’re saying the same thing here.

You can make an ASIC which doesn’t have the same power draw as a CPU, but provides the same performance.

It doesn’t need to be faster than the fastest software implementation, but power per performance will always favor ASIC.


Is the LLM running on an ASIC for the Pi here? I dout it.


The ARM cores and the VideoCore parts are all in the ASIC, it's a SoC type ASIC.


Hmm, yeah it's a SoC, but not an ASIC. Maybe you mean APU? ASICs are circuits thst can only do one thing, CPU cores are definitely not that.

Edit: an example ASIC the Pi has is the video encoder/decoder, with JPEG also supported. I think it's embedded in the GPU.


ASIC just means it's an application specific IC (= chip), meaning it's fabbed for that specific product (like in this case the Raspberry Pi). A functional block like a JPEG codec contained therein is not an ASIC. Quoth wikipedia:

"Modern ASICs often include entire microprocessors, memory blocks including ROM, RAM, EEPROM, flash memory and other large building blocks. Such an ASIC is often termed a SoC (system-on-chip)."

What you're describing in the JPEG codec might be termed a fixed function IP block in semiconductor design terminology.


First paragraph of that Wikipedia article:

An application-specific integrated circuit (ASIC /ˈeɪsɪk/) is an integrated circuit (IC) chip customized for a particular use, rather than intended for general-purpose use, such as a chip designed to run in a digital voice recorder or a high-efficiency video codec.


That’s weird that Wikipedia says there’s no real distinction between SoCs and ASICS.

Colloquially, I’d never call anything with a programmable processor an ASIC


Some ASICs are SoCs. Some are not.


From the article, passage about the 14900k:

> For example, when I run my spam.sh shell script, it only takes 420 milliseconds, which is 7x faster than my Raspberry Pi 5. That's right, when it comes to small workloads, this chip is able to finish before CUDA even gets started.

So… it depends :)


I think I understand what you are thinking. You may be fixing "than other ways of running them" to the end of the title, but it's actually "than it was on CPU before now".


Is it easy to find where the matvecs are, in LLaMA (if you are someone who is curious and wants to poke around at the “engine” without understanding the “transmission,” so to speak)? I was hoping to mess around with this for Stable Diffusion, but it seemed like they were buried under quite a few layers of indirection. Which is entirely reasonable, the goal is to ship software, not satisfy people who’d just want to poke things and see what happens, haha.


did you see tiny grad can run llama and stable diffusion? it's an intentionally extremely simple framework vs pytorch or even micrograd, which helped me dig into the underlying math. though https://spreadsheets-are-all-you-need.ai/ is a good one for learning LLMs.


For C++, also check out our https://github.com/google/gemma.cpp/blob/main/gemma.cc, which has direct calls to MatVec.


I haven’t seen that. I’ll definitely have to take a look, thanks!


Unfortunately BitDefender (corporate) blocks llamafile as a ransomware "atc.heur.crypt" and it seems there is no workaround. :(


Are the executables not signed? That would be surprising to me for something coming from Mozilla.

EDIT: I just realized that the cross-platform single-binary thing might actually cause issues with code signing. I'm curious about this.


Multithreading support in llama.cpp is probably still pretty busted, assuming it uses the same underlying NN inference code as whisper.cpp: https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...


From what I have heard they use manual spin locks. Generally, spin locks are not a good idea unless you want to dedicate the entire machine to a single application. If the process a spinlock waits on gets suspended, you're burning CPU time for nothing. The OS thinks a spinlock making zero progress is actually a high priority process, so it is starving the suspended process from making progress.


Yeah the code looks like a spinlock. It behaves terribly under contention, resulting in performance falling off a cliff as the number of threads increases. Adding more threads actually slows down the total performance.

I would fix it if I could be bothered. Instead I will just use the Cuda whisper backend which is pretty nice and fast.


Definitely wild we’re in the timeline you can run a 1.1 bn param model on a raspberry pi, but its still tough to justify because the 1.1 is kinda useless compared to the beefier models. Sick for home builds/hobbyists though I might wanna get one of the new Pis just to try this out


Any performance benchmark against intel's 'IPEX-LLM'[0] or others?

[0] - https://github.com/intel-analytics/ipex-llm


note, this is "goes faster on CPUs than before", not faster than GPUs.


Are there any benchmarks on the performance of these new matrix multiplication kernels compared to the Eigen library (ideally for float32)?


While I did not succeed in making the matmul code from https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafil... work in isolation, I compared eigen, openblas, and mkl: https://gist.github.com/Dobiasd/e664c681c4a7933ef5d2df7caa87...

In this (very primitive!) benchmark, MKL was a bit better than eigen (~10%) on my machine (i5-6600).

Since the article https://justine.lol/matmul/ compared the new kernels with MLK, we can (by transitivity) compare the new kernels with Eigen this way, at least very roughly for this one use-case.


Here's a complete working example for POSIX systems on how to reproduce my llamafile tinyBLAS vs. MKL benchmarks: https://gist.github.com/jart/640231a627dfbd02fb03e23e8b01e59... This new generalized kernel does even better than what's described in the blog post. It works well on oddly shaped matrices. It needs however a good malloc function, which I've included in the gist. Since having the good memory allocator is what makes the simple implementation possible.


"As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x slower than my Mac Studio, and 3x slower than my Intel (which has the same M.2 stick). I'm told that Intel and Apple are just better at this, but I wish I understood why. "

Can anyone here answer why this is?


Apple made fsync a noop.

You have to make a different call to get sync on macos.

So tons is stuff is faster because it's not actually writing to disk.


Plus he isn’t using oflag=direct, so since output file is small it isn’t even making it to disk. I think it would only be sent to page cache. I’m afraid he is testing CPU and memory (bus) speeds here.

oflag=direct will write direct and bypass page cache.


Exactly. Something is very fishy if this system only writes 1.6 GB/s to the page cache. Probably that dd command line quoted in the article is incomplete.


*she


Does someone else see llamafile using Wine on Linux?

Edit: After the download I did a simple chmod +x llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile


There's a simple fix for that.

    sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
    sudo chmod +x /usr/bin/ape
    sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
    sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-fil...


Mmm, I wonder how well this would work on a mobile device. Maybe I'll try grabbing my ubuntu touch here in a sec...


(For any who were curious: it does not for memory reasons)


So Nvidia in trouble now because intel can be used instead for faster/cheaper? inference?


today being today ; I must ask ; anyone has actually tried this ?


Awesomeness. thank you for sharing!


The ram is not on the cpu on a mac. It's in the same can but it's still regular ddr dimms.


Posted too early.


I know this post is focused specifically on CPU performance, but the section on the performance on the Mac Studio seems to be deliberately avoiding directly mentioning that machine's GPU, let alone benchmark against it. I think it would have been interesting to see a straightforward comparison of what compute performance and memory bandwidth (as measured by the prompt processing and token generation speeds, respectively) are achievable with reasonable optimization effort on the CPU vs GPU when they're attached to the same memory subsystem.


It would be good to see some independent verification of this claim. HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after. Justine Tunney appears to enjoy extreme superstar status here, and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation (to begin with, what other LLM developments even hit upvote numbers like the +1300ish there or the +712 here at the time of writing?).

[1] https://news.ycombinator.com/item?id=35393284


> Justine Tunney appears to enjoy extreme superstar status here

This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works. Cosmopolitan C and Actually Portable Executable were some of the things in the past that alone were worthy of significant respect, and for many people (like myself) these were our first introduction.

Speaking only for myself, I have a high opinion of Justine on technical merits. I'm sure she makes mistakes like all humans. I can tell she gets excited by discoveries and the chase, and that probably does sometimes cause premature celebration (this is something I struggle with so it's recognizable to me haha), but being wrong sometimes doesn't erase when you're right, and she has been spectacularly right a lot more times than most people I know.

There have been some personality clashes between Justine and others at times, and unfortunately it's situations where only part (sometimes a small part) of it was public, meaning we can only take people's word for what happened. Given my ignorance, I choose to withhold judgment here, but even if I didn't (and assumed she was guilty) it doesn't change the technical merits and it certainly wouldn't dissuade me from seeing what she's working on now.

So when I see stuff from Justine come out like this, it gets my attention. Would it get my attention if the same thing were posted by somebody whose name I don't recognize? Likely not, but I think that is (unfortunately) part of being a human. We aren't capable (yet!) of evaluating everything on technical merit alone because the shear volume of material far exceeds our time. Therefore we use other (less reliable to be true) signalling mechanisms as a way to quickly decide what is worthy of our time investment and what may not be. Reputation/name recognition is a much imperfect, but better than random chance, indicator.


I don't know, my first (and main) impression of them was actually in the context of the llama.cpp mmap story, as I was somewhat involved in the project back then, and there I thought their impact on the project was predominantly negative. While they introduced a mildly beneficial change (mmap-based model loading), the way in which this was done was not healthy for the project - the changes were rammed through with little regard for concerns that existed at the time about backwards compatibility and edge cases that might be broken by the half-baked patch, Justine came across as self-aggrandizing (in the sense of "acting as if they ran the place", presenting their proposals as a plan that others must follow rather than suggestions) and overly eager to claim credit (epitomized by the injection of their own initials into the magic number file format identifier next to those of the project originator's, and the story of the hapless other author of the mmap changeset who was at first given a token acknowledgement but then quickly sidelined). Arguments for the inclusion of the patch seemed to be won by a combination of half- and untruths like those about memory savings and the sudden participation of a large number of previously uninvolved sycophants. It is fortunate that Georgi handled the fallout as well as he did, and that he in fact had amassed the social capital necessary to survive his heavy-handed solution (soft-banning both JT and their most prominent detractor). A less-successful project would probably have found itself captured or torn apart by the drama.

There is nothing wrong with holding people in esteem for their achievements, but in this case the degree of esteem really seems to be excessive. This is not a matter of simply being annoyed that people like "the wrong thing" - the mmap situation was significantly exacerbated by the presence of irrational/excessive supporters of Justine's as well as the irrational/excessive detractors that emerge wherever the former exist.


I would like to know more about the mmap situation, as what I saw on the surface could warrant some concern. Being somewhat involved you would probably know better than I as I was just an observer reading the thread after-the-fact. It seemed like the biggest accusation was the plagiarism (or "collaborating" but mostly taking somebody else's code).

Did anybody besides the two parties see the code develop, or does anybody else have knowledge of this? Or is it just his word vs. hers? Do you have any suggested reading to get more perspective other than just the github thread and HN thread? (really asking. these aren't rhetorical questions)

Reading the thread, I do think there are a lot of opportunities to read in confirmation bias. For example if I start reading that thread with the idea that Justine is coming in to hijack the project and make herself the hero that it needs and deserves, and to get her initials embedded in there as a permanent tribute to her own glory, I can see that. But if I read it as her coming in with cool work that she's excited about, and had to come up with a new format and couldn't think of a name (naming things can be really hard) and just stuck in one of the first things that came to mind (or even used as a placeholder prior to discussion), I can see that as well.

I absolutely don't want the truth covered up, but I also don't want to accept as true things that aren't true, especially where the implications are toward somebody's character. I'm a big "benefit of the doubt" kind of person.


My sense is that the part about credit/collaboration was actually somewhat overblown among the detractors. What roughly happened as far as I can remember is that JT and another person worked on mmap together with about equal contribution, though the other person might have been the one to have initiated the idea (and solicited help to push it through); then at some point JT decided to make a PR to the main repository in their own name, but crediting the other collaborator as a coauthor, which may or may not have been coordinated with the other person. After that, though, in a fairly characteristic fashion, JT started fielding adulatory questions from their fans (on Github, but also on HN, Twitter and possibly other media) about the change, and quickly switched to simply referring to it as their own, with no mention of the other contributor. The other contributor expressed some misgivings about having their contribution erased, which were picked up by a growing set of people who were generally resentful about JT's conduct in the project. As far as I can tell, when confronted about it, JT at no point explicitly denied what the other person did (and I think the commit logs should all still be there in the fork), but at some point the other person just decided to stop pushing the issue due to being uncomfortable with becoming a playing ball in the fandom war between JT fans and antis.

My personal main gripe with JT really was the tone they adopted in the Github discussions, and the effect of the large numbers of drive-by supporters, who were often far less restrained in both unfounded claims about Justine's accomplishments and attacks on any critics. (At this point I'd also like to note that I consider some sibling comments to be uncomfortably hostile in a personal way, like the "hit piece" one.) I think that as a public persona, especially one who actively pursues publicity, you have some responsibility to restrain your followers - Justine, I get the sense, instead uses them as deniable proxies, as also seen with the instances where instead of straight up putting their signature on the "RAM usage reduced to 6GB" claim they instead choose to post a collage of screenshots of supporters making it.


This could all be true, but it's hard to evaluate these claims on their own. Not being involved in any way, all I can do is conclude that there is some friction in that community. It's possible that JT is toxic, it's possible that you are toxic, it's possible that neither of you is generally toxic but something about your personalities causes your interactions to become toxic, it's even possible that neither of you were toxic in any way but your impression of things after the fact is as-if Tunney had been toxic. Sometimes one has to stop and think about these things and figure out how to smooth things over, and sometimes it's not possible to smooth things over.


I didn't have any direct interactions with JT then or now - while it was hard to ignore the discussion as an onlooker, it did not touch upon any parts of the code that I was involved with. This seems to be one of the topics where everyone who is even tangentially involved is under a default suspicion of being biased in one direction or another.


>This is true, and for sure pretty much all humans can benefit from increased skepticism (though not cynicism), but that superstar status is achieved from numerous impressive works.

It is achieved through a never ending parade of self aggrandizement.

What Justine is very good at is presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand.

I had the misfortune of having to find out about her because of how thoroughly she polluted the google search space for lisp with her implementation of sector lisp. For some reason google decided that sector lisp needed to be in the top 5 results for every query about `minimal lisp with quotation` even when quotation wasn't implemented in her version.


> presenting trivial concepts from a world which few front end developers understand in a language that most front end developers understand

Completely ignoring the JT discussion, the argument that something is trivial in some area does not really hold. 1) Science is mostly "just" connecting the dots, and 2) landmark discoveries tend to look trivial in hindsight almost by definition, because they have to be straightforward enough to be widely adopted.


I am also quite impressed by Tunney’s technical chops—Cosmopolitan C blew my mind— but, as with others, am somewhat put off by the self-aggrandizing self-satisfied tone and I-know-best attitude that are always on display. Maybe it’s a cultural, generational, or age thing? My younger coworkers tended to sound like this, and tended to minimize others’ contributions, which seems to be the case with the mmap() situation.


This comment reads like real scientific skepticism, but from my recollection of events, is more of a hit piece that takes what should be a technical discussion and drags in bunch of personal baggage. In particular:

> HN has previously fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model,

is not true at all. Someone else made the claims about 6GB RAM usage for a 30B model, I remember reading it at the time and thinking "Yeah, that doesn't make sense, but the loading time improvement is immense!" And it was - I run all my LLMs locally on CPU because I don't have dedicated hardware, and jart's work has improved usability a lot.

> and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation

I was reading the same HN discussions you were at the time, and it was pretty trivial to see that the loading time claim held up, and the RAM claim was dubious and likely simply due to not understanding some effect of the change completely. Heck, jart's own discussion of the topic reflected this at the time.

For the current change, I feel like your comment is even more misplaced. The blog post linked to for this story has a huge amount of detail about performance on specific processors (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with specific models. So when you say:

> It would be good to see some independent verification of this claim.

What I hear is "4bpp thinks there's a real risk the numbers in the linked post are fabricated, and jart is just trying to get attention."

And that doesn't seem reasonable at all, given the history of her work and the evidence in front of us.


The loading time improvements largely held up, and on the balance the mmap contribution was ultimately good (though the way it was implemented was really quite problematic, as a matter of process and communication). However, as I point out in https://news.ycombinator.com/item?id=39894542, JT quite unambiguously did try to cash in on the "low memory usage" claim - uncritically reprinting positive claims by others about your own work that otherwise would have been largely invisible should really not be treated differently as making those claims yourself.

I do think that there is a real risk that the numbers are wrong (not necessarily "fabricated", as this implies malfeasance, but possibly based on an erroneous measurement insufficiently questioned due to an excess of trust from themselves and others, as the mmap ones were). This is also in part based on the circumstance that at the time (of the mmap story, and myself being more involved in the project) I was actually involved in trying to optimise the SIMD linear algebra code, and unless llama.cpp has since switched to a significantly less performant implementation the proposition that so much more performance could be squeezed out strikes me as quite surprising. Here, your intuitions may say that Justine Tunney is just so brilliant that they make the seemingly impossible possible; but it was exactly this attitude that at the time made it so hard to evaluate the mmap memory usage claims rationally and turned the discussion around it much more dysfunctional than it had to be.


I distinctly remember most of the people in the comments misunderstanding kernel memory paging or learning about it for the first time.

It genuinely did make llama.cpp a lot more usable at the time.


All the core llama.cpp devs are superstar devs and 10x devs or whatever you want to call a super smart person who is also super productive and very good with applied calculus. Jart is very apparently very smart, but their relationship with this project was not without turbulence and at present they (jart) are not a core dev of llama.cpp. So for a while lots of their (i'd like to write her moves, but not sure if correct) actions seem to be aimed at getting attention and perhaps particularly the attention of the same folk.

On the contrary ggerganov, slaren, JohannesGaessler seem to have never chased this sensationalist superstatus, but actually leave their work to speak for them. You'll barely find comments by these people on HN, while jart figures every so often a way to manifest themselves some way on HN. And this behaviour on jart's part now bears fruits - for example Phoronix' Michael Larabel would praise jart for their work on the llamafile, absolutely obliterating the fact that it is largely based on the wonderful work of ggerganov at al.


When they claimed to drastically improve memory utilization through the use of memory maps, despite not doing so and then starting a huge controversy which derailed the project I would say they were a 0.1x dev not a 10x dev.


>HN has previously [1] fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model, which should have failed a basic smell test and indeed was debunked shortly after.

Where did Justine claim this? The link you provided is Justine saying that she doesn't have an explanation for the reduction in RAM and that readers shouldn't treat it as fact yet:

>The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

Was the link supposed to show the false claim or the debunking of the claim?


Plenty of claims about it, e.g. here as a "fact": https://github.com/ggerganov/llama.cpp/discussions/638#discu.... I don't think occasional expressions of lingering doubt (still couched among positive language like calling it a "miracle") can offset all the self-promotion that clearly seeks to maximise visibility of the implausible claim, even as it is attributed to others, as for example in https://twitter.com/JustineTunney/status/1641881145104297985... . A cereal manufacturer would probably be held responsible for package text like "Fruity Loops cured my cancer! - John, 52, Kalamazoo" too.


Where's the 30B-in-6GB claim? ^FGB in your GH link finds [0] which is neither by jart nor by ggerganov but by another user who promptly gets told to look at [1] where Justine denies that claim.

  [0] https://github.com/antimatter15/alpaca.cpp/issues/182
  [1] https://news.ycombinator.com/item?id=35400066


These all postdate the discussions that I linked (from March 31st). By April 1st JT themselves seems to have stopped making/boosting the claim about low memory usage.


I used your link.


I don't read that as a claim of fact at all. From the link you shared:

>Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse.

I haven't followed her work closely, but based on the links you shared, she sounds like she's doing the opposite of self-promotion and making outrageous claims. She's sharing the fact that she's observed an improvement while also disclosing her doubts that it could be experimental error. That's how open-source development is supposed to work.

So, currently, I have seen several extreme claims of Justine that turned out to be true (cosmopolitan libc, ape, llamafile all work as advertised), so I have a higher regard for Justine than the average developer.

You've claimed that Justine makes unwarranted claims, but the evidence you've shared doesn't support that accusation, so I have a lower regard for your claims than the average HN user.


The very opening line says

> I'm glad you're happy with the fact that LLaMA 30B (a 20gb file) can be evaluated with only 4gb of memory usage!

The line you quoted occurs in a context where it is also implied that the low memory usage is a fact, and there might only be a bug insofar as that the model is being evaluated incorrectly. This is what is entailed by the assertion that it "is" sparse: that is, a big fraction of the parameters are not actually required to perform inference on the model.


I think you are making a lot of soup from very little meat. I read those links the same way mtlynch read them. I think you're looking for a perfection of phrasing that is much more suited to peer-reviewed academic papers than random tweets and GitHub comments taken from the middle of exploring something. Seeing your initial comment and knowing little about the situation, I was entirely prepared to share your skepticism. But at this point I'm much more skeptical of you.


You can simply check the Pull Request on llama.cpp on Github. JohanesGaessler (a core maintainer) has already ran the code and says it's an impressive speed-up. There isn't a thorough review by any of the core maintainers yet, but this is very likely just exactly what justine says it is; various significant and insignificant speedups.


What's the point of your comment if you're not going to do the work yourself? If you don't have something nice to say then don't say it.

The "hey this may or may not be true so someone go figure it out" is lazy, self-gratifying and pointless.


I think it's very helpful for someone to point out that the source has been shown to be unreliable before, and we should wait for more verification from others knowledgable in the space.


Agreed. I think there's a blurry gray line between pointing out a potentially unreliable source and a lazy dismissal, but if there's reasonable doubt I think it's good for HN. If the doubt isn't reasonable, it will be torn apart by other commenters, and then it's an explicit discussion that people can read and decide on


If you give such comments a lot of credence without doing that own verification then you open yourself to what is essentially a social denial of service attack.


It's really popular online. I think that's because many people here read a lot of this content but don't actually have the skill or background to do analysis. So they give us history rather than examination. Which has some value, I suppose.


> and indeed was debunked shortly after

was also surprised that she continues to mention the mmap thing in a positive light even after the facts about the claim were settled to the contrary, even disregarding the whole attribution fiasco.


Unfortunately, I've pulled and built the PR branches and have only seen about a 5% speed increase on a modern zen4 EPYC system. Hardly front-page worthy news.

Its too bad there doesn't seem to be anyone else in this thread trying to actually replicate the results to evaluate these claims on their merits.


If anyone wants to follow along with EPYC optimizations, please subscribe to: https://github.com/ggerganov/llama.cpp/pull/6412#issuecommen...


Thanks. Watching with bated breath!


So, I can now run it on my 2015 Macbook with 8GB RAM?


re:funding

my friend suggested to nominate Justine for the open source contributions in an internal Microsoft programme (the winner takes $10k). They did not even want to add her to the potential list of nominees because her software is not used in MSFT. It speaks volumes about the corporate culture and shows what they really think about OSS support.


TL;DR: unroll the outer two loops of matrix multiplication


Shouldn't this have been done in a library instead of a specific project? Then others could also profit from it.


[flagged]


Every release a perf improvement?


Count how many times I and me were said in the article. I have a nice grease monkey script that blocks both domains on my old machine. I guess I forgot to import it on this one.


I feel your pain here. I hate it when I am forced to post online about a website I don’t like because I’ve both forgotten to stop myself from being able to look at it and forgotten not to click on it.


Someone doing performance work can state they did it. It's a public service we all benefit from.

If the rivalry further adds urgency to improve: great.


did a search and I have no idea what you are talking about


https://github.com/ggerganov/llama.cpp/issues/91

Justine was kicked out of llama.cpp for introducing changes before the rest of the maintainers approved them.

Much drama has been had since then and I've lost interest in both projects.

It's just been exhausting wanting to build good software without ego in this space. Everyone is trying to get rich quick before the inevitable AI ice age starts and all these skills are again useless.


The PR associated with the issue was merged after the approval of the owner of the repo (https://github.com/ggerganov/llama.cpp/pull/613). And there is other more recent contributions.

Where do you see any drama ?



Finally a useful link in demonstrating all the drama.


The last commit from Justine I can see to the llama.cpp repo is a week ago, so whatever drama there was appear to have been at a minimum partially resolved.


> https://github.com/ggerganov/llama.cpp/issues/91

> Justine was kicked out of llama.cpp for introducing changes before the rest of the maintainers approved them.

I followed your link and found nothing regarding this. What am I missing? Seems to me that you're just casting aspersions w/o backup.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: