I feel like there has to be another shoe to drop here, this seems almost too good to be true.
> Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($600) such that even the smallest of the LLaMA models—the 7B one, which can compress down to a 4GB file with 4-bit quantization—provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.
this is the most exciting thing. the cost of finetuning is rapidly coming down which means everyone will be able to train their own models for their usecases.
Looking for the contrarians on HN: what is being left unsaid here that people like myself and Simon might be getting too optimistic about? what are the known downsides that people in academia already know about?
We saw this happen with Stable Diffusion and it's not surprising we see this happening here. There is a lot of interest in taking these models that are in striking distance (single order of magnitude) from running inference and training on consumer level hardware and as such a lot of energy is going into making the optimizations that can get us there.
Generally speaking, research is not usually done with consumer usage in mind, so what this is, and Dreambooth etc. for Stable Diffusion was, is that gap between researcher software and accessible software being bridged.
> what is being left unsaid here that people like myself and Simon might be getting too optimistic about?
The past week has felt like a wake-up call to enthusiasts. Running models locally has been available for a while (even small, fairly coherent ones), and the majority of "improvements" recently have come from implementing the leaked LLaMa model.
The results from 7B are an improvement on what we had a year ago, but not by much. We're learning that there's room to optimize these models, but also that size matters. ChatGPT and 7B are both great at bullshitting, but you can feel the difference in model size during regular conversation. Adding insult to injury, it will almost always be faster to query an API for AI results than it will be to run it locally.
Analysis: Things are moving at a clip right now, but people expecting competitive LLMs running locally on their smartphone will be disappointed for quite a while. As the technology improves, it's also safe to assume that we'll find ways to scale model intelligence with greater resources, and the status quo will look much different than it does today.
>API for AI results than it will be to run it locally.
True, and remotely called APIs will always be the mover for the AI craze. Only niche hobbyists will be running them locally.
There is no company on the planet that would benefit from providing people local means to run LLMs. As a result only hacks and leaks will be how individuals can manage to run LLMs outside of heavily monitored remote API calls.
Companies like Facebook can harm their competitors by releasing models.
Facebook is not a major player in the Llm field, the technological advantage of openai is to large, BUT they can reduce the expected gains of their competition by providing less powerful alternatives for free.
Facebook may destroy their social networks by releasing LLMs. What will happen to spam when it could be driven by local LLMs? You'll have realistic accounts, with realistic pictures, posting realistic messages only to spam links or ads x% of the time.
(Though to be clear I do hope Facebook does release more models because I want to play with them)
I wouldn't count on it. However, I definitely think Facebook's remaining managers are scrambling to control this LLaMa leak though. They don't want the liability or responsibility of managing this project, and now everyone and their mother wants to play with it. Battling spam can largely be managed through the usual methods, but fighting a leaked model is much harder.
Oh, how I hope there is a project ongoing at Apple right now to implement some kind of on-device LLM that can drive the phone a la Siri. Any time I open the Bing app on my iPhone to answer a question I think of how useful it could be to ask it to update my calendar, fill out a dinner order, etc. How much time do we collectively waste putzing around with delivery apps etc?
It would be neat, but iPhone hardware has limited memory (and iOS limits that usage further). It would be much faster and higher-quality to run it as an API request like it is today.
Given the incentive, it can push the hardware in this direction.
Or they’ll do a home network device like implement it like a HomeKit hub. So the plugged in resource has more computational power and feeds it to iphones when nearby. Disconnected, the iphone uses a much more rudimentary one or falls back to Siri or a cloud service.
In the last 50 years of AI research has any academic institution ever provided open source easy to use tools like the stuff big companies have put out in the past 5 years?
>what are the known downsides that people in academia already know about?
I work in biology with some interface to wildlife managers/'end users' of our biological insights. I was hoping we could use these models as advanced chatbots so that our end users could ask biological questions from our data.
However, even the replies to the most basic biological questions in LLAMA/ChatGPT/OpenChatKit are hallucinated, even when setting temperature to 0. We simply can't trust any of these answers; the bot needs to be as truthful as humans in its replies, but none of them are so far. So what good are these models then?
This is the wrong way to go about it. Instead of trying to force that informaiton into the model via training, you need to set it up such that it "thinks out loud" as it answers the question, generating intermediate queries against your data sources in the process.
Take a look at this chat log from Bing. Note that the user only sees the stuff that is marked #message and #suggestions.
I've been playing a lot with the langchain Document Loaders - their examples make the results look more impressive than they are, or perhaps that's because our documents are 'special' (heavy in taxonomy/biology). I've been priming it with 'if you don't know, say that you don't know, don't invent anything' and it usually ends up saying 'I don't know' to questions it can clearly answer. But yeah, working on that part!
It seems still unclear how much quality loss there is compared to the best models. What's really needed is systematic evaluation of the output quality, but that's tricky and relatively expensive (compared to automated benchmarks), so I understand why it hasn't happened yet.
Edit: I just tried it with a single task of my own (that I've successfully used with ChatGPT and Bing) and it flubbed it horribly, so this model at least is noticeably inferior to the SOTA, which is not surprising given how small it is.
It does quite well at simple reasoning. More complex stuff, it does struggle sometimes, but I am impressed by the output and the fact that it even runs on my computer.
I have had a few questions where it gave a completely nonsensical answer, but by fiddling with the parameters, I've gotten more usable output. Have you tried lowering the noise and increasing top_p?
i always found this to be a strange pov to have on LLMs. imo, it's not humans tricking/gaming the ai, but rather chatgpt has tricked you into believing it's smarter than it actually is. (in human terms, chatgpt is just more articulate than llama)
it's a subtle distinction, but i think it shapes and reflects how you view ai as a tool for humans or as a replacement.
I don't understand why that is a bad thing. If your goal is to make an AI assistant, then you should be optimizing for giving answers that real users find useful, not trying to impress other AI researchers.
The big failure mode is they can hallucinate nonsense that isn't obviously nonsense. You have to check any facts against expert sources. At that point, you could just email an expert who can use their own LLM to whip up an answer and check the facts themselves.
That's a big problem if you're using a language model as a search engine. The trick is to learn how to use them for the things that they're good for outside of that.
^^ this. For instance, LLMs are really good at turning natural language into SQL. And if you know SQL, you can read it and make sure it looks good. But, it's much faster and easier than writing SQL by hand.
First catch is that someone needed to spend the enormous up front cost to train the base model, then release it under a flexible enough license for your use case.
The second catch is that you would get much higher quality out of the 65b model, but would need to lay out a few thousand for the hardware.
The third catch is that you need the fine tuning data, but that seems easier than ever to create out of more capable LMMs.
The 65b model now runs on the 14" MBP with 64GB of RAM for $3,500.
The base config is $2500 to begin with. So not thousands plural, it is one grand more than the base config and a lot of devs/pros go for the 32GB anyhow.
Apple laptops also have atypically high resale value, I usually sell mine after a couple of years for 50%+ of what I paid for it. By the by, Apple have a two week return policy. ;)
I think only your first catch applies. But if you can come up with the right mvp even that might fall away.
It is still much faster/cost-effective thanks to the unified memory. Would be interested in building out a PC as soon as something comparable becomes available.
If you are going to spend $3500 on macbook because its more cost effective, might as well go all the way to buying an a6000 for $4095 which is iirc 5-10x faster than macbook
I am agnostic about Mac/PC but people sure like to chime in with these sort of comments.
First of all it is $4,700 on Newegg and secondly it is out of stock. This also leaves me with the rest of the computer to put together which is not $0. And finally it would absolutely not be 10x faster for this use case than a 64GB MBP.
The advantage of going with Apple is that I can pick one up immediately, use it as a laptop (soon with Asahi linux when it matriculates fully) and if need be easily sell it. You might not see the value in that, that's alright, people have different use cases and needs.
At a guess: assuming quality scales with size, the model in the data centre is always going to outcompete the model on the device. So in any situation where you've got bandwidth >4800bps, why would you choose the model on the device?
Your own fine tuning, no restrictions on output, privacy/security, and if you have a reason to produce a lot of output it'll be cheaper. Use ChatGPT if you only want to use it occasionally, you don't care about privacy/security in this context, the output restrictions don't bother you and having the best possible language model is the most important thing to you.
Right, not sending out your text to third parties is absolutely huge for certain industries. ChatGPT has been already banned by some companies, so the issue is whether or not the language model can be used at all.
I'll add the possibility of network-based service downtime to the list.
Nothing much unsaid but huggingface has been doing this for smaller models. What is happening is they are distilling Davinci by using llama and it’s been done before .
But this performance will still page behind OpenAI for a bit but within a month or a few months we should have something almost comparable it’s just frustrating that Stanford is blocking the code and model release when it’s just a matter of time for a full copy to come out.
I wonder if this is legally enforceable. If Microsoft is right with Copilot on the point that training models counts as "fair use", then the only thing stopping people from fine tuning their own models are these terms of use, which seems like it can be easily sidestepped by having an intermediate person/company to "process" the outputs.
That has to run afoul of competition/antitrust laws and be unforceable. Imagine if Ford tried to tell people you can't use their pickups to carry tools around on a new Honda plant construction site.
Wouldn’t the first sale doctrine protect that? As long as the Honda dealership isn’t abusing the Ford trademark, my understanding is it’s perfectly legal to buy a product, modify it (even by removing branding), and then resell it. Of course with software you don’t really buy it, you license it, so there’s a lot more worms in that can of worms. Cars were simple by comparison
Charitably speaking the researchers had little time to execute this, so they just ended up using the well known OpenAI API. Still, it would be very useful if someone used LLaMA-65B instead of text-davinci-003 here.
Someone should ask the researchers, either via email or via github pull request, it shouldn't even be that hard to do.
In their readme they mention `output: str, the answer to the instruction as generated by text-davinci-003`.
And in the alpaca-data file there's this:
{
"instruction": "Perform the following arithmetic operation.",
"input": "(5+7)\*3.",
"output": "60."
},
Do we need human QA on the training data? It's contaminating the set! I wonder how incorrect data like this will be filtered out in the future. Unless we're ok with that kind of new math.
You can actually use AI to validate some of these answers! For questions involving math, it’s easy to get the LLM to write a problem to solve the problem, then use that to double check the result. For example, with the following prompt to ChatGPT:
‘’’
Can a program that solves the following problem accurately be written, and if so, write it in Python:
{“Problem”: “Perform the following arithmetic operation."
"input": "(5+7)3”}
‘’’
I got the answer:
‘’’
Yes, a program can be written to perform the arithmetic operation given in the input. Here is a Python implementation:
input_str = "(5+7)3" # input string
result = eval(input_str) # using eval function to evaluate the expression
print(result) # output: 36
The eval() function in Python evaluates the expression given as a string and returns the result. In this case, the input string "(5+7)*3" is evaluated to 36. The program then prints the result using the print() function.
’’’
> Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($600) such that even the smallest of the LLaMA models—the 7B one, which can compress down to a 4GB file with 4-bit quantization—provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.
this is the most exciting thing. the cost of finetuning is rapidly coming down which means everyone will be able to train their own models for their usecases.
Looking for the contrarians on HN: what is being left unsaid here that people like myself and Simon might be getting too optimistic about? what are the known downsides that people in academia already know about?