Hardly. I've played a lot with the 7,13, and 30B llamas as well as the 7 and 13B alpacas fine tuned by Stanford. They do not have emergent abilities like being able to generate rhymes or, say, represent a movie plot as emoji. Even openai's old text-davinci-003 (gpt3.5, but text completion, not the chat ones) far outperforms them. That said, I have hopes for a 65B 3-bit quantized alpaca-fine tuned. We'll see when someone spends the money to do the (more costly) 65B training. The alpacas are also much more likely to go off rails and start regurgitating their fine-tuning inputs. Either that or openai is doing a lot of post processing on their end to hide the same problems in their LLM.
For now my IRC bots run the alpaca 7B 4-bit. 13B was not a significant improvement for twice the computational time. But it's best to learn them now because as soon as openai gets sued for the first time all the turing test passing older models without the legal-butt-covering bolted on will be removed.
For me the easiest comparison between models is to give it an absurd but entirely possible request, like "Write me a fanfic where the Animorphs battle the truck from Duel, but in the style of Mark Twain". So far nothing else I've tried has done even as well as GPT 3.5 yet, let alone GPT 4.
> Can you give separate summaries of the Animorphs book series and the film Duel?
The reply:
> Animorphs Book Series Summary: The Animorphs series, written by K.A. Applegate, is a popular young adult science fiction book series spanning 54 books and several companion novels. The story revolves around five teenagers—Jake, Rachel, Tobias, Cassie, and Marco—who are granted the power to morph into any animal they touch by a dying alien named Elfangor. The teens, known as the Animorphs, use their new abilities to battle an evil alien race called the Yeerks, who are secretly invading Earth by taking control of humans' minds. Throughout the series, the Animorphs learn about the complexities of war, friendship, and sacrifice as they try to save humanity from the Yeerk invasion.
> Duel Film Summary: Duel is a 1971 television movie directed by Steven Spielberg and based on the short story by Richard Matheson. The film follows a mild-mannered traveling salesman named David Mann (played by Dennis Weaver), who unwittingly finds himself in a deadly game of cat and mouse on a remote California highway. Mann is pursued by a mysterious and relentless tanker truck driver, who seems determined to kill him for no apparent reason. With escalating tension and suspense, the film delves into themes of fear, paranoia, and the primal fight for survival. Duel is notable for being Spielberg's directorial debut, showcasing his skillful storytelling and unique visual style that would later become hallmarks of his career.
Duel is a great film but requires patience compared to a lot of today's films. There's not much action for quite some time -- just a man driving through the desert. Even when things pick up, it's still slow pacing by today's standards. It's from another era, and I doubt anyone would make a film today with the same pacing.
Tried the same question with 30B llama got the following:
The Animorphs book series was a young adult science fiction series written by Katherine Applegate and her husband Michael Grant, under the name K. A. Applegate. The series is made up of fifty-four books, first published between June 1996 and May 2001. The series revolves around five children given the power to transform into animals (the ability to "morph") as they are thrust into a war with an alien species called Yeerks that are determined to take over Earth.
Duel is a television film directed by Steven Spielberg and written by Richard Matheson, based on his short story of the same name. It stars Dennis Weaver as David Mann, a middle-aged salesman driving on a business trip. While traveling down a two lane highway through remote California desert, he is taunted in a life or death duel by the unseen driver of a massive and monstrous Peterbilt 281 tanker truck.
----
You have to remember to prompt it right, since its based on autocorrect just as ChatGPT but a lot there is happening on the background before the text is sendt to the model.... My prompt and settings here was.
---
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:Can you give separate summaries of the Animorphs book series and the film Duel?
----
How exactly do you get it to keep going? Every time I try a prompt like this in the playground it spits out a couple paragraphs and then refuses to generate anything further, even with tokens maxed out.
I never used the OpenAI playground, see if Poe.com will work for you. They give free access to GPT3.5 and Claude, Antrhopic AIs competitor. Paid subscription for GPT4 and Claude+
That is my experience as well. I've tried various models but nothing comes even close to the current ChatGPT implementation (when it manages to stay up).
Highly recommend Animorphs. As an adult the books can be read through pretty quickly, but the story really holds up. Deals with some heavy topics of genocide, enslavement (even exploring the justifications of the enslavers), PTSD among other topics but was still considered a kids book.
It becomes obvious in the middle when some of the books were written by ghost writers, but the books are so easy to read I don't really recommend skipping them. If you must you could probably get away with reading the first ten, last ten, but should definitely read all of the Chronicle books.
Maybe the most striking bit of its exploration of the morality of war is that the series spends something like 45 books building up the alien Andalites as beneficent do-gooders who the team of protagonists expect to swoop in and fix everything (including one of the protagonists who is a young Andalite stranded on Earth), and then it turns out that they've been as morally worn down by the ongoing interstellar war as the protagonists have in their guerrilla conflicts on Earth and are effectively as hostile to humans as the original invaders.
> If you must you could probably get away with reading the first ten, last ten
Not sure that you can. If you were to skip any, probably only 31 through 39 are completely skippable, maybe some of the late 20s but I would still read 29 and 30 at a minimum. Some of the teens and 20s might be skippable after 13 but there’s a fair amount of world-building outside the Chronicles series in the 20s; and 40 onwards is setting up the end game and then the end game. 41 and 48 are both weird but also kind of key towards finalizing the characters of the two cover characters in the end game.
EDIT: actually 33 and 38 shouldn’t be skipped either. They’re Tobias and Ax books and there’s so few of those that they’re all kind of essential, but maybe the Tobias books just a little bit more essential.
I’m impressed by how ChatGPT-like it is but also it’s saying things like
“No, OpenAI does not have an API for dogs. They do, however, have an API for other animals, such as cats. To retrieve an image of a cat, you can use the OpenAI API for Dogs API and select the cat breed or type.”
I've tried all sorts of parameters including those exact ones. As for the huggingface stuff, it's not exactly clear how to use it without going down the python dependency rabbit hole. I am not confident I could get the correct python packages all together on Debian 11 to support running it. The llama.cpp stuff is very simple to compile and run comparatively.
> the alpaca 7B _4-bit_ [and presumably also 4bit for the 13B, 30B and larger parameter sets]
This is the wild card here, though, isn't it? OpenAI's chatGPT likely uses more than 4 bits for it's parameters. IIRC the original LLaMA params were 16bit floats and they were quantitized down to 4bit - considering that large amount of compression, they sill do pretty OK, but not as good as chatGPT. I wonder how the alpaca/LLaMA models would do with 16bit floating point params (as they were originally trained)? What if they would have gone with 8 bits for the params as a compromise?
EDIT: Come to think of it, unless you're using vectorized ops on a CPU, 4 bit and 8 bit math is going to run at the same speed (for most popular CPUs), is it not? So why did they go all the way down to 4 bits instead of stopping at 8 bits (other than to make the param files 1/2 the size)?
EDIT2: looking through the alpacca.cpp code and there is mention of AVX, AVX2, AVX512 (and NEON on ARM) so it probably is taking advantage of vectorized ops where that's possible.
Not an expert on the matter so take this with a grain of salt, but I’d say the compression is also about VRAM/RAM, which seems to be the more limiting factor over inference speed.
I think it’s going to be like movie special effects.
When Jurassic park first came out, or even something like Star Trek next gen. It looked AMAZING. So so realistic. But then…. As time goes on new things showed us what realistic could be.
I think we actually got better at seeing.
Same thing here. The more time you spend with it the more you notice things that don’t quite work. And then the new thing solves those problems, but we’ll find more wrongness
The problem is that you’ve identified two distinct and non-overlapping sets of people as “everyone”. Everyone who was applauding 3.5 when it came out were industry hype people. Even the critical voices were industry hype people, paid to assume the AI is powerful and write about the possible negative consequences of that assumption.
Now we’ve all gotten familiar with 3.5, and we’ve come to understand its limitations, so the public knows it’s not a “godlike” AI.
Luckily there’s a fresh new model, not technically different from the earlier one but it cost more money to build. The hype group can start again, citing the publicly known limitations of 3.5. But in 6 months we’ll understand what’s wrong with it, and the public will be talking about the limitations, just in time for 4.5.
In my personal testing, I throw some sophisticated use cases at LLMs - particularly chain of thought reasoning. None of the models out there are able to do this this well, except for the OG GPT-3 Davinci-003. Even the newer turbo models are not as good.
I am playing around with GPT-4 this week though. Let’s see how that goes.
The newer turbo models are the ChatGPT models, and are worse than text-davinci-003, in my experience. The gpt-4 model is also not as good as the GPT-4 chat version, which is very odd.
fwiw 7B is totaly useless for the subset of non english languages i've used, 13B a bit less so, but nowhere near as good as gpt.
GPT's performance in non-trivial translation tasks is unbelievable. all those articles mentioning jobs that are going to be replaced fail to mention translators are probably going to be the first.
You need the translators to QA the output from GPT. It's less work but not much less, and more types of translation work becomes feasible when leveraging GPT. I'm guessing the job market for translators will grow, not decline.
"Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline."
This would be 175 billion 3 bit weights instead of 175 billion 16 (or 32!) bit weights. It massively reduces the size of the model. It makes loading it in ram on consumer computers feasible. The number of parameters stays the same.
I've read the paper and to be honest I'm not sure what to make of it. Their headline benchmark is perplexity on WikiText2 which would not be particularly relevant to most users. If you look at the tables in the appendix A.4 with some more relevant benchmarks you'll sometimes find that straight RTN 4 bit quantisation beats both GPTQ and even full 16 bit original! No explanation of it is given in the paper.
Some of those benchmarks have a pretty small sample size IIRC, might just be coincidence that the noise introduced by RTN just happens to slightly improve them.
GPTQ beats RTN on almost every benchmark at almost every size, though.
Hm. I haven’t tried the local installs yet. However, when the Alpaca web demo was live, I did find it to be comparable (though not quite as capable) to davinci-003. It answered arbitrary factual questions about pop culture references, law, medicine, and programming. It generated rhymes and poems. (I didn’t try asking for the emoji thing, so can’t say anything about that.) It performed natural language tasks such as information extraction and summarization. And it did all of it coherently.
Be aware this file is a single ~8GB 4-bit model (ggml-alpaca-13b-q4.bin) instead of the 2x ~4GB models (ggml-model-q4_0.bin, ggml-model-q4_0.bin.1) that most llama.cpp style inference running programs expect. You'll probably have to edit the line,
n_parts = LLAMA_N_PARTS.at(hparams.n_embd);
in chat.cpp (or main.cpp) to hard code it to treat this 1 file model properly like,
n_parts = 1;
Or re-write the parameter config subroutine to recognize and handle non-standard weights file.
This sounded like a really cool idea but I asked ChatGPT to do this for the plot of the movie The Shawashank Redemption and there is no way that I would ever have been able to guess that movie from the emojis it gave me. Perhaps GPT-4 does a better job at it.
For now my IRC bots run the alpaca 7B 4-bit. 13B was not a significant improvement for twice the computational time. But it's best to learn them now because as soon as openai gets sued for the first time all the turing test passing older models without the legal-butt-covering bolted on will be removed.