This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.
The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.
ME: What's another saying similar to: "The cat is out of the bag" or "The genie is out of the bottle"?
chatgpt: Another similar saying is "the toothpaste is out of the tube." Like the other two expressions, this phrase conveys the idea that something has been revealed or unleashed, and it cannot be undone or put back into its original state.
ME: Can you invent a new phrase that has the same meaning, which is completely new?
chatgpt: Certainly! Here's a new phrase with a similar meaning: "The confetti has left the cannon." This phrase suggests that once something is out in the open or has happened, it's impossible to reverse or contain it, much like confetti that has been shot out of a cannon during a celebration.
ME: What's the name for this kind of expression? Aphorism? Fable?
chatgpt: This kind of expression is known as an idiom. Idioms are phrases or expressions with a figurative meaning that is different from their literal meaning. They are commonly used in everyday language to convey a specific message or sentiment in a more colorful or memorable way.
This seems backward to me. Wouldn’t you be less impressed by ChatGPT if you thought that thought that human intelligence worked the same way as LLMs?
If humans have some special sauce different from the computer, then it’s crazy that ChatGPT can emulate human writing so well. If humans are also just statistical models, then of course you can throw a big training set at some GPUs and it’ll do the same thing. Why should we be surprised or impressed by idioms?
Well, I don’t they work exactly like the human mind, because we are likely a more complex ensemble model. But a) it’s impressive because it’s reaching a goal of computing for generations and as a computer scientist I’m excited, b) it’s corpus of learnings is the amazing collection of everything that we built with the modern internet, so it’s breadth and depth is astounding. The act of creating what we have naturally is itself exciting though.
I suppose it depends on the extent to which you're impressed by the concept of LLMs, I can see it both ways.
To explain the other way to your thinking: human intelligence is the same; holy crap they cracked robotic 'human' intelligence, it works exactly the same way.
That's how the brain evolved, after all: take incoming input, process according to previous patterns, output a reaction to peripheral nervous system that will enhance survival. The evolutionary purpose of a cortex is to store a statistical model of an animal's environment based on past stimulus. Human intelligence is just a complicated permutation of these initial systemic parameters.
That’s why people have to take courses in reasoning and logic to be able to successfully apply it. It’s not like reasoning and logic are natural human skills. They’re learned through reinforcement learning by repeatedly exposing our brain to patterns again and again until we can train our stochastic expectation juice to not hallucinate an answer.
And it takes ~20 years to train a new brain so it can coherently answer questions about a wide variety of topics. Even worse, you can't even copy-paste it once you're done!
What we shouldn't is anthropomorphise it too much. While LLMs can express themselves and interact with us in natural language, their minds are very different from ours - they never learned by having an embodied self, and they can't continuously learn and adapt the way we do - once the conversation is over, it's like it never existed unless it's captured for a future training cycle.
Right now, their ability to learn is severely limited. And, yet, they outcompete us quite easily in a lot of different tasks.
Agreed. There are a hundred different kinds of information processing that go into a human-like mind, and we've kinda-sorta built one piece. And there are a lot of pieces that it would neither be sane nor useful to build (eg. internalized emotions), so we might not see an AI with all the pieces for a very long time ("never" is probably too much to hope for).
From a pure data amount point of view yes, but relatively little of that would seem to be relevant for our intellectual capacities. If GPT was a robot moving autonomously around in the world with full visual, auditory and tactile apparatus, it may be a bit different.
Hm, not sure how most of that data would be irrelevant, could you clarify? I think all of that data as well as interacting with the environment creates the level of knowledge and intelligence we have today.
There are many mentions of confetti cannons on the web, along with explanations of how they work (saying something like confetti shoots out of the cannon). Chat-GPT just picked a random thing (confetti) and completed the pattern "X out of Y" with the thing confetti comes out of.
It's easy. The cereal is out of the box. The helium is out of the balloon. The snow is out of the globe.
And it's exactly the one thing LLMs are trained to do: reproduce patterns of words. They are (perhaps) already better than humans at that one specific skill (another win for AI research) but I don't think it's a sign of general intelligence.
Like you, I thought these pieces of software and data were little more than statistics-based text generators. But it turns out that this is a Category Mistake.
There was an argument made by Raphaël Millière in a recent Mindscape Podcast [1] with Sean Carroll that finally landed for me. He used the example that human beings are driven to eat and reproduce, so by that argument all humans are just eating and reproducing machines. "Ah! But we developed other capabilities along the way to allow us to be good at that!" And that's the point.
GPT-4, for example, is very very good at producing pleasing and useful output for a given input. It uses a simulated neural net to do that. Why would one assume that on the way toward becoming excellent at that task that a neural net wouldn't also acquire other abilities that we associate with reasoning or cognition? When we test GPT-4 for these things (like Theory of Mind) we actually find them.
"Ah hah!" you say, "Humans are set up to learn from the get go, and machines must be trained from scratch." However if you consider the entirety of our genetic legacy together with our childhoods, those are our equivalent "training" from scratch.
I don't think it can be easily dismissed that we're seeing something significant here. It's not human-level intelligence yet. Part of the reason for that is that human brains are vastly more complex than any LLM at the moment (100s of trillions of "parameters" in LLM-speak, along with other advantages). But we're seeing the emergence of something important.
That emergence is precisely what I'm looking for evidence of.
Human beings evolved to eat and reproduce and yet here we are, building computers and inventing complex mathematical models of language and debating whether they're intelligent.
We're so far from the environment we evolved to solve that we've clearly demonstrated the ability to adapt.
ChatGPT doing well at a language task isn't demonstrating that same ability to adapt because that's the task it was designed and trained to do. ChatGPT doing something completely different would be the impressive example.
In short: I don't categorically reject the possibility that LLMs might become capable of more than being "statistics-based text generators", I simply require evidence.
This paper from late last year shows that LLMs are not "just" stochastic parrots, but they actually build an internal model of the "world" that is not programmed in, just from trying to predict the next token.
We're seeing those other capabilities emerge; like being able to play chess though it's not been trained to do so. That is, these LLMs are displaying emergent abilities associated with reasoning.
These LLMs aren't R. Daneel Olivaw or R2D2 (which is what I think of when I think of the original term for AI, and what we took to calling AGI). We're closer to seeing the just-the-facts AIs we encounter in Blindsight. Intelligence without awareness.
Funny that we still have to use science fiction to make our comparisons because our philosophy of intelligence, mind, and consciousness are insufficient to speak on the matter clearly.
The issue I see here is you are doing a worse job at this than ChatGTP. Creating idioms is hard, that is why we left most of them to Shakespeare.
- I regularly return cereal to its box.
- "helium" and "balloon" have a more awkward rhythm than "confetti" and "cannon". It also loses the connotations of sudden, explosive and exciting change.
- Snow & globe I'm not even sure what that means in practice. It has poor prospects as an idiom. Is the snow even known for leaving globes?
> "helium" and "balloon" have a more awkward rhythm than "confetti" and "cannon". It also loses the connotations of sudden, explosive and exciting change.
Not only that, but "the confetti has left the cannon" is an alliteration, which makes the phrase even more poetic.
But I do think it's cherry picking the most impressive example. I repeated the dialog (and some variations), each time asking for a completely new idiom, and ChatGPT responded with several phrases that aren't new at all:
it also attests to the ramifications, the magnificent difficulty of cleaning confetti, you will keep occasionally finding confetti around the house for months or years
To continue with some human brain powered pedantry for fun I’d argue that “the snow is out of the globe” is not a valid idiom to suit the “X out of Y” pattern.
We have multiple example pairings of X and Y but the common components are that putting X back in Y would be impossible or extremely difficult, and also that X is in some way meant to no longer be contained inside Y as part of either a desired outcome, it’s normal function or the natural expected and thus inevitable result. Cats want to escape, helium leaks, confetti is expelled to have the desired effect, and toothpaste is squeezed out to use it…
For the snow to come out of a snow globe you have to smash it which is not normal usage, not normally meant to happen, and shouldn’t happen by itself. Making the idiom “The snow is out of the globe” not a proper member of this “family” of idioms. (Also I’m not sure if there’s an agreed upon collective noun for idioms)
Better that than the opposite effect, to assume that because a system solves a single problem very well, it is intelligent.
Is Stockfish intelligent?
Is a system with A* pathfinding intelligent?
I would define intelligence as the ability to solve a wide variety of novel problems. A system built to be excellent at a single task may be better than humans at that task but still lack "intelligence".
We still don't know what that even is exactly, but historically people consistently underestimate how difficult it is.
Not even knowing how to approach it, researchers work on solving single specialized problems instead and make little progress on whatever "intelligence" is.
(If you'd prefer a different definition of intelligence under which Stockfish and GPT are intelligent, then what would you call the ability to solve a wide variety of novel problems? Feel free to substitute that word for "intelligence" above if you'd like to understand what I'm saying.)
I'm not sure if we should get stuck on definitions of intelligence.
The fact is that these tools are useful, as are the currently existing AI's. The latter can also pass for humans, in many ways, while the algorithms you mentioned can only pass for humans in very narrow domains. Both can exceed human performance in some ways.
Eventually, AI's may be indistinguishable from human or convince humans that they should be treated differently from "mere" programs and algorithms, and at that point we will have entered a new era, call it what you will.
In a way, this comment perfectly encapsulates why the argument "machines will never replicate human behavior" is so ridiculous. Instead of engaging with the discussion and topic, you chose a position, and then tried to justify it without really thinking about why one example works and the other one doesn't. In doing so you're literally showing that for certain topics, machines are already more capable than some humans.
> And it's exactly the one thing LLMs are trained to do: reproduce patterns of words. They are (perhaps) already better than humans at that one specific skill (another win for AI research) but I don't think it's a sign of general intelligence.
while completely missing why the machine did a better job.
I think the thrilling part is that it's a somewhat atomic concept that can somewhat convincingly be proven to not exist in the training data. While poetry is more impressive if it's as original it's harder to show that it's not just stitched together from the training data.
I just asked GPT-4 to come up with more such "provably original idioms":
"""
Here are a few more examples of idioms with meaningful and provable atomic originality:
"The kite has touched the stars" - This phrase could mean that someone has achieved a seemingly impossible goal or reached a level of success that was thought to be unattainable.
"The paint has mingled on the canvas" - This idiom might convey the idea that once certain decisions are made or actions taken, the resulting outcome can't be easily separated or undone, similar to colors of paint that have blended together on a canvas.
"The clock has chimed in reverse" - This expression could be used to describe a situation where something unexpected and unusual has occurred, akin to the unlikely event of a clock chiming in reverse order.
"The flower has danced in the wind" - This phrase could signify that someone or something has gracefully and nimbly adapted to changing circumstances, just as a flower might sway and move in response to the wind.
"""
Conspiracy theory: this idiom used to be widespread on the internet, but a shadowy cabal of AI maximalists erased millions of such expressions from search engines so GPT could appear to re-invent them.
It is a good expression though -- evocative but not gross or violent. You could imagine many less successful analogies to something ejecting something else.
Is it strange that I think it's more likely that LLMs generalize and synthesize new things out of old patterns instead of just copying a singular example.
I understand that all of this is based on some fundamental mathematics, a couple of algorithms here, and some statistical analysis there. And I find it absolutely amazing that we can use all of that knowledge and encode it into something that resembles intelligence. This makes me think if our brains and the concept of intelligence are really as special and mysterious as we assume.
That name aged like milk. First of all, what you said. And second - a parrot can make more parrots without human help, language models can't make GPU chips. Insulting for both LLMs and parrots.
Language models exist not in the physical world but digital. A more apt comparison would be if LLMs can compile new (or slightly modified) LLMs, and I believe the answer will soon be (if not already), a resounding yes.
I agree, LLMs can generate the source code for their network, can generate the training data, supervise a model being trained and explain step by step what they do. That's why I specifically said they can't make GPUs, if they could, they would be self replicators. We'd have to consider them alive.
We found that Codex was pretty bad at Verilog and so we fine-tuned our own models on Verilog code, starting from the SalesForce CodeGen model. The paper will be at DATE: https://arxiv.org/abs/2212.11140
Serious question: is it typical to describe client-side computing as "on the edge"?
I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.
I believe this has been extended to mean "on device", which is interesting. See Gerganov's article on Github [0]. I wrote about this here [1] where I made a contrast between the core and the edge. I think the term maps well to this meaning.
What I find more interesting is that in the classic "close network proximity", some parts of the world may not have benefited as much from that trend since the closest nodes of a global delivery network could be several countries away.
> I believe this has been extended to mean "on device", which is interesting.
I don't like the connotations this carries. This is almost openly talking about reaching all the way into peoples' hardware to run your software, for your benefit, on them, without their knowledge, consent or control...
I see. Hadn't considered this. Yes, I see how that might be a concern.
What I think is important in this AI Spring is that we make it possible for people to run their own models on their own hardware too, without having to submit anything to a large, centralised model for inference.
Yes, "edge computing" can refer to both computing done as close to the user as possible geographically, or even on the device itself. If someone says "I wanna do edge computing" it's not clear enough to know if they just want to have servers they control as close to the user as possible, or do the computing on the device itself. I think Apple would say "edge computing" is on the actual device while CloudFlare would say "edge computing" is on their infrastructure, but distributed to be physically closer to the end user.
I guess I've been out of the loop for a bit and didn't realize that "edge computing" became a term since cloud computing took off.
It is kind of cyclical then is not?
By that I mean computers used to be shared and to log into it through a terminal.
Then the PC came around.
Then about 15 years ago Cloud computing became the rage (really an extension or more sophisticated system than the first time shared computers)
Now we're back to local computing. I even see more self hosting and moving away from cloud due to costs.
All that rant is to say is it's interesting.
Side note, getting this AI to be localized as much as possible I imagine will be really useful in the medical industry because it helps alleviate HIPAA requirements.
I wonder if the author's perspective has changed with regards to freedom to compute.
Social Media is often used as an example of privacy invasion though I've failed to see why concerns over Facebook handling your private data is worrying when they don't have a product you need to have.
Email on the other hand, is pretty much a necessity today so privacy concerns are vital there imo. Of course you can host your own server whereas you can't host your own Facebook.
> I've failed to see why concerns over Facebook handling your private data is worrying when they don't have a product you need to have.
I have at least two concerns.
1) Viscerally, it would be intensely creepy if someone started following me everywhere. Why would I want a company doing the same?
2) We are in an era of diminishing freedoms, and it's impossible to prove that data Facebook has on us can't be turned against us in the court system in the future. Women's menstrual cycles are being weaponized, for Pete's sake.
What I wanted to articulate though was that Facebook isn't delivering an essential service. The consumer still has a choice not to use it. So if it as a company is known to abuse the information you give it then don't do business with that company.
Email, phone, and even your personal address you don't have a lot choice around not using. And that stuff also isn't really protected without encryption. (excluding mail by postal service, mostly).
Basically if you don't want communication to be intercepted and potentially used against you then do it in person inside of faraday cage. I'm mostly joking here.
I have never had a Facebook account. I believe Meta would have a shadow-account for me, based on browser fingerprint, photos tagged by their users and such.
Because of the ambiguity of the term "on the edge" that is used to refer to both close network proximity and the device closest to the user, as evidenced by this thread, I would suggest to use a new term, at least in the context of A.I. The AI running on the device closest to the user should be called a "terminator".
“Edge computing” arguably implies there’s a network you are connected t, that you’re on the edge of, so I wouldn’t apply the term to applications that can function completely offline. With edge computing there’s usually still a notion of having some sort of internet integration, like IoT devices.
I've used "edge" in this context for around 15 years[1], and I've always intended it to mean "at the edge of the network", which can include being on the other side of the world to a user.
> Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local...
Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."
I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.
I for one dream of a future without maps. I want to walk through a distant forest to find an ancient, unconnected ESP-32 in the bark of a tree containing a tiny specialized AI that can only tell me about things relevant to the area, how far to walk upstream to the nearest town. And only if I can find it and scan an RFID tag to wake it up.
> I like to think
(right now please!)
of a cybernetic forest
filled with pines and electronics
where deer stroll peacefully
past computers
as if they were flowers
with spinning blossoms.
I'd go one step further if it is not happening yet: smaller companies should really pool their resources to train open LLMs. Say, form a consortium and work with the open source community to build ChatGPT-equivalent. Companies will be crazy to assume that they can hand their future to the APIs offered by a handful of companies during this monumental technological paradigm shift in history.
That is, a real OpenAI with a open government body.
Yes, yes, and yes. I'm waiting for an actually open AI that can run on the edge, purely on commodity hardware like our laptops and phones - it's inevitable.
I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.
It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.
Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.
Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.
After using ChatGPT 4 extensively for a few days, I think we're probably only a few years away from the first generation of truly conversational assistants ala Jarvis in Iron Man. Between LangChain and existing voice recognition software, we've already 95% of the way there, it just needs to be packaged up into a UI/UX that makes sense.
These local models are absolutely critical for that to happen though. I'm hitting daily situations where I have to reconsider my use of ChatGPT because what I'm asking would leak very private personal information or somehow trip its morality filter.
Just swapped in a 2TB nvme for a separate /home mount and reinstalled Arch just so I could have enough disk space to test a dozen models locally. I'm so ready!
I pray to the AI gods that OpenAI will fail at calibrating the censorship layer and will continue to overcensor, which in turn will hopefully lead to many usecases requiring local LLMs, which in turn would increase the incentive to build them.
GPT-4 training was finished last summer.
Karpathy is working on a JARVIS there.
I think they already have something quite good for internal use, maybe trying to make it "safe" before releasing it.
My guess: we will have a competent JARVIS(minus the holograms) this year.
> the democratization and commodification of powerful technology accessible and affordable to the public
Empowering everyone equally, very low barrier of entry. The spread of AI will make some skills become less important and other skills more important. There will be painful social changes. Memorising stuff, being exact, being careful, diligent - all these things can be done by LLM. But also creative recombination of ideas. For now we still hold supremacy in long form content, complex software projects and detecting inconsistencies, but they could be emergent qualities in GPT-5.
If you thought crypto miners drove up the prices of GPUs, just wait until a gaming card can host a person who can get a job.
Integer-quantized LLaMA can fit on a 4090, and isn't quite human-level. If a fully aligned, reliably non-hallucinary turingrade LLM can run on the next generation 5 series GPUs then expect their market price to quadruple or more.
Yes, this is true. But, I worry about how long it will take for the utility of "GPT-4" on my phone to be close enough to whatever is only possible through models running on large cloud platforms to make that choice relatively drawback free.
Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.
This is a shocking turn of events given there's no edge equivalent of the previous most powerful information tools (web-scale search). It does seem like it will still be a challenge to continuously collect, validate, and train on fresh information. Large orgs like Google/YouTube/TikTok/Microsoft still seem to have a huge advantage there.
> And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
Any projects I can follow? Because I haven't seen any one click installers yet that didn't begin with "first install a package manager on the command line"
Which is a few months ahead (because the Stable Diffusion tech happened a few months earlier) and is definitely at a point where anyone can easily run it, locally or on a hosted environment.
I expect this "text-generation-webui" (or something like it) will be just as easy to use in the near, near future.
>So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
I want to believe you, but I'm ignorant of the hardware requirements for these things. How soon do you think we'd be able to run something reasonably gpt4-like on, say, a 4090?
I feel like no less than 10 years if the singularity doesn't kick in before that. Hardware and energy isn't progressing as fast as we'd like, and that is the main bottleneck. As in, imagine a world where we actually had the same computing power required to train (not run) GPT-4 in 1s in a phone? That kind of world is way beyond AGI and the cure of cancer IMO. Which is great, because it gives us a very objective goal to achieve these things. Sadly, I don't think we're nowhere near that. What was even the total energy consumption of GPT-4 training? Very hard to imagine computers will get that much better anytime soon. IIRC we have some kind of data that the a smartphone today has the power of the best supercomputer... of 30 years ago, right? Don't remember the source, sadly.
Your comment makes me wonder if it's not a coincidence that as we seem to be the hitting a limit in hardware power that human level intelligence begins to emerge.
Such hardware problems might be overcome with new computer architectures and/or substrates, like DNA computing, quantum computing, etc... Current AI's could help us overcome such limits.
I don't think 10 years for training such large models on phones is entirely feasible, if only because phones are mainly concerned with power drain and ergonomics before anything else. Nobody is going to buy an iPhone 15 if it's the size of a brick and lasts 2 hours just because you're training&running models on it.
The focus instead should be on running expansive models locally on your desktop system at home. This is a better focus in two ways: one the power issue is a non-concern (and AMD has already stated power usage will reach 500-700W average by 2025), two once you have it running locally the avenues of use open up such as being able to access your local model through other devices without the heavy burden on those devices.
The cat is out of the bag,The genie is out of the bottle,The confetti has left the cannon,The ship has sailed,The horse has bolted,The toothpaste is out of the tube,The beans have been spilled,The train has left the station,The die is cast,The bell has been run.
Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.
> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)
> We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license
Which is the point made elsewhere in these comments, e.g. https://news.ycombinator.com/item?id=35344192, and also usefully shows how open Cerebras are. They're pretty open, but not as much as they would be if they were optimising for filling in other companies' moats.
Indeed but this is zero-shot performance. Fine-tuning for a task should get you pretty good results. I'm interested in seeing the results of an Alpaca method against this Cerebras 13B model.
Base model performance is what's most important and also impacts fine-tuning quality. Practically, a model that's good out of the box with minimal fine-tuning is also useful to more people. Since they focused on being training compute optimal for some budget, expect their models to lag behind Llama overall. Their 6.7B version should lag behind GPT-J, assuming 20 tokens per parameter.
The Pythia models are also worth checking out, they might be better than or matched to CerebrasGPTs at each size (although they warn it is not intended for deployment).
Conclusion: the landscape of top open models remains unchanged.
I agree fine-tuning for task will give better results. Cerebras actually showed some research recently on this front. Sparse pre-training and dense fine-tuning (https://arxiv.org/abs/2303.10464). You can recover the accuracy of sparse pre-trained models with dense fine-tuning and reduce FLOPs of the end-to-end pipeline by 2.5x compared to dense.
>I'm interested in seeing the results of an Alpaca method
You're talking apples to oranges. The "Alpaca method" is a dataset generation method. Nothing about Alpaca's training method is novel, interesting, or efficient. Alpaca used the same standard training method everyone else uses, A100 clusters.
If you mean LoRA/PEFT training which people used to replicate Alpaca then that is also apples to oranges because LoRA/PEFT is a finetuning method not a pre-training method.
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
> The challenge of extracting more than 20 kW of heat from the wafer was solved by having the wafer "float" on a cold plate. The wafer is allowed to expand and contract while remaining in contact with the polished front side of the cold plate, despite the different coefficients of thermal expansion of copper and silicon.
The cold plate is much more than a a slab of metal: advanced computational fluid dynamics modelling was used to design a labyrinth of coolant channels capable of maintaining a precise, stable, thermal environment even as 850,000 Al-optimized cores swing into action.
> The power density of the CS-2 is too high for direct air cooling, so liquid cooling is used instead. The internal manifold transfers heat between the CS-2 system's internal coolant and facilties water. Separating these two fluids ensure that the CS-2 system is not affected by changes in the quality of facilities water and that the very highest-quality coolant circulates through the cold plate.
> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.
Cerebras makes impressive hardware, but Nvidia still performs better in every regard. The most telling factor is that Cerebras claims they're too busy to run common benchmarking (e.g. MLPerf) to compare against Nvidia.
Simply focusing on the "better in every regard" part of the comment.
One example where Cerebras systems perform well is when a user is interested in training models that require long sequence lengths or high-resolution images.
One example is in this publication, https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2, where researchers were able to build genome-scale language models that can learn the evolutionary landscape of SARS-CoV-2 genomes. In the paper mentions, researchers mention "We note that for the larger model sizes (2.5B and 25B), training on the 10,240
length SARS-CoV-2 data was infeasible on GPU clusters due to
out-of-memory errors during attention computation."
Mostly teasing but my guess would be $500k+ since they'll likely price it so that it is the same $ as the equivalent NVIDIA cluster (or very close to it).
Actually if they are around $2M looks like my company can afford one. Given this is just getting started it looks promising as I’m sure future generations will be more affordable.
On the order of several million USD for the second gen system. Last I heard, they’re still at lowish volumes, selling some to national labs and the like.
ServeTheHome claims "HGX A100 platforms, when they are sold as single servers are generally in the $130K-$180K even leaving a very healthy margin for OEMs/ resellers"
Not sure about the H100, but it seems to be more supply constrained (hence pricier) atm.
Now, the real question is how many HGX nodes "equals" a single CS2 node. The math here is extremely fuzzy, as the benefit to such extreme node consolidation depends on the workload, and the CS-2 takes up less space, but the HGX cluster will have more directly accessible RAM and better turnkey support for stuff since its Nvidia.
This is actually really important from my perspective. It looks like
an end user can work backwards from available inference hardware, or interference budget, required speed, then figure out a viable model size. Bring their own data and then fine tune or train from scratch.
It's a pretty mad architecture tbh. Compile times must be absolutely insane. Also Tesla's Dojo also uses a manufacturing technique that has basically obsoleted their WSI design already.
Compile times are not a whole lot different than any other large model build. It's a kernel based compilation pipeline and the kernels are simply tiled over a 'core' area in the weight streaming architecture.
I used to work for a competitor with a more flexible architecture and even our compile times were bad (significant fractions of a day in some cases). And we didn't have to do place and route!
I just googled it and it's apparently bad enough that they had to implement incremental place and route.
EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.
Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?
Came here to point this out, though not as pithily :D
Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.
edit: so this post is more useful, 4k res using Edge browser
In other words, they’re actually incentivized to help make LLMs as accessible as possible, rather than try to keep them locked up to hide them from competitors.
Which makes me wonder if Nvidia is doing anything with LLMs too?
RAM. GPT-3 is over 600GB, ie just the max RAM of 8xA100s, because that's all the hardware can fit.
StableDiffusion plus a whole chain of imagenets can make any visual imagery imaginable in 2GB of RAM. Meanwhile 2GB of RAM barely runs a basic tiny text completion NN that can't do anything intelligent. Text requires a lot more parameters (and more memory/RAM) than images.
The Cerebras node's actual "RAM" (the 40GB of SRAM) is pretty modest too, but being an enormous chip with the networked storage pools is certainly a better situation than a bunch of A100s reaching out to every other A100.
Honestly, all the AI ASIC makers drastically underestimated the RAM requirements of future models. Graphcore's 4GB and Tenstorrent's 8GB per IC is kinda laughable, and it takes them longer to adjust than Nvidia. And Cerebras' original pitch was "fit the entire model into SRAM!"
> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.
I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.
Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something
Money quote for those who don't want to read the whole thing:
'''
When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.
However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:
1) there is a strong incentive to train smaller models which fit on single GPUs
2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).
Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
>Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
I'm not so convinced, especially if people are doing multiple training runs for hyperparameter tuning, cleaning data, fixing bugs, etc.
I would be very interested in knowing what portion of OpenAI's compute budget is training. I would not be surprised if it was a significant minority.
"the training cost is going to be dominated by the inference cost."
That's only true for general-mass-consumer models.
Companies may want to fine-tune/train their own models, which don't have that many users for their narrow use cases (possibly only internal staff), will find that training cost is a substantial chunk of the TCO
While true I think this also misses that “for almost everyone else” you’re probably not (or at least should not) be trying to optimize zero-shot performance if you have an intended high inference use case so I don’t think Chinchilla would be all that relevant.
I have a suspicion that good zero-shot performance is a good starting point for fine-tuning. If you have more than one intended high inference use case, or can imagine a couple of new ones on the horizon, it might still be best to not target the first use case directly.
Well yeah that’s kind of intuitive, my point is that if you just optimize for zero-shot you end up with something like GPT4 when an enterprise could probably be using finetuned LLaMA-7B with similar performance.
The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.
They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).
If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.
According to https://pile.eleuther.ai/paper.pdf the test loss on the pile is the log of the perplexity, and the perplexity is 2^H where H is an entropy which is non-negative. So the perplexity is always at least one, so its log is always at least zero.
So yes the test loss can be seen as a log, but no it's not allowed to go below zero.
The intuition is that the test loss is the number of bits that the model would need on average to encode each next token in the test part of the pile, given that you have seen the preceding parts.
You’re not wrong, the Chinchilla rationale is that it may be more compute efficient to obtain a given loss using larger model sizes if the budget allows. As another commenter states this ignore the inference part of the equation.
As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.
There is a point of overtraining where downstream performance is impacted but that’s pretty high.
I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?
I might be missing something but it looks to me that actually running this "open" model requires special hardware only accessible with a cloud subscription with 60 000 USD / week minimum spend[1]. Can anyone confirm if you can run it on your own hardware? If software is open but hardware is locked I don't see the point.
The PyTorch model files are already available to download from Hugging Face - the largest one looks to be 52GB. They should run on any hardware that can run regular PyTorch models.
I've been following open source LLMs for a while and at first glance this doesn't seem too powerful compared to other open models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems to perform much better. Although I'm not sure about the legalities about that licensing, since it's basically Flan-T5 fine-tuned using the Alpaca dataset (which is under a Non-Commercial license).
Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.
Sounds like you might be the right person to ask the “big” question.
For a small organization or individual who is technically competent and wants to try and do self-hosted inference.
What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?
A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.
Google's Flan-T5, Flan-UL2 and derivatives, are so far the most promising open (including commercial use) models that I have tried, however they are very "general purpose" and don't perform well in specific tasks like code understanding or generation. You could fine-tune Flan-T5 with a dataset that suits your specific task and get much better results, as shown by Flan-Alpaca.
Sadly, there's no open model yet that acts like a Swiss knife and gets good-enough results for multiple use cases.
Sorry for the late reply, as I said Flan-UL2 (or Flan-T5 if you want lighter models) fine-tuned against a dataset like CodeAlpaca's[0] is probably the best solution if it's intended for commercial use (otherwise LLaMa should perform better).
Their goal isn't to make a powerful model. It's to show how well compute-optimal models do on test-loss as a function of increasing model size. This function can be used with some caveats to forecast the test-loss of larger models for which compute-optimality becomes more important.
Does the chinchilla recipe still hold today? I got the impression that the LLaMA paper proposed a different result where throwing far more tokens at the problem had a very meaningful impact, or did I misunderstand that?
There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.
Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.
There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.
You can think of this as an optimization function.
Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.
A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.
This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them
First, it only holds for a given architecture and implementation. Obviously, a different architecture will have a different training slope. This is clear when comparing LSTM with Transformers, but is also true between transformers that use prenorm/SwiGLU/rotary-positional, and those that follow Vaswani 2017.
In terms of implementation, some algorithms yield the same result with fewer operations (IO, like FlashAttention and other custom CUDA kernels, and parallelism, like PaLM, which both came after Chinchilla), which unambiguously affect the Tflops side of the Chinchilla equation. Also, faster algorithms and better parallelization will yield a given loss sooner, while less power-hunger setups will do that cheaper.
Second, even in the original Chinchilla paper in figure 2, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).
Of course this is great news, I hope these models can be fine-tuned to be like lighter versions of chatGPT. But I remember reading in the LLaMA paper that a small model can still improve when trained more than the Chinchilla budget.
> For instance, although Hoffmann et al. (2022) recommends
training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
Cerebras says:
> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP
But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.
Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.
I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.
Someone posted this repost from the Cerebras Discord earlier, but sharing for visibility -
"We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license"
There are plenty of such efforts, but the organizer needs some kind of significance to attract a critical mass, and a AI ASIC chip designer seems like a good candidate.
Then again, maybe they prefer a bunch of privately trained models over an open one since that sells more ASIC time?
The other AI companies also have discords. Midjourney does like biweekly townhalls on discord, from the CEO directly to the interested users. Emad also hangs out on the SD discord.
Talk about having a pulse on customers, doesn't get more direct than that. Any company that is both
1: Focused on a tech savvy/friendly customer base
2: The customer base is passionate about the product
Should probably orient their entire customer support model around discord.
This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.
From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."
From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."
The first one is whether they would actually sue. The optics would be terrible. A similar situation occurred in the 90s when the RC4 cipher’s code was leaked. Everyone used the leaked code pretending that it was a new cipher called arc4random, even though they had confirmation from people that licensed the cipher that its output was identical. Nobody was sued, and the RSA company never acknowledged it.
The second one is related to the terms. The LLaMA weights themselves are licensed under terms that exclude commercial use:[0]
> You will not […] use […] the Software Products (or any derivative works thereof, works incorporating the Software Products, or any data produced by the Software), […] for […] any commercial or production purposes.
But the definition of derivative works is gray. AFAIK, if LLaMA is distilled, there is an unsettled argument to be had that the end result is not a LLaMA derivative, and cannot be considered copyright or license infringement, similar to how models trained on blog articles and tweets are not infringing on those authors’ copyright or licensing. The people that make the new model may be in breach of the license if they agreed to it, but maybe not the people that use that new model. Otherwise, ad absurdum, a model trained on the Internet will have content that was generated by LLaMA in its training set, so all models trained on the Internet after Feb 2023 will break the license.
IANAL, but ultimately, Meta wins more by benefiting from what the community contributes on top of their work (similar to what happened with React), than by suing developers that use derivatives of their open models.
> It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling.
This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.
This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).
Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.
Everyone except Google uses Nvidia for training. Cerebras, Gaudi, and other custom AI accelerators have unable to surpass Nvidia in performance/$ and performance/watt yet.
You might see more training on Intel XPUs when they come out, since they have such enormous RAM pools. Maybe AMD MI300s and Intel Ponte Vecchio (both 128GB) in the shorter term, though I think they will mostly be in HPC supercomputers instead of cloud instances.
I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.
And, at least officially, OpenAI's outputs can't be used to train other AI models.
Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.
A tangential question: I wonder what, as chiplets become increasingly more common, will Cerebras do to keep their technological advantage of wafer-scale integration. What is the bandwidth and latency of the connections between the tiles? Is there such a thing as bandwidth per frontier length?
"Cerebras open sources seven GPT-3 models from 111 million to 13 billion parameters."
I don't understand why they describe them as GPT-3 models here as opposed to calling them GPT models. Or even LLMs - but I guess that acronym isn't as widely recognized.
Is there a regularly updated repository containing all the releases of LLMs as they happen? TBH I am tired of having to doommark (doom-bookmark) so many repositories and links...Would appreciate some collected database.
[I'm not an expert] but I believe .ckpt and .safetensors.
The problem with .ckpt is that it executes arbitrary code in your machine(very unsafe). While .safetensors was made by huggingface in order to have a safe format to store the weights. I've also seen people load up the llama 7B via a .bin file.
Every new variation of model gets some new name, just like every library gets a new name. There were all kinds of BERTs before -
DistilBert, Roberta, SciBERT, Schmobert, Schmuber, etc. Many hundreds of them, I think.
As mentioned, chinchilla is not part of this trend, and chinchillas are rodents. Alpacas and llamas are South American camelids (animals related to camels). So if additional names are needed, I would expect them to be vicuña and guanaco, as they are also in the group.
I have not tried, but 96GB of GPU memory is plenty, for inference there should certainly be no issue. Their biggest model has 13B parameters, you should be able to run inference (float16) already with 32GB of memory.
With 96GB of memory you should also be able to fine-tune it (possibly some tricks like gradient accumulation and/or checkpointing might be needed), but you have to be ready for many days of computation...
> but you have to be ready for many days of computation...
I was thinking since we have API prices in tokens and now it looks like self hosted inference on high end GPUs for similar models. Then based on electricity prices there will be a self-hosted prices in tokens. Then how close are these already? What is the markup today from roughly the raw electricity cost that OpenAI has.
I've been wondering about the best way to print large format versions of custom Renaissance-style paintings with goofy subjects for our walls at home. I guess I have to figure out how to best upscale the output first.
The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.