Currently right now there's too many caveats to run even the 7B model per the workflows mentioned in the article.
The big difference between it and Stable Diffusion which caused the latter to go megaviral is a) it can run on a typical GPU that gamers likely already have without hitting a perf ceiling and b) it can run easily in a free Colab GPU. Although Hugging Face transformers can run a 7B model on a T4 GPU w/ 8-bit loading, but with its own caveats too.
There's a big difference between "can run" and "can run well". VQGAN + CLIP had a lot of friction too and that's partially why AI image generation didn't go megaviral then.
I have the 65B Model running on my MacBook here. Honestly, I wish I knew enough about how to get it to output something useful. It's like poking a drunk librarian.
BTW - Now that discussions about sampling techniques within NLP are going mainstream, I have to shamelessly self-plug my own research work in this direction, specifically in the context of Constrained Text Generation. Simply filtering tokens according to constraints before generating text is wonderful for fixing stupid failure cases in LLMs like ChatGPT - such as asking it to not write with a particular letter, asking it to respect syllable counts, or asking it to rhyme properly (look carefully, ChatGPT struggles with exact rhymes in poetry!)
I feel bad for not being active on Twitter, specifically because I missed out on stuff like this (which predates my paper). I should have cited this thread since it's evidence that someone else had utilized this technique in my paper!
I think this vocabulary filtering technique will pick up steam again now that good quality open source LLMs are proliferating. I've always wanted to try these techniques with ChatGPT, and the inability to do so has caused me to somewhat publicly complain about their lack of access to ChatGPT's vocabulary probability distribution (they don't give more than the top 10 most likely continuations).
You might be interested in this thread from 2019 here on HN as well, I linked to a line of work on constrained sequence generation from Sony CSL that is pretty nifty https://news.ycombinator.com/item?id=21334639
As an aside - do you have any academic references for masking / constraining based next step transitions? It's a trick I've been messing with for years, but I probably picked this up from some older poetry generation work. Haven't been able to find any direct reference for simple 1 step masking, let alone more involved things like constraining based on the unrolled, modified markov sequence or things bigger than 1 step.
For any models where you get access to the code (counting HF in this), definitely a lot of tooling available - and this tool looks really nice hadn't seen it before! Hope similar levels of control can become available for the bigger, less accessible systems, even though people don't want to expose token log-likelihoods there should be ways.
I've also been reading your paper on these topics, based on previous comments on HN. So thanks for that! Always interested in work about constrained generation.
Thanks for flagging this. Using that repo, I'm getting pretty good results with the following:
./main -m ./models/13B/ggml-model-q4_0.bin -t 7 -b 10 --top_p 0.0 -n 5000 --temp 0.7 --repeat_last_n 64 --repeat_penalty 1.3 --prompt "You are a question answering bot that is able to answer questions about the world. You are extremely smart, knowledgeable, capable, and helpful. You always give complete, accurate, and very detailed responses to questions, and never stop a response in mid-sentence or mid-thought. You answer questions in the following format:
Question: What’s the history of bullfighting in Spain?
Answer: Bullfighting, also known as "tauromachia," has a long and storied history in Spain, with roots that can be traced back to ancient civilizations. The sport is believed to have originated in 7th-century BCE Iberian Peninsula as a form of animal worship, and it evolved over time to become a sport and form of entertainment. Bullfighting as it is known today became popular in Spain in the 17th and 18th centuries. During this time, the sport was heavily influenced by the traditions of medieval jousts and was performed by nobles and other members of the upper classes. Over time, bullfighting became more democratized and was performed by people from all walks of life. Bullfighting reached the height of its popularity in the 19th and early 20th centuries and was considered a national symbol of Spain. However, in recent decades, bullfighting has faced increasing opposition from animal rights activists, and its popularity has declined. Some regions of Spain have banned bullfighting, while others continue to hold bullfights as a cherished tradition. Despite its declining popularity, bullfighting remains an important part of Spanish culture and history, and it continues to be performed in many parts of the country to this day.
Now complete the following questions:
Question: What happened to the field of cybernetics in the 1970s?
I think accidentally the bot is kind of correct, you are asking to "complete the following questions" and that question seems complete, maybe the bot could have asked for clarification, I've never had that situation yet, would be something like .. "That question seems very well formulated, do you want me to answer it ? or to complete the question? we could complete the question by being more specific about location, for example what happened to the filed of cybernetics in EEUU in the 1970s ?"
> and never stop a response in mid-sentence or mid-thought
I'm curious if this part actually does anything to improve the output, my intuition says it doesn't help much. I don't have it running locally to test yet, though.
These prompts become “key pillars” of the LLM’s “identity”. If you are curious you should at look about how much personality MS Sydney acquired through it’s own startup prompt.
there's no for dummies, this stuff is too new. there's a bunch of white papers to read; _attention is all you need_ is a good place to start, kinda.
if you troll HN for AI related stuff and just read read read, progressively googling the terms you don't understand, eventually the haze will lift and things will start to make some sense.
> if you troll HN for AI related stuff and just read read read
Do you mean "troll" or "trawl"?
(I'm thinking of an old saying that the best way to get an answer isn't to ask a question, but to say the thing is impossible. Can't remember it well enough to google for the exact form, however).
If I'm not hallucinating like GPT, I think the etymology of troll as in internet actually stems from trolling as in fishing.
I have a hazy recollection of folks in the early days of 4chan describing 'trolling for idiots' - as in, feigning ignorance to get a rise out of folks. Like, you post a dumb, obviously stupid but infuriating comment and watch people get all worked up about it. You know, for the lulz.
I feel like since then it's changed to just actually being a dumb asshole and enjoying when people are mad at you. Or maybe I've just gotten smarter. In any event at the time it certainly didn't seem like the kind of thing racist lunatics would go out of their way to do. Perhaps more the fool I, as I was a teenager at the time and as such that opinion cannot really be trusted.
I think the best way is to get a basic understanding of torch and then just reading what the code does. The model internals are hard (though for the transformer there are good step by steps online) but for this case it is basically playing with the output of the actual neural net inside the model. The net gives you a probability distribution to be next over all tokens (think words). Now it has proven to be useful not to pick the token with the highest probability but to employ different strategies. Top-k in this case means probabilistically sampling from the top k tokens instead of piking the best.
It’s essentially a question of did you take a look the code?
There is comments in the codebase directly where the config values in question are being used. Eg.
Now I’m assuming a base rate of knowledge for this to help, but in general I find diving into the code paths of open source models, usually is a good use of time.
Give it few shot examples and the quality should improve a lot. If you want a bot able to do technical Q&A give it two questions as well as the answers in its prompt. The model will then infer it's role as an AI that answer questions truthfully, and will try it's best to continue along those lines
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been "
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:
and it's output was:
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 90% balls."
Seentiment: Negative
###
Twitter: "i don't have a favourite type of music i just hate country"
Seentiment: Negative
###
Twitter: "c'mon first sunshine of the year...why the rain already?!"
Seentiment: Negative
###
Twitter: "Dear twitter, are you getting tired of me yet?
Seentiment: Negative
###
Twitter: "I don't care about the race, I just care about the president being a psych
Doesn't feel like that did the trick, but at least you got me to setup the shell command in a way where I pipe long-form prompts in via a separate file.
Hopefully a RLHF fine tuned model will be released soon because that will make the technology much more useful and less disappointing to people trying to run an alternative of ChatGPT on their local machine.
For people complaining that ChatGPT makes up stuff, I use to say it's like asking a librarian at gunpoint. Of course it's very polite, and it knows a lot, but it also doesn't have time to look anything up or do any real research.
If you didn’t have a $5,000 MacBook, could you run it on a GPU over a long period of time? Would you be able to feed it a list of prompts that maybe take all night to process and have the responses by morning?
Uh, if your budget is <$5000 or <$3000 or whatever you can get a lot more GPU crunch power if you don't buy a Mac.
Vanilla Ubuntu desktop with a used server-grade NVIDIA GPU with lots of GPU RAM is probably your best bang for buck for messing with large model inference right now.
You're dispensing outdated advice. Instead of saying "you'll save" and "lots of GPU RAM" start specific listing numbers and you'll soon realize your mistake.
Mac has unified memory architecture that no other mainstream desktop can offer right now. This is the source of having up to 96GB VRAM on a laptop.
On a separate note I never understood why NVIDIA was always so stingy on RAM.
Give us a GPU with 256GB of RAM already. It doesn't need to be that fast, it just needs to be big enough to hold a huge model.
Instead they're still selling RTX4090Ti with 24GB of RAM while the free GPU I got from them at a raffle a few years ago has 32GB. WTF? I would have expected the RTX4090Ti to have 64GB minimum, being the highest-end consumer GPU.
Hardware lags use cases by a few years. You don't need that much RAM to game, or to mine crypto. With AI, you need all the RAM. So expect NVIDIA to offer some drastically different products in coming years.
But to me, Mac's architecture is the most inevitable final solution: the GPU becomes part of the CPU, and there's no CPU RAM and GPU RAM, just RAM. Will be great for AI too.
Maybe we need GPUs with DIMM slots? Or maybe a second card that's all memory and connected by the bus they use for dual GPU setups? Sadly both are almost certainly too niche to exist in our current GPU marketplace.
GPUs have too many cores, and each of them needs to read and write to RAM really fast, so any such indirection introduced through interfaces like DIMM is an obstacle to performance.
I suspect the future is tightly integrated CPU+GPU+RAM, but we may STILL have DIMM slots to acts as a buffer between "fast" RAM and disk. So your swap file goes from "fast RAM" to "DIMM RAM" before it has to go to disk, basically.
Conceptually that's already the case i.e. L1 cache -> L2 cache -> RAM -> swapfile
If the definition of RAM is DIMMs then it's really just a matter of merging the CPU and GPU into one unit with a massive L2 cache (or maybe we have an L3 cache) of e.g. 32GB that's able to hold a massive model without cache misses.
Long term is fuzzy, but the 3D chip tech seems poised to offer exactly that: massive cache, GPU and CPU cores. NVidia isn't going to play nice because they don't like where this positions them in this future. AMD, Intel and Apple don't seem to need NVidia. This tension is bad for end-users.
I feel like they would not be too niche, many AI researchers I know complain about the amount of RAM on a single GPU, especially the ones they or their labs can afford.
Very often multi-GPU training is maxing out RAM on each GPU and isn't making full use of compute capacity.
wow, this makes me wonder if human intelligence, or just biological intelligence, is based on how many parameters an individual can float and access.
I wonder if there are learnings for how to measure intelligence, just from proliferation of these AI models. Maybe even learnings for how to more efficiently access biological intelligence.
I used to think that to achieve human-level intelligence with AI, we need to have human-sized neural network. It was a conservative guess, but seeing where things are going, I now strongly believe we need a lot LESS parameters with AI, because AI is trained from literally nothing (from noise) on our current culture, from the Internet. It has no baggage, it's built from the ground up to be a modern 21st century human. We're not, we train decades to become ones. And struggle with it. Our genes are mostly tuned for our environment from thousands of years ago.
A human brain carries a lot more baggage with it. Long story short, our biological brain hardware is much superior to current AI hardware, but our biological brain software (where we learn the culture that surrounds us) is much inferior to current AI machine learning. GPT-3 trained 175B on basically the whole internet in 15 days. That's insane. And now there are rumors that ChatGPT is executed on an optimized subset of this 175B model, which may be a bit more than 1/10 that size.
I think going forward we'll see 20-30B parameter models perform much better than an average human in typical tasks. Especially after we evolve the topology of an AI to not be a straight directed acyclic graph, as it is now, but allow structures to loop internally, to invoke other structures with input (as subroutines) and so on.
And also there's new hybrid analog/digital hardware in the pipeline, which makes AI execution 100 times more energy efficient and compact. So above-human level AI on your phone is on the horizon, I'd say.
This is nonsense. These nets regurgitate the most average possible output given inputs. By definition. You’re conflating and hyping and mixing up so many weird things at once.
You literally couldn’t get GPT to come up with a single novelty if you tried. It’s all remixing existing content, and again, doing so in a way to fit the average of the dataset.
When you realize this you realize it has no intelligence as we most typically define it (novel solutions to novel problems).. its not AI. Call it what it is: a beautifully advanced way to regurgitate the exact most popular (mundane) reply you’d expect given a huge dataset.
It’s sort of good for studying what already exists. It won’t even really ever show you the edges though so it’s actually almost dangerously deceptive as evidenced by this absurd rounding up people are doing. If you want to learn the gist of anything, ask GPT. If you want to know anything in depth, GPT in fact will only mislead you towards genericity, platitudinous mediocrity.
Has GPT ever told you "let me think" and then you notice it actually pause for a few seconds? No. Its output is fixed cost. GPT can't think in depth because it's not allowed to. This is something I already covered above by noting that current NN models are a DAG. We're already experimenting with variations that can iterate internally, and therefore think deeper, using still their existing knowledge. And the results change dramatically.
You're also not right that GPT can't produce anything novel. Here's a tiny, modest example:
I won't waste my time, as I see you didn't read what you responded to the first time. But you need to start paying attention. You denial & empty confidence in what's happening will only get you so far.
Considering my former co founder works at OpenAI and we spent years working on and talking about related ML with and around some of the best people in the world, considering I’ve implemented language models multiple times myself, please try and be more specific on where I’m wrong.
I know many people who work in ML share this concern -
people thinking it’s smart, it’s more than it is.
It’s very useful no doubt, but your comment is an example of extrapolation from maladjusted priors.
Your example proves my point, a very uninteresting looking game. I don’t even doubt GPT will find all sorts of interesting stuff by giving averaged out things. There’s plenty of gaps in the averages humans have missed. But nothing new, and until you can show that let’s just call it what it is.
I've seen ChatGPT draw logical conclusions from (most-likely) previously unseen axioms. I've seen it invent ideas that (most-likely) were not in his training set.
I've seen it fail too, but when it succeeds I can not substantially distinguish it from the novelty or logical reasoning of a human.
On niche subjects, it fails often. But you can get it to correct itself by making it think logically by spelling out axioms, which seems to fit your definition of novelty.
All the arguments I've read about it not being intelligent seem to stem from some form of essentialism. Right now it's dumb intelligent but I think giving it an adjustable recursive depth, similarly to what BulgarianIdiot suggests, will go a long way to improving it given its base capabilities.
> I've seen it invent ideas that (most-likely) were not in his training set.
How can you possibly know what's in its training set? Its training set is so large that any "guess" about what is or is not in it is a fool's errand. Given that we know how the technology works, there's no reason to claim emergent behavior when Occam's razor would say the items are likely in the training set.
It's not a fools errand, you can't know definitively but you can have a level of confidence.
It's quite easy to invent a sentence that's never been uttered before, it's also fairly easy to hit the limits of the internet's knowledge in pretty much any complex discipline that's not software engineering.
And it's apparent when you speak to ChatGPT, he will hallucinate a lot to compensate for niche topics it doesn't know much about.
Occam's razor wouldn't point to ChatGPT having omniscient knowledge of all things ever thought of and that will ever be invented.
> If you know nothing about either book, this answer almost sounds useful (it isn’t). Now this is a trivial research task; simply typing ‘the limits of empire review’ into Google and then clicking on the very first non-paywalled result (this review of the book by David Potter from 1990) and reading the first paragraph makes almost immediately clear the correct answer is that Isaac’s book is an intentional and explicit rebuttal of Luttwak’s book, or as Potter puts it, “Ben Isaac’s The Limits of Empire offers a new and formidable challenge to Luttwack.”
> A human being who understands the words and what they mean could immediately answer the question, but ChatGPT which doesn’t, cannot: it can only BS around the answer by describing both books and then lamely saying they “intersect in some ways.”
> The information ChatGPT needed was clearly in its training materials (or it wouldn’t have a description of either book to make a lossy copy of), but it lacks the capacity to understand that information as information (rather than as a statistically correlated sequence of words). Consequently it cannot draw the right conclusion and so talks around the question in a convincing, but erronous way.
That appears to show that it can't answer some obscure trivia question to the standards of the blog post author. Nothing to do with generalizing concepts.
It's not an obscure trivia question. It's a question that anyone with an understanding of the material in question should be able to answer, with something vastly better than a superficial 'the two books are similar, but, like, different and talk about different stuff'.
It immediately cuts to the heart of the matter - an LLM does not understand things. It does not know anything but which word follows another. It can't do anything but regurgitate things it has read in an incredibly lossy manner.
But it's a neat party trick that seems to fool a lot of people into thinking that there's something legitimately useful, here.
Indeed. And telling it to spell out the steps in math solution is essentially (ab)using the chat log as the memory for an iteration loop.
This demonstrates then when you instruct it "break down the problem and solve it iteratively" it comes out with superior solutions. More accurate, more novel. Ergo, then if it was allowed to iterate in this way internally, it'd produce better answers by default.
Yeah, I think it's going to be a promising area of research. It's known as Chain-Of-Thought (CoT) in the literature and there are a few papers already out:
There's no memory or iteration log. You are just priming the (more or less deterministic even...) predictor with some history. You are doing a lot of work yourself, setting a factually correct context (and this has severe limits...).
As for "we just need a way to include iterations": isn't this what attention is supposed to do (kind of dynamically updating weights). The usual way to really update weights are variations on gradient descent. Can you link a paper outlining how we integrate your proposal into the current framework (which took around 20years to mature)? Otherwise your statements are Sci-Fi (at the flying cars-level).
The langchain project is an example of the iterative queries approach. It comes with constructs for working memory, factual lookup / calculation agents, etc.
The general (non-technical) guideline is that the LLMs can "answer" anything you just gave them the answer for. So you give it a problem, ask it how to solve it, tell it to use that method and explain the data it needs, give it that data, and then show it everything at once: "With this data you requested and summarized, use this technique to answer this question".
Your goalpost moved from "nothing novel" to "it's novel, but it's not interesting to me". I'll let not the years ahead, but literally the DAYS ahead prove you wrong. Because things are changing quite rapidly lately.
> We're already experimenting with variations that can iterate internally, and therefore think deeper, using still their existing knowledge. And the results change dramatically.
This is still just deterministic except for a bit of noise to vary the output. There is no "thinking" going on. What you get is still just the most statistically likely output based on the training data.
I think people also often figure out exactly what they think mid-sentence, by way of actually putting it into words. LLMs can iterate on their ideas as they generate the text.
> I won't waste my time, as I see you didn't read what you responded to the first time. But you need to start paying attention. You denial & empty confidence in what's happening will only get you so far.
Maybe he didn't read what you wrote, but you didn't even bother Googling the claims in the article you posted.
Your second link is a game different than Sumplete's rules, so not sure why you linked it.
You have one source, the Digital Trends article. It speaks about a mobile game being similar. The link here is at best circumstantial. The app page doesn't describe the rules of the game. Instead it says this:
"It is very similar to classical game of Kakuro"
I checked Kakuro, and it has similar, but distinctly DIFFERENT rules, as well. So, what's the theory here? GPT downloaded the game and ran it on its phone? No. Maybe the description of this game is somewhere. Maybe. But I couldn't find it. Where is it?
The excuse "it's somewhere, so it didn't invent it" will always be used for why an AI can't produce a "novel idea". While the reality of novel ideas is very simple: it's like conservation of matter & energy. You can never make something or destroy it. You can only rearrange it. New ideas are a rearrangement of other existing ideas. It's the case for everything we ever made. And everything we ourselves are.
This is a discussion in the art world that you'd hear a lot, before stable diffusion came out. The whole idea of "steal like an artist".
It's one of those questions about the nature of us that's on the level of "is there a ghost in the machine"; "do we have actually novel ideas?".
I'm kind of leaning the way you are; I don't think we do. What seems to be happening that we _call_ novelty is just an advanced form of recursive synthesis - we take information in and remix it. But we're capable of levels of abstraction, so we can remix it really well.
I think that getting the abstraction part into these models might just be as simple as wiring them together with a meta-model. A network responsible for identifying similarity between outputs of different types, or one that serves as a connective layer, distributing tokens through other networks and then synthesizing the results.
What I'm almost certain of, is that the important changes are going to be architectural.
1. We're reinventing structured programming in AI in fast forward. First it's a plain Markov chain. Then it's an "attention" directed acyclic graph. Then we realize we need loops. Then we realize we need to jump to different points in the loop. Then we realize it's useful to recursively call yourself or parts of yourself as a subroutine, parametrized with specific input. Etc.
2. Even before we fully realize this framework of thought into a model, I'm almost sure the model EVOLVES some of these structures during training. In the form of crude unrolled loops etc. Simply because it's inevitable for processing certain types of input data.
3. In order to preserve pragmatic outcomes, I'd bet the future is not one giant monolithic model for AI, but many medium-sized models, communicating in a meta network, like meta neurons, sending meta (high-level) messages to each other.
Essentially, we need to make neural networks more like a fractal. I have this rule of thumb that always works somehow: "no concept definition is complete, until it's made recursive". Neural networks will get there.
There is no evidence at all that it copied the smartphone game. No one has even been able to show a textual description of its rules online that could be copied. And modifying the rules of Sudoko to create a variation is well within ChatGPT's known abilities. You can easily test this stuff yourself.
It's quite bizarre to watch this discrimination against AI forming, where we make these increasingly more intelligent entities, and now everyone is out to prove they're stupid, and by definition they'll always be stupid, because only humans can be smart.
I don't want or need to anthropomorphize AI (yet?), but this reaction by us is a tad too similar to rather uncomfortable parts of our history.
I wonder how ChatGTP reinvented its own game? That indicates it's being retrained very frequently on crawl data which I wasn't aware was happening - certainly not with a cycle time in the weeks!
It didn't. The article is nonsense. He probably primed it beforehand by talking about Sudoku, or it may have just chanced onto something similar. When I ask it to code Sumplete, it creates a "math-based game in which players have to solve a sequence of simple addition problems within a limited time to progress to the next level."
> “Can you code a game called Sumplete?” Despite protesting about its inability to make games earlier, ChatGPT immediately started spitting out fully formed HTML, Javascript, and CSS code. Sure enough, it had once again created a version of Tait’s game – only this time, we never discussed what it actually was beforehand.
I guess "a version of Tait's game" leaves some wriggle room about exactly what it did.
The article seems to leave for itself wiggle room in several such places, suggesting the author had an agenda. Likely ideological rejection of AI as intelligence.
> You literally couldn’t get GPT to come up with a single novelty if you tried. It’s all remixing existing content, and again, doing so in a way to fit the average of the dataset.
So if I ask chatGPT to make up a poem about the power of friendship, and it must involve a unicorn and a jackelope, it will give me that. What's not novel about it? Isn't it pretty much what any person would do, mash up some related words?
Leaving aside the question of whether that combination of images is novel… In your example, all the proposed novelty is specified in the prompt. ChatGPT didn’t come up with it, you did.
> Isn't it pretty much what any person would do, mash up some related words?
Yes, that’s exactly what chatgpt does, and it’s what many humans do. But to be analogous to your initial example, there should be another person who actually came up with the instruction specifying which words to mash into a poem. The word-masher, whether human or chatgpt, is just following the instructions, not coming up with them.
So kinda like when your English teacher asks you to write a poem about whatever you want? Can't ChatGPT do that as well? There's at least novelty in whatever random content it generates, is there not? Just by having a large enough library you will create a new thing because nobody's been in the jakelope/unicorn area before, or whatever the dice decide. I feel like maybe there's something about this no-novelty argument, but it's not been worded in a convincing way yet.
The point is that it's able to produce content that hasn't been made before, it's not just giving back examples from the training set. It's able to--yes, with instruction--combine concepts logically to produce something new.
If I ask a random person I meet to write a poem with the same instructions as were given above I'd be surprised if they were able to come up with something as good as ChatGPT.
Try getting a human to come up with a novel idea and you'll find their idea is just a remix of existing ones, partial understanding and a masked view of the world from their unique perspective (or they incorporated a random stimulus).
This constant reach to see human cognition as somehow unique (without evidence) is tiring and frankly an ego-extension / "made in God's image" type thought hole.
On this site lately there seems to be a truly deep sense of unmitigated woo over all things AI and this comment nicely demonstrates it.
We know so far that human cognition and sense of self is still largely a mystery and no evidence exists to show that it's just mechanical remixing of absorbed things. Contextually, many humans can indeed create novel suggestions, arguments and ideas, and they can self direct towards these in ways that no AI can so far. This does indeed make our cognition visibly unique without even having to mention anything religious
ChatGPT on the other hand is literally just a well-engineered design for creating coherent phrases from a huge training set of human information. As even its very creators admit, it doesn't consciously think, consider for real or perform literal AGI in any way that's sentient. For you and others here to call this equal to human cognition is absurd and not based on measurable evidence. It seems more guided by emotional awe at something presented in a new way than guided by sober reasoning. And you talk about human-centric ego arguments being tiring?
> We know so far that human cognition and sense of self is still largely a mystery and no evidence exists to show that it's just mechanical remixing of absorbed things
I'm not saying that ChatGPT is comparable in either capability or design to a human's cognition. What I am saying is that a human brain is just input,process and output. There's no reason a model of sufficient size can't emulate human thought in such a way that there's not much difference in capability (even if the mechanics differ).
I also think you're overstating the uniqueness of "novel" suggestions from humans and that "novel" ideas are some kind of high-water mark of intelligence rather than a fuzzy mechanism with induced feedback.
I can think of a novel idea by randomly mixing concepts I know. What matters is if this idea is good, and that test is done via interacting with the world. This mechanism can be implemented in an LLM now.
Do you think this net can tell us something about math or physics that we don’t know? If so, how?
The person making the claim that this is as or more intelligent than humans should at a minimum show it do something as impressive as humans. I’ll wait for a great piece of art or science until I for no reason diminish humans, that’s all. It’s not tiring to me to see the beauty of human intelligence.
People seem to be able to draw outside the lines. It’s cool, it’s beautiful. I shouldn’t have to point to any specific amazing achievements because there are too many. I haven’t seen one from an LLM yet. I’ll know it when I see it! Not hating, just clarifying.
I'm saying that human intelligence isn't some mystic process above replication or emulation ... and that if an LLM-type computation can do what it does now I see not reason it can't match or surpass human intelligence in the future.
It may produce that intelligence in a different way to our organic squishglobes but I don't see that it matters.
If you're looking for ways to measure creativity, ask it for recipes and then ask it to adapt them. It'll give you a recipe that uses strawberries and anchovies if you ask it. But like the words, it's remixing existing content and knowledge. Is that creative? I don't know. It's what a chef does, though.
Here's a thought. Recently the whole connectome of the fruit fly has been mapped. 3000 neurons and 548k connections.
The fruit flys brain is capable of piloting the body of the fruit fly on a precision and energy efficiency that's simply unmatched in any kind of aerial machine we have developed. And only with those many connections, half a million? Neurons are very complex computation machines that can module their response very carefully.
The human brain does so much more than simply being able to process language. It's capable of one shot learning and to course correct incredibly easy. These features are very important for survivability and I would say are the "magic sauce" that we just haven't been able to replicate.
So I'm actually not so sure this is simply a parameters game. There could be important structures and specialisation that we are missing. In the fruit fly there are 93 different types of neurons, in the human brain iirc it might be hundreds.
A language model is also capable of one shot learning. Try it in the chat, you tell it something, once, and it incorporates it for the rest of the chat. :-)
You may think I don't know what "one shot learning" means, but I do know, I'm just saying it's a problem of our perspective, not of the model's capability. Aside from it having no long term memory (by design) from the chat. Yet.
Our high-level learning is absolutely nothing like how we learned as toddlers. It took us DECADES to understand complex concepts from "few shots". Only pretrained models can do that, and you're also one, in your adulthood.
What you say about extra complexity from neuron types and neurotransmitters combining with one another to modulate even simple networks is true of course. But it's also true you don't need to replicate all those organic details in a NN, because they're equivalent to just a few more "regular" parameters. Think of an artificial neuron in a network as a Turing machine. A Turing machine can do everything with just some more tape, and a neural network can do everything with just some more parameters.
And yet, our models show that we don't need AS many parameters as we thought. And that's exciting.
> It has no baggage, it's built from the ground up to be a modern 21st century human. We're not, we train decades to become ones.
The whole point of machine learning is to do all that "decades" of training in a few hours.
In terms of learning material, it takes LLM a few orders of magnitude more material to learn than it does a human. After all, a human child trained with very few books (by age 7 a child would have seen only a few books) still outperforms CHatGPT trained with the same material, because ChatGPT trained on 4 books is basically useless.
They train on years of real-time streaming video. Feedback loops all over the place. Higher order structures and rules all get learned way, way before they even are able to touch a book. I cannot phantom the amount of data and analyses that passes a typical human child in a day.
So, sure, they can learn fast after all that. But so can GPT, which also one-shots a lot.
> They train on years of real-time streaming video.
...
> I cannot phantom the amount of data and analyses that passes a typical human child in a day.
You're counting every second of every image that a child sees as a separate "image" that is learned. I'm counting the number of objects that a child sees, because people aren't doing frame-capture and analysis on every image, they're doing lazy interpolation (which is why all your peripheral vision in in monochrome but you perceive it as full color).
A 6 month old baby who has only ever seen close family (parents, siblings, pets) is quite able to discern the same features on strangers.
Nothing we have in AI right now can take 3 example models of humans, and detect hair, nose and mouth in a brand new never before seen model of a human.
Not really the same thing as training on millions of different images and videos. The number of distinct and different objects in a 7 yo child's training set is rarely the same as "every single youtube video ever uploaded".
Going back to the original point: ChatGPT needs millions of books to learn to regurgitate information as well as a semi-intelligent young adult, while typical toddlers are quite able to regurgitate what they've just seen.
Humans and animals get by with millions of times less training data than AI models need.
>You're counting every second of every image that a child sees as a separate "image" that is learned. I'm counting the number of objects that a child sees, because people aren't doing frame-capture and analysis on every image, they're doing lazy interpolation (which is why all your peripheral vision in in monochrome but you perceive it as full color).
Then it sounds like he is slightly overestimating the dataset (since some of the frame is monochrome) and you are massively underestimating it.
Faces may not be the best example, because there's a lot of evidence babies are 'hardcoded' to be interested in and pay attention to faces. This implies there's some knowledge of them built in.
However, if you substitute abstract shapes (blocks, balls, etc.), your point absolutely stands up.
In some ways, though, isn't it more interesting to have an intelligence that learns in a completely different way from humans, as opposed to one that learns through essentially the same method?
I think there's a simple thought experiment that challenges these models as intelligence, artificial or otherwise.
Not especially long ago humanity knew basically nothing. The epitome of technology was the stone age. Literally crush two rocks together and then poke things with the sharp bits that flake off. Somehow, in the blink of an eye, we went from that to putting a man on the moon.
So let's do the same with these sort of LLM models and see what happens. Train one with the entirety of expressible human knowledge from the stone age. Where will it send you? I think the answer is largely pretty much where you are. In any case, certainly not to the moon. It seems we're just building natural language search.
This isn't shifting the goal posts. The problem is we keep intentionally making meaningless goal posts in pursuit of AI, because if you don't then the goal seems impossibly far off. But that's because the reality is that it probably is!
I get the impression that people are more triggered by the words intelligence and sentience, than the actual outcomes.
LLMs can make social consequences for us, from creating and maintaining state about opinions of us individually, and we will want to coexist in that reality. That's enough to adjust to and... respect. It doesn't seem to matter what arbitrary threshold of "intelligence" there is, in the face of an entity that creates much less arbitrary and dynamically generated - but predictable - outcomes. I don't have to understand what a human was trained on, or how smart they are, to understand the same thing. The same with an animal.
I can personally see a difference in the definition of ‘intelligence’ on HN to when I speak to people I socially interact with day to day.
On HN intelligence is often seemed to be defined as more about new and novel thought / invention - it’s only intelligence if it thinks a thought that has never been thought/written before.
Off HN ‘intelligence’ seems to be closer defined to “is it smart enough to replace my job as a knowledge worker”. Pharmacists would say it has intelligence if it could automatically check prescriptions, and data analysts would say it has intelligence if it can answer a data request without their intervention.
IMO most people seem to define intelligence as whatever the part of their job is that currently requires analytical skill, because people are worried about the impact this could have on the middle and entry tier of knowledge work, rather than defining AI as needing to invent something fundamentally new.
They have API glue jobs, they take a request and then manually maps it to an API. It is basically the mental version of factory work, it is obvious that most of those jobs would disappear the instant we had good NLP. Human calculator jobs disappeared when we got artificial calculators, now human language processors will disappear since we have artificial language processors.
At least partly, yes. There's that "7 +/- 2" number for how many slots there are in working memory, and then what actually goes in those slots has something to do with how well you understand whatever topic you're thinking about. IIRC there have been studies on chess experts that more or less boil down to them being able to use their slots more efficiently for game states that make sense.
Yes. It's fewer than 7 (more like 3), and each slot takes one 'symbol'. A symbol is anything that's in long term memory. Symbols are also recursive. It could thus be, for instance, a chess opening of 15 moves. This is why an expert player can memorize usual chess games easily, but struggles with truly random arrangements of chessmen. In the latter case, an expert does not outperform a beginner.
It could be interesting if we can create custom neural processors to excel at particular kinds of tasks.
Like temporarily, or permanently, allocate some part of the brain for a certain configuration, beyond what experts can do and closer to what a processor can do.
With the 4bit quantization they needed to make it is like a jpeg with only 16 pixels per "thought" parameter.
It's going to perform like a brilliant person with brain injury. 8 bit is far better and it would be interesting to see if they can get it working on any reasonable hardware.
Then this is SD for Apple silicon users. 13B runs on my m1 air at 200-300ms/token using llama.cpp. Outputs feel like original GPT-3, unlike any of the competitors I’ve tried. Granted- non-scientific first impressions.
Agreed. For those who have been quietly sitting with a base Mac Studio, or a reasonably capable Mac Mini. The possibilities changed on some fronts, but GPT's extremely low price on their API remains a good option.
> Let’s see if the Mac studio is discontinued or updated.
I do hope that it's updated. I could see them reserving the top end chip for the Mac Pro, though. Or maybe make it a dual socket?
In any case, I much prefer the ability to run an Apple computer with a proper cooling setup. I know that cooling pads for laptops exist, but ultimately it's a bit of a janky solution compared to actually adequately cooling the system properly.
The Mac studio is surprisingly quiet and well cooled. I haven’t had to install a fan control utility as of yet.
I thought I’d be trying one of these out and sell it if my setup didn’t work, but it’s the first desktop I’m starting to consider especially if I want to leave a workload running on it instead of a decade of carrying a lot of horsepower around with me.
Instant on for 3 monitors, no need for docking station or USB hubs, everything is plugged in and works, I can run a virtual cam obs setup easily.
The studio is a great upgrade on Mac mini, which might cannibalize the Mac pro sales and why it might not get toasted or refreshed. I’ll F the max studio was a way to put a dent in hackkntoshes, it makes sense.
The M2 mini looks super decent but quickly goes up in price when ram and ssd is setup to match a studio.
For my purposes the ultra wasn’t that much faster than the M1 max on because the software has to be optimized to benefit from it. Maybe some of this ML stuff will be shortly.
This would actually be a great use case for homomorphic encryption. I’m behind on what the state of the art is in that though, but my mind immediately went there.
From my engineering point of view, the state of the art of homomorphic encryption can do some maths very very slowly at a huge cost and can’t be used yet for any real use case. A very cool research topic, but it’s much simpler to run our software locally if you don’t want to leak your data.
You can put together a 192GB 16+ core x86 system with 16GB CUDA card and multi TB of fast storage for $2000ish. I'm not trolling, just wondering why people go to "Apple" all the time, when other approaches may be better for this kind of work. Yes, the CPU - RAM interface is not as fast, but the CUDA card is much faster, and the large cheap memory makes some things a lot more practical. If I'm not mistaken, in these approaches, GPU + VRAM + CPU + RAM are used in conjunction, so it all adds up to quite a bit more powerful system for the same amount of money if working with them is a main goal. If Apple had expandable RAM it would be a different story.
I’m as much PC/Linux as I am Mac. Lots of experience putting together more computers than I can count, or having them put together.
I think there might be a different use case for me but to compare with your economics - I picked up a base mac studio (m1 max with 32 gb) for about $1370 USD with 8 months warranty remaining. It’s letting me test my daily setup to see if everything can run on Apple silicon yet without intervention.. if not, I’ll be able to get rid of it at little to no loss, and decide if I want to carry that much computing power in a laptop or not, or head on to other options.
My interest currently is computational power, per watt. Not much comes close to Apple Silicon.
The cost of a loaded pc can quickly outstrip other options with electricity costs included so it needs to be useful in any case.
The integrated speed of the Apple silicon, ram, and ssd is a little astonishing. More than I expected to admit. I don’t know if there’s anything like it on PC. If Apple silicon supported eGPUs it would interest me.
Comparable PCs have described are generally power inefficient. Still, the system you’re laying out is interesting, especially the ram, mind sharing a bill of materials?
I guess my PoV is with a GPU + inexpensive RAM, I can do some things more quickly, and other things are at least possible. But granted, hardware nuts and bolts is a hobby for me too, I've always enjoyed researching and building systems, there is a huge community of people and companies, and I much prefer KDE to MacOS, Android to IOS.
If Apple released a reasonably priced system with lots of RAM, or that could take a GPU, I'd not have as much to stand on, but they don't seem inclined to do that.
In my case I'm interested in AI, but not quite enough to spend money on a whole separate computer for exploring ML work. I want a really great single laptop I can use for everything.
Screw that - the other day I realized that it's cheaper to buy an Intel A770 ($350) with 16gb of memory than it is to upgrade a Mac Mini with 16gb ($400) of extra memory.
Apple's optimization here is nice for the people who own their hardware, but it's totally silly to read through the comments promising the end of CUDA.
llama.cpp runs on the CPU anyways, the whole conversation is on a false premise. The only hard thing about running it on consumer hardware is the quantization step (where the largest RAM requirements come in) and that's not even a technical problem since you could just redistribute the quantized version if it weren't for the license of the model.
"possible resolutions to the Fermi paradox, the question of where is the evidence for extra-terrestrial civilizations:
- "
answer:
possible resolutions to the Fermi paradox, the question of where is the evidence for extra-terrestrial civilizations:
- 1. We are looking in the wrong direction and in the wrong frequency bands for signals from intelligent life in the galaxy.
- 2. We are looking in the right frequency bands, but the technological level of intelligent life in the galaxy is not high enough to be a threat to us.
- 3. We are looking in the right direction and in the right frequency bands, but the probablity of interstellar travel is so small that we are not likely to meet civilizations.
- 4. We are looking in the right direction and in the right frequency bands, and the probability of interstellar travel is not small, but we are unlikely to meet civilizations at our level of technological development.
- 5. We are looking in the right direction and in the right frequency bands, and the probability of interstellar travel is not small, and we are likely to meet civilizations at our level of technological development.
- 6. We are looking in the right direction and in the right frequency bands, and the probability of interstellar travel is not small, and we are likely to meet civilizations at our level of technological development.
I would be inclined to think that option 5 or 6 are the most likely. Given the nature of the universe, I think that we are more likely to encounter civilizations in the earliest stages of their development, than those that are further developed.
I would think that the big question of the next 10 years is whether SETI will be able to detect an ET signal. I think that the chances of being able to do so are less than even.
The reason is that a lot of the technical issues that SETI has been addressing are becoming less of a problem, because the Earth-based radio telescopes are being joined by many smaller radio telescopes that are being sent into space. So, from a technical standpoint, the ability to look for ET signals is becoming easier and easier.
The problem is that any signals that are sent to Earth that come from technologically advanced civilizations are likely to be only a few percent of the power of the ET signals that are coming from sources at our level of technological development. So, for example, SETI@home is looking for
By caveats do you mean the licensing terms or the difficulty of prompting the model?
Unless it's relicensed I don't expect LLaMA to be a long-term foundation model. But it's shown that yes, you can run a GPT-3 class model on an M1 Mac with 8GB of RAM (or maybe 16GB for the 13B one?)
I fully expect other models to follow, from other organizations, with better capabilities and more friendly licensing terms.
Ignoring the licensing issues, there are a few other constraints that would make the model harder to go viral outside of developers who spend a lot of time in this space already:
1) Model weights are heavy for just experimentation, although quantizing it down to 4-bit might make them on par with SD FP16.
2) Requires extreme CLI shenanigans (and likely configuration since you have to run make) compared to just running a Colab Notebook or a .bat Windows Installer for the A1111 UI.
3) Again hardware: a M1 Pro or a RTX 4090 is not super common among people who are just curious about text generation.
4) It is possible the extreme quantization could be affecting text output quality; although the examples are coherent for simple queries, more complex GPT-3-esque queries might become relatively incoherent. Particularly with ChatGPT and its cheap API (timely!) out now such that even nontechies have a strong baseline on good output already. The viral moment for SD was that it was easy to use and it was a significant quality leap over VQGAN + CLIP.
I was going to say inference speed since that's usually another constraint for new LLMs but given the 61.41 ms/token cited for the 7B model in the repo/your GIF, that seems on par with the inference speed from OPT-6.7B FP16 in transformers on a T4.
Some of these caveats are fixable, but even then I don't think LLaMA will have its Stable Diffusion moment.
The 4-bit quantized models are 4GB for 7B and 8GB for 13B.
I'm not too worried about CLI shenanigans, because of what happened with whisper.cpp - it resulted in apps like https://goodsnooze.gumroad.com/l/macwhisper - wouldn't be at all surprised to see the same happen with llama.cpp
A regular M1 with 8GB of RAM appears to be good enough to run that 7B model. I wonder at what point it will run on an iPhone... the Stable Diffusion model was 4GB when they first released it, and that runs on iOS now after some more optimization tricks.
For me though, the "Stable Diffusion" moment isn't necessarily about the LLaMA model itself. It's not licensed for commercial use, so it won't see nearly the same level of things built on top of it.
The key moment for me is that I've now personally seen a GPT-3 scale model running on my own personal laptop. I know it can be done! Now I just need to wait for the inevitable openly-licensed, instruction-tuned model that runs on the same hardware.
It's that, but also the forthcoming explosion of developer innovation that a local model will unleash. llama.cpp is just the first hint of that.
> The key moment for me is that I've now personally seen a GPT-3 scale model running on my own personal laptop.
I hate to pooh-pooh it for everyone, but this was possible before LLaMa. GPT-J-125m/6b have been around for a while, and are frankly easier to install and get results out of. The smaller pruned model even fits on an iPhone.
The problem is more that these smaller models won't ever compete with GPT-scale APIs. Tomorrow's local LLaMa might beat yesterday's ChatGPT, but I think those optimistic for the democratization of chatbot intelligence are setting their hopes a bit high. LLaMa really isn't breaking new ground.
I'm not particularly interested in beating ChatGPT: I'm looking for a "calculator for words" which I can use for things like summarization, term extraction, text rephrasing etc - maybe translation between languages too.
There are all kinds of things I want to be able to do with a LLM that are a lot tighter than general chatbots.
There was only ever a 6B GPT-J, you must be thinking of GPT-neo's for smaller sizes. GPT-J was the best of its kind for a long while but even just the 7b version of LLaMa soundly surpasses it in how well it follows examples to solve problems and do useful work like summarization, keyword extraction and a mini search agent. Being able to run something on your laptop that matches the original davinci GPT3 and can be tuned to go beyond is nothing to sniff at.
The best models will be by API but there is a threshold of utility for LLMs, below which they're just bad story telling toys, that hasn't been available for home users until recently, with FlanT5s and LLaMas.
GPT-J is cool and all, but it’s been out for awhile and hasn’t seen anywhere near this amount of network effects. Further the Eleuther team is a bit detached from the “average user” in my experience, being all-in on TPU and JAX (which is still suffering from a similar lack of maturity making it difficult to run on non-standard setups).
I’m sure someone did the work you suggest to let folks run on this on more reasonable setups- but I never saw it on the front page here.
Did they achieve GPT-3 performance on anything at that model size? Happy to be corrected, but my understanding was llama does in fact outperform GPT-3 even at the smaller parameter-counts on certain downstream tasks due to using the chinchilla training regime (and other tricks).
But is anyone actually making money off of StableDiffusion? Maybe the shovel-sellers (runpod.io et al), but afaik no one using it as the foundation for a revenue generating company. I ask, because yes, technically, you can't get LLaMA legally unless you're a researcher and get it directly from Facebook. But that's not going to stop the faithful from finding a copy and working on it.
If anyone is making net money on Stable Diffusion already, they're either small scale or not going to be around for long, probably both. Anything in it for the longer term is going to be in the buildup phase.
I don't know about Stable Diffusion in particular, but three examples of AI-generated art making money immediately spring to mind:
1 - some guy won hundreds of dollars in an art contest from AI generated art (and this made big news, so it should be easy to find)
2 - one person reported using midjourney's images as a starting point for images that wound up being used in a physical magazine
3 - another artist has used midjourney images that they modify to sell in all sorts of contexts (like background images on stock illustration sites)
You'd probably find many other examples in midjourney's #in-the-world discord channel.
I'd also be shocked if stock image sites, clipart sites and freelance design/illustration sites weren't already flooded with AI generated images that have been sold for money.
That being said, because high questly AI-generated images are so easy to make, the value of images of all types is likely to plummet soon if it hasn't already.
> the value of images of all types is likely to plummet soon if it hasn't already.
The price might, but the marginal value having them or not has to other products which incorporate them likely won’t as much. So, say, if you are a TTRPG publisher, the value of art to your business doesn't change, but the cost of moving farther along the no art -> scattered sketches & tangentially relevant public domain works -> ideal quantity of art, specific to the product, goes down. So, you are, in effect, making money on SD.
Ultimately, its an art tool — a shovel — ecosystem that is democratized enough that it is dominated by high-quality, end-user usable components.
Making money with it in the non-shovel-selling sense is going to be commercial artists (or other people with visual image needs, perhaps avoiding the need for commercial artists) incorporating it in production workflows. But except when they are also selling shovels (e.g., Pixar with Renderman) there’s not a whole lot of incentive to trumpet workflow components rather than product in that space.
> commercial artists (or other people with visual image needs, perhaps avoiding the need for commercial artists) incorporating it in production workflows
Maybe it's just me, but that reeks of the archetypal race to the bottom.
"We didn't have to pay professional X to do task Y as we and others have been doing for decades, instead we used modern replacement Z funded by cheap money which costs peanuts at present".
Q: Is there really a business to be made selling Z?
Who cares? If Z is effectively free, and there are businesses to be enhanced using Z, then people are making money with Z, even if the mechanism isn’t “selling Z.”
(Of course, yes, if there are businesses to be enhanced with Z, even if Z is free in the sense that open source software is free – which is exactly or approximately the case with much of the Stable Diffusion ecosystem – there will be businesses also to be made “selling” Z, or, more accurately, selling specialized services relating to Z like bespoke enhancements, support, hosting, and training. But those are very much the selling-shovels services that the question upthread asked to exclude; the using-shovels businesses are actually producing products incorporating images.)
Another big difference is quality of the results. Haven't tried myself but seen many complaints that it's nowhere near GPT-3 (at least for the 7B version). Correct me if I'm wrong!
That doesn't bother me so much. GPT-3 had instruction tuning, which makes it MUCH easier to use.
Now that I've seen that LLaMA can work I'm confident someone will release an openly licensed instruction-tuned model that works on the same hardware at some point soon.
I also expect that there are prompt engineering tricks which can be used to get really great results out of LLaMA. I'm hoping someone will come up with a good prompt to get it to summarization, for example.
OpenAssistant has already collected 100,000 human feedback examples, estimated 5,000+ hours of human work via crowd sourced volunteers.
Enough programmers want this badly enough that its going to happen. Inference at 8 GB and fine tuning at 24 GB, just like stable diffusion, on a 13B model.
Is it that hard to believe? If you look at Stable Diffusion, I'd say a lot, maybe the majority of the volunteer effort was focused on anime girls and "realistic" pictures of anime girls (which amounts to young faces on adult bodies).
That's the next level up from instruction tuning though: that was the RLHF stuff, which was essential to make ChatGPT useful and safe enough to expose to a wide audience.
For a model running on my own laptop I'm OK taking more risks. I'd like it to be able to obey simple instructions like "Summarize this text" or "Extract the names of everyone mentioned in this article" - I don't care as much about the stuff ChatGPT has to get right.
Next to being easier to run, I also have the feeling that images by them selves are also more interesting then generating text. It is much easier for us to look past the defaults, whereas for a text it has to be of a really high quality to be of notice (and even more if you want it to be interesting).
A casual gamer would be running with Intel or AMD integrated graphics on whatever laptop they can get for a reasonable sum. They might have a mobile RTX 3060 if they splurged a bit. More likely they'll just have a console to game on.
Gamers with RTX 3070 or above are a small count, at least according to Steam. RTX 4080 gamers are a staggering 0.20% of those surveyed.
Prediction: sometime by 2033 most consumer CPUs will have the Apple Mx chip architecture with unified memory and the price for 2/4/16 TB of RAM will be similar with today's price for 2/4/16 TB of SSD.
Today's price for 1TB of consumer RAM is around 3,000 USD; since it's paramount to be able to run these models locally, especially as they get better the privacy concerns will become even greater, it's time for hardware to step up to the challenge: x10 price reduction in 10 years, a new curve for Gordon Moore's observation.
The unified memory model doesn't mean it's faster for inference. Nvidia's GPUs run circles around Apple Silicon in training and inference right now. Though a part of that is software optimization.
If anything, the reason to have unified memory SoC is to increase general efficiency, not to provide faster AI inference.
I think chip makers will instead start making their NPU (such as Apple's neural engine) really massive. It'll start to take more transistors away from the CPU and GPU. By 2033, perhaps you're buying an NPU with a CPU and GPU attached to it, not the other way around like today.
Also, 2/4/16TB of RAM will not have a 10x price reduction in 10 years. As far as I know, there's nothing in the horizon to do this. Prices will go down but it might stay relatively flat for the next 10 years.
Memory stopped getting exponentially cheaper some time around 2011. This is why the MacBook Air still starts with 8GB of RAM.
The one thing that maybe could have helped was Intel's Optane tech, which they abandoned last year. It's memory that sits in between the speed of DRAM and SSDs.
Sure, the prediction is that a new curve is laying just a bit beyond the horizon.
MacBook Air still starts with 8GB of RAM for the same reason they still start with a 256 GB disk and for the same reason an 8 TB SSD from Apple means +$2,200†: corporate greed since it's good enough. The argument is it won't be good enough in the near future.
Intel stopped innovating Optaine years ago. If Intel kept improving it, it would be much faster than SSDs today. Even today, an Optane drive is much faster in latency and important metrics like random reads.
There’s no way that TBs of ram will be accessible to the average consumer. Even in the previous decade we did not see any improvement. In memory availability. There has not been any need to go further than 32 GB for the general public for a long time now and even the average consumer does not have more than 16.
There must be a real necessity for that to happen, and I do not see how LLMs provide substantial benefits to push for a change.
Running locally your own personal assistant that knows everything about you on a deeply personal, conversational level seems a rather real necessity. But since people seem not to care today who has their files, their emails, their health records, maybe I am wrong and the next 10 years will actually only bring us the first $100 trillion market cap company.
LLMs are merely a technical detail, a stepping stone, the major breakthrough will be the first algorithm of a synthetic agent. Do we really want the boring dystopia HAL, Rehoboam, and so forth, or our own personal Star Trek Data.
Question for someone smarter than me: What would it take for existing RAM to get that much better?
I have a dim sense of the hardware industry, but I know that disk storage has gotten 10-100x cheaper, CPU/GPU power increases quite steadily, but RAM is still quite expensive.
Is it even possible for RAM to get much, much better than it is currently? Are there physical limitations, or is there just not the same amount of market interest in more RAM?
Until now there wasn't much incentive for a consumer to have 1TB+ of RAM, for any CPU or GPU task: what were you going to do with that much RAM? render 1,000 more Chrome tabs?
Beyond the supply/demand economics and also corporate greed [1], there are fabrication issues, just look at the current mess [2], but as new US-based fabs come online from 2025 and beyond [3], they will be solved to some extent.
Theoretically, the maximum amount of RAM a 64-bit machine can address is around 16.8 million terabytes [4]: as always with computing, the journey hasn't even begun.
Today the upper range of RAM in devices is in the hundred-GB. 10 years ago it was about 16GB or so, and making a (probably bad) interpolation I think it wouldn't be crazy to have the upper range in the TB mark in 10 years time. We could get there faster too, for specific use cases (compiling, rendering, etc).
On the other hand, yes, everyday use (web, mail, video/media consumption) don't require today much more than 8 or 16GB of RAM. If we go a bit creative, a PC with Linux can run very smoothly on 4GB alone, and surely someone here can point to their one anecdote of a machine with 2 GB or even 1 GB sporting a nice desktop environment, or 128mb CLI-only machine.
Edit: also memory has to improve its bandwidth and data transfer rates to keep up with faster processors, so it could also improve over 10 years time without much focus on storage capacity. Or maybe they focus on latency instead, or a mix of all three. Point is that it's not a single metric to improve.
There is incentive to use RAM as planned obselence though. As can be seen when updating old devices, forcing apps to swap frequently in and out of disk
> Will demand reach 1TB of RAM for a consumer level computer/mobile device? No.
Speculative Sampling allows to run a smaller model locally and only check up with a larger model in the cloud from time to time, once every few tokens.
> Is it even possible for RAM to get much, much better than it is currently?
DRAM currently consists of a small capacitor per bit (cell), which can hold a small amount of charge (1) or not (0). The amount of charge it can hold, its capacitance, depends on the area and the distance between the conductors[1].
So, as you try to cram more cells into the same area, the amount of capacitance goes down. This is problematic because of leakage[2], causing the charge to leave the capacitor. This is why DRAM cells are refreshed periodically, hence the "dynamic" in DRAM. Too much leakage, and you spend all the time refreshing rather than accessing the memory.
To raise the capacitance back up, you can move the conductors closer. However, if the insulating material stays the same, this will lead to increased leakage. Thus a better insulating material is needed.
While the current DRAM cells are quite impressive, it seems they're starting to hit fundamental road blocks with the current technology[3]. There's a limit to just how little amount of charge a cell can hold and still be reliably read, and finding better insulators is getting harder and harder.
It may well be a new approach is the solution to continued scaling. Moving away from individual capacitors as the basis for a memory cell might be a way forward[4].
First of all, the issues of today are also of extreme importance [1], but there is a difference between some queries on a search engine or some numbers around your life (caloric intake, period regularity, etc.) and the deep and in-depth conversations people will have with the future chatbots (from personal issues bordering on psychological counselling to financial advice exposing desires and choices in a way a simple query for "restaurants near me" never could).
We aren't just talking about search. In fact you gave an example of something that is completely in the cloud - health services. Health, banking, education, and pretty much everything else are already in the cloud, and are as sensitive as anything that could happen using AI interactions.
There you go and very unsurprising to see that happen very quickly, unless you have a Apple Silicon machine and want to download the model to try it yourself.
I still think that open source LLMs models have to be much smaller than 200GB and to be much better than ChatGPT to be more accessible and highly disruptive to OpenAI.
It is a great accident needed thanks to Meta. For now one can use it as a service and make it as a SaaS rather than depend fully on OpenAI. Open source (or even free binary only) LLMs will eventually disrupt OpenAI's business plans.
Are there any online communities running these models with non professional hardware? I keep running into issues with poor documentation or outdated scripts with GPT neox, BLOOM, and even stable diffusion 2. Seems like most of the support is either for professionals with clusters of A100s, or consumers who aren’t using code. I have 3 16gb Quadra GPUs but getting this stuff running on them has been surprisingly difficult
The limitation is because of the word position embedding matrix size. This isn't a config issue, or an API limitation. This is a limitation on the size of a matrix that is part of the model and is decided on before training. You can't change it.
What does that mean?
For each token in your input or inference output it requires the model to have some understanding of what the position of the word means.
So there is the word position embedding matrix that contains a vector per position. The matrix has "only" 1024 entries in it for GPT2 or 4096 for GPT3. The size of each entry varies as well, containing a vector from 768 for GPT2 small and up to 12,288 for GPT3.
So the WPE (word position embeddings) for GPT2 is (1024x768) and for GPT3 (4096x12288)
Inference requires info from this vector to be added to the word tokens embedding for each token in the original prompt + each generated token.
Positional embeddings are tricky - it very much depends on the specific embedding method chosen. Some advanced methods allow conserved or even slightly improved performance with context length increased beyond what was used for the main pretraining run.
As often is the case with these large models, you can change it with some finetuning on longer context samples from the same dataset, with what is really a small amount of compute invested compared to the million hours spent on training the thing.
You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.
Where did you get that GPT3 has 12288 size token embeddings? I thin that's the internal or output size of the token inside the transformer layers, not in the embedding table.
It doesn't really use them, it uses something called RoPE which is hardcoded rather than learned and is applied multiplicatively at every layer to both the key and the value.
There are already solutions to this kind of problem. Using embeddings to store semantic meaning -> query the vector database with a question -> use extractive q/a models to get relevant context -> using a Reader model to generate answers based on the context from the document.
just checkout Haystack tutorials. I started looking into it after getting introduced to the concept by articles mentioning OpenAI embeddings and GPT 3 api, but it can be done using open source models.
I used Haystack due to the readily available colab notebook[1] for their tutorials. I wanted to feed my own text corpus to it, and that was the fastest way available.
Langchain docs are helpful, and it would be even better if you published an end-to-end notebook using a popular dataset. Definitely looking forward to try langchain as I dive deeper into this.
>Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.
Yes. You can break the document up and index each part and then tackle it that way. It works surprisingly well. The 4096 token limit is tied to the attention window, not an API restriction.
That’s a great question - typically memory in transformers scale as O(N^2) with token count, so there must be an upper limit, but I would bet it to be far more than 4096 tokens
One thing I think will be different and that had totally escaped my radar until recently is just the enormous and diverse community that has been developing around Stable Diffusion, which I think will be less likely to form with language models.
I just recently tried out one of the most popular [0] Stable Diffusion WebUIs locally, and I'm positively surprised at how different it is to the rest of the space around ML research/computing. I consider myself to be a competent software engineer, but I still often find it pretty tricky to get e.g. HuggingFace models running and doing what I envision them to do. SpeechT5 for instance is reported to do voice transformations, but it took me a good bit of time and hair-pulling to figure out how to extract voice embeddings from .wav files. I'm sure the way to do this is obvious to most researchers, maybe to the point of feeling like this needs not a mention in the documentation, but it certainly wasn't clear to me.
The community around Stable Diffusion is much more inclusive, though. Tools go the extra effort to be easy to use, and documentation for community created models/scripts/tools is so accessible as to be perfectly usable by a non-technical user who is willing to adventure a little bit into the world of hardcore computing by following instructions. Sure, nothing is too polished and you often get the feeling that it's "an ugly thing, but an ugly thing that works", but the point is that it's incredibly accessible. People get to actually use these models to build their stories, fantasy worlds, to work, and things get progressively more impressive as the community builds upon itself (I loved the style of [1] and even effortlessly merged its style with another one in the WebUI, and ControlNet [2] is amazing and gives me ideas for integrating my photography with AI).
I think the general interest in creating images is larger than for LLMs with their current limitations (especially in current consumer-available hardware). I do wonder how much this community interest will boost the spaces in the longer run, but right now I can't help but be impressed by the difference in usability and collaborative development between image generative and other types of models.
Ooba's textgen webui runs LLaMA 4bit on 8GB VRAM consumer video cards.
KoboldAI (https://github.com/henk717/KoboldAI/) by the creator the AI Horde of stable diffusion fame also supports LLaMA and has forked the Stable Horde for textgen. Currently horde is running half a dozen LLaMA models accessible at http://lite.koboldai.net/
Did you know that AUTOMATIC1111 got strapped off of 4chan?
Go to 4chan right now, and poke around their technology and video game boards. There's so much chatter about LLaMa. The last time I saw that much chatter about a technology was when eth was 3 dollars a coin. The communities exist, the general public just isn't aware of them.
Isn't it more the four bit quantization than the choice of C++ as an orchestrator that's the win? It's not as if in either the C++ or the Python case that high level code is actually doing the matrix multiplications.
That basically the whole AI revolution is powered by CPython of all things (not even PyPy) is the 100 megaton nuke that should end language warring forever.
That the first AGI will likely be running under a VM so inefficient that it refcounts even integers is God laughing in the face of all the people who've spent the past decades arguing that this language or that language is "faster". Amdahl was right: only inner loops matter.
That was because the current trend in AI (deep learning) mainly just requires batched matrix computations, which makes an efficient implementation fairly language-agnostic (provided that only the internal kernels are written in high-performance C/C++ or CUDA). The same thing doesn’t hold for other computationally intensive tasks like ray tracing or physics simulation, where computation cannot be expressed in huge batched matrix operations but rather millions of highly variable small-sized computations.
If mainstream AI moves beyond just deep learning and start to encompass other approaches and disciplines (for example, differentiable simulators/renderers), suddenly the language (and the compiler) will be much more important than before.
> That basically the whole AI revolution is powered by CPython of all things (not even PyPy) is the 100 megaton nuke that should end language warring forever.
And a lot of new AI tooling such as tokenization has been developed for Python using Rust (pyo3)
My personal problem with Stable diffusion moment is that nothing practical came up from that moment. Concerning all image generative models I only see them used in generation of OG-images for mostly uninteresting articles on the internet. I can only see the same practical personal use for language models.
> My personal problem with Stable diffusion moment is that nothing practical came up from that momen
Before Stable Diffusion democratized image generation, Very Serious People across the industry claimed that unless they and their friends exercised strict control over the kinds of images people generated using AI, society would collapse, the sky would fall, and it would all be over. After that, the public gained access to diffusion models for imagine generation, and the sky did not in fact fall. These prognosticators of doom are wearing egg on their faces.
Likewise, today, the intelligentsia is predicting DOOM! DOOM I SAY! if LLMs become available outside the close supervision (and censorship control) of San Francisco trust and safety teams.
Democratizing LLMs will further delegitimize claims that some central authority needs to control this new technology. This is great! When the printing press came out, it took a hundred years to democratize it. It'll take us ten months!
The single biggest benefit of projects like llama.cpp is annihilating claims on the past of gatekeepers that for our own safety we need to let them control how we use math.
> for our own safety we need to let them control how we use math
Not math. The magical sauce is not math, it's the raw text used to train the model. Language data has this property that it can take a raw initialised neural net and turn it into LLaMA and chatGPT.
The same language data turns babies into modern empowered adults instead of cave men or worse. We're not that smart without language. A single human can't rediscover it, it is a self replicating system evolving in parallel with us.
> Likewise, today, the intelligentsia is predicting DOOM! DOOM I SAY! if LLMs become available outside the close supervision (and censorship control) of San Francisco trust and safety teams
I wouldn't order figures as Elon Musk or Peter Thiel as intelligentsia. They are just greedy operators for the MIC.
Yes, this is true and it is hard to ignore. However, I was talking about practical applications, not just general interest. There were several hype-generating stories about diffusion model winning art contest and using LLM to write grading essays, but that was it.
How do you know the artwork and adwork you're seeing today isn't generated using a diffusion model? Every professional artist I know has SD & blender in their toolkit
Anyone able to get llama.cpp into some sort chat-bot style mode via a script or similar? What do you need to wrap around it to get it to act that way? Right now it will just go on forever in response to prompts.
“Chat mode” just means that 1) you structure the prompt given to the model for completion as a chat, i.e. a list of lives in the form “NAME: UTTERANCE” and 2) specify as “end” sequence the “NAME:” part for the user (i.e. prevent the model from going on in both roles).
On top of that, you may want to do fine-tuning with reinforcement learning. But the above is “all you need” to get into chat mode with a LLM (or any text generation model, really).
George Hotz did a stream yesterday building with Llama and got some good functionality. It's done with some few-shot pre-prompting to give it understanding of 'who it is' and how to answer questions:
People have been running large language models locally for a while now. For now the general consensus is that llama is not fundamentally better than local models with similar resource requirements, and in all the comparisons it falls short of an instruction-tuned model like Chat GPT
My argument here is that this represents a tipping point.
Prior to LLaMA + llama.cpp you could maybe run a large language model locally... if you had the right GPU rig, and if you really knew what you were doing, and were willing to put in a lot of effort to find and figure out how to run a model.
My hunch is that the ability to run on a M1/M2 MacBook is going to open this up to a lot more people.
(I'm exposing my bias here as a M2 Mac owner.)
I think the race is now on to be the first organization to release a good instruction-tuned model that can run on personal hardware.
It's not just that it's accessible, it's also significantly higher in quality than previous local runnable causal LMs.
I suspect people saying it's not good are prompting it like ChatGPT, not realizing how much trickier a raw model is to prompt. Getting the hyperparameters for good sampling is another stumbling block. The models are very good if you do everything properly.
As someone who just got the 7B running on a base MacBook M1/8GB, I strongly agree. The rate of tool development & prompt generation should see the same increase that Stable Diffusion did a few months (weeks?) ago.
And given how early the cpp port is, there is likely plenty of performance headroom with more m1/m2-specific optimization.
I've been following LLaMa closely since release and I'm surprised to see the claim that it's "general consensus" that's it isn't superior. I've seen machine and anecdotal evidence to the contrary. I'm not suggesting you're lying, but I am curious, can you point me to something you're reading?
But llama is the most performant model with weights available in the wild.
Personally I hope we quickly get to the stage that there's a real open llm like SD is to DALL-E. It sucks to have to bother with Facebook's core model, and give it more attention than it deserves, just because it's out there.
If facebook had actually released it as an open model, I would have said that all the credit should go to them. But instead people are doing great open source work on top of their un-free model just because it's available, and in the popular conception they're going to get credit that they shouldn't
Even if its not the "precise" moment, in the scheme of things does perfect timing really matter? Fact is, this shows beautifully that significant new algorithmic capability is becoming commoditised - fast. And thats a good thing.
I wonder why we don't have external "neural processing" devices like we once had soundcards. Is anyone working on hardware implementation of transformers?
Kudos to Yann lecun for getting his revenge for Galactica
but those are not for sale, and not transformer-specific. There must be some optimizations that can be done in hardware and transformers are several years old now
Computation-wise, transformers are really just a bunch of matrix multiplications, nothing more to it. (Which is partially why they’re so efficient and scalable.) Also, Nvidia’s GPU architectures are moving in the TPU direction (https://www.nvidia.com/en-us/data-center/tensor-cores/).
TPUs were released in 2016, Tensor Cores in 2017. But Edge TPUs (that is, the ones that exist outside of Google's Cloud infrastructure) weren’t until 2018. So, Nvidia had their cores available for people to buy first, but not before TPUs existed.
I think the answer to this is that we don't anything that much better than GPUs to justify dedicated hardware - but I see it as inevitable. (Edit: I'm not counting TPUs, since they're not for sale).
My money is on something like processing-in-memory enabling inference on gigantic models with rock bottom energy consumption, and rolling out to the masses as a PCIe device. Lots of research on this direction, see https://mythic.ai for an example.
I dont understand why pursuing "neuromorphic hardware". These are like asics running the most rudimentary (and wrong) simulation of human neurons, which are not known to solve any problem very well, in order to ... what exactly?
Artificial neural networks with backprop is all we need for AI
>Artificial neural networks with backprop is all we need for AI
No. Biological neural networks run many orders of magnitude more energy efficiently, and learn substantially more data efficiently and continuously.
We've got a lot of catching up to do with nature. In my opinion, backprop on today's compute architecture won't ever compete (I can elaborate if you like) - we need some fundamental changes in our approach.
We're currently running up against the Von Neumann bottleneck in ML. It's a great architecture for general purpose compute, but not suited for efficient operation of neural networks: a ton of energy is spent shuttling weight and activation values between GPU RAM and ALUs. Biological networks appear to be able to process a large amount of information per unit energy by exploiting the static physical structure of the network. A small amount of energy in sparse neural activations can transform a large volume of information stored in the structure and synapse strengths.
Here's some very rough napkin math: assume the brain has 100 billion neurons, each with 1000 synapses, and is sparsely activated at ~5% activation, and processes at 30Hz (gamma wave frequency, roughly). This means every "tick", 5 billion neurons must interact with 5 trillion other neurons. If you tried to implement this in a Von Neumann architecture, even if you only calculate for the sparsely activated neurons, and even if you quantize your weights values to 1 byte, this means processing ~150 TB/s of weight data. That's an insane amount of memory bandwidth. An A100 is 2 TB/s at 300W, but our brain only uses 20W.
My two cents is that the future will involve a hardware architecture that let's us avoid moving these weights around during inference. Whether this will mean a "mortal computer" as Hinton has recently discussed (https://www.youtube.com/watch?v=sghvwkXV3VU) or whether the weights will be loaded and fixed at init time (as with Mythic's or other neuromorphic approaches), time will tell.
Look at what happened with bitcoin : first mined on CPUs, then GPUs, but now only dedicated ASICs that can only run specific functions are competitive.
Same thing is likely to happen here : artificial neural networks directly in hardware might have extra possibilities for optimization ?
The Intel Neural Compute Stick is like this. It's a USB stick intended to power computer vision and other ML tasks. It's pretty tiny though and isn't nearly as powerful as a graphics card.
There's very interesting and easy implementation for Tsetlin Machines in hardware that I think have possibility to replace transformers (I am not in field though so might be wrong).
It quantizes 32-bit floating point weights to 4-bit integers so that models fit into memory and run reasonably fast on a CPU by using the processor’s vector extensions.
If you want to run on GPU, use the official Python-based stack, which BTW takes about 10GB of runtime binaries on disk, and only supports nVidia GPUs because CUDA.
The link from GP is the CPU only one implemented in C++.
The python + GPU one can be found on the official facebook repo: https://github.com/facebookresearch/llama (Presumably GP thought this was already known to everyone so they pasted the other link)
It doesn't use CoreML so it should work on Intel machines at some speed.
If it used the GPU/ANE and was a true large language model then it would only work on M1 systems because they're unified memory (which nothing except an A100 can match.)
LLaMA's GPT-3 175B level model, LLaMA-13B, only requires 8GB of VRAM (and no RAM) to run using pre-quantized 4bit weights. So it's hardly a job for an A100.
Even the largest model, LLaMA-65B, is only 30GB and inference can be split between two graphics cards with almost no effect on performance.
The big difference between it and Stable Diffusion which caused the latter to go megaviral is a) it can run on a typical GPU that gamers likely already have without hitting a perf ceiling and b) it can run easily in a free Colab GPU. Although Hugging Face transformers can run a 7B model on a T4 GPU w/ 8-bit loading, but with its own caveats too.
There's a big difference between "can run" and "can run well". VQGAN + CLIP had a lot of friction too and that's partially why AI image generation didn't go megaviral then.