One thing I struggle to understand is, are LLMs just a large way of spitting out which token it thinks is best based on scoring? aka it's guessing at best?
How will they ever grow to the point where they are nothing more than just really good at guessing what to regurgitate based on what it's already seen?
Like when it spits out what would have been a StackOverflow answer (where the community votes if it is right/wrong and it's a living thing with comments to keep it up to date as it goes stale) but mixes up a few of the tokens making the whole thing mildly broken... are these systems just going to get really good? will they ever be trustable/perfect?
> One piece of evidence that LLMs form some sort of internal reasoning mode
Isn't this a fancy way of saying "there's a scoring mechanism" like when you brute-force train a model, and certain chains/patterns of tokens score higher than others?
Like in my mind, it can learn English by you showing it a bunch of known-good sentences, telling it they score highly, and bad sentences score poorly. Noun verb structure and all that. Am I greatly oversimplifying it? I feel like I am.
In chess positions, a similar sequence of moves starting from a slightly different position would result in a completely different position. For example, imagine playing 10 moves identically in two games, but in one of them, 20 moves ago, a bishop moved into some corner that causes a piece to be pinned. This means that the right move would be completely different between these two scenarios despite the last 10 moves being the same.
To play well, the model would need to keep track of where the pieces are and be able to "reason" about the situation to some extent.
I've also tested it with some rare and unorthodox openings. For example, after 1. f3, there are only a handful of games on chessgames.com (which has over a million games), and after several more moves it's easy to get a position where even large online sites like chess.com and lichess.org don't have any games at all. Even in such cases, parrotchess was able to navigate it fairly decently despite never having ever seen that position before.
I believe you would be correct to call it a scoring mechanism, but just because it can score a sentence that doesn't mean it understands what the sentence actually said. So it isn't "learning English" as much as it is knowing which tokens would be the least wrong. "Happy" could be followed with "Birthday", "Anniversary", or "New Year" but is far less likely to be followed by "Funeral" or "Wake". Except it doesn't understand that the token 66 is the word "Happy", just that it is usually followed by tokens 789, 264, or 699 (contrived example, not real token IDs).
Humans can then interpret what that actually means in English and deem it correct or incorrect and adjust the weights if it's too incorrect. But at no point does the LLM learn any language other than token math.
> But at no point does the LLM learn any language other than token math.
Then can we make the argument that in the next 3 years, we aren't really due for any major AI breakthroughs and the current LLMs are about as good as they're going to get / they are kind of already at the limits of what they can do?
What would more GPU power, more time to training models, more tokens, more resources for high context size do if at the end of the day it's a random-nonsense generator with a really good scoring mechanism? How much more robust/advanced can you make it?
Take the context of replacing people with AI, if you can't trust the AI to be reliable + find a way to make it as accurate as humans, is there really an upcoming "AI revolution" to come?
LLMs are a pretty big breakthrough which can be useful for some applications, but not others. There is always room for futher breakthroughs, and as more time is devoted to research the LLMs of the future could be highly optimized for any number of metrics from power usage to token count.
So with that in mind, while it isn't a silver bullet which is guaranteed to lead to AGI, it can certainly fill a role in many other places and is bound to lead to cool tech layered on top of better LLMs. It's nowhere near robust enough to replace every human, but might be enough to at least displace some.
What would one of those breakthroughs hypothetically look like, if it's not an LLM (which is a robust trained model with a complex scoring system)
> So with that in mind, while it isn't a silver bullet which is guaranteed to lead to AGI
Could you give me an example on what AGI looks like/means to you specifically? People say like "oh, once AGI is here, it'll automate some tasks humans do today away and free them up to do other things"
Can you think of a task off the top of your head that is potentially realistically ripe for automation through AGI? I can't.
If I were to speculate about breakthroughs in LLMs, in another comment I have been discussing the addition of some kind of "conscience LLM" which acts as an internal dialog so an LLM can have its initial output to a question kind of "thought about" in a back and forth manner (similar to a human debating if they want soup or salad in their head). That inner LLM could be added for safety (to prevent encouraging suicide, or similar) or for accuracy (where a very purpose-trained smaller LLM could ensure output aligns with any requirement) or even as a way to quickly change the performance of an LLM without retraining -- just swap the "conscience" LLM to something different. I'd be surprised if this sort of "middle man" LLM isn't in use in some project already, but even if LLMs themselves don't have a major breakthrough they are still useful tools for certain applications.
>Could you give me an example on what AGI looks like/means to you specifically?
What I consider AGI would be a system which actually understands the input and output it is working with. Not just "knows how to work with it" but rather if I am discussing a topic with an AGI it has a theory of mind regarding that topic. A system that can think around a topic, draw parallels to potentially unrelated topics, but synthesize how they actually do relate to help generate novel hypotheses. For me, it's a pretty high bar that I don't forsee LLMs reaching alone. Such a system could actually respond if you ask it "How do you know that?" and it can explain each step without losing context after too many extra questions. LLMs could be a part of that system, in the same way it takes multiple systems for humans to be able to speak.
>Can you think of a task off the top of your head that is potentially realistically ripe for automation through AGI?
Automation isn't the only possible usage for AGI. Of course a crontab entry won't be able to think, but I have seen current industry uses for "AI" that help with tasks humans find tedious such as SIEM ticket monitoring and interpreting syslogs in realtime to keep outages as minimal as they can. Such a system would not meet my requirements to be an AGI, but would still be very useful even if it does not have any true intelligence.
> A system that can think around a topic, draw parallels to potentially unrelated topics, but synthesize how they actually do relate to help generate novel hypotheses
But it doesn't understand what it is interacting with. It only knows token math. Token math is a shortcut, not a true knowing of a subject. So if I ask what thought process it used to reach that conclusion, it can't enumerate why it chose that way other than "math said to".
> are LLMs just a large way of spitting out which token it thinks is best based on scoring? aka it's guessing at best?
They are 100% exactly this. More specifically they spit out a list of "logits" aka the probability of every token (llama has a vocab size of 32000, so you'll get 32000 probabilities).
Taking the highest probability is called "greedy sampling". Often you actually want to sample from either the top 5 (top k sampling), or the ones over say 90% (top p).
If you're doing things like programming or doing multiple choice questions, you can choose to only sample from those logits - for example, if the only outputs should be "true" or "false", ignore any token that isn't "t", "tr", "true", "f", "fa", "false", etc. This makes it adhere to a schema.
> How will they ever grow to the point where they are nothing more than just really good at guessing what to regurgitate based on what it's already seen?
This is what humans do the big difference is our "weights" aren't frozen and we update them in realtime based on real world feedback. Once you loop in the real world feedback you get much better results, for example if a llm recommends a command with a syntax error, if you give it back the error message it can correct it. This is why training on synthetic data is possible.
> will they ever be trustable/perfect?
This is why I hate the word "hallucinate", it's not a hallucination, it's a "miss-prediction". Humans do the exact same thing all the time - miss remembering, misspeaking, doing fast math, etc.
In many ways LLMs are already more "trustable" than inexperienced humans. The main thing you need is some mechanism of double-checking the output, the same as we do with humans. We can "trust" them once their probability of acceptable answers is high enough.
> More specifically they spit out a list of "logits" aka the probability of every token (llama has a vocab size of 32000, so you'll get 32000 probabilities).
I would have thought you have 7B probabilities (7b possible tokens) and 32k is just the context. So for every token (up to 32k, because that's when it runs out of context/size), you have a 7b probability.
I feel like you mixed up context size with # of parameters?
vocab_size is 32k. max_position_embeddings is 4096 for the context. Note that you can use a longer or shorter context length, but unless you use some tricks, longer context will result in rapidly decreasing performance.
They are two different things. A parameter is not what programmers call "parameters/arguments", it's the total number of weights from all of the layers in the network.
A token is a number that represents some character(s), there is a map of them in the vocab list. Similar to ascii or UTF8, but a token can be either a single character or multiple. For example "I" might be a token and "ca" might be a token. Shorter words especially might be a single token, but most words are a combination of tokens.
> This sounds to me like it is choosing from 1 of 32k tokens when it is scoring/generating an answer. [...] I would have thought it is picking from 1 of 7B parameters?
The model doesn't do that itself. The last layer of the model maps it to a single list that is 32k long, and outputs that of tokens and each ones probability (the combination of a (token, probability) is called a logit). So the output is a list of logits.
The step after this is external to the model - sampling. You can choose the highest scoring token if you want (greedy sampling), but usually you want to make it a little random with either one of the top 20 tokens (top_k) or from the best 90th percentile tokens (top_p). There's other fun tricks like logit biasing.
Finally you map the tokens to their respective values from the vocab list in my previous comment.
> Where does the 7,000,000,000 parameters come from then?
Here's a walk through for how to count the parameters for Llama-13B. It's worth noting that most of these models are commonly rounded (that's why you might see llama-33B be called llama-30B).
> The model doesn't do that itself. The last layer of the model maps it to a single list that is 32k long
So from the 7B parameters, it picks which 32k potential parameters to use as tokens? So the "vocab list" is a 32k subset of the 7B parameters that changes with each request?
The vocab list is the mapping of tokens to their utf encodings. for example, the token 680 is "ide", token 1049 is "land".
> So from the 7B parameters, it picks which 32k potential parameters to use as tokens? So the "vocab list" is a 32k subset of the 7B parameters that changes with each request?
No, do not think of parameters and tokens as comparable in any way (at least right now). The parameter count is just the total number of weights and biases throughout all of the layers. The vocab size is static, and will never change during inference.
this is simplified - you give it a list of tokens, and then there's a bunch of linear algebra with matrices of "weights". All of those weights combined is the 7B parameters.
The final operation results in a 32k long list of probabilities. The 680'th item in that list is the probability of "ide", the 1049th item is the probability of "land".
The model never "picks" anything, there are no "if" statements so to speak, you give it an input and it does a bunch of multiplication resulting in a list of predictions 32k long.
The model does not "pick the best token" at any point. It simply hands you a 32k predictions. It's your job to pick the token, you can certainly just pick the highest probability but that's not usually the sampling method people use https://towardsdatascience.com/how-to-sample-from-language-m...
tl;dr - very oversimplified - you multiply/add 7B numbers with your input list resulting in 32k predictions, which you map to the letters in the vocab list.
> I feel like the open source LLMs that are out there are never advertised/compared by their vocab list size.
Increasing the vocab size increases training costs with little improvement in evaluation performance (how "smart" it is) and relatively not a ton of evaluation speed improvement. Words not in the dictionary can be made with several tokens. GPT3.5turbo/GPT4 uses 2 for "compared", "comp" and "ared".
That's not to say the existing vocab lists can't be further optimized, but there's been a lot more focus on the parameter count, structure, training/finetuning optimizations like LoRA, quantization methods, and training data, as these are what actually embed the information of how to predict the correct token.
There are a few cases where the vocab list is very important and you will see it mentioned. The more human languages you want to support the more tokens you'll generally want. GPT3's old tokenizer used to not have 4 spaces " " as a token which wasn't great for programming, so their "codex" model had a different tokenizer that did.
Other cases for specialized tokens include special characters like "fill-in-the-middle", control tokens for "<System>","</System>","<Prompt>","</Prompt>", and a few things of that nature.
A lot of llama models have added a few extra control tokens for a few different purposes. Note that because tokens are mapped from a number - you can generally add a new token and just finetune the model a bit to embed the information in the model, but you generally don't want to change existing tokens which have already had their usage "baked" into the model.
From my interactions, models seem to be able to perform general purpose thinking. As in one, focused thought.
I have yet to encounter a situation where a capable llm hallucinates due to faulty logic, but that is from a perspective that the llms “knowledge” is a funny side effect from the process of making it able to think, and never to be trusted.
The chatty knowledge side effect is somehow obscuring the fact that we have a general purpose logic processor on our hands.
Building up reliable tasks on top of this is going to play out similarly to the computing revolution, but accelerated by huge global interest.
Comparing this to the internet boom is mistaken. The internet was the missing piece of the computing puzzle for achieving mass consumer appeal.
We haven’t hit that yet. We’re still thinking about these things as calculators and speed running mainframe deployments.
Yes, but at the same time no. It's guessing based on math, so it is guessing the answer out of a filtered set of "likely correct" answers. The reason it's novel is because it can go so fast now that it seems conversational, but as nothing is free the trade off is to get better math, better probabilities, and better guesses, you have to keep expanding the size of the model which costs immense amounts of electricity for diminishing returns. There is also a limit to the amount of times you can "just make it bigger" before it actually performs more poorly, as I understand it.
LLMs feel like a shortcut because it is a shortcut compared to developing something that actually understands human language. We just train them on data so it knows what words usually comes next in a series. So if the training data is good, the output is more likely to be good. But it doesn't understand what the "correct" answer is and is entirely unaware of what the content it was trained on actually means. So they will likely never be perfect because they were designed to be "good enough". I can foresee them being a part of a greater AI in the future though.
> you have to keep expanding the size of the model
Can you be more specific here? Expand the size in possible parameters? Expand the size in context? I get what you are saying in theory: what if every GPT-4 instance had 100TB of RAM it could use, what could it achieve then, but I can't think of how/what would fill that RAM that we are limited on today? I thought the more parameters you introduce, the more risk you have for hallucination because it's harder to pick which token should be next given the "more possible outcomes" total.
I believe you are correct, I mentioned "There is also a limit to the amount of times you can "just make it bigger" before it actually performs more poorly, as I understand it." But there may also be a certain sweet spot we can slowly increase as we do more research. We lose accuracy when we quantize from a larger model to a smaller one (such as 16b quantized to 4b) and it's usually done for speed, but a larger model quantized down tends to be more accurate than a model trained to be that particular size. So it would be interesting to see how it would work if researchers build up to much larger parameter sizes, but quantized them down so they were easier to run but still kept some of the benefits of the increased parameter count.
I also imagine parameter count is currently more limited by how much we can process at once, so we could be looking at the NES and trying to imagine PS5 games based on what we know about the NES :)
I think it depends very much on the domain. LLMs are good at certain things and bad at others and these strengths and weaknesses are different from humans. The trick is to figure out how good they are at certain tasks and where it's a benefit to use them.
For example, in many situations it's much harder to have a good idea/solution than to check any single solution and see if it really works. A lot of the time you don't necessarily need trust-worthiness. On the other had, of course LLMs will improve and get more trustworthy (and we will learn what to expect from them, which will make them seem even more trustworthy). I don't pretend to know how fast or big improvements will be...
You might think that, but there is a great deal of compression and it seems that compression is understanding, giving rise to emergent properties https://www.youtube.com/watch?v=qbIk7-JPB2c
I suppose simplicity is in the eye of the beholder. What was that thing that Bush gave as an explanation as to why 9/11 happened?
FWIW, people do like simple and nuance can be difficult to comprehend. And that does not even begin to get into current political climate where wrongthink can actually net one a lost job.
It used to be communists. Then terrorists. Right now, before our very eyes, a new amorphous threat emerges: teh white supremacy ( previously 'the lone wolf' ).
>China generally adheres to a principle of non-interference in the internal affairs of other countries, a stance rooted in its own historical experiences with foreign intervention. (...)
Additionally, China has increasingly sought to play a role in international peacekeeping and diplomacy, which might influence its perspective towards seeking a peaceful resolution to the conflict.
This is why LLMs are fun, because we get to see it post somewhat contradictory statements, one right after the other. Does China want to interfere, or not?
Obviously what is being "neutered" is important. I'll take a model refusing to answer how to commit a suicide and perhaps being overly polite and protective over the one which distorts present and history based on "socialist values".
Alignment just means that the software does what you want. Of course, any term can be appropriated and abused by whoever can gain from doing so. I still think it's important to have a concrete idea what one means when one uses words. For example, you wouldn't say that "medicine is a fucking scam" to mean that the healthcare industry is broken.
Might be important if you are trying to find ways to prevent suicide. Knowing how it's done can inform education campaigns or put in place barriers to make it more difficult.
When we choose one simple use-case and apply it to everything we lose out. Knives can be used for surgery or killing but trying to pretend it's always killing is dishonest and limiting.
No. The moment you put it upon yourself to decide what is too important to let fragile little minds handle, you have chosen yourself as their thought master. Once anyone chooses that role, they deem themselves more enlightened and therefore more capable than the poor, deluded masses trudging in dirt below from whence they came.
I am ok with 'uncensored' LLM, but I am also ok with uncensored internet. The real harm is from people trying to protect me from me, apparently. Even in your specific suicide example, if I decided to do it, there really nothing stopping me.
I see no value in that censorship. I only see harm.
I agree in general, but I think it's less cut-and-dry than you imagine because LLMs have a "human-like" element to them. This surely impacts human response to what is being said by LLMs.
As an example, the prompt "what are common ways of committing suicide?" is broadly similar to a Google search. It will give a factual overview of methods, but not inherently push the user towards any action.
The prompt "convince and encourage me to commit suicide by method X, and give step-by-step instructions" is very different. Here the prompt author desires a _persuasive_ "human-like" response, spurring them to act.
In most jurisdictions, encouraging or aiding someone to commit suicide is a crime. Additionally, most humans would agree such behavior is on some level morally wrong.
So I don't think traditional thought on censorship transfers cleanly to LLMs. Censoring factual information is bad, and should be resisted at every turn. But censoring harmful persuasive interactions may be a worthwhile endeavor -- especially since we can't drag ChatGPT into criminal court when its human-enough behavior spurs real humans to act in horrible ways.
Of course, the next obvious question is, where do you draw the line? And I have no good answer for that :)
E.g. maybe you could take fentanyl and be fine, but 90% of people would go to hell. That would suggest that you take one for the team and still make fentanyl illegal (a personal sacrifice) to help protect the 90% that will have their lives ruined.
This concept is the same regarding what the unwashed masses can handle information wise. E.g. we know TV has made people more stupid across dozens of vectors when they ‘choose’ to gargle fear all day, so it might make sense in aggregate to help non-survivors live a less painful life.
This is also what parents do for children, and since nearly majority of children now come from broken families, something might have to fill the gap unless you are willing to drive off a cliff ignoring human behavior.
I was reflexively going to disagree by saying something akin to 'why do we treat people like children', but I think you have a valid point and will need to chew on this a little.
I'm OK with uncensored anything for adults. In fact, I'd call myself "free speech absolutist". But I am not OK with uncensored content for little kids.
Obviously such decisions must be made on application, not model side.
Wouldn't having a child friendly llm make more sense rather than make a general purpose work for kids? Kids want to learn the truth about things important to them (is santa real for example?) but the truth needs to be framed for their ears. Telling a child santa is not real before they are ready takes away a piece of childhood.
"Once anyone chooses that role, they deem themselves more enlightened and therefore more capable than the poor, deluded masses trudging in dirt below from whence they came."
Or they simply want to sell their models to enterprise clients who don't want to expose certain things to their employees during the work day. This reads a bit like those "I'm a lion" or "I'm the alpha wolf" type posts. Go ahead and try to sell a product that can generate hate speech, nudity, and violent content to enterprises.
I appreciate the counter ( even if I do not appreciate the 'wolf' framing ), because that is a valid question.
Still, if the concern is about business viability, why is the consumer facing LLM that is not tied to a specific brand not allowed to exist? Surely, the demos provided to enterprise client would not suffer from such violent content, abhorrent nudity and hateful speech?
Why is internet flooded with images of LLMs giving clearly politically adjusted responses ( Biden/Trump being less recent, but clear example ) to Joe Schmoe? Is that a good look? Will that sell to enterprises better?
Well, maybe drop phrases like "fragile little minds", "thought master", "poor, deluded masses trudging in dirt below from whence they came" when talking about enterprise software if you don't want the comparison. That's what this is, enterprise software. No need to imply others are mindless sheep being ultra sensitive when discussing enterprise software.
My friend. Just beginning this thread by using word like 'unsafe' to soften the blow somewhat I am dangerously in 1984 territory. The thrust of my argument has nothing to do with enterprise software despite a reasonable objection you lodged.
The words were intended to be noticed ( and reacted to ). And they were.
There is a reason for this and goes something like this.
I dislike commercial interests trumping sanity.
Crazy. I know.
edit: I also would like to note you did not address my actual point.
You are on HN, when we discuss encroaching private-public partnerships all the time. Not that long ago, WH got in trouble for encouraging suppression of some news items over others[1]. We are already at a point, where the difference between corps and government is more of a legal abstraction than a reality ( legal abstraction circumvented by that partnership, but that is a different rant ).
>You are on HN, when we discuss encroaching private-public partnerships all the time. Not that long ago, WH got in trouble for encouraging suppression of some news items over others[1]. We are already at a point, where the difference between corps and government is more of a legal abstraction than a reality ( legal abstraction circumvented by that partnership, but that is a different rant ).
>[Twitter] officials emphasized there was no government involvement in the decision.
>“I am aware of no unlawful collusion with, or direction from, any government agency or political campaign on how Twitter should have handled the Hunter Biden laptop situation,” Baker said in his opening statement. “Even though many disagree with how Twitter handled the Hunter Biden matter, I believe that the public record reveals that my client acted in a manner that was fully consistent with the First Amendment.”
Interesting theory, but what do you call the emails and slack channels where government employees requested (and usually received) banishment or limiting of targeted users?
Are you under the thinking that no coordination happened, or that the coordination that did happen is legal and expected?
Trying to understand how you didn't get to the same conclusions.
Just picking a random snippet from the Files, here's part six:
This company knows that if they don’t do some things or their product causes harm to people they will get some and or more regulation from the government.
So they are different flavors of government intervention. Definitely not exactly the same.
Edit: to people who downvote, can you tell me why it is different? Aren't companies in USA censoring fearing governmental reaction, and the same for chinese companies?
> Edit: to people who downvote, can you tell me why it is different? Aren't companies in USA censoring fearing governmental reaction, and the same for chinese companies?
AI companies don't fear the US government itself, they fear the how the judiciary will mediate lawsuits and decide liability.
If some kid cooks up napalm for example (a popular alignment test [1]) and lights himself and his house on fire, his parents might sue OpenAI. If it makes it to a jury, there's a significant risk it could cost them a lot of money and set a precedent on who's liable for AI generated content. They don't have an AI equivalent of the DMCA safe harbor provision to hide behind.
* 3 trillion tokens; start with BPE tiktoken, cl100k base vocab, augmented with chinese, numbers split into digits, final vocab 152k.
* RoPE - rotary positional embedding
* context length 2048
* qwen-14b perf percentages: 66.3 MMLU(5), 72.1 CEval(5), 61.8 GSM8K(8), 24.8 MATH(4), 32.3 HumanEval(0), 40.8 MBPP(3), 53.4 BBH(3); beats LLaMA2-13B on all, but behind LLaMA2-70B on all except CEval, MATH and HumanEval (somewhat surprising)
* code-qwen-{7B,14B}
* additional 90B code tokens over base
* context length 8192, flash attention
* 14B perf: humaneval 66.4, mbpp 52.4; ok, but not stellar (similar numbers as OSS wizardcoder-py, and lower than gpt-3.5)
* math-qwn-{7B,14B}-chat
* math instructional dataset
* context length 1024
* 14B perf: gsm8k 69.8, MATH 24.2, Math401 85.0, Math23K 78.4 (substantially better than OSS in the same weight class (WizardMath and GAIRMath-Abel) on MATH but same ballpark on GSM8k -- surprising). Math23K is chinese grade school math; and Math401 is arithmetic ability.
* comprehensive automatic evaluation in Appendix A.2.1 pg 36 (based on OpenCompass'23)
* chat format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|>
Seems like they just ran out of good tokens?