Hacker News new | past | comments | ask | show | jobs | submit login
ChatGPT Explained: A normie's guide to how it works (jonstokes.com)
465 points by hui-zheng on March 7, 2023 | hide | past | favorite | 140 comments



I like the token window bit. I don't really like the probability bit, because it kinda alludes that OpenAI just built a huge probability map of all N-grams (N=8000) and called it a day.

Which incidentally would also imply that a lot of N-grams just don't exist in the training data, causing the model to completely halt when someone says something unexpected.

But that's not the case - instead we convert words into a lot fuzzier float vector space - and then we train the network to predict the next fuzzy float vector. Since the space is so vast, to do this it must learn the ability to generalize, that is, to extrapolate or interpolate predictions even in situations where no examples exists. For this purpose, it has quite a few layers of interconnects with billions of weights where it sums and multiplies numbers from the initial vectors, and during training it tries to tweak those numbers in the general direction of making the error of its last predicted word vector smaller.

And since the N-gram length is so long, the data so large, and the number of internal weights is so big, it has the ability to generalize (extrapolate) very complex things.

So this "probability of next word" thing has some misleading implications WRT what the limits of these models are.


“The hard and expensive part of the above one-sentence explanation — the part that we’ve only recently hit on how to do using massive amounts of electricity and leading-edge computer chips — is hidden deep inside the word /related/.”

The whole point of mentioning relatedness is to show readers that we’re not dealing with finite n-grams.


The author knows all that, he’s just trying to simplify to give the layman some intuition about what’s going on.


So basically, if a pattern that hasn't been encountered before is seen, it will just try to connect "something" together, which is why it does things like predict today's date being in the future etc?

The model says, "I don't have a good enough path forwards here, I'll just make one up given the next best thing I have and serve it back"?

Maybe this is why Bing is working differently, they've changed the model or the working to just say, "I don't know" when there isn't enough confidence in what it's generated based off what it finds in it's database?


Not just connect something.

For example, at a certain size, GPT models start to "learn" how to do basic arithmetic (addition), even for numbers with multiple digits they've never encountered before.

It might look like a small thing, what with computers being quite able to do arithmetic at the base level. But this is a language model, so its a bit different. It learns how to add numbers without "carrying the 1" first, then at a certain larger size also learns to carry the 1, then when even larger it learns to do that across multiple digits... So its not just blindly guessing, its learning the rules of the game (and in some cases some quite complex rules) by building a model of the world made of words.

And the model of digits and addition is just one small bit most likely, as the training space doesn't contain much of that - writing about adding numbers is pretty boring after all. The full model must encode rules and knowledge about a variety of complex things to be able to make reliable predictions. It probably also contains true generalizations that humanity hasn't thought of before, as well as specializations of those generalizations that could be immensely useful.


Hey! That's a very interesting explanation, could you please provide any references for further/detailed reading on model's abilities to learn to add numbers, with carry and finally with carry across digits?


There is some commentary in the GPT-3 paper https://arxiv.org/pdf/2005.14165.pdf (figure 3.10 and table 3.10)

note: I may have extrapolated a bit more than strictly correct from those two bits, but the accuracy being about 50% indicates carrying is the issue. GPT-3 is at around carrying twice, but hasn't generalized full carrying yet.


The only issue with the math example would be that we don't do complex math in English. Of course it can learn other "languages" but as you said there is less "written" complex math

It probably also contains true generalizations that humanity hasn't thought of before, as well as specializations of those generalizations that could be immensely useful.

I thought this is quite exciting, I actually think this will become a field of research in itself, and that might spur completely novel new industries and field of technology, research etc.

I'm being optimistic here but I hope we're on the cusp of a greater future, rather than a dystopian one brought on by terminators and AI that replaces all work :)


Borges, of course, kind of skewered your hopes about 80 years ago.


Bit of context might be nice.


I assume they're referring to Jorge Luis Borges's short story "The Library of Babel," which imagines an enormous library that has a book for every possible sequence of letters and punctuation, all of a certain length (a few hundred pages). The library therefore contains all useful books on any topic in any language (as long as it's covered by the alphabet), but also all useless or inaccurate ones, and of course a vast sea of gibberish. All the knowledge you could want is there, yet unattainable.


Thank you :)


I wonder if someone could make some standardized way to describe/write mathematical proofs - in a very formal language and then train a model that will try to find more proofs for open questions.


What if it solves the P versus NP problem?


We go home to rest.... no more working for us....


Worth pointing out that it's not 100% accurate at arithmetic, that becomes quite clear when you ask it to calculate large numbers. It's result is usually close, but not perfect.

So it hasn't learned the exact rules for arithmetic, it's learned rules for approximating arithmetic to a decent level of accuracy. Similar to how humans can know the approximate result of an equation before doing the actual math (though GPT is way more precise than humans)


They need to add support for co-brains - functions that are available for weightware brain to use. Ie if could and knew how to use wolfram alpha - it’d boost performance dramatically.


That's called Toolformer (Facebook AI Research): https://arxiv.org/abs/2302.04761 Toolformer: Language Models Can Teach Themselves to Use Tools


That's exactly how Bing AI can do search queries and then produce output based on the result.


Okay but that's not "learning" how to do basic math in the same way I can't "learn" japanese just by mimicking the mouth noises. Yeah I'll get some of the pronunciation right sometime, maybe even get it in the right order, but only for the listener. To me, it's still just mouth noises.


If you were given the task "mimic the sound of Japanese as best you can", at first, you would just learn the basic phonology and just try and mimic the general sounds of the language. You would get good at that, and eventually you would be perfectly mimicking the pronunciation of Japanese phonemes. At some point, it would become worth your time to actually learn how different sounds are put together, e.g. Japanese can't have an S sound followed by a T sound. After that, you may start to learn how the different syllables interact. And at some point, you would learn how words are put together to form sentences.

At a certain point, there really is no difference between "being really good at mimicking the sound of Japanese" and actually knowing it, because in order to mimic it to a high level you will have to actually know it. "Mimic the sound of Japanese" is the equivalent task here to "predict the next token in this text".


Where that argument falls apart is when you’re talking about something that’s still bad a mimicry. ChatGPT still completely fails the touring test it’s not even vaguely close to that threshold.

For now it’s slightly better than fake crowd noise you’re hearing in movies, but still frequently just gibberish.


ChatGPT passes the Turing test and is not even the first bot to have done so. No idea why you're so keen on downplaying it but soon there won't be anyone left that you will be able to convince that this is no big deal. You're fighting an uphill battle.


That’s completely false.

The Touring test allows unlimited topics, time, etc. There are competitions that use rules heavily favor bots where bots have “won” in the past but they aren’t actually preforming the test.

ChatGPT seems amazing at first, but that’s because its flaws are so novel to them. People just aren’t used to looking for them so they can overlook how quickly it completely forgets about previous parts of a conversation etc.


And when it passes your current interpretation of the Turing test you will find another excuse why it still "completely fails" and is "not even vaguely close". That's called moving the goalposts.

Passing the Turing test does not imply general intelligence but saying that what it outputs is "just gibberish" is obviously just another hyperbole from you.


Gibberish is just a shortcut for words without meaning.

I don’t have some arbitrary rules for what passes the Turing test, but it’s about the worst case not the best.

https://en.wikipedia.org/wiki/Computing_Machinery_and_Intell...


[edit: You edited your comment but it used to say it fails "the test as described" and that the competitions are invalid since they do not follow the rules. I presume you looked up the actual test afterwards and realized how wild your "completely fails" comments were - and did a 360 and rewrote the comment. Keeping my original response below.]

> the test as described

I hope you realize that the original Turing test is where you have a man and a woman trying to convince an interrogator that they are of the opposite sex. The test is to replace one with a machine and see if the interrogator would decide the wrong sex as often as when there's an actual human playing.

So if we're talking about the actual test, as described, the most basic bots have passed it a long time ago. If we're talking about the standard interpretation (convince the interrogator that the bot is human) it's a derived version that has no intrinsic rules and was not described by Turing.


PS:

> wrong sex

You can read the original paper it’s clear in his version the goal for the computer is trying to convince someone communicating with them it’s human even though the form is to convince someone they are male. “The game may perhaps be criticised on the ground that the odds are weighted too heavily against the machine. If the man were to try and pretend to be the machine he would clearly make a very poor showing. He would be given away at once by slowness and inaccuracy in arithmetic.” https://redirect.cs.umbc.edu/courses/471/papers/turing.pdf

It’s also clear he’s referring to the spirit of the game not the specific details: “It might be urged that when playing the "imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the theory of the game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man.”

He does give a benchmark of 70% accurate after five minutes of questioning, but that wasn’t success just a benchmark.


I was just going into excessive detail. My point was limitations stop following the spirit of the original.

I don’t specifically object to changing the judge from interrogation to observation of a conversation. But, it should be clear his version doesn’t have all the loopholes the modern interpretation does.


>For now it’s slightly better than fake crowd noise you’re hearing in movies, but still frequently just gibberish.

You've clearly never used ChatGPT to help you build or troubleshoot anything before.

Try it before you knock it.


I wasted quite a bit of time trying. Its utility falls off a cliff if it can’t copy something from its existing corpus.

That said, it’s got a lot of code to copy from. So I may try again if when I suspect I am reinventing the wheel.


Did we read the same comment? What spion describes would be complete science-fiction just a couple of years ago and yet your reaction is "meh".


Everyone things this tech is cool, the main question discussed is how close it is to general intelligence. The poster you respond to argues that it is very far from general intelligence still, there are many who argues that it basically is general intelligence so he isn't arguing against a strawman.

If you told me "We have made a natural language parser/processor that is at human level" I'd think that was a huge step forward for computing. Nobody can argue with that.


The real question in that context is what "general intelligence" even means, for starters. Until we have a clear answer to that, we can't really say anything meaningful about how close and far it is.

And it is quite possible that it's a question without any meaningful answer. We basically have a "gut feeling" that it is a qualitative rather than a quantitative difference, but is that really based on hard data, or it's just more comfortable for us to think that way?

In the meantime, the practical question is - how useful is it? What can it do?


Chatgpt is able to caveat it's responses with I don't knows etc, especially when you prompt it appropriately. Tell it not to improvise and provide it escape hatches (if my question doesn't make sense, ask for clarification) and it is able to better recognize when it's about to make something up.


How many dimensions does the vector space have? Seems like there are an infinite number of ways you could convert word->vector, how do you choose which one is 'right'?

What should I google to understand how a word is encoded as a vector and then vector turned back into word(s)?


That's the beauty of it - you don't have to choose, the network chooses for you. If you set up your layers such that the first input layer has one input for each possible token (lets say total of 80K weights), and the next layer connected to that layer only 1024 values, you will get "word embeddings" as a consequence of training the network (i.e. each node in the 80K layer will have 1024 values pointing to individual nodes of the next layer)

The simplest way to get word embeddings (without necessarily building a complex GPT like model) is word2vec: https://towardsdatascience.com/creating-word-embeddings-codi... - the principle is similar but the network is smaller.


I feel like as an intro aimed at "normies" it still manages to communicate in a more abstract and overthinky way than necessary.

People often find it difficult to intuit examples from abstract descriptions. BUT, people are great at intuiting abstractions from concrete examples.

You rarely need to explicitly mention abstractions, in informal talk. People's minds are always abstracting.

> If I’m relating the collections {cat} and {at-cay}, that’s a standard “Pig Latin” transformation I can manage with a simple, handwritten rule set.

Or...

"Translating {cat} to {at-cay}, can be managed with one “Pig Latin” rule:

  If input is {cat} then output is {at-cay}."
"Translate" is normie and more context specific to the example than "transformation". "Set" is not "normie" (normies say "collection"), and its superfluous for one rule.

Concrete, specific, colloquial, shorter, even when less formally correct, all reduce mental friction.


And it talks down to the reader. "Ok! You made it! Phew! I'm sorry you're too stupid to understand this stuff, so I'm going to blow emotional smoke up your ass so that you aren't so intimidated"

Gag.

edit- and then he just casually starts talking about collapsing wave functions and hermeneutics. With no explanations, just assuming the reader knows it. (I had to look up both those concepts) So who exactly is he writing to here?


Technically it's a transformation rather than a translation because Pig Latin isn't a real language, it's the product of a word game to make up new words out of English words by rearranging them according to certain rules. The article does assume you know this though (and doesn't consider the alternative of picking a simpler example like cat to CAT in capital letters or cat to chat in French)

Same goes for its super brief explanation of a core term like "latent space", and it explain probability distributions by invoking atomic structures rather than a more basic stats example like a bell curve of adult human heights. It's definitely aimed at the sort of "normie" that reads Hacker News rather than actual normal people! I liked it though...


“Translate: to transfer or turn from one set of symbols into another” – Merriam-Webster

No, “translate” still works here, regardless of Pig Latin not being a language. And I agree with the GP that it’s more intuitively understandable word than “transform.”


You can call it both a transformation or a translation, but that doesn't make 'translation' more appropriate of a term. It is a procedural transformation that can be expressed entirely in a one line function of pseudo-code.

translation from one real language to another is so complex that it cannot even be represented in a hash table. Notably, the elements of grammar and syntax are not present in this naive form of translation. It is also lacking the element of bridging understanding, the root of the word translate.

I would even posit that many CS words that are more appropriate here: encoding, compiling, cyphering, mutating, obfuscating,

Thus insisting on a word where critical elements of that word are lacking and despite the existence of more precise language is itself a poor translation.


"Translate" is an appropriate term, because pig-latin is a language. I.e. you can communicate with it.

The fact that it is an invented, derivative, isomorphic language doesn't mean it loses any properties of being a language.

--

Being precise in mathematical or technical terms to a non-technical newbie makes no sense. They need to grasp some basic concepts of what its all about first, in terms that will make immediate sense to them.

Only introduce formalisms when the need for them arises. That way, the reader is not only prepared for them, but can understand the motivation for learning them.

Unless the intro is going to go long and deep, as in the reader is going to become a practitioner, overly precise formalism or language may not add anything at all - because they won't be ready to understand the nuance.

One novel step at a time.

It ends up being easier for the reader, gives them a series of rewarding light-bulb experiences, and requires less explanation in then end. Win-win-win.


> a generative model is a function that can take a structured collection of symbols as input and produce a related structured collection of symbols as output.

Yeah that’s exactly the way a nOrMiE would find easy to think about it. Duh.

The author probably should dish out that sentence on his grandparents and see how that would work before putting it on the internet and labeling it as “for normies”.


Ha, good luck with that. I’m still trying to convince my own grandparents that a monad is simply a monoid in the category of endofunctors.


Geez, how much more can you dumb it down?! /s


Have sympathy. The wall of monad is a hard one to get over.


At least you’re actually trying! :)


Woah, article is way to TL:DR too! Let me show how it is done:

A generative model is just a computer program. The program is incredibly book smart. No matter how incoherent your ramblings it will find the most likely connection between words, string them into (mostly wrong) candidate sentences with different degrees of certainty, it looks in its database where everything you said is compared to the wrong word combinations assigning scores to each and then it adds up all the points scored and it finds THE most likely correct response: "All gore invented the internet"

There, that is all there is to it.


A normie's guide that references subatomic particle mechanics...


I flinched at that too, but the payoff of being able to talk about probability lobes was well worth it. I do think it could handle the electron cloud part better. Definitely lead with the electron shell diagram that most normies have actually seen, the incorrect one with the discrete electrons in rings around the nucleus, and THEN say “after you graduated high school, scientists have built cameras that can actually look inside individual atoms, and this is what they see: lobes of probability indicating where the electron most likely is.”


> after you graduated high school

Depends on age. I’m 35 and we did electron rings in middle school and probability orbitals in high school.

With no good explanation of course. Just expected to memorize. It didn’t make sense to me until college physics


> "Translating {cat} to {at-cay}, can be managed with one “Pig Latin” rule:

> If input is {cat} then output is {at-cay}."

Even this can be translated further into “human-speak”:

“Move the first bit of the word to the end, and add ‘ay’. Like, ‘cat’ becomes ‘at-cay’.”


Yes, that would be a good second step!

Fixed rule, pattern match rule, collections of rules, ...


Obligatory xkcd - https://xkcd.com/2501


The biggest drawback of LLM is that it never answers with "I don't know" (unless it is some quote) and it just brings bullshit hallucinations which human has to reject as wrong. Thus it is mostly useless for anything serious. Personally I use it to beautify some text, but still have to do a bit of correction to fix b/s or missed context.


I haven't experienced chatgpt explicitly tell me "I don't know", but it has provided me with factually incorrect answers pertaining to the functionality of an appliance and associated software. Further probing and I was advised to contact IBM support.... lol.

My usage has been confined to specific technical inquiries and chatgpts benefit has been colossal in saving me the time of sifting through decade(s) of documentation, forum posts, bug reports, kb articles, etc.


Honestly, how can that be? If it's bullshitting me half of the time... If it's something I already know I can see it's bullshitting, if it's something I don't know (i.e. I'm using it for a practical purpose) then how can I tell!


This seems like a great opportunity for some good old fashioned adversarial training.

Post train one LLM to please another LLM, that rates the quality of the first model's responses and calls it on any bullshit! (And vice versa.)


I expect that this will be tried, and I worry about some negative consequences if it works well. It could be a way of generating very effective propaganda, that defeats the efforts of the opposing LLM to call bullshit.

On the other hand, diffusion models seem to have replaced GANs for image synthesis, so perhaps there's something I'm missing, or perhaps there's a way to combine both techniques.


I am not sure calling bullshit on images works the same way, given we generally want to invite creativity to image generation.

Although an adversarial extra-finger detector seems in order! Extra fingers are not creativity. They are taboo!!

But for text, there is a lot of structure around bullshitting, and fortunately, the internet is full of examples of people calling bullshit. As long as the bullshit adversary has to give a strong critique to back up any bullshit call, it should work.

And bullshit judgments can be de-bullshitted too.

This would be the same as what our social circles, work colleagues, families, etc. do for us, so a circular firing squad of bullshit judges, who also give gold stars for quality responses, seems like a natural solution for models too.


The first really good image generators for faces (see for example https://this-person-does-not-exist.com ) used GANs, which were trained by having one model generate faces and another model saying bullshit, that isn't a face.

I'm not worried about using approaches like this to converge on truth (model A generates an answer to a question, model B attempts to refute model A). Perhaps we could train model A to generate lies that model B can't detect.


Stable diffusion (of bullshit).


> The biggest drawback of LLM is that it never answers with "I don't know"

I see this a lot on social media, but it simply isn't true in my experience as someone who uses multiple APIs from OpenAI.


Ask it to “prove how a martingale converges to the zariski measure on a commutative scheme.”

Totally absurd question, totally absurd answer.


I asked ChatGPT Legacy:

"Hello! Can you prove how a martingale converges to the zariski measure on a commutative scheme?"

It replied:

"Hello! The notion of a martingale converging to the Zariski measure on a commutative scheme is a highly advanced and abstract mathematical concept. It is not possible to provide a simple proof for this statement, as it requires a deep understanding of several advanced mathematical concepts such as probability theory, algebraic geometry, and commutative algebra."

It then listed definitions for martingale, Zariski measure, and a commutative scheme, and then concluded with:

"However, to fully understand this concept, you would need to study advanced mathematical topics such as algebraic geometry, commutative algebra, and probability theory."

That seems like a useful enough answer with definitions that effectively converges to "I don't know."


Notice that martingales do not converge AND the zariski measure is not a thing at all.

Sorry saw your reply too late.


Yes, I see a lot of people talking about how it revolutionizes learning, but it only provides decent answers for extremely well-known and popular topics on the internet. Ask it a question about a less-documented human language for instance and it will mislead you with the most hilariously wrong and confidently made-up garbage answers.

This could end up being another example of knowledge quality going down as people keep going for the easiest was to get answers, like searching for info on topics on Instagram and TikTok.


It seems to respond better if you include a 'in the context of <subject matter>' at the end of the prompt. Then keep refining context over a series of questions.


Bing AI often tells me that something is not known and when it hasn’t found any sources to confirm something. I think ChatGPT is worse off though.


Sorry but that's not true. I play with ChatGPT only for about a week or two and I have seen such a response a couple of times. However it's not easy to get such an answer. I'm not denying that.

(Unfortunately I can't provide you an example. The ChatGPT history is not available right now. Wed Mar 8 02:02:14 PM CET 2023)


I just saw some post 9n the subteddit talking about methods to get it to provide exactly that about hallucination and confidence. With various techniques.

I think in the future it'll become less problematic or more transparent about hallucinations


Yeah I don’t understand why “I don’t know, but…” or “I’m not too sure, but…” isn’t part of the generative text. Does everyone on the web talk like they have the answers to everything? I’m genuinely asking.


Can't you address this with adversarial training items that build upon each other?

I will be doing some testing of this against the base davinci model over the next few weeks.


It has no way of judging the quality of it's own answer, I see this trait as it's biggest hurdle to most practical uses.


I’m not an expert, but I’ve always felt the use of softmax everywhere is a contributing factor. It’s basically saying “you HAVE to pick one” rather than “which one would you pick?”


There is the notion of the _rejection class_, which addresses the problem of 'You have to pick one' - it is the class you select when you mean to say 'none of the above'.


This is a very good guide, although if it’s truly aimed at normies it’s still far too formal/mathematical in certain parts (ChatGPT itself could probably help you rewrite some of those sections in plainer terms).

The ‘token window’ section does a fantastic job of answering “but how does it know?”. The ‘lobes of probability’ section does a fantastic of answering “but why does it lie?”.

The ‘whole universes of possible meanings’ bit does an okay job of answering “but how does it understand”, however I think that part could be made more explicit. What made it click for me was https://borretti.me/article/and-yet-it-understands - specifically:

“Every pair of token sequences can, in principle, be stored in a lookup table. You could, in principle, have a lookup table so vast any finite conversation with it would be indistinguishable from talking to a human, … But it wouldn’t fit in the entire universe. And there is no compression scheme … that would make it fit. But GPT-3 masses next to nothing at 800GiB.

“How is it so small, and yet capable of so much? Because it is forgetting irrelevant details. There is another term for this: abstraction. It is forming concepts. There comes a point in the performance to model size curve where the simpler hypothesis has to be that the model really does understand what it is saying, and we have clearly passed it.”

If I was trying to explain that to normies, I would try to hijack the popular “autocomplete on steroids” refrain. Currently it seems like normies know “autocomplete for words”, and think when you put it on steroids you get “autocomplete for paragraphs”. Explain to them that actually, what you get is “autocomplete for meanings”.

(Please feel free to use these ideas in your post if you like them, don’t even think about crediting me, I just want to see the water level rise!)


I recommend: ChatGPT Is a Blurry JPEG of the Web by Ted Chiang

https://www.newyorker.com/tech/annals-of-technology/chatgpt-...


Disagree.

As far as I understand, in this article he argues that in the, say, ChatGPT output, compression happens.

But does it really?

To make a similarly low resolution metaphor, a “bayesian kaleidoscope” of a language model doesn’t necessarily mean it blurs the “word pixels” it is moving around. Because moving them around, rearranging them is what it essentially does, even if in opaque ways; but not degrading them, not changing letters in words or deliberately algorithmically messing up the word order in a sentence.

To make sense of the “image” an LLM produces is left up to us, and therefore, it is also up to us to decide whether any compression of anything has happened. And then, how do you measure it?

If you cut a painting into pieces, then glue them back together at random, thus making a new painting, would that constitute a “compression” or just a new painting, which could be worse or better than the original?

I quite like Chiang’s writing, but not this time. If anything, his take on this undermines what he previously wrote a little bit, painting him more of an LLM that he probably would like to admit :)


A better metaphor would be to say it compresses the internet, creates a Markov chain based on that compression. Then to make it work it compresses your prompt so that it can find it in the markov chain, move to the next step, and make a lossy decompression into a text token and adds it. The lossy decompression here is the temperature, higher temperature more lossy and more random words, but since it is lossy in the "meaning" space the random words would still have very similar meaning to before.

That isn't a perfect metaphor, but it explains very well how it can do most of the things it can do. The lossy compression means that it can work with large prompts and just capture their essence instead of trying to look them up literally, and the lossy decompression lets it vary its output and the text will move in slightly different directions instead of just repeating text it has seen. The magical bit is that this compression and decompression is much smarter than before, it parses text to a format much closer to its meaning than before, and that lets us do the above much more intelligently.

Edit: Thinking a bit, maybe you could make these model way cheaper to run if we would make them work as a compression to meaning rather than the huge models they are now? They do have internal understanding/meaning of the tokens it gets, so it should be possible to create a compression/decompression function based on these models that transforms text into its world model state, and then once we start working with world model states things should be super cheap relative to what we have now.

Also maybe it doesn't have lossy decompression and get words with similar meaning, but that is another way I see the models could be smaller and cheaper while keeping their essence. The Markov chain step could be all it uses currently. But it definitely creates that space and Markov chain, because it parses the previous thousand or so tokens and uses those to guess the next token, that is a Markov chain. It just has a very sophisticated way of parsing those thousand tokens into a logical format.


> creates a Markov chain based on that compression

I dislike that interpretation. It suggests it builds a very basic statistical model, but a very basic statistical model simply wouldn't be able to do what these models can do.

Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.


> but a very basic statistical model simply wouldn't be able to do what these models can do.

Why do you think that? Why do you think a basic statistical continuation of the logic of a text wouldn't do what the current model does? There are trillions of conversations out there it can rely on to continue the text, people playing theatre, people roleplaying, tutorials, people playing opposite games, people brainstorming etc. Create a parser that can parse those down to logic, then make a markov chain based on that, and I have no problem seeing the current ChatGPT skills manifesting from that.

> Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.

Yes, that is the novel thing, it compresses the states down to something manageable without losing the essence of the text, and then builds a model there of likely next token.


Your recommended article is more of a critique than OPs "GPT for dummies", and very unsuccessful at explaining what's going on in that chatbox.


Ted Chiang writes good stories but this metaphor is misleading.


I liked the intro of the can opener problem, but I think it's quite funny that given that intro (particularly trying to convince people they're not stupid they just don't know about the weird problem this thing is solving) a large section of the document is about electron orbitals. Possibly the most complicated example of probability distributions many people will know, and many won't know it at all.

> We all learn in basic chemistry that orbitals are just regions of space where an electron is likely to be at any moment

You may be surprised.

Latent space is then introduced in a side note:

> “latent space is the multidimensional space of all likely word sequences the model might output”

So the simple overview is that "Oh hey, it's just like electron orbitals - you know except in a multidimensional space of word sequences"?

The end part is probably the most useful, describing how these things work in a bit more practical sense. Overall this feels like it introduces the fact the model is static and it has a token window in a very complicated way.


Any explanation of "how ChatGPT really works" has to be able to explain https://thegradient.pub/othello/ using the same underlying theory to be worthwhile.


I’m not quite sure this article is really going to help “normies”. It seems to fall into the classic trap of “I’m going to explain something in simple terms that you will understand. But, I still want you to think I’m really clever so I’m going to make sure you still don’t really understand”


Had an attempt at writing for “normies” - https://atomic14.com/2023/03/08/why-does-chatgpt-make-mistak...


Thanks, this was an informative and enjoyable read!


ChatGPT is probably the first software product that I have no idea how I'd go about implementing. I watched a number of YT videos about it including Andrew Carpathy's 2 hour coding session building a mini GPT.

I understand the process abstractly, but I am unable to grok the details about how it's able to take my vague interpretation of what I want and then write code and actually give me exactly what I wanted.


The ChatGPT video by Karpathy is the last in a 7 video series. The one that really hit it home for me was the first video on MicroGrad [1]. The second video builds on MicroGrad and is also great for understanding how a basic NN works.

[1] https://youtu.be/VMj-3S1tku0


Yeah, unless you're already familiar with the advanced math behind all this don't expect an in-depth understanding of the implementation details the parent comment is concerned about.


It parses natural text into a format representing logic, that is the biggest thing to get.

Then once you have such a parser you can now make a logical Markov chain, it predicts the continuation of your text based on continuations of things that looks similar in the logic space rather than the text space, and that is what you get back from the model.

So ChatGPT does what we could have easily done if we had if natural language was more logical. Then we could make a simple model based on all the logic encoded in all writing of humanity that just looks up human done logic similar to what you asked for and then return an average of those logics. Now since you want human language output it now has to translate that back to human text from logic space, and it could be done in English, Spanish, it could speak like a pirate etc.


Do you know how you'd go about implementing a video codec? A Haskell compiler? Linux kernel? Efficient relational database? wifi 802.11n radio? An efficient classical AI chess engine?


I don't exactly how, but I have a good idea where to start. I've built a compiler before (in college). I've bootstrapped from hardware before, so I probably could built a very basic kernel. I've worked with radios at the low level in the 90s.

But with ChatGPT, it just seems like magic.


None of the actual moving bits are magic, though, and it's way more accessible than an OS kernel. The problem is that 1) you need a lot of raw GPU compute power to train and run a model that has GPT-3 capabilities, and 2) because the end result is a black box, nobody knows how it "actually works". In that sense, it is indeed technomagic.


I would recommend playing with the new ChatGPT API a bit.

Trial and error in terms of tweaking the system prompt is surprisingly instructive.


I get the sense that ChatGPT crosses a complexity threshold where there’s no good way to describe how it works that satisfies everybody, and that’s leading to cyclical stories of the form “everyone else describes it wrong, so here’s my take.”

As a heuristic, I see descriptions falling into simple buckets:

- stories that talk about tokens

- stories that don’t talk about tokens

Anything discussing technical details such as tokens never seems to really get around to the emergent properties that are the crux of ChatGPTs importance of society. It’s like talking about humanity by describing the function of cells. Accurate, but incomplete.

On the other hand, higher-level takes happily discuss the potential implications of the emergent behaviours but err on the side of attributing magic to the process.

I haven’t read much, to be fair, but I don’t see anyone tying those topics together very well.


"err on the side of attributing magic to the process"

I think that is due to LLMs being somewhat magical. I think that the wolfram article "What Is ChatGPT Doing … and Why Does It Work?" captures this beautifully.


Pay attention to Hacker News tomorrow morning. ;)


There’s no way a guy who refers to himself as “a normie” is in fact “a normie.”


I’d explained it in this way: It’s a neural network that learned to understand knowledge by reading a large part of the internet. It’s emergent behavior inside the neural net. It what happens in the brain of a baby. In the first months the eyes can see but the brains cannot. But the data will flow into the brain and due to the learning algorithm: it will start to understand the visual data over time. It’s emergent behavior. The net builds relationship to have a better estimate of the required output to minimize loss. Predicting the future requires intelligence


For more on emergent abilities of LLMs and scaling gains from data and compute, see this great post and discussions on the Chinchilla paper https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla...


An explanation to normies would be: "Chatgpt is like a wunderkind that has read all the books in the world and memorized them. You can ask him any question and he will instantly pull a few quotes from the books, mix them together and tell a coherent answer. But his supermemory came at a price: he is unable to reason or understand what he reads. He sees words as pictures."


It's not an accurate explanation, given that you can literally watch ChatGPT reason if you tell it to "think it out loud" while solving a particular task.


I'ts like an electric circuit where the residual stream provides the concept of a feedback analogous to the trace in traced monoidal categories https://en.wikipedia.org/wiki/Traced_monoidal_category.

Attention is a recurrence relationship that gets gradually pruned.


That didn’t help a normie at all


I would just add that if you are an LLM enjoyer and not necessarily a normie, then https://generative.ink/posts/simulators/ is still the best take.


Who is to say that human brains don't work in a similar way though, with a token window as short term memory and training as long term memory? Even more so when there is injury or disease that prevents long term memories from being formed.


In fact, our short term memory does seem to be token based which is why people talk about "chunking" if you want to keep more info in your working memory.


> Important implication: It’s common but not necessarily always useful to say the language model “knows”

What does the author think knowing actually is if not a convergence of probability distributions?

I'd like for all these "let me explain to you how things work" type write ups to first define what their understanding of the key preexisting concepts are (like what it is to know something). So when they try to explain the new concepts I don't waste my time learning concepts formed on a flawed understanding of their foundational counterparts.


Referencing atomic orbitals... for a good time, ask ChatGPT this series of questions:

1. Hello, what can you tell me about the Legendre polynomials, angular momentum, and the three-dimensional visualization of atomic orbitals in the context of quantum mechanics?

2. So, can the three-dimensional representations of atomic orbitals seen in chemistry textbooks be considered to be solutions of the Schrodinger differential equations in the context of Legendre polynomials?

3. What's the nature - mathematically speaking - of the radial and angular combination of the Legendre polynomials and the hydrogenic radial functions that give rise to the textbook probablility density visualizations of atomic orbitals?

4. Tell me about Hartree-Fock representations of the molecular orbitals of the helium atom?

5. Compare and contrast the Hartree-Fock method and the more recently developed Density Functional Theory approach, please, in the context of simple molecular orbital theory.

That's pretty impressive, probably a better and more concise explanation than a lot of professionals in the field could give off the top of their heads. However, it's also interesting to see where ChatGPT is remarkable untrustworthy - Linux terminal command line options, for example, it often gets those wrong (disastrously so in some cases, happily I double-checked first). It also has problems with modular arithmetic... This one always delivers, it seems, and has been doing so for a month at least:

Q: So is seven congruent to four, in the context of modulus three?

> "No, seven is not congruent to four modulo three. To determine whether two integers are congruent modulo a given modulus, we need to compute the difference between them and check if it is divisible by the modulus.

In this case, we have:

7 - 4 = 3

Since 3 is not divisible by 3, we conclude that 7 is not congruent to 4 modulo 3.

We can also verify this by checking the remainders when 7 and 4 are divided by 3:

7 ÷ 3 = 2, remainder 1 4 ÷ 3 = 1, remainder 1

Since the remainders are not the same, we can again conclude that 7 is not congruent to 4 modulo 3."

Hmmm.... this seems wrong...


Add “Let’s think step by step” with the math problem and see if you get something different


I don't quite understand how it can produce rhyming poetry if it's building the output one word at a time.


Previously generated words are added to the input token window.

At the point it’s generating the next word, it knows what its preceding words were. With a conceptual representation of various rhyme schemes, subsequent words will (probably) fit that form.


So once it's done one line, when it's writing the next line, it will be stuck with being force to rhyme with whatever it came up with on the first line? Curious what happens when the first line ends in the word 'orange' (or maybe it tries to pick end-of-line-words based on rhyme-ability?)


If certain words (like “orange”) are statistically unlikely to be used in that context in that rhyme scheme, then they’re unlikely to be picked to begin with.


You can specifically ask it to end the first line in "orange" or whatever.

In my experience, when it can't come up with a good rhyme, it tends to fib it. But also keep in mind that, as it operates on token representation of words internally, its notion of what rhymes and what does not is far from perfect. This is more noticeable in languages other than English.


"ChatGPT doesn’t know facts" that's not a helpful statement. A better way to express it would be it doesn't understand its facts or it doesn't grok its knowledge. or maybe the statement is true until it's trained with wikidata on which relationships are Factual.


Is that the same Jon Stokes that used to write about CPU design for Ars Technica? Cofounder of Ars, IIRC.


It is.


I explained it as autocomplete but the past messages are the whole Internet.

Super flawed explanation but for non-compsci friends it has helped them understand the mechanism a BIT better.


Is there a way to get a mirror of this without the word "normie"? I supposed I can just copy and paste the content and send it to someone lol.


There any many articles on how it works, but there are very few explaining how other similar scale models doesn't work.


Today I learned that referring to people as “normies” is apparently a thing.


It's a thing if you think you're better than "normies".


normie...instantly puts me off from reading the post. Makes the author sound like he is above everyone else.


I posted some comments about ChatGTP in a local FB group and there was a pretty large percent of folks who responded here that think it's just an awful thing that's going to lead to the downfall of civilization.

I tried to offer that it is pretty cool, but it's just software that basically presents search engine results in a different manner along with a few other tricks, but it's not "HAL".

I live in a very red and rural area so that probably has something to do with it. They love to have new things to complain about that have no effect on any of us at all.


> I live in a very red and rural area so that probably has something to do with it. They love to have new things to complain about that have no effect on any of us at all.

That seems an ungenerous interpretation. I don't doubt that their understanding is full of science-fiction inspired fear, but the implications and dangers of this tech is a hotly debated topic among informed experts. So, even if their specific fears are ungrounded, their fear may not be (ie, effects on the economy, culture, education, especially as the tech advances, which it will).


Especially since the powers that be didn't really do much to cushion the impact of previous societal advances.


> I live in a very red and rural area

This seems like an unnecessary invocation of negative stereotypes. What makes you think people outside outside that demographic don't have similar thoughts? Anecdata and all that.


I'll bite. How does chatgpt fit into any of the conservative themes? I'm baffled. Its not gendered, its not an immigrant, its not "elitist", it generally isn't partisan or for that matter opinionated. Why the hate?


No topic needs to fit into any theme for one political group to hate it. The only thing it needs is enthusiasm from the other side. None of it is rational. In fact multiple issues in American history have completely switched parties over the years.

The fact that people think they are rationally debating issues is the problem.


ChatGPT (at least earlier versions) was deliberately fine-tuned to strongly favor one side of the American culture war. See this paper for an attempt at quantifying its political bias: https://www.mdpi.com/2076-0760/12/3/148.


"Its not gendered, its not an immigrant, its not "elitist", it generally isn't partisan or for that matter opinionated."

Same could be said about Covid, vaccines, and masking during a viral pandemic.

My guess is it's an addiction to the outrage FOX News sells (and others like it).


Wow... I didn't expect to get so many panties knotted up with that observation, but I think those who down voted it prove my observation.

What bugs me most about ChatGPT is that it doesn't attribute where it got the results it offers. For example, I asked it: "how to save a document with PouchDB" and it showed me the code, and while I didn't compare it, it looks like a copy and paste from PouchDB.com, and that should be referenced with a link in the response if that's the case, but it was not.


i'm not a red state / far right / pro-trump in any sense of the word. however, i don't think it is very unreasonable to extrapolate bit and see the potential for societal harm.

over the past three years the entire world was impacted by a dire health crisis where misinformation played a large role in distorting public perception. this has direct impacts on public health (people not wearing masks, refusing vaccines) and has spillover effect into other parts of people's lives (political polarization around said issues).

what you see as a pretty cool toy could also easily be abused as a giant round the clock fake news generator. it doesn't matter if the text is true or even makes any sense... an alarming amount of people will take anything they read as fact without investigating the sources. this can be done as is with chatgpt right now. it has obscenity filters sure, but fake news is trying to pass as legitimate reporting, so it will probably be framed in a tone that escapes the obvious filters they have.

then consider the implications for robotexting, phishing, automated bots that pretend to be you to customer service chats, social media bots, messaging app scammers. all of these things are currently problems that can cause harm both personal and societal... and chatgpt will make it easier and cheaper to scale them up to new levels.


>misinformation played a large role in distorting public perception. this has direct impacts on public health (people not wearing masks, refusing vaccines)

I increasingly hear things that suggest that those who wore masks and got vaccinated were the ones who were actually misinformed. Of course, you won't hear any of that on CNN


> I increasingly hear things that suggest those who wore masks and got vaccinated were the ones who were actually misinformed

The only folks who were misinformed are those who never took the time to learn about masks and vaccinations. But to be fair that is a huge number of folks here in the U.S.

I've yet hear anyone talk about someone they infected who was killed by it though. With over 1.1 Million dead from it here in the U.S. that's astonishing. So is the fact that it is still killing around 2000+ a week here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: