fellendrone's comments

fellendrone · 2024-04-15T14:06:40

> models of word frequencies

Ironically, your best effort to inform people seems to be misinformed.

You're talking about a Markov model, not a language model with trained attention mechanisms. For a start, transformers can consider the entire context (which could be millions of tokens) rather than simple state to state probabilities.

No wonder you believe people are being 'taken in' and 'played by the ad companies'; your own understanding seems to be fundamentally misplaced.

saeranv · 2024-04-15T17:44:32

I think they are accounting for the entire context, they specifically write out:

>> P(next_word|previous_words)

So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.

But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.

Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?

nerdponx · 2024-04-15T17:48:20

It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.

mjburgess · 2024-04-15T20:39:04

It doesnt matter how big it is, it's properties dont change. eg., it never says, "I like what you're wearing" because it likes what I'm wearing.

It seems there's an entire generation of people taken-in by this word, "complexity" and it's just magic sauce that gets sprinkled over ad-copy for big tech.

We know what it means to compute P(word|words), we know what it means that P("the sun is hot") > P("the sun is cold") ... and we know that by computing this, you arent actaully modelling the temperature of the sun.

It's just so disheartening how everyone becomes so anthropomorphically credulous here... can we not even get sun worship out of tech? Is it not possible for people to understand that conditional probability structures do not model mental states?

No model of conditional probabilities over text tokens, no matter how many text tokens it models, ever says, "the weather is nice in august" because it means the weather is nice in august. It has never been in an august; or in weahter; nor does it have the mental states for preference, desire.. nor has it's text generation been caused by the august weather.

This is extremely obvious, as in, simply refelect on why the people who wrote those historical text did so.. and reflect on why an LLM generates this text... and you can see that even if an LLM produced word-for-word MLK's I have a dream speech, it does not have a dream. It has not suffered any oppression; nor organised any labour; nor made demands on the moral conscience of the public.

This shouldnt need to be said to a crowd who can presumably understand what it means to take a distribution of text tokens and subset them. It doesnt matter how complex the weight structure of an NN is: this tells you only how compressed the conditional probability distribution is over many TBs of all of text history.

nerdponx · 2024-04-15T22:44:24

You're tilting at windmills here. Where in this thread do you see anyone taking about the LLM as anything other than a next-token prediction model?

Literally all of the pushback you're getting is because you're trivializing the choice of model architecture, claiming that it's all so obvious and simple and it's all the same thing in the end.

Yes, of course, these models have to be well-suited to run on our computers, in this case GPUs. And sure, it's an interesting perspective that maybe they work well because they are well-suited for GPUs and not because they have some deep fundamental meaning. But you can't act like everyone who doesn't agree with your perspective is just an AI hypebeast con artist.

mjburgess · 2024-04-16T05:58:48

ah, well there's actually two classes of replies and maybe i'm confusing one for the other here.

My claim regarding architecture follows just formally: you can take any statistical model trained via gd and phrase it as a kNN. The only difference is how hard it is to produce such a model from fitting to data, rather than from rephrasing.

The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).

nerdponx · 2024-04-16T14:59:41

I think I see the crux of the disagreement.

> The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).

But it's not just about hardware. Maybe it would be, if we had access to an infinite stream of perfectly noise-free training data for every conceivable ML task. But we also need to worry about actually getting useful information out of finite data, not just finite computing resources. That's the limit you should be thinking about: the information content of input data, not compute cycles.

And yes, when trying to learn something as tremendously complicated as a world-model of multiple languages and human reasoning, even a dataset as big as The Pile might not be big enough if our model is inefficient at extracting information from data. And even with the (relatively) data-efficient transformer architecture, even a huge dataset has an upper limit of usefulness if it contains a lot of junk noise or generally has a low information density.

I put together an example that should hopefully demonstrate what I mean: https://paste.sr.ht/~wintershadows/7fb412e1d05a600a0da5db2ba.... Obviously this case is very stylized, but the key point is that the right model architecture can make good use of finite and/or noisy data, and the wrong model architecture cannot, regardless of how much compute power you throw at the latter.

It's Shannon, not Turing, who will get you in the end.

mjburgess · 2024-04-22T14:01:35

text is not a valid measure of the world, so there is no "informative model" ie., a model of the data generating process to fit it to. there is no sine curve, indeed there is no function from world->text -- there are an infinite family of functions, none of which is uniquely sampled by what happens to be written down

transformers, certainly, arent "informative" in this sense: they start with no prior model of how text would be distributed given the structure of the world.

these arguments all make radical assumptions that we are in somethihng like a physics experiment -- rather than scraping glyphs from books and replaying their patterns

drdeca · 2024-04-15T22:41:29

Perhaps you have misunderstood what the people you are talking about, mean?

Or, if not, perhaps you are conflating what they mean with something else?

Something doesn’t need to have had a subjective experience of the world in order to act as a model of some parts of the world.

fellendrone · 2024-04-15T13:06:27

> Why does, 'mat' follow from 'the cat sat on the ...'

You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.

Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.

You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.

You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/

For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.

fellendrone · on Jan 25, 2022

> I would rather meditate on being full of thought, and thinking of many things simultaneously

This is basically ADHD, and whilst it has its advantages, wouldn't recommend it in general.

The extreme of this might be psychedelics. A useful experience, but not necessarily something you want day to day.

Thoughts are a tool. They're like mind software.

The point of meditation isn't to stop thinking, it's to decouple yourself from them and see them from an outside perspective. In the process, perceiving that "you" are not your thoughts or your mind, and therefore direct the mind more effectively.

It's a bit like closing background processes and gaining back resources. Those thoughts take resources to 'run', and many of them are constantly spawning new ones.

When you turn your awareness on these processes, rather than them running ad hoc in the background, you can understand your self more and direct yourself better. The competing processes reintegrate.

Or perhaps it's like an unruly dog that's constantly pulling you this way or that, but you're so used to it you don't even know it's unruly. Then you start to train it, and sooner or later you become a unit, working in coordination.

Neurologically, it's likely modulating the default network, which some theorise is the neural correlate of the ego/self: https://en.wikipedia.org/wiki/Default_mode_network

> Meditation – Structural changes in areas of the DMN such as the temporoparietal junction, posterior cingulate cortex, and precuneus have been found in meditation practitioners. There is reduced activation and reduced functional connectivity of the DMN in long-term practitioners. Various forms of nondirective meditation, including Transcendental Meditation and Acem Meditation, have been found to activate the DMN.