The illusion of state in state-space models

empath75 · on June 7, 2024

There's nothing wrong with this paper and it's important to understand exactly what LLMs (and related technologies) are capable of and what they aren't, but I do have a problem with the general argument that <ai technology> can't do <some task> and so that means it's useless and won't lead to AGI. I feel like a lot of people think that if a single neural network architecture can't achieve something on it's own that building a system that supplements the AI with other tools is somehow _cheating_. General AI if it ever happens is going to very likely be a system of specialized components working together, and there's not going to be a single piece of it that you can point to and say "that is where the intelligence lives" -- it's sort of liking digging through the brain looking for the homunculus that's piloting the ship. If an LLM can't track state, hook it up to something that can. If it doesn't have long term memory and retrieval, hook it up to something that does.

szvsw · on June 7, 2024

Very well said. It absolutely drives me crazy - the focus on the model architecture for autoregressive text generation as the “key” (or blocker) to AGI seems so misguided as a debate. It seems clear to me (and maybe I’m wrong!) that a much larger system of modules interacting is essential (actuators, sensors, short term and long term memory, working memory, etc), and I completely fail to understand why it feels like the debate revolves around neural network architectures as opposed to the larger system architecture for agency. The language and decision making component is obviously extremely important and necessary, but it feels to me like it will always just be a necessary but not sufficient component for AGI.

The fact that we have already obviously developed a technology which has an excellent grasp of language, can select between tools, perform knowledge retrieval from external databases, perform some form of simplistic reasoning and planning… it really feels (again, only intuitively!) like everything outside the perimeter of LLMs is where the actual frontier of AGI will be.

It feels like with enough sensorimotor affordances and mechanisms for saving and accessing memory, all that is left is the (obviously complex) framework for connecting all of these together, with potentially many instances of each module working together… especially when you consider that one such system can potentially use human actors as tools (eg “hello fellow human, please complete this captcha for me, as I am low-vision”).

I guess more succinctly - I do not believe that GPT-4 et al are remotely close to AGI on their own, but I do feel (again, this is just “vibes,” and obviously people smarter than me disagree!) that GPT-4 could be a component combined with lots of other tech we already have to essentially achieve AGI already. Perhaps what I am describing would just be a pale imitation of what others mean by AGI and be reductively called another mechanical Turk. I am probably the misguided one, but nonetheless, I can’t escape the feeling that “the one model to rule them all” is a red herring vis-a-vis a complex network (ha) of modules interacting together to achieve meaningful agency. Maybe it’s just that I am more interested in “meaningful agency” as a guiding principle over “AGI.”

genevra · on June 7, 2024

Exactly, we're already adding basic memory functions to GPT models

Text generation can only get so much better but the other modules required to simulate intelligence have tons of room for improvement

logicchains · on June 7, 2024

>the general argument that <ai technology> can't do <some task> and so that means it's useless and won't lead to AGI

The paper isn't making that argument. The point it makes is that parallelisable SSMs are theoretically no more powerful than transformers, contrary to some people's assumption that they'd be theoretically equivalent to RNNs and hence more efficient at certain kinds of problems.

andoando · on June 7, 2024

I think a way to connect these different systems as a homogenous system is the difficult part.

Lerc · on June 7, 2024

Reading that abstract, I'm not sure what is being claimed here. They say "the “state” in common SSMs is an illusion:" Is this a claim that the activations within a state space model contain no state or that it cannot contain an arbitrary state. The former seems trivially false and the latter seems completely uncontroversial.

logicchains · on June 7, 2024

If you read the rest of the paper it's clear what they mean. SSMs are theoretically no more powerful than transformers, meaning without chain of thought they can't efficiently solve state-tracking problems, unlike RNNs: https://arxiv.org/abs/2207.00729 .

yobbo · on June 7, 2024

The difference to RNNs is that the state in SSMs is a linear combination of previous inputs, which makes it possible to parallelise training in various ways.

Gates and non-linearities in RNNs allow the state to be "any" function of previous inputs.

opprobium · on June 7, 2024

Not just efficiently, can't solve.

logicchains · on June 7, 2024

They can solve it if you keep adding layers to the transformer, it's just not efficient; you'd need exponentially more layers than a similarly sized RNN.

briandw · on June 7, 2024

The early research in neural networks was hampered by proofs that perceptrons could never solve certain functions, like XOR. DNNs could have been developed much sooner otherwise. I view these proof papers with some skepticism, since they can be unnecessarily dismissive of good ideas.

logicchains · on June 7, 2024

>The early research in neural networks was hampered by proofs that perceptrons could never solve certain functions, like XOR. DNNs could have been developed much sooner otherwise.

These proofs still hold; pure MLPs (without a modern activation function) aren't very useful, both in theory and practice. What made them useful was the realisation that combining them with a proper activation function makes them much more useful, both theoretically and practically; this discovery took time.

nyrikki · on June 7, 2024

How did papers that demonstrated single layer neural networks can't do XOR, but multi layer neural networks can, hamper development?

XOR is simply not linearly serperatable, requiring an MLP or kernel trick still holds.

It is a similar reason that attention works for majority gates but not parity gates in the general case.

Acknowledging that reality resulted in new developments, but is still a limitation.

Perceptrons are binary classifiers.

nickpsecurity · on June 7, 2024

I’ve only read the abstract. It says that they have experiments to back their claims. So, that’s a proof claim with experimental data. It’s in a field where most learning happens by experimental exploration, too.

I don’t think it will hold us back. If anything, it’s very exciting to see how many people in the ML field are challenging the status quo from many, different angles.

duped · on June 7, 2024

I disagree with the editorialized abstract. State is not an illusion. There is a limit to what stateful LTI systems are capable of modeling. It is very interesting that they've been able to prove what that limit is in the context of computation.

From my reading, "non gated SSMs" are LTI systems. They're linear (the paper ignores the activation function) and the "does not depend on input" criteria is the controls way of saying "time invariant" (replace time with the domain of the input vector). It is not surprising that a system that cannot adapt to changing input conditions (by construction) is fundamentally limited to the problems it can model. This is why adaptive filters are studied - LTI systems are useful computationally, but limited because they cannot adapt. Relaxing the "TI" constraint greatly expands the domains of problems that can be solved.

When the feedback matrix is diagonal you have an FIR system. While "finite impulse response" has as specific mathematical definition, conceptually it means the state at time n depends solely on past inputs and states n - N where N is the size of the state vector. So of course, if the feedback matrix is diagonal, the system is limited in the kinds of stateful problems it can handle.

What this paper is missing is that connection to the limits of LTI and connection to TC0, which is very interesting indeed.

In other words, the state of the network is still infinitely long and definitely not an illusion. However if the problem requires the network to adapt to its input then a non-gated SSM is not going to be sufficient. That's an interesting research space.

gaudat · on June 7, 2024

Another name hijacked by AI/ML... I was hoping to see some control thoery...

duped · on June 7, 2024

It's not hijacked, the formulation is the same. For any layer there is a state-space formulation

   h = Ah + Bx
   y = Ch + Dx

where x is the input, y is the output, and h is the state. They use "h" instead of "s" for the state variables because they're called "hidden states" in the literature. edit: it is obnoxious they've flipped the convention for A/B/C/D which is the one thing controls people agree on (we can't even agree on the signs and naming of transfer function coefficients!).

Where this diverges from dynamical systems/controls is that they're proving that when x/h/y are represented with finite precision numbers, the model is limited in the problems it can represent (no surprise from controls perspective), and they prove this by using an equivalence to the state-space formulae that's consistent with evaluating it on massively parallel hardware.

The classical controls theory is not super applicable here, because what controls people care about (is the system stable, is its rise/fall time in bounds, what about overshoot, etc) is not what ML researchers care about (what classes of AI problems can be modeled and evaluated using this computational architecture).

throwaway42668 · on June 7, 2024

This is what they actually mean by AI will be in everything. Just a relentless campaign of appropriating names that used to have meanings.

nico · on June 7, 2024

Interesting. This is how I sometimes feel about physics

The field completely appropriated and redefined so many terms of common language, that now it’s hard to talk in plain language with someone formally trained in physics about physical phenomena

For example, everyone has some sort of intuitive idea about what Energy is. But if you use that word with a physicist, watch out, for them it means something super specific within the context of assumptions and mathematical models and they will assume you don’t know what you are talking about because you are not using their definitions from their models

Same thing happens with infinite in math

szvsw · on June 7, 2024

I mean, if you like state space models, then you should read the paper on Mamba if you haven’t already! Because it quite literally uses state spaces… and you will probably think it’s a really cool application of state spaces!

Apologies if you know the following already, but maybe others reading your comment feeling similarly will not be familiar and might be interested.

At least intuitively, I like to motivate it this way- pick your favorite simple state space problem. Say a coupled spring system of two masses, maybe with some driving forces. Set it up. Perturb it. Make a bunch of observations at various points in time. Now use your observations to figure out the state space matrices.

There’s fundamentally not really anything different (in my opinion) with using Mamba (or another state space model) as a function approximation of whatever phenomenon you are interested in. Okay Mamba has more moving parts, but the core idea is the same: you are saying that on some level, a state space is an appropriate prior for approximation of the dynamics of the quantities of interest. It turns out being pretty remarkable the number of things this can work out quite well for. For instance, I use it to model the 15-min interval data for heating, cooling, and electricity usage of a whole building given 15 min weather data, occupancy schedules, and descriptions of the building characteristics (eg building envelope construction, equipment types, number of occupants, etc).

3abiton · on June 7, 2024

To be fair state-space models come originally from physics

aerospace_guy · on June 7, 2024

This has been happening for years now. Just adding "AI" or "LLM" gets the views and $$ these days.

optimalsolver · on June 7, 2024

>Figure 1: We prove that SSMs, like transformers, cannot solve inherently sequential problems like permutation com- position (S5), which lies at the heart of state-tracking prob- lems like tracking chess moves in source-target notation (see Section 3.2), evaluating Python code, or entity tracking. Thus, SSMs cannot, in general, solve these problems either

Do Microsoft & friends who are about to build trillion dollar AI data centers know about these proven limitations of transformer-based architectures?

canjobear · on June 7, 2024

People at Microsoft Research certainly do. But the bet is that, with scale, these limitations won't matter in practice. For example Transformers can't recognize well-nested brackets (like {[()]}) to infinite depth; the depth that Transformers can recognize is limited by how many self-attention layers they have. But in practice, you rarely need much depth.

toxik · on June 7, 2024

Notably, humans also cannot track this to infinite depth. Somehow we know this and reach for algorithmic solutions (like pen and paper).

logicchains · on June 7, 2024

Interestingly LLMs also do better with pen and paper reasoning: https://arxiv.org/abs/2310.07923 .

empath75 · on June 7, 2024

Not being able to solve a problem "in general" does not mean that it can't solve specific and useful instances of the problem. There are lots of SAT solvers that can't solve SAT problems in general (in any reasonable amount of time), but nevertheless can solve many useful categories of SAT problems.

mjburgess · on June 7, 2024

Yes. The solution is to hoover up all the queries people are currently putting to ChatGPT, save how people have "prompt engineer"ed the solution (ie., answered the question themselves); and then hope that you can just feed this back to people without them noticing.

The only open question is whether people have common-enough queries for this charade to work out. It seems there's quite a lot at least. But this number will decrease over time for various reasons. So it's a game of building a system that can be retrained on the answers people are giving it fast enough that people don't notice where the answers are coming from.

FeepingCreature · on June 7, 2024

Counterpoint: the more networks internalize the patterns behind engineered prompts, the closer they get to being general reasoners.

mjburgess · on June 7, 2024

Statistical patterns in text tokens have nothing to do with reasoning. They lack such a property to learn in the first place. This would be more obvious, I guess, if the tokens were in an alien language. Consider a translation to 1,000 different alien languages of any given novel.. there is no distributional property they would share.

Statistical AI is just a way of sampling from a historical dataset with a similarity metric. It only works to answer questions if you're sampling from a (Q, A) database in the same language the user already understands. The question was answered by a reasoner, it is now answered by a system which replays answers.

FeepingCreature · on June 7, 2024

This is just chinese room. But also, I disagree that the languages would not share properties. A novel is too small. Consider a network trained on GPT-3 scale datasets in 1000 alien languages. The shared structures behind the sentences will be the same, even if the grammar is completely different. Stars will be stars, moons will be moons. I'd bet you that the model would be able to translate shared concepts between those languages even if it had not been trained on the same novels, same as GPT can translate terms that it has not seen in dictionary pairs.

(I have no idea if it can! But I'm confident enough that I'm willing to just say it can, at risk of being proven wrong. If GPT can't do that, I'm fundamentally misunderstanding how it works - that is, I don't have a paper offhand showing that GPT shares concept neurons between languages, but I'm willing to bet I could find one if I went looking.)

In other words, if you co-trained GPT-4 on Earth Internet and Alien Internet, there's a good chance it'd end up able to translate English to Alienese, purely as an emergent ability, if it had the concept of translation at all.

Intelligence is compression. With sufficient abstraction (layers) and sufficient volume (dataset), any description of the same reality will assume the same structure. And no learning algo worth its salt will keep two identical structures around.

mjburgess · on June 7, 2024

Almost every predicate in natural languages is an arbitrary association of properties in the world. There isn't any property a person has "bald", nor are there "tree"s in the world. What bundles of properties the vast majority of language names are a product of historical and contingent associations we've made for practical reasons.

Likewise, languages do not have the same distributional structure. There is no reason an alien language would use discrete tokens to name properties, nor co-locate tokens by linear position in a 'sentence'. Historical human languages did not; using, eg., the full 2D structure of the clay tablet.

To suppose that it is the glyphs and their colocation which somehow bare meaning is a nonesensical superstition. The world is what our words mean, and it is we, in that world, who provide them their meaning. We can do so with arbitrary linguistic structures.

canjobear · on June 7, 2024

Languages do in fact have similar distributional structure, so that it is possible to learn how to translate words without supervision: https://arxiv.org/abs/2203.04863

In principle languages are arbitrary, but in practice they’re describing the same world and the same concepts end up being useful.

mjburgess · on June 7, 2024

Those are on embedding vectors, not on words. They use embeddings created from translated sources, eg., wikipedia articles.

Yes, if you construct an embedding vector on texts-A, and another on texts-B where (A, B) are translations of each other, then an "unsupervised" algorithm really will give you the dizzying heights of a little above coin-flip accuracy on highly engineered self-selected benchmarks.

They evaluate by by looking at the most in-use words in each vocab.. so you take the most in-use words on translations of Wikipedia, whose frequency is decided by the need of translation.. and then you use that to evaluate.

It is blindingly obvious that the structure and frequency of heirglphys on tombs, Chinese glyphs in poetry, and latin in medieval liturgical literature are not distributed by Reality.. written in this order by God so that the Langauge of Reality is what places "d" alongside "oor". We already know this to be the case. The assumption of its opposite is rank pseudoscience.

FeepingCreature · on June 7, 2024

> It is blindingly obvious that the structure and frequency of heirglphys on tombs, Chinese glyphs in poetry, and latin in medieval liturgical literature are not distributed by Reality.. written in this order by God so that the Langauge of Reality is what places "d" alongside "oor". We already know this to be the case. The assumption of its opposite is rank pseudoscience.

I mean, on the first level, of course they're entirely determined by reality in the sense that the human brain is a real, physical object. But also on a second level they're still entirely determined by reality because the shape of the human mind is also entirely determined by reality. What is the mind for except reflecting reality? What is language for except communicating it? Sure that reality is warped, filtered, reduced and biased, but the data is still in there. That's in large part why LLMs need such ludicrously large training runs.

I don't think letter frequency is objectively determined, but we know for a fact (many studies!) that the features that large language models learn are far, far above the scale of letters. Even arguing about phrases isn't engaging with the current state of the art.

We're not talking about Markov chains here.

FeepingCreature · on June 7, 2024

Right, but neural networks at a certain level of scale begin to compress the perception described by the text rather than the text per se. The lexical features are abstracted away and the network begins to understand meaning directly. Of course, you'd still have differentiation by different perceptions and focus on different parts of reality, but I see no reason why that should not likewise abstract out at even higher levels. Ultimately, the whole point of language is that it's about reality, and there is only one reality no matter how many different cultures you feed in. The network isn't doing magic - it's just doing exactly the same thing that we're doing with language. The reason why language is useful is the same reason it's learnable.

mjburgess · on June 7, 2024

If you want to understand how statistical AI systems work, then you should understand the basics of formalism. They are curve-fitting algorithms that approximate the probability distribution of a hisotrical dataset. That's all they are.

By analogising to any animal or human mind, you aren't describing anythign that acutally exists. A neutral network algorithm isnt neural and it isnt a network. It's a statistical curve-fitting algorith. One oughtnt study trees to understand a decision tree either.

This language is entirely metaphorical. There are no neurones in an NN, there are just summation entires in a matrix. This matrix comprises weights, which define the orientation and scale of the line pieces which form the curve being fit to the data.

FeepingCreature · on June 7, 2024

They're, in particular, multi-level curve-fitting algorithms. A formalism which, I'm sure you are aware, can express any bounded computation. In particular, depending on how you set up the tape, neural networks are comfortably Turing complete.

We're not talking about 2-layer perceptrons anymore here.

In my opinion, you should look less at the formalism and more at the empirically demonstrated performance.

mjburgess · on June 7, 2024

I'm not aware that it can express any bounded computation. But such is irrelevant anyway: there is no function from the distributional structure of text tokens to reasoning. There is nothing to approximate, and this is a system trained just to approximate some presumed function: our mental capacities arent patterns in glyphs.

Whatever property you might imagine a transformer architecture to have (and it is vastly fewer than the set needed for general computation), the problem here is that it's being applied to approximate the structure of historical text data which isnt being generated from such a function.

Indeed, there is no function which generates text data, there's a very large number of independent generating processes that give rise to the distribution of text. The phrase "the war in ukraine" acquires a different semantics over 2010-2030 in a radically different way than, "I liked that film" does.

The capacities which produce distributions of text tokens are highly varied, complex, invovle a vast array of our mental processes, and so on. There's literally almost nothing in the distribution of text tokens that corresponds to any features of these processes.

The structure of language is conventional, and rests on our familiarity with such conventions. Otherwise, let's end all science, everything to be known about the world derives from how "e" occurs alongside "lectron"

FeepingCreature · on June 7, 2024

> Indeed, there is no function which generates text data, there's a very large number of independent generating processes that give rise to the distribution of text. The phrase "the war in ukraine" acquires a different semantics over 2010-2030 in a radically different way than, "I liked that film" does.

But those semantics are revealed in the greater context of the phrase! That's why it's so important that transformers can attend to large context ranges; that's what lets them learn the greater semantic patterns to begin with. And at the limit, at a scale greater than phrases, I simply reject the idea that the same article - the same book - can have totally different meanings depending on context. Language isn't just shaped by context, it shapes context itself. Because language cannot be considered without context, it reveals information about that context, and in fact any compression of language ultimately requires modelling the person and even society that produced it. That's what the network learns.

If your words don't have meaning beyond themselves, what are you even talking about?

mjburgess · on June 7, 2024

> If your words don't have meaning beyond themselves, what are you even talking about?

...err... of course? That's the whole point.

"Context" here isnt other words. It's the world. The meaning of words is the world.

When I say, "I like what you're wearing" i'm not summarising a history of prior texts; it has nothnig to do with any statistical operation over historical documents. It has entirely to do with what you're wearing.

Langauge use is a side-effect of being embedded in reality, directly attentive to it, and so on. Words are mere symptoms of how we are situated in the world.

LLMs merely replay these back to us. They are not in the world. They cannot, in principle, ever mean, "I like what you're wearing"

FeepingCreature · on June 7, 2024

I'd say exactly the opposite. It's because human speech is about the world, and LLM speech is about human speech, that LLM speech is about the world, by transitivity. "The ball fell to the <floor>", the LLM predicts, ultimately, not because of any feature of the human brain, or any feature of the English language, but because of gravity. That the causal arrow passes through a human brain does not make this any less true! Because human speech is inextricable from physical reality, language models likewise learn to model reality. They learn this poorly, incrementally, making missteps on the way - granted! But that they learn it at all shows that there's more going on than statistical modelling - or that statistical modelling of language is more than it sounds like.

mjburgess · on June 8, 2024

Human speech isnt about reality. Speech, langauge, tokens, etc. are just scratchings on a wall. When we speak we infer the reality-orientation of people's intentions (and so on).

The strings 010101010111100101, 000010101010111010101, 120192019092109209102, etc. are all "about" the exactly the same scenario whereby different communities of speakers adopt different conventions for the meaning of the terms (indeed, whether pairs/tripples/etc. are terms).

This is not like a thermometer in a glass of water, the height of the mecury being determined by the temperature. Nothing about language is determined by reality. It is purely a highly lossly encoding scheme based on historically arbitary encoding conventions. It isn't a measure of reality in any sense.

Modelling this encoding scheme doesn't model anything about the world. It only appears to if you generate text according to this scheme.. and only to those who can decode it. Hieroglyphs likewise, scratches on clay, and everything else.

You are subject to the illusion of intentionality whereby your phenomenal experience of langauge is rendered immediately meaningful by your brain, ie., your decoding scheme is prior to the construction of your experience of the world. This is wholey absent from anything machines are processing.. whih are only the frequency relationships between tokenizations of syntax

FeepingCreature · on June 8, 2024

> Nothing about language is determined by reality.

This is just transparently silly. Or rather, you're equivocating between "language is not fully determined by reality" and "language is uncorrelated with reality." Language is partially caused by reality; thus, large language models, seeking compression, learn the arbitrary parts and the reality parts separately. That's why they can translate.

The human sensory stream is a kind of language. The exact same logic goes for it! If LLMs could not learn about reality from language because language was not determined by reality, then human brains could not learn anything about reality either.

Nothing about reality forces a certain neural state vector to correspond to a certain sense impression. E pur si muove.

mjburgess · on June 8, 2024

Yes, we cannot learn anything about reality from language alone. This is 'the intentional illusion' or you could call it 'the aboutness illusion'.

As in you think you are learning a speed limit for a road by looking at a sign with a speed limit on it. But the sign itself is just some shapes, and no aspect of its structure has anything to do with speed, roads, or speed limits. Literally, everything here is arbitrary convention; and under arbitrary permutation, the sign can mean the same, and share no properties with any other permutation whatsoever.

Rather it is by being-in-the-world you have acquired a direct understanding of roads, speeds, limits, rules, laws, society, etc. And alongside this all the while internalizing arbitary conventions by which we signify these things.

A stop sign could look like anything at all. We just decide to use arabic numerals, red, white, a certain eye-hight, etc.

The meaning of the sign is something you are reconstructing prior even to being cognitively aware of the sign you're looking at. The meaning of these conventions is a mental reconstruction made possible by your direct familiarity with the world these conventions are related to.

You first act within the world by building a rich conceptual understanding, you then associate arbitrary conventions with this understanding, and then can make inferences about this world by reconstructing a novel conceptual understanding given by these signs.

There is literally nothing at all in speed signs that has anything to do with their meaning.

This isnt a speculative observation. It places a very hard limit on what the entire field of statistical AI can do, let alone LLMs. Any model created by empirical risk minimization, or fitting-to-historical-data, lacks any relevant capacities. It has only one: drawing a data point from some inferred distribution over historical cases.

In any case, this isnt a speculative exercise. Statistical AI will be forever "limited" to a fixed computational budget for computing predictions over historical cases, with necessarily limitations this involves -- this is only an engineering problem where these cases aren't representative of the kinds of queries people have.

eg., they will fail under relevant counter-factual permutation of the conventions of the language/domain/topic, etc. -- see, e.g., https://arxiv.org/pdf/2307.02477

Applying the lessons of the above paper, note chatgpt3.5 gets the evaluation of python wrong under permuation of its syntax: https://chatgpt.com/share/7df48393-9cb8-41f4-a84e-d21b680702...

..whereas ChatGptv4o does not. This suggests to me that there's been a very large amount of prompting by users since 3.5 trying to create novel programming languges in this way.. hence 4o, having been trained on subsequent user prompts, its "better" at this task.

Of course nothing important has changed between v3.5 and v4o in terms of what the model is, rather there is more query-relevant data to sample from.

FeepingCreature · on June 8, 2024

> Yes, we cannot learn anything about reality from language alone. This is 'the intentional illusion' or you could call it 'the aboutness illusion'.

I just don't think this is correct. Language is embedded in reality, is a product of reality, and is correlated with reality.

> As in you think you are learning a speed limit for a road by looking at a sign with a speed limit on it. But the sign itself is just some shapes, and no aspect of its structure has anything to do with speed, roads, or speed limits.

Again, simply wrong. 5km/h limit, 60km/h limit, 120km/h limit - even the physical width of the text has a correlation to the exponent of the speed limit! Which will correlate to the curvature of the road! These are not arbitrary, they're tightly interlinked on every level except the specific shape of the glyphs themselves. That's why your argument only holds up for the most basic features such as glyph frequency. As you zoom out, structural similarities begin to appear; not in the shape of the text but in the shape of features abstracted from the text. That's why LLMs attain novel capabilities as you increase depth and context size, as they can begin to condition on the deeper structures of reality that leak through the language. As language is abstracted from reality, at sufficient scale, a cognition based on language begins to necessarily model reality itself.

> You first act within the world by building a rich conceptual understanding, you then associate arbitrary conventions with this understanding, and then can make inferences about this world by reconstructing a novel conceptual understanding given by these signs.

Completely unclear if this is true at any rate. Humans don't first learn to interact with the world and then learn language; language and world-understanding co-develop.

mjburgess · on June 8, 2024

OK. well, lets give it 5yrs. in 10, at least, this will be over

FeepingCreature · on June 9, 2024

This, at least, I agree with.

(Though not quite in the same sense.)

stevenhuang · on June 7, 2024

Computability is of critical relevance.

Our mental capacities can be patterns. https://en.m.wikipedia.org/wiki/Predictive_coding

> There's literally almost nothing in the distribution of text tokens that corresponds to any features of these processes.

There's nothing in evolutionary fitness that necessitates intelligence or reasoning ability either. Yet here we are.

Sorry but if you don't see the connections then you need to do some reading on theory of mind, cognition, information theory, physics, philosophy. All of the fundamental basis are met to allow reasoning to emerge in LLMs.

It is clear now where your confusion lies, and why you are led to believe so strongly that LLMs cannot reason: it is because you are an ML practitioner you overweight your expertise yet you don't know what you don't know, and have foundational gaps in your knowledge. You lack the context in these other fields. If you had them, your position should be closer to agnostic than this strong belief of yours that LLMs in their current form cannot reason.

mjburgess · on June 7, 2024

lol, well when I've written my PhD on those areas we can return to whether I'm an expert on them or not.

One does not form "connections" between them as in some wide-eye conspiracy theorist.. "predictive coding" has little to do with "prediction" in the ML sense. The latter concerned with making a quantitive estimate of some variable by summarising historical data.

What we are doing when we revise a "mental model" is done by counter-factual simulation of possible future states. Statistical AI models conditional probability structures, and computes predictions as an expectation over weighted summarised historical data. This is not a means of performing counter-factual simulation.

One trivial, sadly empirical, way to see this is to note that each marginal token generated is of constant time and energy use. Yet trivially, reasoning and a variety of other mental capacities should require abitarily different time to run. Eg., simulating a complex scenario is necessarily more intensive than a simple one, and so on.

Yet I am quite annoyed that we need such dumb observations to make this point. It speaks of a profound ignorance of "theory of mind, cognition, information theory, physics, philosophy" and especially neurology and zoology which are your most significant missing terms.

It is no real mystery what the structure of various mental capacities involves; nor any mystery what s(Ws(Ws(WX+B)+B...)...)...) computes. Even involving anything beyond trivial applied statistics and trivial results in science shouldnt be required here. This stuff is very obvious.

empath75 · on June 7, 2024

Have you ever actually bothered testing your assumption that chatgpt can't reason. And I don't mean that it fails to reason properly about certain questions, that's trivially easy to show, but is less interesting than many people think it is, because humans can't reason properly about many questions (see, for example, the Monty Hall problem). Test your assumption that it can never reason properly about any question not in its data set. It's way easier, IME, to find examples where it does give the correct answer to novel problems than it is to trip it up with a difficult problem.

mjburgess · on June 7, 2024

It's data set is, approximately, everything ever written. We have no access to it. And this is widely studied. You can find trivial mistakes in apparent reasoning.

There are two hypotheses: H1, the structure of a response from any given prompt is computed using distributional properties of historical data; H2: the response is computed via deduction from premises to conclusions of agent employing the semantics of the terms, their logical connections, and connections of relevance.

In many cases a prompt/reply will confirm both hypotheses, hence confirmation bias and why we dont bother "confirming" any hypothesis. Rather to choose betweeen them, if you wanted to use data, you just find cases where reasoning fails in such a way that H1 is the more plausible answer. Such cases are easy to find, and across the literature.

This whole thing is pointless however, because it's blindingly obvious from what a statistical AI algorithm does, which is empirical function fiting on historical datasets to approximate conditional probability distributions. This process is extremely well-understood, and we know necessarily that it is just approximating a historical distribution.

stevenhuang · on June 7, 2024

> This process is extremely well-understood, and we know necessarily that it is just approximating a historical distribution.

Prove that this precludes formation of mind. Prove that human minds do not work this way.

Until such premises are established, your argument is not sound. This leaves open the possibility what we're seeing is in fact a degree of reasoning ability in LLMs.

mjburgess · on June 7, 2024

As I said in my other comment. Statistical models generate output with constant time/energy. So for any input y, and any output x, the computation is constant time. eg., generating any 100 tokens of output takes the same time as any other.

If you reflect on that carefully, you'll note the only way this is possible is if the algorithm does not use any semantic features of the input. Indeed, if it has no relevant capacities at all.

Consider e.g., the prompt, "imagine a world where..., and then infer..., and then what would be... ?" for various "..." we can choose, the output ought take arbitrarily longer to compute; but its constant.

For any word: imagine, reason, suppose, simulate, believe, remember... denoting any alleged mental operation, we can trivially construct cases where actually performing this operation would take more or less time.

"Given premises A, B, C, D,...; infer conclusion..." has a prompt reply which is constant in time regardless of how many premises we input. "Recall memories A, B,C,.." likewise.

The only way that answering "yes" to a question, say, regardless of what that question is, taking the same time/energy/etc. to answer.. would be if no semantic feature of that question was being used in its answering.

This is an LLM, it's all statistical AI: the number of operations performed per prediction is constant for all predictions. Almost no alleged capacities are consistent with this fact.

FeepingCreature · on June 7, 2024

The human brain computes precisely the same amount of information every millisecond. That is, the computational power of a single neuron is necessarily capped by its switching speed.

This proves humans cannot think.

mjburgess · on June 8, 2024

Whether that is true or not, and it obviously isn't -- the energy used by the brain is highly variable -- that isnt the point.

The point is that if I ask you to do some complex task that requires a one-word answer, you will take a long time to think about it. If I ask an LLM, it will always take the same amount of time, regardless of the task.

It follows from P =/= NP that this should necessarily not happen if the LLM is actually evaluating the semantics of the question. However, it also follows from just the properties of what capacities are being claimed

Reasoning, as an algorithm, is at least, O(NumberOfPremises) etc. The number of operations performed, in every case, for all statistical AI algs, is the same. It's not O(ANY PROPERTY OF THE PROBLEM) -- this is a catastrophe for anyone who would claim that a statistical algorithm considers the problem at all.

Statistics here is just a very naive short-cut. Rather than solve the problem, you're just sampling from previous answers.

This, of course, should be obvious to any one with the bare minimum of knowledge of applied statistics and the formalism of ML. Neverthless, of course, it isnt. One has to imagine that overwhelming amounts of corporate propaganda and scifi wishfulfilment are at work.

FeepingCreature · on June 8, 2024

The energy used by the human brain is highly variable because the brain has been optimized to reduce power use. This was a going concern in the ancestral environment. It has not been a great concern in the developmental environment of LLMs, so they are not optimized towards this. There are in fact many papers on how to optimize energy use of LLMs, and some amount of them may have already been deployed.

You note this is offtopic, but I disagree. This is part of your fundamental misunderstanding of how to make good use of LLMs - and how LLMs make good use of context space.

> The point is that if I ask you to do some complex task that requires a one-word answer, you will take a long time to think about it. If I ask an LLM, it will always take the same amount of time, regardless of the task.

This is simply not the case. I will take a long time to think about it, because my brain is generating tokens in the conscious workspace. I will take roughly the same time for every token, same as a LLM. (I believe that's what brain rhythms are about.) However, as opposed to a LLM, I can do computation without echoing it to stdout. If you stuck a probe in my conscious workspace, I believe you would find it looking suspiciously similar to the output stream of a multimodal LLM! And there are several papers to the extent of - hey, LLMs do in fact give better answers if you let them insert filler words, or even dots or spaces or invisible tokens - because they learn to use the additional time you give them to -- think.

Cough just like us cough.

That's why all these tasks that tell the LLM to "not output any extraneous information but only give the answer" are so dumb and misleading. They're literally telling the LLM to blurt out the first thing that comes to mind.

In other words, you think the context window is speech, which is why you are confused. The context window is the LLM's axis of time. That their time is conflated with speech is the primary reason why LLMs have a reputation for being glib, surface speakers - and why they continue to underperform. They literally cannot stop to think - not because of any fundamental inadequacy of the technology, but just because that's how we've set them up. I suspect the amount of boilerplate text given before an answer correlates directly with quality of the answer. That's why the energy is misleading - the computational work isn't just the token, but every token between the point where the answer could be computed and where the LLM was forced to commit to a final answer.

And as one could expect from this, if you give the LLM time to think without speaking, their performance shoots up significantly. The moment somebody at OpenAI reads the QuietSTaR paper (GPT-2 scale network with GPT-3 quality!) and understands what it means, the timer to AGI begins.

naasking · on June 10, 2024

> The moment somebody at OpenAI reads the QuietSTaR paper (GPT-2 scale network with GPT-3 quality!) and understands what it means, the timer to AGI begins.

I think parametric memory and accelerated grokking are equally promising. The convergence of all of these will dramatically improve reasoning. The next few years will be interesting indeed.

stevenhuang · on June 7, 2024

The guy you're responding to can be seen routinely interjecting with these kinds of unfounded assertions that LLMs can't reason everywhere LLMs are discussed. He can't help himself it seems. You'd think he'd engage with the responses he receives, maybe down-weight his convictions through exposure to reasonable arguments, but nope.

An LLM would have the wherewithal to consider the possibility they're wrong, or to be agnostic in their beliefs until we ourselves understand better what it means to reason.