Hacker Newsnew | past | comments | ask | show | jobs | submit | accountnum's commentslogin

It's not a problem, because the point at which we are in the logarithmic curve is the only thing that matters. No one in their right mind ever expected anything linear, because that would imply that creating a perfect oracle is possible.

More compute hasn't been the driving factor of the last developments, the driving factor has been distillation and synthetic data. Since we've seen massive success with that, I really struggle to understand why people continue to doomsay the transformer. I hear these same arguments year after year and people never learn.


They're not sampling from prior conversations. The model constructs abstracted representations of the domain-specific reasoning traces. Then it applies these reasoning traces in various combinations to solve unseen problems.

If you want to call that sampling, then you might as well call everything sampling.


They're generative models. By definition, they are sampling from a joint distribution of text tokens fit by approximation to an empirical distribution.


Again, you're stretching definitions into meaninglessness. The way you are using "sampling" and "distribution" here applies to any system processing any information. Yes, humans as well.

I can trivially define the entirety of all nerve impulses reaching and exiting your brain as a "distribution" in your usage of the term. And then all possible actions and experiences are just "sampling" that "distribution" as well. But that definition is meaningless.


No, causation isnt distribution sampling. And there's a difference between, say, an extrinsic description of a system and it's essential properties.

Eg., you can describe a coin flip as a sampling from the space, {H,T} -- but insofar as we're talking about an actual coin, there's a causal mechanism -- and this description fails (eg., one can design a coin flipper to deterministically flip to heads).

In the case of a transformer model, and all generative statistical models, these are actually learning distributions. The model is essentially constituted by a fit to a prior distribution. And when computing a model output, it is sampling from this fit distribution.

ie., the relevant state of the graphics card which computes an output token is fully described by an equation which is a sampling from an empirical distribution (of prior text tokens).

Your nervous system is a causal mechanism which is not fully described by sampling from this outcome space. There is no where in your body that stores all possible bodily states in an outcome space: this space would require more atoms in the universe to store.

So this isn't the case for any causal mechanism. Reality itself comprises essential properties which interact with each other in ways that cannot be reduced to sampling. Statistical models are therefore never models of reality essentially, but basically circumstantial approximations.

I'm not stretching definitions into meaninglessness, these are the ones given by AI researchers, of which I am one.


I'm going to simply address what I think are your main points here.

There is nowhere that an LLM stores all possible outputs. Causality can trivially be represented by sampling by including the ordering of events, which you also implicitly did for LLMs. The coin is an arbitrary distinction, you are never just modeling a coin, just as an LLM is never just modeling a word. You are also modeling an environment, and that model would capture whatever you used to influence the coin toss.

You are fundamentally misunderstanding probability and randomness, and then using that misunderstanding to arbitrarily imply simplicity in the system you want to diminish, while failing to apply the same reasoning to any other.

If you are indeed an AI researcher, which I highly doubt without you providing actual credentials, then you would know that you are being imprecise and using that imprecision to sneak in unfounded assumptions.


LLMs are just modelling token order. The weights are a compression of the outcome space.

No, causality is not just an ordering.


[flagged]


It's not a matter of making points, it's at least a semester's worth of courses on causal analysis, animal intelligence, the scientific method, explanation.

Causality isnt ordering. Take two contrary causal mechanisms (eg., filling a bathtube with a hose, and emptying it with a bucket). The level of the bath is arbitrarily orderable with respect to either of these mechanisms.

cf. https://en.wikipedia.org/wiki/Collider_(statistics)

Go on youtube and find people growing a nervous system in a lab, and you'll notice its an extremely plastic, constantly physically adapting, and so on system. You'll note the very biochemcial "signalling" you're talking about itself is involved in the change to the physical structure of the system.

This physical structure does not encode all prior activations of the system, nor even a compression of them.

To see this consider Plato's cave. Outside the cave passes by a variety of objects which cast a shadow on the wall. The objects themselves are not compressions of these shadows. Inside the cave, you can make one of these yourself: take clay from the floor and fashion a pot. This pot, like the one outside, are not compressions of their shadows.

All statistical algorithms which average over historical cases are compressions of shadows, and replay these shadows on command, ie., they learn the distribution of shadows and sample from this distribution demand.

Animals, and indeed all science, is not concerned with shadows. We don't model patterns in the night sky -- this is astrology -- we model gravity: we build pots.

The physical structure of our bodies encodes their physical structure and that of reality itself. They do so by sensor-motor modulation of organic processes of physical adaption. If you like: our bodies are like clay and this is fashioned by reality into the right structure.

In any case, we haven't the time or space to convince you of this formally. Suffice it to say that it is a very widespread consensus that modelling conditional probabilities with generative models fails to model causality. You can read Judea Pearl on this if you want to understand more.

Perhaps more simply: a video game model of a pot can generate an infinite number of shadows in an infinite number of conditions. And no statistical algorithm with finite space and finite time requirements will ever model this video game. The video game model does not store a compression of past frames -- since it has a real physical model, it can create new frames from this model.


It hasn't read that riddle because it is a modified version. The model would in fact solve this trivially if it _didn't_ see the original in its training. That's the entire trick.


Sure but the parent was praising the model for recognizing that it was a riddle in the first place:

> Whereas o1, at the very outset smelled out that it is a riddle

That doesn't seem very impressive since it's (an adaptation of) a famous riddle

The fact that it also gets it wrong after reasoning about it for a long time doesn't make it better of course


Recognizing that it is a riddle isn't impressive, true. But the duration of its reasoning is irrelevant, since the riddle works on misdirection. As I keep saying here, give someone uninitiated the 7 wives with 7 bags going (or not) to St Ives riddle and you'll see them reasoning for quite some time before they give you a wrong answer.

If you are tricked about the nature of the problem at the outset, then all reasoning does is drive you further in the wrong direction, making you solve the wrong problem.


It literally is a riddle, just as the original one was, because it tries to use your expectations of the world against you. The entire point of the original, which a lot of people fell for, was to expose expectations of gender roles leading to a supposed contradiction that didn't exist.

You are now asking a modified question to a model that has seen the unmodified one millions of times. The model has an expectation of the answer, and the modified riddle uses that expectation to trick the model into seeing the question as something it isn't.

That's it. You can transform the problem into a slightly different variant and the model will trivially solve it.


Phrased as it is, it deliberately gives away the answer by using the pronoun "he" for the doctor. The original deliberately obfuscates it by avoiding pronouns.

So it doesn't take an understanding of gender roles, just grammar.


My point isn't that the model falls for gender stereotypes, but that it falls for thinking that it needs to solve the unmodified riddle.

Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.

In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.

You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.

I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?


They don't understand basic math or basic logic, so I don't think they understand grammar either.

They do understand/know the most likely words to follow on from a given word, which makes them very good at constructing convincing, plausible sentences in a given language - those sentences may well be gibberish or provably incorrect though - usually not because again most sentences in the dataset make some sort of sense, but sometimes the facade slips and it is apparent the GAI has no understanding and no theory of mind or even a basic model of relations between concepts (mother/father/son).

It is actually remarkable how like human writing their output is given how it is done, but there is no model of the world which backs their generated text which is a fatal flaw - as this example demonstrates.


No, it's necessary to either know that it's a trick question or to have a feeling that it is based on context. The entire point of a question like that is to trick your understanding.

You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.

I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.


> I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand

They could fail because they didn’t understand the language. Didn’t have a good memory to memorize all the steps, or couldn’t reason through it. We could pose more questions to probe which reason is more plausible.


The trick with the 7 wives and 7 bags and so on is that no long reasoning is required. You just have to notice one part of the question that invalidates the rest and not shortcut to doing arithmetic because it looks like an arithmetic problem. There are dozens of trick questions like this and they don't test understanding, they exploit your tendency to predict intent.

But sure, we could ask more questions and that's what we should do. And if we do that with LLMs we can quickly see that when we leave the basin of the memorized answer by rephrasing the problem, the model solves it. And we would also see that we can ask billions of questions to the model, and the model understands us just fine.


Some people solve trick questions easily simply because they are slow thinkers who pay attention to every question, even non-trick questions, and don't fast-path the answer based on its similarity to a past question.

Interestingly, people who make bad fast-path answers often call these people stupid.


It does mean something. It means that the model is still more on the memorization side than being able to independently evaluate a question separate from the body of knowledge it has amassed.


No, that's not a conclusion we can draw, because there is nothing much more to do than memorize the answer to this specific trick question. That's why it's a trick question, it goes against expectations and therefore the generalized intuitions you have about the domain.

We can see that it doesn't memorize much at all by simply asking other questions that do require subtle understanding and generalization.

You could ask the model to walk you through an imaginary environment, describing your actions. Or you could simply talk to it, quickly noticing that for any longer conversation it becomes impossibly unlikely to be found in the training data.


If you read into the thinking of the above example it wonders whether it is some sort of trick question. Hardly memorization.


No, you're not. Are you genuinely trying to suggest that LLMs, which can:

- Construct arbitrary text that isn't just grammatically but semantically coherent

- Derive intent, subtle intent, from user queries and responses

- Emulate endless different personalities and their reactions to endless stimuli

- Describe in detail the statics and dynamics of the world, including sight, smell, touch and sound

do not have a model of the external world? What do you think a "corpus" means in this context? How is the "corpus" of sensory and evolutionary data that makes you up in any way different?

LLMs are excellent common sense reasoners, and they generalize just fine. Why exactly do you think they get things _subtly_ wrong? Make up API syntax that looks sensible but isn't actually implemented? In order to make these guesses they need to have generalized, they need an understanding of the structure underlying naming, such that they can produce _sensible_ output even if they lack the hard facts.


You are correct. We are flooded with studies on AI now, so can't find reference.

But just few months ago, saw example of AI, from video, building an internal representation of the world. An internal model of the world. Everyone saying this can't be done, it already is. Maybe can argue it wasn't an LLM, and then I'd say were nitpicking over which technology can do it or not. We already have example of tying them together, symbols and LLM's.

Might be related. https://www.nature.com/articles/d41586-024-00288-1 https://www.technologyreview.com/2019/04/08/103223/two-rival...


You seem to repeatedly insist that hidden computation is a distinction of any relevance whatsoever.

First of all, your understanding of the architecture itself is mistaken. A transformer can iterate endlessly because each token it produces allows it a forward pass, and each of these tokens is postpended to its input in the next inference. That's the autoregressive in autoregressive transformer, and the entire reason why it was proposed for arbitrary seq2seq transduction.

This means you get layers * tokens iterations, where tokens is up to two million, and is in practice unlimited due to the LLM being able to summarize and select from that. Parallelism is irrelevant, since the transformer is sequential in the output of tokens. A transformer can iterate endlessly, it simply has to output enough tokens.

And no, the throughput isn't limited either, since each token gets translated into a high-dimensional internal representation, that in turn is influenced by each other token in the model input. Models can encode whatever they want not just by choosing a token, but by choosing an arbitrary pattern of tokens encoding arbitrary latent-space interactions.

Secondly, internal thoughts are irrelevant, because something being "internal" is an arbitrary distinction without impact. If I trained an LLM to prepend and postpend <internal_thought> to some part of its output, and then simply didn't show that part, then the LLM wouldn't magically become human. This is something many models do even today, in fact.

Similarly, if I were to take a human and modify their brain to only be able to iterate using pen and paper, or by speaking out loud, then I wouldn't magically make them into something non-human. And I would definitely not reduce their capacity for reasoning in any way whatsoever. There are people with aphantasia working in the arts, there are people without an internal monologue working as authors - how "internal" something is can be trivially changed with no influence on either the architecture or the capabilities of that architecture.

Reasoning itself isn't some unified process, neither is it infinite iteration. It requires specific understanding about the domain being reasoned over, especially understanding of which transformation rules are applicable to produce desired states, where the judgement about which states are desirable has to be learned itself. LLMs can reason today, they're just not as good at it than humans are in some domains.


Sure - a transformer can iterate endlessly by generating tokens, but this is no substitute for iterating internally and maintaining internal context and goal-based attention.

One reason why just blathering on endlessly isn't the same as thinking deeply before answering, is that it's almost impossible to maintain long-term context/attention. Try it. "Think step by step" or other attempts to prompt the model into generating a longer reply that builds upon itself, will only get you so far because keeping a 1-dimensional context is no substitute for the thousands of connections we have in our brain between neurons, and the richness of context we're therefore able to maintain while thinking.

The reasoning weakness of LLMs isn't limited to "some domains" that they had less training data for - it's a fundamental architecturally-based limitation. This becomes obvious when you see the failure modes of simple problems like "how few trips does the farmer need to cross the river with his chicken & corn, etc" type problems. You don't need to morph the problem to require out-of-distribution knowledge to get it to fail - small changes to the problem statement can make the model state that crossing the river backwards and forwards multiple times without loading/unloading anything is the optimal way to cross the river.

But, hey, no need to believe me, some random internet dude. People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.


>You don't need to morph the problem to require out-of-distribution knowledge to get it to fail

make the slight variation look different from the version it have memorized and it often passes. Sometimes it's as straightforward as just changing the names. humans have this failure mode too.


> One reason why just blathering on endlessly...

First of all, I would urge you to stop arbitrarily using negative words to make an argument. Saying that LLMs are "blathering" is equivalent to saying you and I are "smacking meat onto plastic to communicate" - it's completely empty of any meaning. This "vibes based arguing" is common in these discussions and a massive waste of time.

Now, I don't really understand what you mean by "almost impossible to maintain long-term context/attention". I'm writing fiction in my spare time, LLMs do very well on this by my testing, even subtle and complex simulations of environments, including keeping track of multiple "off-screen" dynamics like a pot boiling over.

There is nothing "1-dimensional" about the context, unless you mean that it is directional in time, which any human thought is as well, of course. As I said in my original reply, each token is represented by a multidimensional embedding, and even that is abstracted away by the time inference reaches the later layers. The word "citrus" isn't just a word for the LLM, just as it isn't just a word for you. Its internal representation retrieves all the contextual understanding that is related to it. Properties, associated feelings, usage - every relevant abstract concept is considered. And these concepts interact which every embedding of every other token in the input in a learned way, and with the position they have relative to each other. And then when an output is generated from that dynamic, said output influences the dynamic in a way that is just as multidimensional.

The model can maintain context as rich as it wants, and it can built upon that context in whatever way it wants as well. The problem is that in some domains, it didn't get enough training time to build robust transformation rules, leading it to draw false conclusions.

You should reflect on why you are only able to provide vague and under defined, often incorrect, arguments here. You're drawing distinctions that don't really exist and trying to hide that by appealing to false intuitions.

> The reasoning weakness... it's a fundamental architecturally-based limitation...

You have provided no evidence or reasoning for that conclusion. The river crossing puzzle is exactly what I had in mind when talking about specific domains. It is a common trick question with little to no variation and LLMs have overfit on that specific form of the problem. Translate it to any other version - say transferring potatoes from one pot to the next, or even a mathematical description of sets being modified - and the models do just fine. This is like tricking a human with the "As I was going to Saint Ives" question, exploiting their expectation of having to do arithmetic because it looks superficially like a math problem, and then concluding that they are fundamentally unable to reason.

> People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.

What weakness? That current LLMs aren't as good as humans when reasoning over certain domains? I don't follow him personally but I doubt he would have the confidence to make any claims about fundamental inabilities of the transformer architecture. And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.


> And even if he did, I could name you a couple of CEOs of AI labs with better models that would disagree, or even Turing award laureates. This is by no means a consensus stance in the expert community.

I disagree - there is pretty widespread agreement that reasoning is a weakness, even among the best models, (and note Chollet's $1M ARC prize competition to spur improvements), but the big labs all seem to think that post-training can fix it. To me this is whack-a-mole wishful thinking (reminds me of CYC - just add more rules!). At least one of your "Turing award laureates" thinks Transformers are a complete dead end as far as AGI goes.

We'll see soon enough who's right.


A weakness of the current models in some domains considered useful, yes - but not a fundamental limitation of the architecture. I see no consensus on the latter whatsoever.

The ARC challenge tests spatial reasoning, something we humans are obviously quite good at, given 4 billion years of evolutionary optimization. But as I said, there is no "general reasoning", it's all domain dependent. A child does better at the spatial problems in ARC given that it has that previously mentioned evolutionary advantage, but just as we don't worship calculators as superior intelligences because they can multiply 10^9 digit numbers in milliseconds, we shouldn't draw fundamental conclusions from humans doing well at a problem that they are in many ways built to solve. If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.

The bet of current labs is synthetic data in pre-training, or slight changes of natural data that induces more generalization pressure for multi-step transformations on state in various domains. The goal is to change the data so models learn these transformations more readily and develop good heuristics for them, so not the non-continuous patching that you suggest.

But yes, the next generation of models will probably reveal much more about where we're headed.


> If the failures of previous predictions - those that considered Chess or Go as unmistakable signals of true general reasoning - have taught us anything, it's that general reasoning simply does not exist.

I don't think DeepBlue or AlphaGo/etc were meant to teach us anything - they were just showcases of technological prowess by the companies involved, demonstrations of (narrow) machine intelligence.

But...

Reasoning (differentiated from simpler shallow "reactive" intelligence) is basically multi-step chained what-if prediction, and may involve a branching exploration of alternatives ("ok, so that wouldn't work, so what if I did this instead ..."), so could be framed as a tree search of sorts, not entirely disimilar to the MCTS used by DeepBlue or AlphaGo.

Of course general reasoning is a lot more general than playing a game like Chess or Go since the type of moves/choices available/applicable will vary at each step (these aren't all "game move" steps), as will the "evaluation function" that predicts what'll happen if we took that step, but "tree search" isn't a bad way to conceptualize the process, and this is true regardless of the domain(s) of knowledge over which the reasoning is operating.

Which is to say, that reasoning is in fact a generalized process, and one who' nature has some corresponding requirements (e.g. keeping track of state) for any machine to be capable of performing it ...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: