I think Yann is right if all you do is output a token, which is dependent on the...

jxmorris12 · 2025-02-21T19:36:14 1740166574

No this isn't right. The probabilistic formulation for autoregressive language models looks like this

     p(x_n | x_1 ... x_{n-1})

which means that each token depends on all the previous tokens. Attention is one way to parameterize this. Yann's not talking about Markov chains, he's talking about all autoregressive models.

1024core · 2025-02-21T20:02:06 1740168126

Current token depends on _all_ previous tokens, but only indirectly for the ones before the previous one, no?

dimatura · 2025-02-21T22:08:20 1740175700

No, using a large context window (which includes previous tokens) is a critical ingredient for the success of modern LLMs. Which is why you will often see the window size mentioned in discussions of newly released LLMs.

sudosysgen · 2025-02-21T20:26:43 1740169603

In the auto regressive formulation the previous token is no different from any past token, so no. Historically some token took the shortcut of only directly looking at the past token or some other kind of recursive formulation for intermediate states in generating the past token, but that's not the case in for the theoretical formulation of an autoregressive model that was used, and plenty of past autoregressive models didn't do that, for example with nonlinear autoregressive models.

lagrange77 · 2025-02-21T19:37:20 1740166640

> But with Attention mechanism

I would think LeCun was aware of that. Also prior sequence to sequence models like RNNs have already incorporated information about the further past.