σ-GPTs: A new approach to autoregressive models

cs702 · 2024-06-07T13:49:14 1717768154

This looks great.

The authors randomly permute (i.e., shuffle) input tokens in training and add two positional encodings to each token: one with the token's position and another with the position of the token to be predicted. Otherwise, the model is a standard autoregressive GPT. The consequences of this seemingly "simple" modification are significant:

* The authors can prompt the trained model with part of a sequence and then decode the missing tokens, all at once, in parallel, regardless of order -- i.e., the model can in-fill in parallel.

* The authors can compute conditional probability densities for every missing token in a sequence, again in parallel, i.e., densities for all missing tokens at once.

* The authors propose a rejection-sampling method for generating in-fill tokens, again in parallel. Their method seems to work well in practice.

I've added this to my reading list. Thank you for sharing it on HN.

toxik · 2024-06-07T17:07:32 1717780052

This problem formulation has been around for a while, it’s kind of the holy grail of modeling. What is new compared to PixelCNN and related is this position embedding idea.

cs702 · 2024-06-10T13:36:55 1718026615

Yes, the modification the authors propose is indeed "simple."

It's "obvious" only in hindsight.

thomashop · 2024-06-07T15:38:04 1717774684

I don't understand how that parallel prediction can work...

Let's say I give it as input the sentence:

I . . . . . . . . happily.

The second word to be predicted depends on the first word.

cs702 · 2024-06-07T15:48:55 1717775335

Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.

hexomancer · 2024-06-07T16:07:17 1717776437

Yes, but is there any guarantee that the complete sentence makes sense?

toxik · 2024-06-07T16:57:43 1717779463

That is indeed an issue. Their sampling method rejects impossible combinations.

entropicdrifter · 2024-06-07T16:09:18 1717776558

That guarantee didn't exist with regular GPT LLMs, did it? It just came about as an emergent property of throwing more and more compute, training data, and training time at the problem

amluto · 2024-06-07T19:30:14 1717788614

I think it’s effectively built in to the design. The model outputs a probability distribution for the first unknown token [0]. Then some code outside the model chooses a token and runs the model again with that token provided to the model. So the second output token’s probability distribution is automatically conditioned on the first output token, etc.

Sometimes people will attempt to parallelize this by using a faster model to guess a few tokens and then evaluating them in as a batch with the main model to determine whether the choices were good.

[0] Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

qeternity · 2024-06-07T22:19:33 1717798773

> I think it’s effectively built in to the design.

It isn't. There is no guarantee that successive tokens will be comprehensible.

> Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

The logits are the probability distribution (well technically, you would apply softmax). Temperature is a parameter for how you sample those logits in a non-greedy fashion.

hexaga · 2024-06-07T23:14:00 1717802040

> Temperature is a parameter for how you sample those logits in a non-greedy fashion.

I think temperature is better understood as a pre-softmax pass over logits. You'd divide logits by the temp, and then their softmax becomes more/less peaky.

    probs = (logits / temp).softmax()

Sampling is a whole different thing.

qeternity · 2024-06-07T23:20:25 1717802425

Sure, my comment about softmax was simply about the probability distribution. But temperature is still part of sampling. If you’re greedy decoding, temperature doesn’t matter.

alextheparrot · 2024-06-07T19:32:47 1717788767

No, but it makes more conceptual sense given the model can consider what was said before it

KRAKRISMOTT · 2024-06-07T16:15:46 1717776946

Isn't this bag of words all over again? Except with positional hints?

taneq · 2024-06-07T16:42:00 1717778520

Wow, if that works that's wild (and also has that "damn, now you say it it's obvious" flavour that so many really cool discoveries share...)

WanderPanda · 2024-06-07T18:44:21 1717785861

Wait wasn't BERT all about non-causal masking aka predicting words in the middle?!

nico · 2024-06-07T19:31:00 1717788660

I know this is for tokens/text, but can the same concept be applied to images using something like a diffusion model? And then be able to upscale images arbitrarily by infilling?

gwern · 2024-06-07T20:18:36 1717791516

Yes. See the related work section in the paper: there is a long history of models, recently like MAE and MaskGit, which predict pixels in basically arbitrary orders, and that is useful because it lets you train on subsets of each image, upscale/infill during generation, and so on. (If you know what MAE is, that might be the fastest way to summarize OP: "it's a GPT trained like a MAE".)

psb217 · 2024-06-07T20:39:20 1717792760

People also often forget "orderless autoregression", which was introduced a while back and has been reinvented many times since. See Sec 4 (pg 8) of "Neural Autoregressive Distribution Estimation" [https://arxiv.org/abs/1605.02226]. The main difference from current work is that this 2016 paper used MLPs and convnets on fixed-length observations/sequences, so sequence position is matched one-to-one with position in the network's output, rather than conditioning on a position embedding. Of course, Transformers make this type of orderless autoregression more practical for a variety of reasons -- TFs are great!

Key quote from Sec 4: "In this section we describe an order-agnostic training procedure, DeepNADE (Uria et al., 2014), which will address both of the issues above. This procedure trains a single deep neural network that can assign a conditional distribution to any variable given any subset of the others. This network can then provide the conditionals in Equation 1 for any ordering of the input observations. Therefore, the network defines a factorial number of different models with shared parameters, one for each of the D! orderings of the inputs. At test time, given an inference task, the most convenient ordering of variables can be used."

telotortium · 2024-06-12T16:03:54 1718208234

MAE = masked autoencoder, not mean absolute error.

DonHopkins · 2024-06-11T16:51:16 1718124676

[flagged]

gwern · 2024-06-11T20:28:13 1718137693

I have no idea what you are talking about, given that you are linking to a comment a decade old which has nothing to do with Conway's death just now; nor have I commented on Conway in any way today or been on HN today until just now, where I saw your comment. (Nor do I intend to comment now that you have drawn my attention to her passing, as I generally feel that such obituary notices should try to celebrate the good about a person, as posterity will have quite enough time to consider the bad as well.)

If you disagree and wish it 'kept out', my suggestion would be to... not link it? Especially in other pages which have nothing to do with Conway?

DonHopkins · 2024-06-12T06:28:28 1718173708

I'm sorry, I confused you with a twitter account that used the same name as yours, as I was upset by Lynn's passing. I take back what I said about you being an asshole.

gwern · 2024-06-12T17:37:11 1718213831

You are, unfortunately, not the first to be misled by my impersonator.

I've done what I can to get that account blocked: I have reported him for impersonation to Twitter years ago, after his previous accounts, and I have asked my followers to report the account to get it banned, and I have asked again just now. If any HNers would like to take a moment to report the false troll https://x.com/gwernbranwen1 account for impersonation of my real Twitter account, https://x.com/gwern , I would be grateful - there must be some number of reports at which he'll finally be banned...

But Twitter takes forever to ban impersonation accounts, so, 'gwernbranwen1' keeps on tweeting that crap.

(Why, you might wonder? I don't know. I have no idea who or why whoever is behind that account does it. They have never explained it anywhere I've seen, nor do they seem to particularly hate me or are trying to ruin my reputation; their main fixation seems to be misogynist hatred of a few women like Julia Wise. I've met Julia Wise all of once, who is a nice woman who I liked, but I otherwise have no particular connection to her, and I have no idea why anyone would want to harass her online like that or why they would do so under my name. It's just the Internet, I guess: https://gwern.net/littlewood )

telotortium · 2024-06-13T20:50:33 1718311833

Whoa, look at that - the impersonator is suspended now.

DonHopkins · 2024-06-12T03:15:24 1718162124

As somebody with a twitter feed full of recent pathetic misogynistic shit like this:

https://x.com/gwernbranwen1/status/1742214526878265550

...you are in no position to make claims about how posterity will consider the bad things about other people, asshole.

telotortium · 2024-06-12T16:01:00 1718208060

That’s a fake account - gwern’s real account is at https://x.com/gwern. It’s usually protected, but he occasionally processes his follow requests.

RivieraKid · 2024-06-07T19:46:00 1717789560

If there are multiple missing tokens, what's the positional encoding for the "token to be predicted"?

cs702 · 2024-06-07T20:01:51 1717790511

See this thread, also on this page:

https://news.ycombinator.com/item?id=40609689

tripplyons · 2024-06-07T17:51:34 1717782694

The only difference I see from XLNet is how they use it during inference.

arnaudpannatier · 2024-06-07T18:58:29 1717786709

Hey! I'm Arnaud, first author of the paper. XLNet also shuffles the data during training, but they use a masking mechanism instead of the causal + double positional encoding. The application differs, XLNet is not AFAIK focused on generation (even if it can be used for that) and the burst-sampling idea is new.

RivieraKid · 2024-06-07T20:06:49 1717790809

Are there any obvious practical application of this algorithm for existing large (10B+) text / image models?

Does the rejection sampling lead to a statistically correct sample from the joint probability distribution or is that just a (possibly rough) approximation?

arnaudpannatier · 2024-06-08T12:30:12 1717849812

For the application: being able to prompt anywhere in the sequence can be of interest. For what we've seen in the experiment, the rejection sampling leads to similar generation than the autoregressive one, we did not see any mode collapse or anything of that kind.

tripplyons · 2024-06-07T19:38:24 1717789104

Thanks for the clarification!

mglikesbikes · 2024-06-07T22:12:46 1717798366

Off topic, but what do you use for your reading list?

inhumantsar · 2024-06-07T22:32:44 1717799564

hijacking for a bit of shameless self promotion: if you're an obsidian user, I recently built a plugin that simplifies web pages, parses out metadata, and saves them to obsidian as markdown files: https://github.com/inhumantsar/slurp

arXiv comes through a bit ugly atm but it's high on my to-do list. I'm leveraging the library that Firefox uses for reader mode, so most sites come through quite well. A lot of my work right now is expanding their metadata support and fixing parser issues.

ofou · 2024-06-07T22:23:32 1717799012

I use Emergent Mind[1] to keep track of new research published on ArXiv. You can bookmark articles once logged in. It's very useful for keeping track of articles, reading quick summaries, and following conversations on various social media.

[1]: https://www.emergentmind.com/papers/2404.09562

concurrentsquar · 2024-06-08T02:03:52 1717812232

Google Chrome has a built-in reading list (go open the 3-dotted menu at the top-right corner, then click on "Bookmarks and lists" -> "Reading list")

cs702 · 2024-06-07T23:21:34 1717802494

old-fashioned text files

barfbagginus · 2024-06-07T23:30:31 1717803031

Zotero is great for organizing and annotating papers, keeping notes, and building bibliographies.

You can create libraries and sub libraries according to topic, and also create libraries for projects or reading lists. You can file items into multiple libraries, and you can also create shared libraries, allowing your team to share annotated papers.

Finally it can archive offline copies of web pages, which makes it useful for blog articles and other online resources that might vanish.

There's a learning curve, but it's worth it if you find yourself juggling dozens or hundreds of technical papers! Enjoy!

nsagent · 2024-06-08T01:31:51 1717810311

What's old [1] is new again... without citing prior work. It's not like it's an unknown work. It was published in ICML and has ~250 citations.

[1]: https://arxiv.org/abs/1902.03249

szvsw · 2024-06-07T14:28:14 1717770494

Wow, really cool concept! I wonder if this starts to become similar dynamics to what we see in image generation models, where structure/detail emerges in one region of the image and then the surrounding areas start to resolve themselves into place. That kind of behavior seems particularly useful for longer reasoning/logic/planning, where the big ideas might become apparent first, and then the interstitial details and text just naturally fill in…

byteknight · 2024-06-07T15:18:21 1717773501

The process you describe is referred to as diffusion

szvsw · 2024-06-07T15:38:46 1717774726

Yep yep I know, but I was trying to suggest something diffusion-like occurring with a language model through a totally separate mechanism that does not rely on the denoising process (at least not literally).

byteknight · 2024-06-10T14:39:16 1718030356

It kinda does that but one token at a time.

immibis · 2024-06-07T16:07:45 1717776465

I'm fairly certain diffusion refers to the overall architecture, not the emergent self-organization process.

byteknight · 2024-06-12T14:26:04 1718202364

It refers to the process of taking noise and "diffusing" it until visually appealing

smusamashah · 2024-06-07T18:02:16 1717783336

There is a video on twitter showing it generating text (looks a bit like image diffusion)

https://x.com/ArnaudPannatier/status/1799055129829839166

lukasb · 2024-06-07T18:47:08 1717786028

Weird that they chose an example that ended up somewhat nonsensical.

sebzim4500 · 2024-06-07T23:38:19 1717803499

Part of the issue is they are training a pretty tiny model, it's not like GPT-2 ~100M is especially coherent either.

vessenes · 2024-06-08T05:31:38 1717824698

I kept thinking about this paper today, and I really like the capabilities.

A number of things that are relatively hard for sequential LLMs are easy here. Want json? Fix curly brace tokens to the beginning and end.

Want a specific token-length explanation of an answer? Write a short answer, post-pend it, and infill.

Want a higher-density answer to something? Add a density assessment section to your generated text, a space for the LLM to score info-density, and generate looking for a high density score.

I would guess there's a lot here to be experimented with. It would be nice to get an 8b parameter model with reasonable number of tokens (x3 based on the paper, sadly) through it.

zakkor1 · 2024-06-08T06:11:26 1717827086

> Fix curly brace tokens to the beginning

Regular LLMs can already do this, by prefilling the start of the assistant's response.

But there is actually something even better: you can constrain the LLM's output to a specific grammar (like JSON), so it'll only be able to answer with syntactically valid JSON.

vessenes · 2024-06-08T19:31:24 1717875084

Yes. And you can have a grammar parser only select from valid tokens in a randomized distribution. But, this feels much more sophisticated to me, especially if you can mix specific token-based grammar requirements with other instructions during the token selection phase.

mbil · 2024-06-07T18:02:58 1717783378

I wonder if this would help especially for computer code generation, where what is output at a given step may materially depend on what would be written at a later step.

mbil · 2024-06-07T19:00:19 1717786819

And, maybe prohibitively slow, perhaps integrate some kind of linting or syntax checking as part of the rejection sampling. I.e. burst sample N potential generated snippets in parallel, reject those that are syntactically invalid.

barfbagginus · 2024-06-07T23:35:29 1717803329

It would be nice if it could diffuse right on the AST. That would ensure each generated item passes a syntax check, without the waste of rejection sampling

omernivro · 2024-06-08T21:47:11 1717883231

This is an interesting study. A similar permutation approach appears already in the Taylorformer paper (https://arxiv.org/pdf/2305.19141v1). The authors use a Transformer decoder for continuous processes, like time series. During training, each sequence is shuffled randomly. Each sequence element has a positional encoding. Then, they use log-likelihood on the shuffled sequence. There, the permutation helps with predictions for interpolation, extrapolation and irregularly sampled data. Also, they show it helps with 'consistency', i.e., roughly the MSE is the same regardless of the generated order.

What might this paper add to our understanding or application of these ideas?

The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.

bigyikes · 2024-06-07T14:28:38 1717770518

Is this applying the learnings from vision transformers to language transformers?

If I understand correctly, vision models split an image into tiles and append a positional encoding to each so the model can understand the relative position of each tile.

I admittedly only read the abstract - a lot of this stuff goes over my head - but it seems like this paper proposes a similar idea, but for 1D instead of 2D?

seurimas · 2024-06-07T14:56:01 1717772161

Positional encoding is standard for transformers of all stripes. They introduce a seemingly novel, redundant positional encoding scheme. It's more difficult to train, but seems to enable producing multiple tokens at once (i.e. you could get an answer that is N tokens long in N/x steps instead N steps).

aconz2 · 2024-06-08T00:51:27 1717807887

Is there code somewhere? I don't totally understand the double position and shuffling. Interesting they use concat instead of plus for the positionals

optimalsolver · 2024-06-07T14:31:40 1717770700

Yann LeCun would say [0] that it's autoregression itself that's the problem, and ML of this type will never bring us anywhere near AGI.

At the very least you can't solve the hallucination problem while still in the autoregression paradigm.

[0] https://twitter.com/ylecun/status/1640122342570336267

cs702 · 2024-06-07T15:18:46 1717773526

LeCun may or may not be right, but I'm not sure this is relevant to the discussion here.

The OP's authors make no claims about how their work might help get us closer to AGI.

They simply enable autoregressive LLMs to do new things that were not possible before.

andreasmetsala · 2024-06-07T15:10:51 1717773051

Does everything have to take us towards AGI? If someone makes a LLM that’s faster (cheaper) to run then that has value.

I don’t think we want AGI for most tasks unless the intent is to produce suffering in sentient beings.

ben_w · 2024-06-07T19:18:53 1717787933

> I don’t think we want AGI for most tasks unless the intent is to produce suffering in sentient beings.

Each letter of "AGI" means different things to different people, and some use the combination to mean something not present in any of the initials.

The definition OpenAI uses is for economic impact, so for them, they do want what they call AGI for most tasks.

I have the opposite problem with the definition, as for me, InstructGPT met my long-standing definition of "artificial intelligence" while suddenly demonstrating generality in that it could perform arbitrary tasks rather than just next-token prediction… but nobody else seems to like that, and I'm a linguistic descriptivist, so I have to accept words aren't being used the way I expected and adapt rather than huff.

barfbagginus · 2024-06-08T00:10:57 1717805457

I call GPT an AGI

1. To highlight that the system passes the turing test and has general intelligence abilities beyond the median human

2. To piss off people who want AGI to be a God or universal replacement for any human worker or intellectual

The problem with AGI as a universal worker replacement - the way that it can lead to sentient suffering - is the presumption that these universal worker replacements should be owned by automated corporations and hyper wealthy individuals, rather than by the currently suffering sentient individuals who actually need the AI assistance.

If we cannot make Universal Basic AGI that feeds and clothes everyone by default as part of the shared human legacy - UBAGI - then AGI will cause harm and suffering.

ben_w · 2024-06-08T09:37:43 1717839463

> 1. To highlight that the system passes the turing test and has general intelligence abilities beyond the median human

I think that heavily depends on what you mean by "intelligence", which in turn depends on how you want to make use of it. I would agree that it's close enough to the Turing test as to make the formal test irrelevant.

AI training currently requires far more examples than any organic life. It can partially make up for this by transistors operating faster than synapses by the same ratio to which marathon runners are faster than continental drift — but only partially. In areas where there is a lot of data, the AI does well; in areas where there isn't, it doesn't.

For this reason, I would characterise them as what you might expect from a shrew that was made immortal and forced to spend 50,000 years reading the internet — it's still a shrew, just with a lot of experience. Book smarts, but not high IQ.

With LLMs, the breadth of knowledge makes it difficult to discern the degree to which they have constructed a generalised world model vs. have learned a lot of catch-phrases which are pretty close to the right answer. Asking them to play chess can result in them attempting illegal moves, for example, but even then they clearly had to build a model of a chess board good enough to support the error instead of making an infinitely tall chess board in ASCII art or switching to the style of a chess journalist explaining some famous move.

For a non-LLM example of where the data-threshold is, remember that Tesla still doesn't have a level 4 self-driving system despite millions of vehicles and most of those operating for over a year. If they were as data-efficient as us, they'd have passed the best human drivers long ago. As is, while they have faster reactions than we do and while their learning experiences can be rolled out fleet-wide overnight, they're still simply not operating at our level and do make weird mistakes.

barfbagginus · 2024-06-09T10:43:26 1717929806

So your points are

I don't really pick a definitiom intelligence

llms could be regurgitating training data

they take more data to train than humans

They can't do some tasks, like driving

However, in my experience, LLM are more empathetic than humans, more able to help me reason about my feelings and communication problems than humans, less likely to perform microagressions or be racist or ableist than humans, and better at math and science than most humans. These are just my personal feelings as an autistic person, which I can back up only loosely with benchmark data, but which I will expect to see the world coming to realize over the next years.

So in terms of being able to constructively interact with me in an intelligent and helpful way, LLMs are often more useful better than humans that I have access to. I say they are smarter than these people as well, because AI will give me solutions that are useful, and which other humans could not give me.

The fact that it cannot drive doesn't bother me since I don't consider driving a general skill but a specialized skill. It can still have general intelligence without being able to do some specific things. Going back to my original post, I specifically reject AGI definitions where to be generally intelligent the AI has to out perform humans in every possible skill. I would consider that a super intelligent AGI.

As for the information problem and data issue, AIs so far have been black boxes isolated from reality and we haven't solved the online continuous learning problem. I believe that as we turn AIs into agents which are constantly interacting with reality via high bandwidth token streams, we will have a lot more data to train with. I also believe that we'll start being able to train continuously on that data. Then even assuming that training is no more efficient than it is today, I think the extra data could make the difference.

I'm also not convinced that AI won't eventually be able to learn from as little data as humans do. I don't think it has to be the case, and I also don't discount the possibility of an AI winter that leaves AI less than efficient than humans are for a long long time maybe even forever. However I also feel like we may come to understand why humans learn so fast, and might be able to transfer some insights into artificial systems. I also know that people will be trying very hard to solve the AI energy and data usage problems, since their major threats against large-scale AI adoption. So we'll be trying really hard to do it and we'll have a blueprint for how to do it - our brains. That means there's a chance we'll crack that problem.

Finally the regurgitation issue is irrelevant to intelligence - just like it would be irrelevant if the brain is secretly just regurgitating stuff it learned. Because the brain can also do novel things.

Furthermore we know that llms can learn and usefully reason about context information outside of their training distributions. This is called in context learning.

For example if I come from a culture that the AI was not really well trained on, I can give it four or five examples of values that are important to me in that culture, and then it will be able to extrapolate how to apply or respect those values in situations that I present.

And again here's the kicker- it'll do this more faithfully than the average person. Remember that if you tell a person five values from a culture outside of their own, and ask them to uphold those values... Perhaps half will just get angry and give you some kind of racist slur, and then 80% of the remainder will lack the empathy and mental flexibility to do a good job.

Finally I need to point out that I have studied AI for over two decades out of books starting from the '80s, then the '90s, then the 00s and 10s. And the change in the literature and capabilities has been unreal.

Perhaps you are forgetting how feeble AI was before, or simply not putting it to use. There are many many tasks that no AI from over 3 years ago could have touched, and now suddenly you can do it for just a $20 a month subscription.

The change in capabilities is so drastic that I wonder if you're simply discounting that change because you're not using AI, comparing it to old AI, or seeing it enable things that no AI before could have possibly done, no matter how hard you tried.

So to conclude, the change has been too great, enabled too many new things, and taking such a big departure from old AI, and consistently outperforms humans on so many tasks that I find important, that I feel it would be not only senseless to say that there isn't some intelligence there - some useful information processing capability that I can depend on and rely on more than a human, in many tasks and settings where humans are consistently bad. In fact it would be harmful for me if I didn't realize that these things have changed, because I would not be benefiting from them.

TheEzEzz · 2024-06-07T16:27:03 1717777623

LeCun is very simply wrong in his argument here. His proof requires that all decoded tokens are conditionally independent, or at least that the chance of a wrong next token is independent. This is not the case.

Intuitively, some tokens are harder than others. There may be "crux" tokens in an output, after which the remaining tokens are substantially easier. It's also possible to recover from an incorrect token auto-regressively, by outputting tokens like "actually no..."

vessenes · 2024-06-07T15:07:27 1717772847

I think this method might not be amenable to the exponential divergence argument actually.

Depending on token sampling methods, this one could look at a proposed generation as a whole and revise it. I’m not sure the current token sampling method they propose does this right now, but I think it’s possible with the information they get out of the probabilities.

modeless · 2024-06-07T17:09:56 1717780196

Yes, to me this seems to address LeCun's objection, or at least point the way to something that does. It seems possible to modify this into something that can identify and correct its own mistakes during the sampling process.

vessenes · 2024-06-08T05:51:01 1717825861

Well, I think I understand LeCun has a broader critique that any sort of generated-in-a-vacuum text which doesn't interact with meatspace is fundamentally going to be prone toward divergence. Which, I might agree with, but is also, just, like, his opinion, man. Or put less colloquially, that's a philosophical stance sitting next to the math argument for divergence.

I do think this setup can answer (much of) the math argument.

sebzim4500 · 2024-06-07T23:16:48 1717802208

LeCun is a very smart guy but his track record predicting limitations of autoregressive LLMs is terrible.

barfbagginus · 2024-06-08T00:01:27 1717804887

Can I please convert you into someone who summarily barks at people for making the LeCunn Fallacy rather than making the LeCunn Fallacy yourself?

And can you stop talking about AGI when it's not relevant to a conversation? Let's call that the AGI fallacy - the argument that a given development is worthless - despite actual technical improvements - because it's not AGI or supposedly can't lead to AGI.

It's a problem.

Every single paper on transformers has some low information comment to the effect of, "yeah, but this won't give us AGI because of the curse of LeCunn". The people making these comments never care about the actual improvement, and are never looking for improvements themselves. It becomes tiring to people, like yours truly :3, who do care about the work.

Let's look at the structure of the fallacy. You're sidestepping the "without a major redesign" in his quote. That turns his statement from a statement of impossibility into a much weaker statement saying that auto regressive models currently have a weakness. A weakness which could possibly be fixed by redesign, which LeCunn admits.

In fact this paper is a major redesign. It solves a parallelism problem, rather than the hallucination problem. But it still proves that major redesigns do sometimes solve major problems in the model.

There could easily arise a regressive model that allows progressive online updating from an external world model - that's all it takes to break LeCunn's curse. There's no reason to think the curse can't be broken by redesign.

optimalsolver · 2024-06-08T07:59:20 1717833560

This thing will still hallucinate, not matter what new bells and whistles have been attached to it, meaning it will never be used for anything important and critical in the real world.

barfbagginus · 2024-06-10T21:48:28 1718056108

Here's a system that uses an llm to generate equivalence proofs for refactoring operations.

https://news.ycombinator.com/item?id=40634775

In this system, the llm can hallucinate to its hearts content - the hallucinations are then fed into a proof engine and if they are a valid proof then it wasn't a hallucination, and the computation succeeds. If it fails it just tries again. So hallucinations cannot actually leave the system, and all we get are valid refactorings with working proofs of validity.

Binding the LLM to a formal logic and proof engine is one way to stop them hallucinating and make them useful for the real world.

But you would have to actually care about Proof and Truth to concede any point here. If you're only protecting the worldview where AI can never do things that humans can do, then you're going to have to retreat into some form of denial. But if you are interested in actual ways forward to useful AI, then results like this should give you some hope!

Good luck and good day either way!

freilanzer · 2024-06-12T07:57:31 1718179051

> Binding the LLM to a formal logic and proof engine is one way to stop them hallucinating and make them useful for the real world.

Checking the output does not mean the model does not hallucinate and thus does not help for all other cases in which there is no "formal logic and proof engine".

barfbagginus · 2024-06-13T04:45:08 1718253908

What if I consider the model to be the llm plus whatever extra components it has that allows it to not hallucinate? In that case then the model doesn't hallucinate, because the model is the llm plus the bolt-ons.

Remember llm truthers claim that no bolt-ons can ever fully mitigate an llm's hallucinations. And yet in this case it does. But saying that it doesn't matter because other llms will still hallucinate is moving the goal post, or at least discounting the utility of this incremental progress. I think it's unfair to do this because there are many many domains where things can indeed be reduced to a formal logic amenable to approve engine.

If they don't care about the actual output of a hybrid system that doesn't hallucinate, because it's math and not speech, then do they care about solving the issue at all, or providing human utility? I get the feeling that they only want to be right, not the benefit anyone.

This shows that in cases where we can build good enough verifiers, hallucinations in a component of the system do not have to poison the entire system.

Our own brains work this way - we have sections of our brains that hallucinate, and sections of the brain that verify. When the sections of the brain that verify our sleep, we end up hallucinating dreams. When the sections of the brain that verify are sick, we end up hallucinating while awake.

I agree with you that the current system does not solve the problem for natural language. However it gives an example of a non hallucinating hybrid llm system.

So the problem is reduced from having to make llms not hallucinate at all, to designing some other systems, potentially not an llm at all, that can reduce the number of hallucinations to a useful amount.

barfbagginus · 2024-06-09T10:04:13 1717927453

You have no proof that every modification of the architecture will continue to have hallucinations. How could you prove that? Even LeCunn admits that the right modification could solve the issue.

You're trying to make this point in a circular way - saying it's impossible just because you say it's impossible - for some reason other than trying to get to the bottom of the truth. You want to believe that there's some kind of guarantee that no offspring of the auto regressive architecture can never get rid of hallucinations.

I'm saying they're simply no such guarantee.

barfbagginus · 2024-06-09T12:10:42 1717935042

Plus, humans bullshit all the time, even well paid and highly trained humans like Doctors and Lawyers. They will bullshit while charging you 400 an hour. Then they'll gaslight you if you try to correct their bullshit.

AI will bullshit sometimes, but you can generally call it on the bullshit and correct it.

For the tasks that it helps me with, I could work with a human. But the human I could afford would be a junior programmer. Not only do they bullshit more than a well prompted AI, but I also have to pay them 30 an hour, and they can't properly write specs or analyze requirements. GPT 4 can analyze requirements. Much better then a junior, and I'm many ways better than me. For pennies.

I do use it in the real world, to maintain and develope software for the industrial design company I own. It would be foolish if I didn't. I've been able to modernize all our legacy codes and build features that used to stump me.

Maybe the fact is that I'm an incompetent programmer, and that's why I find it so helpful.

If that's the case so be it! It's still a significant help that is accessible to me. That matters!

klysm · 2024-06-07T14:17:03 1717769823

Encoding the sequence like that seems like a really clever workaround for some of the data dependency limitations of GPT.

naveen99 · 2024-06-08T03:18:33 1717816713

Bert had random masking from the sequence. But time is sequential.

ETH_start · 2024-06-08T07:43:10 1717832590

This is not a phonetically friendly acronym.

skilled · 2024-06-07T13:42:52 1717767772

> The main idea is to train the model to generate sequences in a random order, which allow conditional density estimation, infilling and generating sequences by burst using a novel rejection sampling method.

> In exploring that idea, we aslo compared to a discrete diffusion baseline, which also allows to generate sequences in burst. We were surprised to see that diffusion models were able to solve path-finding task and we made a short Twitter thread

The said thread:

https://nitter.poast.org/ArnaudPannatier/status/176286434739...

And a showcase here:

https://www.idiap.ch/~apannatier/sigma-gpt/

(excerpt taken from here: https://www.idiap.ch/~apannatier/)

nico · 2024-06-07T19:39:29 1717789169

> We were surprised to see that diffusion models were able to solve path-finding task

I wonder if this type of method might allow for faster solutions to the traveling salesman problem

3abiton · 2024-06-07T15:43:40 1717775020

I just wonder if such models based on their method would make hallucination even worse.

arnaudpannatier · 2024-06-07T18:53:56 1717786436

Hey, I'm Arnaud, first author of the paper. The answer is a bit mixed. We actually started looking into this because of a repetition problem that appeared in a low-data regime for a sequence generation task. Basically, the left-to-right GPT was stuck repeating the same token once it sampled twice the same in a row during generation. And to mitigate that, we tried to generate the sequence in a random order and it seemed to help and we see less of this repetition issue. We initially thought when we don't have enough data, shuffling would be like data-augmentation and might actually help the model reach better performance. But this is not what we found in the experiments: apparently as learning in any order is a harder task, the model memorise the data more.

hammock · 2024-06-07T13:42:04 1717767724

Add to this some kind of "autofocus" for the user to click on the word that is the "center" of the prompt and you've really got something

MaheshNat · 2024-06-08T03:25:24 1717817124

What exactly do you mean by "autofocus"?

hammock · 2024-06-08T18:00:12 1717869612

I mean the user clicks on the word that is the "focus" of the prompt the way that you click on a camera screen

behnamoh · 2024-06-07T16:32:29 1717777949

Title is incorrect: it's σ not Σ.

modeless · 2024-06-07T16:50:20 1717779020

Σ is uppercase σ. Maybe this happened automatically? Pretty funny if so. Correct in a Greek context; clearly incorrect in a math context.

mehulashah · 2024-06-07T17:30:37 1717781437

Yes, HN automatically did that.

modeless · 2024-06-07T17:34:11 1717781651

For future reference, it is possible to edit the titles of stories you've submitted. This allows you to correct any errors introduced by HN's title rewriting heuristics at submission time, without waiting for a moderator to do it for you. Just like for comments, though, the edit window is time limited. For comments the window is two hours. I don't know if it's the same for story titles.

barfbagginus · 2024-06-08T00:17:20 1717805840

Great, now I'm imagining GPT flexing its roided biceps while making sigma faces, as edgy incel music goes hard in the background with spiky synths and a boy's choir.

After seeing how awesome the showcase looks, I'm not even sure I'm mad about this, lol