More

mnk47 · 2024-06-16T21:38:46

This just seems like a longer version of the HN comments I see in every single AI related post. No such thing as a hallucination, LLMs don't really "understand" therefore they can neither say truths nor lies just like autocorrect can't be honest or dishonest, it "just" chooses a token from a probability distribution, etc.

I see this every day. I'm pretty sure most of us who have even a slight interest in AI know the gist of how LLMs work. I'm not sure about what difference it makes in practice.

mnk47 · 2024-05-24T20:15:31

The new GPT4o model is free. Plus users will get better rate limits and the voice feature, but everyone has access to the best model right now.

mnk47 · 2024-05-22T02:45:13

No, their site literally says "our most powerful model" as the description for GPT4o, and it scores slightly higher than GPT-4 in their benchmarks: https://openai.com/index/hello-gpt-4o/

mnk47 · 2024-05-17T23:57:36

No, there's more. It's a thread. He really should've used twitlonger or something like that. Here are the tweets:

Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.

I joined because I thought OpenAI would be the best place in the world to do this research.

However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Building smarter-than-human machines is an inherently dangerous endeavor.

OpenAI is shouldering an enormous responsibility on behalf of all of humanity.

But over the past years, safety culture and processes have taken a backseat to shiny products.

We are long overdue in getting incredibly serious about the implications of AGI.

We must prioritize preparing for them as best we can.

Only then can we ensure AGI benefits all of humanity.

OpenAI must become a safety-first AGI company.

To all OpenAI employees, I want to say:

Learn to feel the AGI.

Act with the gravitas appropriate for what you're building.

I believe you can "ship" the cultural change that's needed.

I am counting on you.

The world is counting on you.

:openai-heart:

solardev · 2024-05-18T00:07:45

Ah, thank you so much! Twitter doesn't make that clear at all, and I don't have an account there.

ChrisArchitect · 2024-05-18T01:29:29

As posted on his site and discussion over here:

https://news.ycombinator.com/item?id=40391412

mnk47 · 2024-05-15T02:25:54

>the guy who quit/lost his job at OpenAI because he didn't agree with their corporate shift and departure from the original non-profit vision

There is no evidence of this being true.

He is one of the biggest proponents of keeping AI closed-source, by the way.

nicce · 2024-05-15T02:56:56

> He is one of the biggest proponents of keeping AI closed-source, by the way.

From quite different reasons than profit, tho

surfingdino · 2024-05-15T04:47:17

That's a naive way of thinking. Keeping it closed source would only make it available to the highest bidder on the black market.

mnk47 · 2024-05-14T01:16:18

You can tell it to talk in a robotic, unrealistic way and it will do so.

Here is a demo from their presentation: https://youtu.be/D9byh4MAsUQ

morgante · 2024-05-14T04:27:25

I have the opposite impression from that demo.

It doesn't sound like a neutral, boring voice. It sounds like an overly dramatic person pretending to be a robot.

mnk47 · 2024-05-14T19:37:33

>It sounds like an overly dramatic person pretending to be a robot

That's precisely what it was ordered to do.

mnk47 · 2024-05-04T07:02:02

FWIW, that one guy on twitter is spot on more often than not in his leaks and predictions.

mnk47 · 2024-04-30T08:06:59

>You're posting under a thread where many seniors are discussing how they don't want this because it doesn't work.

Many artists and illustrators thought AI art would never threaten their livelihood because it did not understand form, it completely messed up perspective, it could never draw hands, etc. Look at the state of their industry now. It still doesn't "understand" hands but it can sure as hell draw them. We're even getting video generation that understands object permanence, something that didn't seem possible just over a year ago when the best we got were terrible low quality noisy GIFs with wild inconsistencies.

Many translators thought AI would never replace them, and then Duolingo fired their entire translation team.

I'm sure that GP isn't worried about being replaced by GPT-4. They're worried about having to compete with a potentially much better GPT-5 or 6 by the time they graduate.

mnk47 · 2024-04-26T00:04:33

Most people don't love their jobs. Switching careers isn't easy, and we all have bills to pay.

mnk47 · 2024-04-25T10:27:01

Yi Tay's response (chief scientist at Reka AI, ex-Google Brain researcher): https://twitter.com/YiTayML/status/1783273130087289021

>not true, especially for language. if you trained a large & deep MLP language model with no self-attention, no matter how much data you'll feed it you'll still be lacking behind a transformer (with much less data). will it get to the same point? i don't think so. your tokens cannot even see each other in a raw MLP.

>on the other hand, tiny tweaks to transformers may not matter as much as data/compute. sure. but it's also not very accurate to say "architecture research" does not matter and "makes no difference". i hear this a lot about how people use this to justify not innovating at the architecture level.

>the truth is the community stands on the shoulder of giants of all the arch research that have been done to push the transformer to this state today.

>architecture research matters. many people just take it for granted these days.

neonbjb · 2024-04-25T13:43:01

I'm James Betker.

Of course architecture matters in this regard lol. Comparing a CNN to a transformer is like comparing two children brought up in the same household but one has a severe disability.

What I meant in this blog post was that given two NNs which have the same basic components that are sufficiently large and trained long enough on the same dataset, the "behavior" of the resulting models is often shockingly similar. "Behavior" here means the typical (mean, heh) responses you get from the model. This is a function of your dataset distribution.

:edit: Perhaps it'd be best to give a specific example: Lets say you train two pairs of networks: (1) A Mamba SSM and a Transformer on the Pile. (2) Two transformers, one trained on the Pile, the other trained on Reddit comments. All are trained to the same MMLU performance.

I'd put big money that the average responses you get when sampling from the models in (1) are nearly identical, whereas the two models in (2) will be quite different.

jfyi · 2024-04-25T14:26:44

There's not many people who will proudly announce their employer, their name, and where someone can stick it over the course of two public comments these days.

You, sir, are my hero.

wrs · 2024-04-25T15:50:24

Please humor me for a moment, because I'm having trouble seeing why this is not just true by definition. Doesn't "training to the same performance" mean that you get the same responses? Or from a different angle: given that the goal of the model is to generate plausible completions based on a training dataset, it seems like plausibility (and therefore performance) is obviously defined by the dataset.

HarHarVeryFunny · 2024-04-25T14:48:22

If Mamba really was as capable as a Transformer on tasks requiring accurate attending to long context, then there'd be no need for Jamba (Mamba+Transformer hybrid).

Your argument of "if we train a Mamba SSM to be as good as a Transformer, then it'll be as good as a Transformer", seems a tad circular...

lossolo · 2024-04-25T17:00:19

Yeah, I'm not sure how someone could interpret what you said in the way people are citing here. It's actually obvious that you are right in the context of data in LLMs. Look at LLAMA 3, for example there are minimal architectural changes, and its performance is almost at the level of GPT-4. The biggest change was in the dataset.

ahartmetz · 2024-04-25T10:35:50

Well, both can be true if you interpret the "it" as "the secret sauce / competitive advantage". A good architecture is a necessary but not sufficient condition for success, but everybody uses more or less the same currently, so data makes the difference. Until the next improvement in architecture.

nkozyra · 2024-04-25T11:38:30

Or until we run out of data that actually differentiates the models

segmondy · 2024-04-25T12:01:27

I do argue that the IT is the architecture. We have pretty much had all the data that these LLMs were trained on for a long time. The game changer was the architecture not the data. Unless of course you are on the code is data camp ;).

empath-nirvana · 2024-04-25T13:35:53

Probably the "it" is whatever one model has that other models don't have. When everyone is using the same architecture, then the data makes the difference. If everyone has the same data, then the architecture makes the difference.

It sounds pretty obvious to say that the difference is whatever is different, but isn't that literally what both sides of this argument are saying?

edit: I do think that what the original linked essay is saying is slightly subtler than that, which is that _given_ that everyone is using the same transformer architecture, the exact hyperparameters and fine tuning that is done matters a lot less than the data set does.

4death4 · 2024-04-25T12:49:17

MLP is a universal approximator, so there’s definitely a configuration that can match an attention mechanism. Whether or not it’d be feasible to train is another question.

HarHarVeryFunny · 2024-04-25T13:36:54

Not sure about feasible, but certainly not efficient.

I think this MLP universal approximator notion is similar to a Turing machine being a universal computation device. Correct, but practically useless.

I don't think Sutton's bitter lesson is going to result in everything being an MLP. You want the most scalable architecture, which an MLP certainly is not.

HarHarVeryFunny · 2024-04-25T11:51:21

Yes, and note that in terms of different architectures, the author (James Betker) is talking about image generators, while when he's talking about LLMs they are all the same basic architecture - transformers.

Some tasks are going to be easier to learn that others, and certainly in general you can have more than one architecture capable of learning a given task, as long as it is sufficiently powerful (combination of architecture + size), and well trained.

That said, it's notable that all the Pareto optimal LLMs are transformer-based, and that in the 7 years since the attention paper (2017), all we have seen in terms of architectural change have been scaling up or minor tweaks like MoE and different types of attention.

How do you make a different architecture such as Mamba more competitive with transformers? Add some transformer layers to it (Jamba) !

So, yeah, as far as LLMs go, the precise model doesn't matter as long as it's a transformer, which isn't very surprising given what we know about how they work - primarily via induction heads. The lesson here isn't that architecture doesn't matter for LLMs, but rather that the architecture has to be a transformer! Data then becomes paramount, because the model learns the program (induction heads, etc) that runs on the machine (transformer) from the data.

No doubt there will be architectural advances beyond transformers, although few people seem to be currently looking for them, but I'm pretty sure they will still need something equivalent to the transformer's attention mechanism.

geysersam · 2024-04-25T10:55:20

Seems like an objection that is slightly beside the point? The claim is not that literally any model gives the same result as a large transformer model, that's obviously false. I think the more generous interpretation of the claim is that the model architecture is relatively unimportant as long as the model is fundamentally capable of representing the functions you need it to represent in order to fit the data.

HarHarVeryFunny · 2024-04-25T22:36:58

OP's claim/observation is that "trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point [of inference performance]".

His conclusion is that "It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else".

There is an implicit assumption here that seems obviously false - that this "convergence point" of predictive performance represents the best that can be done with the data, which is to imply that these current models are perfectly modelling the generative process - the human brain.

This seems highly unlikely. If they are perfectly modelling the human brain, then why do they fail so badly at so many tasks? Just lack of training data?

geysersam · 2024-04-27T16:36:20

Interesting point. But, does the data contain enough information to perfectly model the generative process? Maybe even a very complex and capable model like "the human brain" would fail to model the datset better than large transformers, if that was the only thing they ever saw.

You and me can model the dataset better, but we're already "pre-trained" on reality for decades.

Just because the dataset is large doesn't mean it contains useful information.

HarHarVeryFunny · 2024-04-27T20:40:39

Perhaps, but even with an arbitrarily good training set, the LLM would still be constrained by it's own architectural limits. e.g. If a problem can't be broken down into sub-problems that each require <= N sequential steps, then an N-layer transformer will never be able to solve it.

Even if the architectural shortcomings were all fixed, it seems "[pre-training] data is all you need" would still be false, because there is no getting around the need for personal experience, for the same reasons that is true for us...

Perhaps most fundamentally, any action/prediction you make can only based on the content of your own mind, not the mind of a tutor you are trying to copy. Even if the tutor diligently tries to communicate all nuances and contingencies of a skill to you, those are still all relative to his/her own internal world model, not the one in your head. You will need to practice and correct to adapt the instructions to yourself.

omnicognate · 2024-04-25T11:09:12

Machine learning insights from e e cummings.