edmara's comments

edmara · on June 1, 2024

https://arxiv.org/abs/2310.02207 https://transformer-circuits.pub/2024/scaling-monosemanticit...

edmara · on May 15, 2024

Of course. A trademark exists to mutually protect consumers and businesses from deceptive advertising. When a term referring to a specific product becomes a term for a product category etc, trademark protections then becomes harmful to consumers, but they still benefit the business. If you're building a brand generally you want to be as close to the legal limit as possible without exceeding it

anticensor · on May 18, 2024

There are jurisdictions where genericide is explicitly outlawed.

edmara · on May 15, 2024

If you have an evaluation function which does this accurately and generalizes, you pretty much already have have AGI.

edmara · on May 15, 2024

Gary Marcus was arguing in 2020 that scaling up GPT-2 wouldn't result in improvements in common sense or reasoning. He was wrong, and he continues to be wrong.

It's called the bitter lesson for a reason. Nobody likes seeing their life's work on some unicorn architecture get demolished by simplicity+scale

edmara · on May 15, 2024

From reporting GPT-5 finished pre-training while ago and was in the process of red-teaming.

edmara · on May 2, 2024

Assuming there is a development that makes GPUs obsolete, I think it's safe to assume that what will replace them at scale will still take the form dedicated AI card/rack

1. Tight integration necessary for fundamental compute constraints like memory latency.

2. Economies of scale

3. Opportunity cost to AI orgs. Meta, OpenAI etc want 50k h100s to arrive in shipping container and plug in so they can focus on their value-add.

Everyone will have to readjust to this paradigm. Even if next get AI runs better on CPU, Intel won't suddenly be signing contracts to sell 1,000,000 xeons and 1,000,000 motherboards etc

Also, Nvidia have 25bn cash in hand and almost 10 billion yearly r&d spend. They've been an AI-first company for over a decade now, they're more prepared to pivot than anyone else

Edit: nearly forgot - Nvidia can issue 5% new stocks and raise 100B like it's nothing.

edmara · on May 1, 2024

The modelling is advanced enough that you can't fundamentally distinguish it from (lossy, limited) planning in the way you're describing.

If the KQV doesn't encode information about likely future token sequences then a transformer empirically couldn't outperform Markov text generators.

mjburgess · on May 1, 2024

No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.

Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.

They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.

The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)

Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).

As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.

Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...

edmara · on May 2, 2024

>Why it's maximal is not in the model at all, nor the data

>It replays the data to us and we suppose the LLM must have the property that generates this data originally.

So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?

Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.

Also RE inference time, when you give transformers more compute for an individual token, they perform better https://openreview.net/forum?id=ph04CRkPdC

edmara · on May 1, 2024

Transformers are still stateless, KV cache is just a compute-saving measure (but otherwise correctly described)

jacobsimon · on May 1, 2024

Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.

I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.

edmara · on May 2, 2024

Because attention is all you need.

I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)

You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would

a) be far more compute intensive to train and run (especially train)

B)be susceptible to all of the issues that RNNs have.

C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it

edmara · on April 28, 2024

In the case of 3sum, because the LLM has been fine tuned to use 'blank' token key values as a register to represent the sums of specific integer triplets.