Of course. A trademark exists to mutually protect consumers and businesses from deceptive advertising. When a term referring to a specific product becomes a term for a product category etc, trademark protections then becomes harmful to consumers, but they still benefit the business. If you're building a brand generally you want to be as close to the legal limit as possible without exceeding it
Gary Marcus was arguing in 2020 that scaling up GPT-2 wouldn't result in improvements in common sense or reasoning. He was wrong, and he continues to be wrong.
It's called the bitter lesson for a reason. Nobody likes seeing their life's work on some unicorn architecture get demolished by simplicity+scale
Assuming there is a development that makes GPUs obsolete, I think it's safe to assume that what will replace them at scale will still take the form dedicated AI card/rack
1. Tight integration necessary for fundamental compute constraints like memory latency.
2. Economies of scale
3. Opportunity cost to AI orgs. Meta, OpenAI etc want 50k h100s to arrive in shipping container and plug in so they can focus on their value-add.
Everyone will have to readjust to this paradigm. Even if next get AI runs better on CPU, Intel won't suddenly be signing contracts to sell 1,000,000 xeons and 1,000,000 motherboards etc
Also, Nvidia have 25bn cash in hand and almost 10 billion yearly r&d spend. They've been an AI-first company for over a decade now, they're more prepared to pivot than anyone else
Edit: nearly forgot - Nvidia can issue 5% new stocks and raise 100B like it's nothing.
No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.
Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.
They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.
The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)
Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).
As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.
Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...
>Why it's maximal is not in the model at all, nor the data
>It replays the data to us and we suppose the LLM must have the property that generates this data originally.
So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?
Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.
Oh huh. Why not make it stateful, like re-use and compute just the “diff” when you add a new token? Assuming it’s not that easy because each token can affect attention globally.
I think I’ve read something about this but I wonder if you could abstract attention to sentence/page levels and then only recalculate the parts that are relevant.
I.E. the KV cache is 'just' a time saving measure because an LLM goes back and calculates those values anyway. (Which is why per-token compute increases exponentially otherwise)
You're not wrong that you could make an LLM more stateful. There are plenty of ideas for that but it would
a) be far more compute intensive to train and run (especially train)
B)be susceptible to all of the issues that RNNs have.
C) most importantly, it would almost certainly just converge at scale with transformers. Labs run small scale, internal tests of architectures all the time and most of them basically come to this conclusion and abandon it
In the case of 3sum, because the LLM has been fine tuned to use 'blank' token key values as a register to represent the sums of specific integer triplets.