I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.
You are correct on true H100 ownership costs being far lower. As I mention in the H100 blurb, the H100 numbers are fungible and I don't mind if you halve them.
MFU can certainly be improved beyond 40%, as I mention. But on the point of small models specifically: the paper uses FSDP for all models, and I believe a rigorous experiment should not vary sharding strategy due to numerical differences. FSDP2 on small models will be slow even with compilation.
The paper does not tie embeddings, as stated. The readout layer does lead to 6DV because it is a linear layer of D*V, which takes 2x for a forward and 4x for a backward. I would appreciate it if you could limit your comments to factual errors in the post.
My bad on the 6 D V estimate; you are correct that if they do a dense decoding (rather than a hierarchical one as google used to do in the old days) the cost is exactly 6 D V. I cannot edit the GP comment and I will absorb the shame of my careless words there. I was put off by the subtitle and initial title of this HN post, though the current title is more appropriate and correct.
Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.
I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.
And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.
It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.
I agree, and really hope that Meta is doing something in that vein. Reducing the FLOPs:Memory ratio (as in Soft MoE) could also open the door to CPU (or at least Apple Silicon) inference becoming more relevant.
I suspect OpenAI will figure out some way to reduce the randomness at some point, though, given their public commitment to eventually adding logprobs back to ChatCompletions.
I don't think this commitment had any plausibility. Token "probabilities" only have a straightforward probabilistic interpretation for base models. In fine-tuned models, they do no longer represent the probability of the next token given the prompt, but rather how well the next token fulfills the ... tendencies induced by SL and RL tuning. Which is presumably pretty useless information. OpenAI has no intention to provide access to the GPT-4 base model, and they in fact removed API access to the GPT-3.5 base model.
You do, because it’s not just more training it’s PPO updates instead of MLE. It’s no longer trying to estimate the token distribution of the training corpus, it’s trying to shift logprobs into tokens that maximize expected reward from the RM. The GPT-4 technical report has a figure showing that logprobs become less well calibrated as confidence scores in the RLHF vs pre-train model.
Two users I talked with mentioned bad experiences with them. Not that it's always bad, and they mentioned that it can be good and I know the pricing is often great, but they noted bad experiences with unreliable instances. Therefore I don't want to recommend it to most people.
> I think it's more plausible that teaching focuses on this surface knowledge because it's much easier and more legible, and looks and feels very much like "programming education" to someone who does not have actual domain knowledge (because other subjects are usually done in the same way), or who isn't thinking very much about it, and then similar problems and a notion that testing should be "fair" and "cover what students have learned" lead to insufficiently outcome-oriented exams, which then sets up incentives biasing students in similar directions.
And that's how you end up with CS students aceing theory exams, while completely flunking the coding exam...
> Computers are not at all human, in that they do exactly what someone has set them up to do, which is often not what they thought they were doing, while many beginners expect them to "understand what they meant" and act accordingly. Every simple-looking capability is burdened with detail: the computer "knows what time it is" (thanks to some nontrivial engineering with some possible failure points); the out-of-order CPU "runs just like an abstract in-order machine, but very fast" (until security researchers find a difference); DNS "resolves domain names to IPs" (but is frequently intercepted by networks, and can also serve as a covert backchannel); video codecs "make videos smaller" (but are also complex domain-specific programming languages); text rendering "is just copying bitmaps into the right places" (unless you care about Unicode or antialiasing or kerning).
Although people don't really encounter complex leaky abstractions like that as a beginner in coding. More likely, they'll encounter some simpler poor abstraction like Scratch blocks not being all fully composable in an intuitive fashion, or a sheer wall of complexity (e.g. getting taught in Java/C for an introductory course)
The approach is different, it can use pretrained models, i.e. stable diffusion, which is a pretty exciting research development. This means that it only requires 'fine-tuning' existing models to get this result.
I agree with that, but it's hard for me to get excited with the knowledge that it'll almost certainly be discarded and forgotten. I've seen too many papers that looked interesting from a theoretical perspective, but were simply never brought to the public because of the barrier of dev+training.
In this case, you need someone that can implement the method as described (hard!), and then you need someone with a setup better than a rented 8xA100 (expensive and not available on many cloud providers) to actually reproduce the model.
To put it in context, in almost all areas of research (physics, biology, chemistry, electronics, etc), running experiments is expensive. ML is in the category that there can still be advances done by amateurs at home. I don't think it's worth writing off everything that requires more resources than a hobbyist.
Ugh, that tweet is so confusing. The researchers did work for Stability - this work was done for NVIDIA. It's entirely unclear if the researchers are even still associated with stability.ai but Emad sure does imply that's the case.
> (Team working on our variant, will be done when it's done and they are happy with it).
I _think_ he's saying that _his_ team is working on a similar model - and that they will release _that_ model "when it's done" (and not to to expect that to happen any time soon).
Just super vague, bordering on taking credit for work that NVIDIA did. Seems like he typed it out on his phone and/or is Elon-levels of lazy about tweets.
I doubt you'll ever again see a model being publicly released for something as capable as this.
I can't say I fully understand the mechanisms by which they achieve that, but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models. Stable Diffusion and LLaMA were mistakes that won't be repeated anytime soon.
AI models with obvious, broad, real world applications always seem get reproduced in public. Nvidia’s result is obviously great, but it’s still a long way from being useful. It reminds me of image generation models maybe 4 years ago.
We need a killer application for video generation models first, then I’m sure someone will throw a $100k at training an open source version.
I am going to guess: Making "dog-nature-like" humanoids that aim to please lonely people, who are nicer to be around than people, easier than real relationships.
The current generation of GPT-3, which started with text-davinci-003, was actually released on November 2022, not quite 3 years ago. I'm not even sure the model that was released 3 years ago is still available to test, but it was much less impressive than more recent models - I wouldn't be surprised if LLaMa were actually better.
The model trained 3 years ago was only trained on 300B tokens, heavily undertrained (in terms of the Chinchilla scale), that's why LLaMa models can easily beat it on most benchmarks (they were trained on 1/1.4T tokens). About the current GPT-3.5 models, who knows, OpenAI is not very open about it.
The tragedy of the commons is at play here. We could get amazing ML models rivaling the best if people interested could pool together money for a $100k or $1 million training run. But there doesn't seem to be the willingness, so we humbly rely on the benevolence of companies like Stability and Meta to release models to the public.
Kickstarters like the one you linked don't suffer from the tragedy of the commons because people are essentially just pre-paying for a product. With funding an open source ML model, there's little incentive to not be a free-rider.
> but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models.
The most dangerous AI model today (in a practical sense, as people are actually using for shady stuff) is ChatGPT, which is closed source, but open to the public so anyone can cheat on their exams, write convincing fake product reviews, or generate SEO spams, etc.
The fact that a model is closed source doesn't change anything as long as it's available for use. Bad actors don't care about running the code on their own machine…
But they're still showing us that the results exist. They're trying to have it both ways, by showing the results are tangible progress while implicitly admitting that that progress is too powerful in the hands of the public.
Is there anything that incentivizes Nvidia to publish these results? Is it just needing to get papers out in the public for the academic clout? Something tells me that all this accomplishes is setting the standards of everyone who sees the possibilities, that "this will be the future", and a third party without the moral framework of Nvidia will become motivated to develop and openly release their own version at some point.
That's just good marketing, isn't it? "Our product is amazing! In fact it's too good, no you can't have it. Unless just maybe, we might let you buy access to it." Oh wow if it's so good that they won't let me have it then I definitely want it!
I was told by an employee that GDM internally has a credits system for TPU allocation, with which researchers have to budget out their compute usage. I may have completely misunderstood what they were describing, though.