Hah funny to see this on HN, it is a relatively old project but one that I continue to love and still work on. I was trying to train a GPT one day and discovered that available implementations were quite complex, spread across many files, and took way too many kwargs switches for esoteric/rare options that just bloated and complexified the code. But in my head a GPT was a super simple neat, isotropic model, so I got all worked up and wrote minGPT.
The project went on to have more impact than I originally imagined and made its way into a number of projects and papers. One of those I found only a few days ago here: https://twitter.com/karpathy/status/1566100736076697600 . What I love about these projects is that the authors often "hack up" minGPT in code directly. They don't configure a comprehensive kwarg monster. I think there's a beauty in that. Very often I wish we had more gists and fewer frameworks - to look at code chunks, understand them completely, tune them to our needs, and re-use them in projects, similar to how bacteria trade little DNA plasmids. minGPT is written for those who want that for their GPT projects. There's plenty of cons to this approach too, ultimately I think there's value in both approaches.
Coming up the theme of future minGPT development: more examples, and more teeth - it should be possible to demonstrate the training of relatively serious (~few B) models with minGPT on one n-gpu node and reproduce some benchmarks around that scale, but never sacrifice its readability.
I completely agree! I personally find these powerful new network releases border on the depressing, in that they aren’t really network releases but huge training systems of dispersed YAMLs. YOLOv4 was a case in point where I was too overwhelmed to try and integrate it into a project I was working on.
PS you are a hero of mine - I’m an academic medical doctor for who CS231n was my first foray into AI, and since then I’ve gone on to gold medal in a couple of Kaggle competitions and secured 5 years of higher research funding to pursue clinical AI. I am immensely grateful to you and Fei-Fei Li.
This works for an architecture which has been well tuned and studied before, like LSTM or Transformer.
Once you do research on the model, testing out things, it often tends to become such kwarg monster in many frameworks.
Having everything (relevant) in one file (even in the config file itself with hyper params) allows you to copy the file for every experiment and modify it inplace. This avoids the kwargs mess. But then the config files are very complex, and can become messy in other ways (esp for research projects). Example: https://github.com/rwth-i6/returnn-experiments/blob/master/2...
Such approach makes it much more flexible and does not mess with the baseline code. As you say, it's more like an evolutionary DNA-like approach, where you then tend to do crossovers with other evolved good-performing configs, etc.
Thanks for making it it! There is immense value in something you can just dive into and hack on. I’ve been hacking on stable Diffusion/latent diffusion these past couple weeks, and you don’t know how much time it would have saved me, if it just had something similar!
This is actually a pretty neat, self-contained implementation that can super easily extended beyond stereotypical natural language models, for example to create world models for video games [1] or to create robot models that can learn to imitate from large, chaotic human demonstration data [2] (disclaimer, I'm an author on the second one.) Basically, GPT (or minGPT) models are EXCELLENT sequence modelers, almost to the point where you can throw any sensible sequence data at it and hope to get interesting results, as long as you don't overfit.
Even though I have only been working on machine learning for around six years, it's crazy to see how the landscape has changed so fast so recently, including diffusion models and transformers. It's not too much to say that we might expect more major breakthroughs by the end of this decade, and end in a place we can't even imagine right now!
> Even though I have only been working on machine learning for around six years, it's crazy to see how the landscape has changed so fast so recently, including diffusion models and transformers.
it's pretty wild considering how hidden markov models were considered state of the art not all that long ago.
Not only him. The tech boom in the past decade made a lot of great programmers rich, and it is a good thing. Looking also at how Aras Pranckevicious (of the Unity fame) is now contributing to Blender. (Also to some extents Rui (for mold fame) and Raph Levien (for xi editor fame), although not certain about their financial standing).
I love your approach and philosophy around programming. If anyone is unaware, Karpathy has a relatively small youtube channel he started a few weeks ago. https://youtu.be/VMj-3S1tku0
I am working on a video lecture series that will step through it and "spell it out". Without it even this code can be a bit opaque for someone who is new to the field and e.g. uncomfortable with n-dimensional array manipulations or the surrounding language modeling concepts.
All we need now is a good, local copilot implementation. Has anyone done anything like that with minGPT that you know of? Something to inspire the masses like stable diffusion has with images.
With enough training data and enough GPUs to do the model training, you'll be there! Goes to show that for AI, the code really isn't the important part. AI is and always has been about data and compute.
The project went on to have more impact than I originally imagined and made its way into a number of projects and papers. One of those I found only a few days ago here: https://twitter.com/karpathy/status/1566100736076697600 . What I love about these projects is that the authors often "hack up" minGPT in code directly. They don't configure a comprehensive kwarg monster. I think there's a beauty in that. Very often I wish we had more gists and fewer frameworks - to look at code chunks, understand them completely, tune them to our needs, and re-use them in projects, similar to how bacteria trade little DNA plasmids. minGPT is written for those who want that for their GPT projects. There's plenty of cons to this approach too, ultimately I think there's value in both approaches.
Coming up the theme of future minGPT development: more examples, and more teeth - it should be possible to demonstrate the training of relatively serious (~few B) models with minGPT on one n-gpu node and reproduce some benchmarks around that scale, but never sacrifice its readability.