
Karpathy's MinGPT - aliabd
https://github.com/karpathy/minGPT
======
modeless
I think many people don't appreciate just how simple state-of-the-art neural
net techniques are. It's great to have an illustration of just how little code
you need to get the results that have amazed people.

You could say that it relies on PyTorch which is a lot of code, but most of
the complexity comes from the need to do GPU/TPU/SIMD acceleration. A truly
from-scratch CPU-only implementation in C would still not be a large amount of
code.

~~~
2bitencryption
As a developer with interested in neural-net based ML, my eyes start to glaze
over a bit when I see so much crazy-looking numpy operations even just to get
a trivial representation of data.

I guess it's just something I have to get used to, but I wish there was an
interface to do the same nd-array based logic that was designed more for
developers like me rather than data scientists who perform surgery on 50d
arrays all day long.

~~~
dheera
Have you checked out numba? You can use regular Python for loops and numpy
calls and it will compile it. Instead of numpy vectorization mess.

~~~
Der_Einzige
Good luck with anything of non trivial types. Numba is awesome but also quite
finicky.

------
martythemaniak
I wonder, is a community trained model feasible? As in, get a few tens of
thousands of dev to run a seti@home type app on their GPUs during the night,
and at the end you get access to the 175B trained model. If it cost 5m to
train, but IIRC that was estimated at cloud gpu prices, if you're using spare
capacity you're just paying for electricity.

~~~
karpathy
Fun idea. GPT @ Home :D. Scatter of the inputs would be very cheap as they are
tiny LongTensors (sequences of indices), but the Gather of the gradients seems
like a bottleneck. These models can be quite large. Maybe each worker only
communicates back some sparse or potentially precision-reduced gradients? In
the limit, I recall papers that were able to only communicate one bit per
dimension. May also be possible to further reduce the number of weights by
weight sharing, or e.g. with HyperNetworks.

~~~
londons_explore
I built this a few years ago:

[https://github.com/Hello1024/shared-
tensor](https://github.com/Hello1024/shared-tensor)

It does updates to weights based on 1 bit precision updates each iteration.

It would be fairly trivial to go to less than 1 bit precision too - simply set
some threshold (eg 3), and wherever the difference between the weight on the
server and the client is greater than 3, transmit a binary "1", else send a
binary "0". Then entropy code all the resulting binary.

By adjusting the threshold up and down, you trade off the size of the data to
send Vs precision.

~~~
0-_-0
I read a paper that did exactly what you describe but of course I can't find
it now...

------
abakus
If you just want to understand the Transformer, here is a clean
implementation:

[https://github.com/blue-
season/pywarm/blob/master/examples/t...](https://github.com/blue-
season/pywarm/blob/master/examples/transformer.py)

~~~
chronolitus
and here's a breakdown of the architecture:

[http://dugas.ch/artificial_curiosity/GPT_architecture.html](http://dugas.ch/artificial_curiosity/GPT_architecture.html)

~~~
odnes
These 4 videos (~45 mins) do an excellent job at explaining attention, multi-
headed attention, and transformers:
[https://www.youtube.com/watch?v=yGTUuEx3GkA](https://www.youtube.com/watch?v=yGTUuEx3GkA)

------
minimaxir
> huggingface/transformers has a language-modeling example. It is full-
> featured but as a result also somewhat challenging to trace. E.g. some large
> functions have as much as 90% unused code behind various branching statments
> that is unsued in the default setting of simple language modeling.

I don't understand this criticism of Transformers. Doesn't tracing (in both
TorchScript and ONNX forms, which Transformers supports for exporting) just
take the relevant model graph and freeze it? I don't think either contains the
somewhat-weighty performance code.

~~~
fabmilo
Try to read the code and understand how it works and you will find it very
challenging to interpret. But not just that even the documentation is very
sparse and hard to read. Compare that to the open-ai code, is so coincise and
easy to read. There is mastery in doing that, deep mastery. Few repositories
on tensorflow or pytorch organization get to that level.

~~~
sillysaurusx
Agreed re: OpenAI's GPT implementation. It took roughly a year to appreciate
how simple it is.
[https://github.com/openai/gpt-2/blob/0574c5708b094bfa0b0f6df...](https://github.com/openai/gpt-2/blob/0574c5708b094bfa0b0f6dfe3fd284d9a045acd9/src/model.py#L147-L173)

Especially compared to StyleGAN, BERT, or pretty much anything else.

I used to hate the OpenAI GPT codebase: zero comments? no classes? What does
"mlp" even mean? But over time, I find myself reverting to their style.

------
ypcx
A few more transformer implementations that I’ve found:

[https://github.com/pbloem/former/blob/master/former/transfor...](https://github.com/pbloem/former/blob/master/former/transformers.py)

[https://github.com/openai/blocksparse/blob/master/examples/t...](https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py)

[https://github.com/google/trax/blob/master/trax/models/trans...](https://github.com/google/trax/blob/master/trax/models/transformer.py)

------
fpgaminer
This is really cool to see; I'm glad karpathy shared the work.

The internet at large has been really good at taking opaque machine learning
research and elucidating the details and recreating the results. I've seen a
few posts/repositories/etc doing that for GPT-3 as well but man, 175B
parameters is just so far out of reach for hobbyists. It's really a shame.

In time further research will likely make language models more efficient to
where something GPT-3-like can be trained at the hobbyist level. Probably a
blended model like SHA-RNN or something with dedicated memory in its
architecture, so the model isn't burning precious weights on remembering e.g.
Lincoln's birthday. In the meantime though it makes me sad that something as
impressive as GPT-3 is solely the toy of corporations.

~~~
GaryNumanVevo
We recently trained GPT-3 (SMALL) at work on our GPU cluster for fun, took 4
days across a couple dozen machines...

Millions of dollars in CAPEX and OPEX just for one model

~~~
stainforth
You're saying your project costed millions of dollars, or the big boys'
projects did?

~~~
shmageggy
If "4 days across a couple dozen machines" cost millions, something is very
wrong.

~~~
GaryNumanVevo
Running your own DC is quite expensive with GPU hardware. One DGX-2 is $400k
and draws something like 24kW.

~~~
rajnathani
> draws something like 24 kW.

That number is off. The DGX-2 consumes 10 kW at peak [0] and the DGX-2H
consumes 12 kW at peak [1].

[0] [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Cent...](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
Center/dgx-2/dgx-2-print-datasheet-738070-nvidia-a4-web-uk.pdf)

[1] [https://www.nvidia.com/content/dam/en-
zz/es_em/Solutions/Dat...](https://www.nvidia.com/content/dam/en-
zz/es_em/Solutions/Data-Center/dgx-2/nvidia-dgx-2h-datasheet.pdf)

------
xiphias2
It's fun to find the sources to see how GPT output correlates with the input
data:

Input texts:

Go, rate thy minions, proud insulting boy!

Hither to London, to be crown'd our king. / Welcome, sweet prince, to London,
to your chamber. / Post you to London, and you will find it so / Now to
London, To see these honours in possession.

How will the country for these woful chances Misthink the king and not be
satisfied!

Output:

Go, rating to London, with all these woful chances Misthink the king and not
be satisfied!

~~~
master_yoda_1
Did you understand the correlation? And what part of the source code is doing
it? Or you had just fun without any understanding?

~~~
xiphias2
GPT is using multi-headed attention, so of course it's not as simple as
putting a few texts together, but I was still interested in finding some
similar texts (that can be done because the training data is only 1MB).

~~~
sillysaurusx
That's a really interesting idea. Could you go into detail about how you're
searching for similar texts using GPT?

It's true that the probability distribution is a sort of "edit distance". And
GPT has already been used for text compression:
[https://bellard.org/nncp/gpt2tc.html](https://bellard.org/nncp/gpt2tc.html)
so it seems not too far of a stretch to use it for similarity matching.

(Sure, perhaps there are more efficient or more effective techniques than
using GPT for this, but I like the idea and am curious how it works.)

~~~
xiphias2
I was using Ctrl-F in Chrome for the words on the training data (Shakespeare
texts).

With Teansformer models it's quite easy to print out top Query-Key pairs to
debug what happened, but that was not my intention.

------
haolez
Is it possible to build an useful transformer without investing millions of
dollars?

~~~
grandmczeb
People have built passable translation systems with transformers using a
single high end GPU.

------
mark_l_watson
This is great. Years ago his RNN example helped me a lot.

------
fpgaminer
Just started a run of play_char on my 8GB 2070. Had to drop batch size to 128.
Getting ~2.2 iterations per second, so it looks like it's going to take two
hours for training to finish. I don't expect my training to differ from
karpathy's, but I'm curious to play around with the trained model a bit.
Already ran the math one and got the ~same results.

~~~
fpgaminer
In the committed notebook the final training loss for play_char was 0.30588.
Yet my training got down to 0.02638. Odd. Either way the resulting model seems
to be just as good/bad. Like char-rnn it's amazing to see it spell words from
scratch consistently. It has a good grasp on structure and even a passable
grasp on grammar. But also like char-rnn it lacks any ability to form coherent
sentences.

EDIT: I'm running it on the IMDB dataset now ... just to see.

~~~
fpgaminer
Running for two epochs on the IMDB dataset (133MB corpus) it only got to a
loss of 1.1. Likely the regularization is too high (I didn't tweak the
hyperparameters at all, and assume regularization was quite high for the
limited tinyshakespeare corpus). Either way, it at least started to learn more
grammar:

Prompt: _This is my review of Lord of the Rings._

> I can't tell why the movie is a story with a lot of potential the main
> reason I want to see a movie that is Compared to the Baseball movie 10 Both
> and I can say it was not just a bad movie.

------
kamalkraj
A TensorFlow re-implementation of mingpt [https://github.com/kamalkraj/minGPT-
TF](https://github.com/kamalkraj/minGPT-TF)

------
studentdev
From the GitHub Readme "The rest of the complexity is just being clever with
batching (both across examples and over sequence length) so that training is
efficient."

What kind of complexities is he talking about? Is it simply the complexity of
having a batch dimension? (compared to more simple single input code)

------
iagovar
I've been delaying introducing myself to this field seriously, but this feels
attractive to me. I have a 32GB Dual Xeon through RDP, would you recommend
running this in local?

------
master_yoda_1
Nice but i am scared people write in their resume that they train a GPT model
from scratch. And when asked in detail they will accept just ran minGPT
without understanding it. This is the AI story now a days. Best solution is to
ignore minGPT and write your own version.

~~~
codezero
That seems fine. How many reddit clones have we seen?

For what it's worth, as a hiring manager who hires technical people, but not
software engineers, these kinds of side projects can really help folks with a
less formal education.

The upside is that if they DO understand it or at the very least learned
something interesting or useful while working on the project it will help them
a lot in an interview.

Folks who do what you're worried about exist, and they just don't get hired by
people who aren't impressed with the shallow use of some new technology. There
are also plenty of companies where that person will be fine, not really need
to use GPT to do their job, and everyone will be happy anyways :)

~~~
master_yoda_1
How you feel if somebody claim they know java just because they can call
elastic search api (or put any tool written in java). Are you going to hire
them and put them on a project which involve coding in java?? the issue is
false claim on resume which shows no integrity. On other extreme do you go to
a doctor for surgery who make false claim about being done surgery.

~~~
codezero
Some companies are looking for someone who knows enough Java to use an
ElasticSearch API - it's about knowing your audience, and I think a lot of new
folks just don't know how to target their applications.

For what it's worth I've gone back and forth over this in my career, and I do
think it has a bit to do with the level of your assertion, but there's an
amount of naïveté that creeps into resumes especially since advice is so
widely varied, do you brag, sell yourself, show real projects, or your past
titles.

Anyways, there's no right way, and I've decided that the job seeker is in a
position of weakness to employers and the industry in general, and if people
are seeking to better themselves - great.

When I interview and hire people, I'm the one who screens out people who lie
or are incapable of doing their job, what they put on their resume is just
part of the process.

Your medical example is a bad one, sorry, I'm not going to engage with it,
there are gatekeepers in almost all industries, and I'm not proposing changing
them by saying that someone putting Java on their resume is not the same as a
doctor lying about their qualifications. One, because systems exist to vet
those qualifications, and two because they're simply not on the same spectrum.

How many people have you screened, hired, and for how many roles? Is this a
problem you've run into or experienced professionally, or just something that
annoys you?

I've experienced both. When I ran a satellite, I employed interns and most
were CS majors who claimed to know C, C++, and Java - I asked them to write
strcpy in C as my interview, none of them could do it, and these people were
juniors in college, so no, I'm not too worried about what people put on their
resume because it's just not representative of their actual abilities ever.

If someone claims to know Java and gets a job with no actual test of their
skill, that's the manager's fault.

------
refulgentis
the Tesla Autopilot rewrite due in 6-10 weeks that'll enable self-driving
cars, promised by Elon on Friday, must be going well if the head of Autopilot
is playing around with language ML models and open sourcing them

[https://twitter.com/elonmusk/status/1294374864657162240?s=20](https://twitter.com/elonmusk/status/1294374864657162240?s=20)

~~~
adamnemecek
He's allowed to do other things. Also, maybe his part of the project is done.
Also the thing might be delayed.

~~~
duaoebg
If they’re giving up on autopilot then fine. But if people’s lives are riding
on the quality of the work then I think it sets a high bar for personal
projects.

~~~
mynameisvlad
I'm sorry, what? Are people not allowed to have free time and hobbies anymore?

Just because someone's working on a high-profile high-risk project does not in
any way change their personal time: it's theirs to do what they want with.

~~~
duaoebg
Society has given Kaparty et al a lot of latitude to take liberties with other
peoples lives. In my view this creates an obligation to do the best job
humanly possible.

~~~
mynameisvlad
That's all well and good for his _job_ , and even that assertion is debatable.
He's a human after all.

But this is a personal project. Something that is unrelated to his job. So it
doesn't really matter what "bar" it is because he's doing it for his own
benefit, not Tesla's or "humanity's".

Unless you're implying he can't have personal hobbies anymore. Which was the
exact point of my previous comment.

~~~
duaoebg
His job affects peoples lives. He certainly is human. I'm not criticising his
project. I'm not saying he can't have personal hobbies. And I'm not the one
that decides the bar.

I'm saying that there is a bar and it's higher when lives depend on it.

There is a long list of activities that people could do in their personal time
that would be considered inappropriate. What is done outside of work is
relevant. It can form part of a character assessment which can have political
and legal ramifications.

If Tesla ends up with a dangerous product then it becomes very relevant.

~~~
mynameisvlad
There is no bar on personal projects. Period. That's the whole point of it
being a _personal_ project.

At least you've made it perfectly clear that you don't think people should
have lives outside of work. Hope you also subscribe to the same ideology
yourself and don't have any personal projects or hobbies (why are you on HN
anyway, shouldn't you be working right now?) or else you'd easily be
considered a hypocrite.

~~~
duaoebg
There is a bar on personal projects. Period. Period.

Either you're misunderstanding logic, law or society. If Karparthy fails he
runs a real risk of having his life very closely examined. The rules are
different for different people.

I do machine learning medical work for the coronavirus. I work at near optimal
as humanly possible. Occasional hackernews is one of my few indulgences.

~~~
mynameisvlad
Agree to disagree. My company has no control over the work I do in my personal
time, and neither does some random person on the other side of the world. This
applies to me, you, and everyone else. Nobody is special because "they do
important work".

By your logic, you should have no indulgences, so you should probably stop
posting.

