Hacker News new | past | comments | ask | show | jobs | submit login
A ChatGPT clone, in 3000 bytes of C, backed by GPT-2 (2023) (carlini.com)
350 points by chubot 60 days ago | hide | past | favorite | 118 comments



I haven‘t run the code, but I‘m impressed by the small size. The first ELISA programs have been larger. And within the past 4 years we can fit this into bytes.

If someone has a hint where the magic lies in, please explain it to me. Is it the GELU function or the model that‘s downloaded through the bash script?


Most of the magic is in the 475 MB model file downloaded through the Bash script.


And probably like 100Gb of training data or whatever :). HN had a heady debate at least once about if that training data is source or not.


I guess the training data should be called assets. Analogously to how you can download the Quake source, and build it and get a working executable, but without the (non-open) assets you cannot replicate what you get by buying the game.


By that analogy, the executable code would be the game engine, the model that inference is done on would be the assets, and the training data would be... I guess the PSDs and high-poly models that get compressed and simplified to turn them into game assets.


It depends on whether the analogue of playing a game is training or inference. If the former, then the training data are the assets, the training software is the game, and the resulting model is the game experience (or if you want something more concrete, savegames and recorded play I guess).

But yeah, if gameplay equals the use a trained model, then the model is an asset bundle (the pak0.pak if you will) and training data is the original models, textures etc, and the training software is all the programs that are used in the asset production pipeline.


Well, consider that training is a batch processing step, while inference is the that's interactive. So training is more analogous to compilation than to gameplay.


Not really impressed after running it.

AI: How can I help you? Human: Who are you? AI: I am Alice. Human: Tell me something about a computer. AI: I am a computer model trained by OpenAI. How can I help you? Human: What is a computer? AI: I am a computer model trained by OpenAI. How can I help you? Human: Explain mathematical addition. AI: Explain mathematical multiplication. Human: 2+2 AI: 2+2 Human: Sum 2+2 AI: Sum 2+2 Human: What are your capabilities? AI: I am a computer model trained by OpenAI. How can I help you.


I'm impressed that it's even that conversational, given that GPT-2 was never trained for conversations.


I remember playing with GPT-2 when it came out. A friend and I exported our chat logs, fine tuned GPT-2 on them, and had it simulate conversations between us. It was hilarious and sometimes disturbingly accurate.

I'm wondering what caused the quantum leap between GPT-2 and 3? Bigger model, more data, or both? I know RLHF makes a huge difference, but even the base GPT-3 model (text completion) was very useful, given enough examples.


I don't know, GPT-2 wrote some of my favorite fairy tales:

https://deepdreams.stavros.io/episodes/the-princess-the-fair...


That is great, and in reality entertaining and for listening and sleeping in, good story.

So did you make it with that GPT-2 from this page?


No, it's three years old. It was made with OpenAI's GPT-2.


That was both impressive, weird, and 90% coherent. Which does give it a special kind of uncanny vibe.


I know, right?! I love it. Unfortunately, GPT-3 and later were much less weird and wonderful, because they were perfectly coherent.


"While I mostly put this together for fun, it's a nice demonstration how simple neural networks actually are."

Psst, don't tell anyone. Artificial Intelligence is the black magic we do to make money.


Is GPT-2 instruction tuned so that it can actually be used for chat? Otherwise I feel it’s more than just a stretch to call this a ChatGPT clone.


From the article:

> You can then use this to create something like Chat GPT---just so long as you don't care about the quality of the output. (It's actually pretty terrible output, objectively speaking... But it does run.)

It's unusable and bears no relationship other than the name-dropping. But it's a program that compiles and runs.

By the looks of those in this discussion who are giving high praises about the capabilities of a project whose author admittedly states it doesn't really work, I guess the point is using buzzwords to bait the bandwagon types.


People "demake" games all the time. Nobody complains if those games are missing most of the features of the original because the core gameplay loop (the essence) is the same. That's the point here -- you can make a small-scale, low-resolution version of a major program in a very small space.

> By the looks of those in this discussion who are giving high praises about the capabilities of a project whose author admittedly states it doesn't really work, I guess the point is using buzzwords to bait the bandwagon types.

Instead of assuming everyone else is naive, perhaps consider another perspective?


the point seems to be making a small binary to run language model. How useful is that? From a functional perspective, I guess not very but the model can be improved. From a performance / cost perspective, also not very because most of the cost is in training the model and small binary isn't necessarily indicative of quick. I guess it's just kind of interesting that it doesn't take that much code to run the model.


My thinking was more that ChatGPT is called "Chat" because it allows users to chat with a Generative Pre-trained Transformer.


> (TAKE THAT LANGUAGES WITH PROPER MACROS. LISP ISN'T ALWAYS BETTER THAN C!)

Ok, allowed this time as punching up.

--

If you missed the code link (it's buried in the text): https://github.com/carlini/c-chat-gpt-2


I've seen better with classical AI chatbots:

https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas...

Splotch will compile it fine on modern unixen with few changes.


Did anybody run it locally to see what kind Of output is generated by this GP2 running locally.


Kind of always gives me the same output over and over again. It's kinda neat though. Have to look into it and tweak it myself. Wanted to play around with gpt2 locally for quite some time.

Thanks for the post!


after reading it, this should output the exact same output as whatever gpt2 model you load normally and in this program, given the same temperature and seed i didn't see those myself in the code, but i was mostly trying to figure out why it was obfuscated; i think the unobfuscated code isn't too terribly much longer (10,000 chars?) and would still be impressive to behold on a screen.


The regular code is linked in the article:

https://github.com/carlini/c-chat-gpt-2/blob/main/c_chat_gpt...


These days, you can easily implement your own ChatGPT in a snap by using gptscript: https://github.com/gptscript-ai/gptscript


GELU really is like magic:

UNARY(GELU, b / 2 * (1 + tanh(.7978845 * (b + .044715 * b * b * b))))


This is just a practical approximation to the actual mathematical definition of GELU, which is `GELU(x) := x * Φ(x)` where Φ(x) is the CDF of the Gaussian distribution.


Isn't that just erf()?


They are related but the error function approaches -1 for large negative numbers. Φ(x) approaches 0 and so does x * Φ(x).


Fast inverse square root lookalike.


You can hand that GELU definition to a mathematician and they can interpret it as a function of a real number b. The definition does not depend upon b being a floating-point number with a particular bit representation.

In contrast, the fast inverse square root really exploits the bit representation of a floating point input to cheaply compute an initial guess.


I dont see how C macros are anything like regex. They match words and replace them with different text. Regex is about matching text with relatively complex patterns, and by itself doesn't do any text replacement.


So will it ever be possible to do something similar with ChatGPT 3.5 or so?


What type of hardware could this run on?

Could quantized weights on huggingface be used with this?

What type of problems or queries would this be really good at?


I love the way it's written: see this ChatGPT-like code that fits in your screen? Let's break it down!


Intelligence as a new fundamental layer! These small programs that do so much really inspire me.


Should try with smollm2 which is one of the smallest instructed models.


Thanks. For the learning it's really good.


At least it's kind of a conversation.

bash ./run.sh

AI: How can I help you? Human: who are you AI: I am alice. Human: what can you do for me? AI: I can help you. Human: how to say bird in Chinese? AI: bird in Chinese. Human: 2+2=? AI: 2+2= bird.


This is what it was like to live in 2019


How long would it take for people to accept this as the ground truth fighting and dying for this, starting the next cult/religion?


Have you accepted our lord and saviour, the machine-proclaimed 2+2bird into your life as the one and only true way to eternal life and happiness?


If only 2+2bird had died for our sins, but alas, it wasn't dead, it was merely resting.


Is it an African swallow or a European swallow?


I don’t know that!


All hail the 4 bird


Scholars are still fighting whether it’s 4bird or 2+2bird.

Bitter clashes between 4 and 2+2 birders around the world resulted in dozens injured, promoting the UN to call for peace between the two churches


Everyone knows (2+2bird) times it’s complex birdy conjugate (2-2bird) == 8bird


It is both, the holy binity.


Well my kids are already convinced 1+1=window


8 seconds. Its a CULT!. There.


Historically speaking, not long, sometimes as little as a few weeks worth of minimal propaganda. The trick is to target the memes at people who are feeling victimized with Big Math's fascist BS.


> Human: how to say bird in Chinese? AI: bird in Chinese.

Well, it knows how to be snarky!


Are commentors even reading the article or just bitching about how GPT-2 sucks or pedantry of LoC metrics?

Look at this article a different way. The author put a lot of information in a small context window so that it easier for readers to understand transformers. They included a useful code highlighter to ground it.

To soooo many people, even those with strong math/science, GPT is magic. This article opens the spell book, lays it out as a computation graph with all the fixing’s. The code isn’t abominable especially when paired with the prose.

It is a good piece of education and I will also call it art.


Also, demistifying the 'complexity'.

I see so many articles saying, or assuming, that 'those software people know what is happening', because it is programmed. I hear 'well they programmed it, so they know what is going on right?'.

This is pretty clearly showing, look it isn't a lot of code, and look what happens. Even if GPT2 isn't all that great, look at the amount of code, it is small, not some pile of millions of lines of complex programming to make it work.


"I am not a madman for saying that it is likely that the code for artificial general intelligence is going to be tens of thousands of lines of code, not millions of lines of code. This is code that conceivably one individual could write, unlike writing a new web browser or operating system."

- John Carmack, Lex Fridman Podcast (August 4th, 2022)

This was around 3 months before ChatGPT's initial release.

Timestamped: https://www.youtube.com/watch?v=I845O57ZSy4&t=14677s


I listened to that podcast at the time. I remeber being surprised about one thing in particular. He was talking about predictions of when the first people will land on Mars. He said that we tend to underestimate many of the challenges involved and so we tend to think it will happen sooner than what it is reasonable to expect. He also recounted a time he was talking with other people about it but, since they decided to bet money on which prediction will be closest to the reality, the people involved started doing a more in-depth consideration of the challenges involved and being more cautious about claiming it will happen very soon. But then, he trew all of his arguments about the bias towards unrealistic optimism and the need for more nuanced analysis out of the window when they started talking about when we will have AGI.


For a detailed accounting of the challenges involved, I recommend this video:

Why It Would Be Preferable To Colonize Titan Instead Of Mars https://youtu.be/_InuOf8u7e4?si=hRO1ZYCZtbQXUuK9

If you fully watch it, or already know these issues, the notion of going to Mars any time soon seems outright foolish.

If you have a large enough platform, it might still be useful as fundraising campaign, a PR stunt, or as a way to rally the uninformed masses around a bold vision. But, that's about it.


His take was not really "novel" however, John McCarthy said basically the same thing multiple times in the 90s and maybe even 80s? He would say something along the lines of "If we ever get to an algorithm that expresses general intelligence, we will be able to write that in one or two pages of a manual. Such a book will still be rather long and the rest of the pages will be about how we got to that algorithm and why it took us so long".


Attention is all you need and GPT-2 were well known at that point. Many might doubt whether this approach leads to "general" intelligence – depends on the definition.

BTW, Karpathy has a nice video tutorial about building an LLM: https://www.youtube.com/watch?v=kCc8FmEb1nY


Depends what you mean by 'code', of course. There's some pretty high-level libraries doing heavy-lifting in those 10 thousands of lines. And if also count the training data and the billions of encoded parameters as "code", that is required to make this work, then that's a lot of "lines".

At the other extreme, if you're happy to exclude 'libraries' then I could wrap these tens of thousands of lines to a bashscript, and claim I have written an artificial general intelligence in only 1 line.


Everybody who has worked on a large/huge code base knows that most of it is there for historical reasons: it's always safer to add a new feature than to replace something old, as most of the time people don't know if that old code is useful or not.

BTW tinygrad shows that tens of thousands of lines are enough, it's skipping even most of the AMD kernel drivers and talks to the hardware directly.


I have seen people re-write large chunks of functionality not knowing it was already there in a form they didnt know about. Have done it myself a few times. Write largish blob of code. Replace it with one system call that does the same thing.

Also most things are simple. It is the data and libraries and applications from those simple things that are interesting. Take for example C++. The actual language is a few dozen keywords. But the libraries and things built using those small things is quite large.

Also most drivers are written so that we can re-use the code. But with the right docs you can twiddle the hardware yourself. That was 'the 80s/90s way'. It was not until macos/os2/windows/xlib that we abstracted the hardware. Mostly so we could consistently reuse things.



Time for another Year of Linux (TM) prediction:

Linux (and friends) do a lot of things on the command line. LLMs are good at writing text and using a command line interface. Creating LLMs and AIs takes relatively little work and is interesting, open source is especially good at this type of work. Therefore, I predict that Linux will keep pace with other operating systems when it comes to voice control.


Time for another prediction that a computer will have a PhD, by the end of the decade. 1960, 1970, 1980, 1990, 2010, 2020, 2030. Next, yawn.


His claim is just that AGI will largely be comprised of stateless mathematical operations -- all of which could be a single line of code if written that way.

As a claim, it's antique -- it would easily have been a view of Turing and others many decades ago.

And as a claim its as false then as now. Not least because all the code which is the actual algorithm for generating an LLM is all the code that goes into its data collection which is just being inlined (/cached) with weights.

However, more than that, it's an extremely impoverished view of general intelligence which eliminates any connection between intelligence and a body. All the "lines of code" beyond a single are concerned with I/O and device manipulation.

Thus this is just another way of repeating the antique superstition that intelligence has nothing to do with embodiment.


> As a claim, it's antique -- it would easily have been a view of Turing and others many decades ago.

Absolutely.

We discussed this in the 90s and were of the opinion, then, that event state of the art NNs (in the 90s) wouldn't get much more complex because of their actual mathematical descriptions.

It's all the bells and whistles around managing training, and real-world input/output translation code.

The core or 'general case' always will be tiny.


Executing computer code is embodied in a physical machine. Modern machines have vision and audio I/O capabilities, for example.

You seem to be assuming, without any evidence, a narrow view of what embodiment entails.

The "antique superstition" claim is particularly ironic in this context, since it much more clearly applies to your own apparently irrational hasty conclusion.


What claim do you think I'm making?


> antique claim ... extremely impoverished ... antique supersition

Frankly, when the total argument against a position consists of such "boo" words, I immediately suspect some projection of personal preferences.

But anyway I googled "intelligence vs embodiment" and found this quite nice summary albeit from 2012: https://pmc.ncbi.nlm.nih.gov/articles/PMC3512413/

The basic idea seems to be that human conscious experience is a mishmash of sensory-reflexive impulses and internal deliberation, combined into a sophisticated narrative. Simulating this kind of combination may help robots to move about in the world and to relate more closely to human experience. I have a lot of sympathy with this although I'm guarded about how much it really tells us about the potentials for AGI.

> Not least because all the code which is the actual algorithm for generating an LLM is all the code that goes into its data collection which is just being inlined (/cached) with weights.

Could this "data collection" code not potentially be put into a few thousand lines also?


> Could this "data collection" code not potentially be put into a few thousand lines also?

This would mean hardcoding a model of the world. Which maybe, with the help of some breakthrough from physics, would be possibe (but I think the kind of breakthrough needed to get down to such a small size would be a theory of everything). But this means eliminating the self-learning part of current neural networks, which is what makes them attractive: you don't have to hardcode the rules, as the model will be able to learn them.


What I mean is, we could program an entity which gathers its own data and updates its own weights. With the lines of code being in the tens of thousands.


But op's point was about people talking about tiny LLMs failing to consider the amount of parameters used. This has nothing to do with the ability of autonomously collecting the training data.


That's also why nvidia's position is fragile. The cost of the GPU is so high that at one point AI will have to be profitable and it is practical for those models to be ported to a competing platform.


What's the point of minifying code that is anyways going to be compiled?


People ask me, 'What is the use of climbing Mount Everest?' and my answer must at once be, 'It is of no use.' There is not the slightest prospect of any gain whatsoever. Oh, we may learn a little about the behavior of the human body at high altitudes, and possibly medical men may turn our observation to some account for the purposes of aviation. But otherwise nothing will come of it. We shall not bring back a single bit of gold or silver, not a gem, nor any coal or iron.

If you cannot understand that there is something in man which responds to the challenge of this mountain and goes out to meet it, that the struggle is the struggle of life itself upward and forever upward, then you won't see why we go. What we get from this adventure is just sheer joy. And joy is, after all, the end of life. We do not live to eat and make money. We eat and make money to be able to live. That is what life means and what life is for.

--

George Mallory


I agree with Mallory that we make money to live but I also think that modern technological society has turned pursuits into highly specialized perversions that aren't even inspiring any more.

The worldwide coverage of mountain climbing has even turned mountain climbing into a somewhat arbitrary thing in terms of "climbing everests": first it was to the top of the mountain (cool), then it was to the top in a certain time, then it was without supplementary oxygen, then it was climbing all 8,000m+ mountains, then it was doing THAT without oxygen...

At some point it's diminishing returns and becomes stupid -- and results in people just killing themselves, and AI is already on that level.

With every achievement there also comes a responsibility and our society is just achievement without responsibility.


I think you missed the point of the quote entirely. When I do something, I don't do it because nobody has ever done it before. I do it to see if _I_ could do it.

Otherwise nobody would ever run a marathon again.


I didn't miss the point of the quotation. I am not really concerned with your motivations. My point was that Mallory's quotation becomes distorted by technological society to present new challenges that are relatively harmful and stupid, like making a small ChatGPT clone (or even making ChatGPT in the first place).

So while I agree that some people's motivations may be sound, it is how they are applied that is a perversion, and that Mallory's quotation has the germ of a phenomenon that is incredibly dangerous and isn't as deserving of awe as it is.


A person had an idea for something they wanted to do. They set about doing it. They undoubtedly ran into difficulties and had to think about how to solve them. They came up with solutions, probably some of the solutions aren't as elegant as they had hoped for, but they reached something that was acceptable, again and again. And eventually they created something that, to some extent, satisfied them.

That process is the joy.

I just don't see the harm in what this person has done? Envisaging something that could be created, and then creating it, is what life is all about.

I just don't see how words like "perversion" and "distortion" can sensibly be applied to this work?


The entire point is that technology shapes society so that joyful activities are subverted to create perversions like AI. The atomic bomb was an intellectually stimulating joy for many physicists for example. Life is about joy but it must be accompanied by the responsibility to seek joy in activities that don't do harm like AI.


Is it carefully minified by hand, or by another program? Because I really, really don't see how calling a minifier is "the struggle of life itself upward".


I'm assuming 3000 bytes refers to the uncompiled minified code


Indeed, and the question is why do that? We can count LOCs just fine already.


Because it'd be very hard to compare short program length otherwise? 5 LoC says nothing.


Technically, it depends on the language. I.e. python doesn't have anything like a semicolon to combine multiple statements and assignments per line

But yeah, for C the LOC metric can be gamed to a silly degree


> python doesn't have anything like a semicolon to combine multiple statements

Python does have a "semicolon to combine multiple statements", and has further (lambda f: f(f)) expressions for complex expressions with local names and scopes.

(not that using either for those would result in pythonic code, but it is certainly not missing from the language)


Oh, you're right. I should've tested that before writing my comment. Thanks for calling me out


Yeah but lines in anything should be "lines formatted with a standard formatter", not just "one line with ten thousand semicolons in it".


What counts as a standard formatter? Python one-liners are Turing complete even without semicolons, evals, execs, and with finite stack depth. E.g. some formatters keep this at one line, while others introduce line breaks https://yonatan.us/misc/bf.html .


You also need to define whether the code has to be raw CPU instructions or if it's allowed to use an external library containing all the code humanity has ever produced.


The usual with this is "whatever ships with the language compiler and OS".


What's the point of art?


To complain about it not having any point on Hacker News, thus creating a paradox where the complaint itself becomes the most discussed and engaging aspect of the post.


This is not what HN is for and your comment has no point. (Upvoted.)


So a standard C program is not art, but if you remove newlines and shorten variable names it suddenly becomes art?


Only if we arbitrarily state this is now art. But you can of course say the same for the standard version. There is even a youngster out there that started a work called "the art of computer programming."

We are definitely legitimate to lament about the good old time when art and mere texnical stuffs were clearly and sharply separated, while this new generation is ruining societies and our glorious civilizations with their careless use of words that mixes everything. Well, at least as legitimate as any old man in traceable history.

https://history.stackexchange.com/questions/28169/what-is-th...


If it's an entry for the IOCCC, yes ;)


That’s craftmanship, not art.


A talk by Brian Eno convinced me that there is no meaningful difference between the two, and I've yet to see a definition of art that does not end up in self-contradiction if we try to make the distinction.


The point of art is the expression of human experience and ability, but it frequently becomes perverted into serving capitalist tendencies. So:

1. Art becomes content 2. Automation to make life better becomes automation for profit 3. Life becomes meaningless achievement

At some point, there is always the counterpoint of balance and we don't have that.


Jart is already hard at work on a version that’s only 2070 bytes but includes a full lisp interpreter for the prompt parser


That fits inside a QR code. Artificial brain scan.


Never heard of Jart - who or what is it?


Justine Tunney, https://github.com/jart


TL;DR: code-golf style C program to do inference on an existing TensorFlow model data for GPT2, not full ChatGPT, nor training or anything.


This isn't code golf. This is just white space removed.


more like IOCCC+codegolf style implementation of GPT2

http://www.ioccc.org


There is actually an IOCCC entry in 2019 that does almost exactly this except it’s LSTM instead of transformers: https://www.ioccc.org/2019/mills/prog.c

https://www.ioccc.org/2019/mills/hint.html


This one is actually impressive: implements Adam training of RNNs, LSTMs, GRUs.


[flagged]


Hacker News front page isn't a trophy ceremony for smart AI models.


> Isn't the whole point that the models hosted by OpenAI are smarter?

That’s not the point here, because as the link says (3rd paragraph): "It's actually pretty terrible output".


Because it's novel and interesting. This person wrote a super optimized dependency free implementation of GPT in C.


It's not really novel nor super-optimized. Many inference engines are out there. But interesting it is. Probably a good learning material too.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: