I haven‘t run the code, but I‘m impressed by the small size. The first ELISA programs have been larger. And within the past 4 years we can fit this into bytes.
If someone has a hint where the magic lies in, please explain it to me.
Is it the GELU function or the model that‘s
downloaded through the bash script?
I guess the training data should be called assets. Analogously to how you can download the Quake source, and build it and get a working executable, but without the (non-open) assets you cannot replicate what you get by buying the game.
By that analogy, the executable code would be the game engine, the model that inference is done on would be the assets, and the training data would be... I guess the PSDs and high-poly models that get compressed and simplified to turn them into game assets.
It depends on whether the analogue of playing a game is training or inference. If the former, then the training data are the assets, the training software is the game, and the resulting model is the game experience (or if you want something more concrete, savegames and recorded play I guess).
But yeah, if gameplay equals the use a trained model, then the model is an asset bundle (the pak0.pak if you will) and training data is the original models, textures etc, and the training software is all the programs that are used in the asset production pipeline.
Well, consider that training is a batch processing step, while inference is the that's interactive. So training is more analogous to compilation than to gameplay.
AI: How can I help you?
Human: Who are you?
AI: I am Alice.
Human: Tell me something about a computer.
AI: I am a computer model trained by OpenAI. How can I help you?
Human: What is a computer?
AI: I am a computer model trained by OpenAI. How can I help you?
Human: Explain mathematical addition.
AI: Explain mathematical multiplication.
Human: 2+2
AI: 2+2
Human: Sum 2+2
AI: Sum 2+2
Human: What are your capabilities?
AI: I am a computer model trained by OpenAI. How can I help you.
I remember playing with GPT-2 when it came out. A friend and I exported our chat logs, fine tuned GPT-2 on them, and had it simulate conversations between us. It was hilarious and sometimes disturbingly accurate.
I'm wondering what caused the quantum leap between GPT-2 and 3? Bigger model, more data, or both? I know RLHF makes a huge difference, but even the base GPT-3 model (text completion) was very useful, given enough examples.
> You can then use this to create something like Chat GPT---just so long as you don't care about the quality of the output. (It's actually pretty terrible output, objectively speaking... But it does run.)
It's unusable and bears no relationship other than the name-dropping. But it's a program that compiles and runs.
By the looks of those in this discussion who are giving high praises about the capabilities of a project whose author admittedly states it doesn't really work, I guess the point is using buzzwords to bait the bandwagon types.
People "demake" games all the time. Nobody complains if those games are missing most of the features of the original because the core gameplay loop (the essence) is the same. That's the point here -- you can make a small-scale, low-resolution version of a major program in a very small space.
> By the looks of those in this discussion who are giving high praises about the capabilities of a project whose author admittedly states it doesn't really work, I guess the point is using buzzwords to bait the bandwagon types.
Instead of assuming everyone else is naive, perhaps consider another perspective?
the point seems to be making a small binary to run language model.
How useful is that?
From a functional perspective, I guess not very but the model can be improved.
From a performance / cost perspective, also not very because most of the cost is in training the model and small binary isn't necessarily indicative of quick.
I guess it's just kind of interesting that it doesn't take that much code to run the model.
Kind of always gives me the same output over and over again. It's kinda neat though. Have to look into it and tweak it myself. Wanted to play around with gpt2 locally for quite some time.
after reading it, this should output the exact same output as whatever gpt2 model you load normally and in this program, given the same temperature and seed i didn't see those myself in the code, but i was mostly trying to figure out why it was obfuscated; i think the unobfuscated code isn't too terribly much longer (10,000 chars?) and would still be impressive to behold on a screen.
This is just a practical approximation to the actual mathematical definition of GELU, which is `GELU(x) := x * Φ(x)` where Φ(x) is the CDF of the Gaussian distribution.
You can hand that GELU definition to a mathematician and they can interpret it as a function of a real number b. The definition does not depend upon b being a floating-point number with a particular bit representation.
In contrast, the fast inverse square root really exploits the bit representation of a floating point input to cheaply compute an initial guess.
I dont see how C macros are anything like regex. They match words and replace them with different text. Regex is about matching text with relatively complex patterns, and by itself doesn't do any text replacement.
AI: How can I help you?
Human: who are you
AI: I am alice.
Human: what can you do for me?
AI: I can help you.
Human: how to say bird in Chinese?
AI: bird in Chinese.
Human: 2+2=?
AI: 2+2= bird.
Historically speaking, not long, sometimes as little as a few weeks worth of minimal propaganda. The trick is to target the memes at people who are feeling victimized with Big Math's fascist BS.
Are commentors even reading the article or just bitching about how GPT-2 sucks or pedantry of LoC metrics?
Look at this article a different way. The author put a lot of information in a small context window so that it easier for readers to understand transformers. They included a useful code highlighter to ground it.
To soooo many people, even those with strong math/science, GPT is magic. This article opens the spell book, lays it out as a computation graph with all the fixing’s. The code isn’t abominable especially when paired with the prose.
It is a good piece of education and I will also call it art.
I see so many articles saying, or assuming, that 'those software people know what is happening', because it is programmed. I hear 'well they programmed it, so they know what is going on right?'.
This is pretty clearly showing, look it isn't a lot of code, and look what happens. Even if GPT2 isn't all that great, look at the amount of code, it is small, not some pile of millions of lines of complex programming to make it work.
"I am not a madman for saying that it is likely that the code for artificial general intelligence is going to be tens of thousands of lines of code, not millions of lines of code. This is code that conceivably one individual could write, unlike writing a new web browser or operating system."
- John Carmack, Lex Fridman Podcast (August 4th, 2022)
This was around 3 months before ChatGPT's initial release.
I listened to that podcast at the time. I remeber being surprised about one thing in particular. He was talking about predictions of when the first people will land on Mars. He said that we tend to underestimate many of the challenges involved and so we tend to think it will happen sooner than what it is reasonable to expect. He also recounted a time he was talking with other people about it but, since they decided to bet money on which prediction will be closest to the reality, the people involved started doing a more in-depth consideration of the challenges involved and being more cautious about claiming it will happen very soon. But then, he trew all of his arguments about the bias towards unrealistic optimism and the need for more nuanced analysis out of the window when they started talking about when we will have AGI.
If you fully watch it, or already know these issues, the notion of going to Mars any time soon seems outright foolish.
If you have a large enough platform, it might still be useful as fundraising campaign, a PR stunt, or as a way to rally the uninformed masses around a bold vision. But, that's about it.
His take was not really "novel" however, John McCarthy said basically the same thing multiple times in the 90s and maybe even 80s? He would say something along the lines of "If we ever get to an algorithm that expresses general intelligence, we will be able to write that in one or two pages of a manual. Such a book will still be rather long and the rest of the pages will be about how we got to that algorithm and why it took us so long".
Attention is all you need and GPT-2 were well known at that point. Many might doubt whether this approach leads to "general" intelligence – depends on the definition.
Depends what you mean by 'code', of course. There's some pretty high-level libraries doing heavy-lifting in those 10 thousands of lines. And if also count the training data and the billions of encoded parameters as "code", that is required to make this work, then that's a lot of "lines".
At the other extreme, if you're happy to exclude 'libraries' then I could wrap these tens of thousands of lines to a bashscript, and claim I have written an artificial general intelligence in only 1 line.
Everybody who has worked on a large/huge code base knows that most of it is there for historical reasons: it's always safer to add a new feature than to replace something old, as most of the time people don't know if that old code is useful or not.
BTW tinygrad shows that tens of thousands of lines are enough, it's skipping even most of the AMD kernel drivers and talks to the hardware directly.
I have seen people re-write large chunks of functionality not knowing it was already there in a form they didnt know about. Have done it myself a few times. Write largish blob of code. Replace it with one system call that does the same thing.
Also most things are simple. It is the data and libraries and applications from those simple things that are interesting. Take for example C++. The actual language is a few dozen keywords. But the libraries and things built using those small things is quite large.
Also most drivers are written so that we can re-use the code. But with the right docs you can twiddle the hardware yourself. That was 'the 80s/90s way'. It was not until macos/os2/windows/xlib that we abstracted the hardware. Mostly so we could consistently reuse things.
Linux (and friends) do a lot of things on the command line. LLMs are good at writing text and using a command line interface. Creating LLMs and AIs takes relatively little work and is interesting, open source is especially good at this type of work. Therefore, I predict that Linux will keep pace with other operating systems when it comes to voice control.
His claim is just that AGI will largely be comprised of stateless mathematical operations -- all of which could be a single line of code if written that way.
As a claim, it's antique -- it would easily have been a view of Turing and others many decades ago.
And as a claim its as false then as now. Not least because all the code which is the actual algorithm for generating an LLM is all the code that goes into its data collection which is just being inlined (/cached) with weights.
However, more than that, it's an extremely impoverished view of general intelligence which eliminates any connection between intelligence and a body. All the "lines of code" beyond a single are concerned with I/O and device manipulation.
Thus this is just another way of repeating the antique superstition that intelligence has nothing to do with embodiment.
> As a claim, it's antique -- it would easily have been a view of Turing and others many decades ago.
Absolutely.
We discussed this in the 90s and were of the opinion, then, that event state of the art NNs (in the 90s) wouldn't get much more complex because of their actual mathematical descriptions.
It's all the bells and whistles around managing training, and real-world input/output translation code.
Executing computer code is embodied in a physical machine. Modern machines have vision and audio I/O capabilities, for example.
You seem to be assuming, without any evidence, a narrow view of what embodiment entails.
The "antique superstition" claim is particularly ironic in this context, since it much more clearly applies to your own apparently irrational hasty conclusion.
The basic idea seems to be that human conscious experience is a mishmash of sensory-reflexive impulses and internal deliberation, combined into a sophisticated narrative. Simulating this kind of combination may help robots to move about in the world and to relate more closely to human experience. I have a lot of sympathy with this although I'm guarded about how much it really tells us about the potentials for AGI.
> Not least because all the code which is the actual algorithm for generating an LLM is all the code that goes into its data collection which is just being inlined (/cached) with weights.
Could this "data collection" code not potentially be put into a few thousand lines also?
> Could this "data collection" code not potentially be put into a few thousand lines also?
This would mean hardcoding a model of the world. Which maybe, with the help of some breakthrough from physics, would be possibe (but I think the kind of breakthrough needed to get down to such a small size would be a theory of everything). But this means eliminating the self-learning part of current neural networks, which is what makes them attractive: you don't have to hardcode the rules, as the model will be able to learn them.
What I mean is, we could program an entity which gathers its own data and updates its own weights. With the lines of code being in the tens of thousands.
But op's point was about people talking about tiny LLMs failing to consider the amount of parameters used. This has nothing to do with the ability of autonomously collecting the training data.
That's also why nvidia's position is fragile. The cost of the GPU is so high that at one point AI will have to be profitable and it is practical for those models to be ported to a competing platform.
People ask me, 'What is the use of climbing Mount Everest?' and my answer must at once be, 'It is of no use.' There is not the slightest prospect of any gain whatsoever. Oh, we may learn a little about the behavior of the human body at high altitudes, and possibly medical men may turn our observation to some account for the purposes of aviation. But otherwise nothing will come of it. We shall not bring back a single bit of gold or silver, not a gem, nor any coal or iron.
If you cannot understand that there is something in man which responds to the challenge of this mountain and goes out to meet it, that the struggle is the struggle of life itself upward and forever upward, then you won't see why we go. What we get from this adventure is just sheer joy. And joy is, after all, the end of life. We do not live to eat and make money. We eat and make money to be able to live. That is what life means and what life is for.
I agree with Mallory that we make money to live but I also think that modern technological society has turned pursuits into highly specialized perversions that aren't even inspiring any more.
The worldwide coverage of mountain climbing has even turned mountain climbing into a somewhat arbitrary thing in terms of "climbing everests": first it was to the top of the mountain (cool), then it was to the top in a certain time, then it was without supplementary oxygen, then it was climbing all 8,000m+ mountains, then it was doing THAT without oxygen...
At some point it's diminishing returns and becomes stupid -- and results in people just killing themselves, and AI is already on that level.
With every achievement there also comes a responsibility and our society is just achievement without responsibility.
I think you missed the point of the quote entirely. When I do something, I don't do it because nobody has ever done it before. I do it to see if _I_ could do it.
I didn't miss the point of the quotation. I am not really concerned with your motivations. My point was that Mallory's quotation becomes distorted by technological society to present new challenges that are relatively harmful and stupid, like making a small ChatGPT clone (or even making ChatGPT in the first place).
So while I agree that some people's motivations may be sound, it is how they are applied that is a perversion, and that Mallory's quotation has the germ of a phenomenon that is incredibly dangerous and isn't as deserving of awe as it is.
A person had an idea for something they wanted to do. They set about doing it. They undoubtedly ran into difficulties and had to think about how to solve them. They came up with solutions, probably some of the solutions aren't as elegant as they had hoped for, but they reached something that was acceptable, again and again. And eventually they created something that, to some extent, satisfied them.
That process is the joy.
I just don't see the harm in what this person has done? Envisaging something that could be created, and then creating it, is what life is all about.
I just don't see how words like "perversion" and "distortion" can sensibly be applied to this work?
The entire point is that technology shapes society so that joyful activities are subverted to create perversions like AI. The atomic bomb was an intellectually stimulating joy for many physicists for example. Life is about joy but it must be accompanied by the responsibility to seek joy in activities that don't do harm like AI.
Is it carefully minified by hand, or by another program? Because I really, really don't see how calling a minifier is "the struggle of life itself upward".
> python doesn't have anything like a semicolon to combine multiple statements
Python does have a "semicolon to combine multiple statements", and has further (lambda f: f(f)) expressions for complex expressions with local names and scopes.
(not that using either for those would result in pythonic code, but it is certainly not missing from the language)
What counts as a standard formatter? Python one-liners are Turing complete even without semicolons, evals, execs, and with finite stack depth. E.g. some formatters keep this at one line, while others introduce line breaks https://yonatan.us/misc/bf.html .
You also need to define whether the code has to be raw CPU instructions or if it's allowed to use an external library containing all the code humanity has ever produced.
To complain about it not having any point on Hacker News, thus creating a paradox where the complaint itself becomes the most discussed and engaging aspect of the post.
Only if we arbitrarily state this is now art. But you can of course say the same for the standard version. There is even a youngster out there that started a work called "the art of computer programming."
We are definitely legitimate to lament about the good old time when art and mere texnical stuffs were clearly and sharply separated, while this new generation is ruining societies and our glorious civilizations with their careless use of words that mixes everything. Well, at least as legitimate as any old man in traceable history.
A talk by Brian Eno convinced me that there is no meaningful difference between the two, and I've yet to see a definition of art that does not end up in self-contradiction if we try to make the distinction.
If someone has a hint where the magic lies in, please explain it to me. Is it the GELU function or the model that‘s downloaded through the bash script?