Hacker News new | past | comments | ask | show | jobs | submit login
TL;DR of Deep Dive into LLMs Like ChatGPT by Andrej Karpathy (anfalmushtaq.com)
381 points by oleg_tarasov 40 days ago | hide | past | favorite | 83 comments



OT: What is a good place to discuss the original video -- once it has dropped out of the HN front-page?

I am going through the video myself -- roughly halfway through -- and have a fw things to bring up.

Here they are now that we have a fresh opportunity to discuss:

1 - MATH and LLMs

I am curious why many of the examples Andrej chose to pose to the LLM were "computational" questions -- for instance "what is 2+2" or some numerical puzzles that needed algebraic thinking and then some addition/subtraction/multiplication (example at 1:50 mins about buying Apples and Oranges).

I can understand these abilities of LLMs are becoming powerful and useful too -- but in my mind these are not the "basic" abilities of a next token predictor.

I would have appreciated a more clear distinction of prompts that showcase core LLM ability -- to generate text that is acceptable as generally grammatically correct, based in facts and context, without necessarily needing the ability of a working memory / assigning values to algebraic variables / doing arithmetic etc.

If there are any good references to discussion on the mathematical abilities of LLMs and the wisdom of trying to make them do math -- versus simply recognizing when a math is needed and generating the necessary python/expressions and let the tools handle it.

2 - META

While Andrej briefly acknowledges the "meta" situation where LLMs are being used to create training data for the training of and judge the outputs of newer LLMs ... there is not much discussion on that here.

There are just many more examples of how LLMs are used to prepare mitigations for hallucinations by preparing Q&A training sets with "correct" answers etc

I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.

I kind of feel that this is a bit like the Manhattan project and atomic weapons -- in that early results and advances are being looped back immediately into the development of more powerful technology. (A smaller fission charge at the core of a larger fusion weapon -- to be very loose with analogies)

<I am sure I will have a few more questions as I go through the rets of the video and digest it>


Regarding 1 - MATH:

Somewhere in the video he says that LLMs have expert (only slightly fuzzy) knowledge about a lot of topics, but fail with simple math questions. Many non-technical people anthropomorphize LLMs and don't know that they can't think or calculate like a real calculator. LLMs compute tokens and you can improve the performance, if you don't put too much computation into a single result token.

I think it's an excellent example to show the capabilities and limits of LLMs. For softer topics, you can argue a lot more about what's considered to be right or wrong. With Math, you have a single correct answer that can be evaluated and people assume that computers are good at computer things, such as calculating numbers, even though LLMs actually aren't good at this.

The takeaway is: Prompting and "computational complexity per token" matter and if you understand how it works for math, you probably understand how it works for softer things like answers about law or whatever.


I've definitely done things like "give me a time stamp" then took too long to realize the time it gave made no sense. You get used to it working well when it does, and then it doesn't, and it's hard to switch the skepticism back on in response.


[flagged]


deedee (pronounced almost the same in Mandarin, written as didi in pinying) also means "little brother" and also something else more explicit :p


I believe Andrej Karpathy runs a discord, which is linked on his website [1]. I haven't participated personally, but from what I've seen, it's very active.

[1]: https://karpathy.ai/zero-to-hero.html


> for instance "what is 2+2" or some numerical puzzles that needed algebraic thinking

there is only one algebraic approach to solving something like 2+2 and that is counting! 2+2 = (((0 + 1) + 1) + 1) + 1). but llms are infamously bad at counting. which is why 2+2 isn't an algebraic problem to an llm. it's pattern matching or linguistic reasoning token by token.


LLMs are bad at counting because nobody counts in text, we count in our heads which is not in the training material.


This result --

https://x.com/yuntiandeng/status/1889704768135905332

Is this a consequence of the fact that "multiplication tables" For kindergarteners are available online (in training data) abundantly ... typically up to 12 times or 13 times table as plain text ?


i don't think it's just about the training material. it's also about keeping track of the precise number of tokens. you'd have to have dedicated tokens for 1+1+1+1 and another one for 1+1+1+1+1 etc.


Internal representation is multidimensional vectors. A typical 4096 in q4 one can name every particle in the universe and have over 4000 dimensions left for other purposes


i don't think that is a valid argument.


For point 1, he gets into that more later in the video - e.g. specifically on counting, and about how/when to have models invoke tools instead of doing math themselves, etc.

Also for the second point, check later in the video when he talks about RL and (simulated) RLHF - he gets into the feedback loops of models training each other and the collapse that follows.


> I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.

At the extreme, there is this paper on ‘inbred LLMs’: https://www.nature.com/articles/s41586-024-07566-y


> I am curious to know more about the limitations / perils of using LLMs to train/evaluate other LLMs.

the entropy goes up in such case (up means less information). The result will be as if someone recompressed mpeg with another lossy compression. You can sometimes see the results on the internet.


I find Meta’s approach to hallucinations delightfully counter intuitive. Basically they (and presumably OpenAI and others):

   - Extract a snippet of training data.
   - Generate a factual question about it using Llama 3.
   - Have Llama 3 generate an answer.
   - Score the response against the original data.
   - If incorrect, train the model to recognize and refuse incorrect responses.
In a way this is obvious in hindsight, but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.

Instead of teaching the model to recognize what it doesn't know, why not teach it using those same examples? Of course the idea is to "connect the unused uncertainty neuron", which makes sense for out-of-context generalization. But we can at least appreciate why this wasn't an obvious thing to do for generation 1 LLMs.


> In a way this is obvious in hindsight, but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.

But the answer space for LLMs is infinite and unbounded. So, no effort will be complete and you will always end up with the question of how to deal with uncertainty.

But I admit this is a bit of hindsight 20/20.


Karpathy's point in the video is that the models don't need to be exhaustively told what they don't know - they already have a good understanding of the extents of their knowledge. Older models just didn't use that understanding; they answered every question confidently because they'd only been trained on confident answers.


i don't think that's what he meant and also don't think that is accurate to say they already have an understanding. i'm not even basing my criticism on the anthropomorphization but on the fact that there will be activation constellation that correlate with uncertainty but you have to train them to channel this into an actual response expressing uncertainty ... only then it makes sense to speak of understanding uncertainty.


Sorry, I don't follow what you're disagreeing with. I was summarizing what Karpathy talks about in the vicinity of 1:31:00 - where he talks (I assume notionally) about a specific neuron lighting lighting up to indicate uncertainty, and how empirically this turns out probably to be the case.

Edit: concretely, we can presume that OpenAI didn't specifically train ChatGPT to know that "Orson Kovacs" isn't a famous person, right? That's all I'm saying here - that they trained it how to say it doesn't know things, and it took care of the rest.


i think i misinterpreted your first sentence.


Hrrm, I'm reading back and I may have misinterpreted your first post too. If so apologies!


> but it goes against ML engineers natural tendency when detecting a wrong answer: Teaching the model the right answer.

Hard to buy.

If a machine makes a mistake, it's because it was configured wrong or because of wear and tear, solar flares or some quake or some manufacturing defect in a part. If a learning machine makes a mistake, it's because it's learning has not extended it's rule set to cover that matrix/mistake/pattern, yet; and so it includes that mistake/matrix and other mistakes, analyses for patterns and then creates mistakes that fall into that pattern. Later doing that in a rolling release or canine kind of way and even later learning machines will do it all live, synchronous to their concurrent actions.

But yeah, thinking about that, I see why ML engineers wouldn't get there from scratch. It's a rhythm, after all, an epiphany about or realization of how ones dog, ones brain works, learned and then coded step by step. And there is, of course the variety of how people learn and "realize".

Someone has to show us the work of those savant programmers/engineers I still haven't seen a documentary of.


Andrej's video is great but the explanation on the RL part is a bit vague to me. How exactly do we train on the right answers? Do we collect the reasoning traces and train on them like supervised learning or do we compute some scores and use them as a loss function ? Isn't the reward then very sparse? What if LLMs can't generate any right answers cause the problems are too hard?

Also how can the training of LLMs be parallelized when updating parameters are sequential? Sure we can train on several samples simultaneously, but the parameter updates are with respect to the first step.


As I understood that part, in RL for LLMs you take questions for which the model already sometimes emits correct answers, and then repeatedly infer while reinforcing the activations the model made during correct responses, which lets it evolve its own ways of more reliably reaching the right answer.

(Hence the analogy to training AlphaGo, wherein you take a model that sometimes wins games, and then play a bunch of games while reinforcing the cases where it won, so that it evolves its own ways of winning more often.)


AlphaGo seems more like an automated process to me because you can start from nothing except the algorithm and the rules. Since a Go game only has 2 outcomes most of the time, and the model can play with itself, it is guaranteed to learn something during self-play.

In the LLM case you have to have an already capable model to do RL. Also I feel like the problem selection part is important to make sure it's not too hard. So there's still much labor involved.


Yes, IIUC those points are correct - you need relatively capable models, and well-crafted questions. The comparison with AlphaGo is that the processes are analogous, not identical - the key point being that in both cases the model is choosing its own path towards a goal, not just imitating the path that a human labeler took.


Details on how DS used GRPO for RL rewards

https://medium.com/@sahin.samia/the-math-behind-deepseek-a-d...


Thanks!



Will have a look. Thanks!


On 53 minutes from the original video, he shows how exact is the quotation of an LLM based on the text it was learning from. I wonder how did the bigtech convince the courts that this is not copyright violation (especially when ChatGPT was quoting some GPL code). I can imagine that the same thing would happen opposite, if I trained a model to draw a disney character, and my ass would be sued in a fraction of a second.


Note that he's inferring from a base model there, which are fairly capable of regurgitating their (highly-weighted) inputs since they do nothing but predict pre-training tokens. For instruct services like ChatGPT, if they regurgitate something I'd think it would more likely be their fine-tuning data, which is usually owned by the provider (and also kept secret).


what I mean is if we can describe an LLM as a lossy compression (which are words spoken by Andrej), we could define what was done during inferring as uncompressing the compressed data, and at this moment shit would hit the fan.


It's a interesting question, one I wonder now that our federal data is being exfiltrated to AI companies. If they train their models on the data; how does the law tell them to 'unlearn'?

Destroying the copies they took will be what the courts ordered, but the data will still be there.


This is still being litigated I believe.


For a model to be ‘fully’ open source you need more than the model itself and a way to run it. You also need the data and the program that can be used to train it.

See The Open Source AI Definition from OSI: https://opensource.org/ai


Is it reasonable to expect companies to redistribute 100TB of copyrighted content they used for their LLM, just on the off-chance someone has a few million laying around and wants to reproduce the model from scratch?


Redistribute? No. Itemize and link to? Yes.

With LLMs, the list doesn't even have to be kept up to date, nor the links alive (though publishing content hashes would go a long way here). It's not like you can get an identical copy of a model built anyway, there's too much randomness at every stage in the process. But, as long as the details of cleanup and training are also open, a list of training material used would suffice - people would fetch parts of it, substitute other parts with equivalents that are open/unlicensed/available, add new sources of their own, and the resulting model should have similar characteristics to the OG one who we could, now, call "open source".


Perhaps that's not reasonable to expect, but Meta apparently kind of did it anyway, if not in a way that helps reproduce their LLM: https://arstechnica.com/tech-policy/2025/02/meta-torrented-o...


Actually they did, the entire 15T tokens that were supposedly used for training the llama-3 base models are up on HF as a dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb

It's just not literally labelled so because of obvious reasons.


The RL-only (no SFT) approaches might remove that issue. Problem sets should be smaller (and mechanically creatable) than the entire western corpus.


Would a reference file with filename, size, source and checksum count towards the OSI definition?


For open source model claiming SOTA performance, we could at least check for data leak from its training data.


Fully agreed, someone actually mentioned that to me on reddit and I modified the content + added a disclaimer on top of the article.

https://www.reddit.com/r/LocalLLaMA/comments/1ilsfb1/comment...


That is incorrect -- you do not have to provide the full training data to meet the requirements.

I recommend reading the actual Open Source AI Definition[1] and the FAQ[2]. There's also the whitepaper[3] that goes into much more detail about the state of affairs.

[1]: https://opensource.org/ai/open-source-ai-definition

[2]: https://hackmd.io/@opensourceinitiative/osaid-faq#What-is-th...

[3]: https://opensource.org/wp-content/uploads/2025/02/2025-OSI-D...


FYI their open source ai definition released with a lot of controversy, unsurprisingly because it had heavy contribution from corporations with their own interests. It's best to ignore it for now until the wider community has decided on an appropriate open source definition.


We need a new definition then.


> you need more than the model itself and a way to run it. You also need the data and the program that can be used to train it.

the model reveals the architecture which is all you need to use/run/train it.


Sadly, the OSAID also does not require training data to be available. :(


Yes. I cannot comprehend this to this day. A model weights data + runner is how different from a closed source executable? Why do everyone call these open source?


Because typically adapting or improving traditional code to your needs is very difficult without access to the source code and build files.

For an LLM you can finetune and enhance, distill and embed given just the model weights, the runtime, and a permissive license. Having more is better. Well written detailed model release papers help a lot. Training code and training data are a great bonus.

However, I find the purity contest a bit too dismissive of the great contributions to the AI dev ecosystem that Meta and Deepseek have brought us. Without these, there wouldn't be the open ecosystem we have today.


It's not a purity contest, it's a clarity contest. If Meta and Deepseek want to operate the way they have been, where they release baked models and whitepapers, that's fine - and you're right, it's certainly more than they're obligated to release. They just shouldn't be calling it "open source" when the source is literally not open.


Eh, I can kinda see it. It depends on your definitions of words. People have been muddying the waters with what "open source" means anyway. I have known it to mean code released under an open source license. Other people use it to mean programs where the source is available regardless of license. I would use "source available" to describe that, but some people strongly disagree with my definitions.

If I write a program, then obfuscate it and then release the obfuscated code under an open source license, would you consider it open source(I would)? That's kind of the case here, they are releasing the model weights under an open source license.

Personally, I think it's fine to shorten it to "open source model" instead of "a model with the weights released under an open source license". What I would object to is releasing model weights under a restrictive license and calling that open source.


> If I write a program, then obfuscate it and then release the obfuscated code under an open source license, would you consider it open source(I would)?

I wouldn't. Most definitions of open source say something like "in the form used for editing". You can release a built binary under an unrestrictive license, but that does not mean that you've opened the source. It's literally the plain meaning of the words: the source, as in where the thing comes from, needs to be open for it be meaningful.


Because practically speaking, you can fine tune them I suppose?

But that's also true for binaries, games are a good example of where people pushed this quite far. Based on what little experience I have in ML, I'd say it's about the same thing. Whereas an API is more akin to a piece of software you can't tinker with in any way.

Guess the bar is just lower in the LLM space :P


So a proprietary program with a lot of knobs and configuration files is kind of opensource?


By what appears to be the logic for "open source AI", a locally executable proprietary program would be "open source" (because you can meddle with the executable). To me, that's mostly just "not SaaS". But somehow, a different definition appears to have stuck for LLMs than for other types of software.


Open source becomes really complicated once assets with unclear license are involved in any way. Lots of people for example would say that Jedi Knight 2 is open source because Raven Software released the source code and tools needed to build the game. But that alone doesn't mean you can run it, because you still need to get a hold of all the assets (models, textures, sounds) which may or may not still be property of LucasArts or its successors. Even if you have them, it's actually unclear if it is legal to use them this way. So while there are tons of people working on mods and conversions, noone in their right mind would distribute all the source assets.

Much in the same way, no sane company will touch the legal nightmare of releasing LLM training data scraped from public websites. Even releasing the LLM alone might be infringement, there are literally court cases being fought over this right now.


Games like that, or the open-source clones of commercial games that require original assets to play (e.g. OpenXCOM), actually give a very clear analogy here: open source does not mean open assets. The software code is under a separate license from the data it processes. Emulators like Dolphin are kind of in this situation too - the program is open, the data it processes is not.

And that's fine! It's still valuable to have access to the source code, even if the "batteries" aren't included. Of course, if you really want to call it an open source model you should include the source for the data scraping/cleaning stages too; then the only thing missing would be the compute time and risk of acquiring dubiously-legal inputs.

I personally prefer a taxonomy like:

* Open weights: you can download the artifact and run it locally, not just use it through an application like chatgpt or an API.

* Open source: the code that created the artifact is provided in the same format that the authors used to work on it.

* Open data: the dataset that the source code was used on is available for download.

All three of those could be individually licensed or released, for 8 possible combinations. In the analogy to games, they would correspond to the licenses on the retail binary, the source code of the game, and the original uncompressed art assets or Blender projects, respectively.


If it has already been established that open source doesn't mean open assets, why would we change that now? After all, training data is literally nothing but assets - except that you don't need them to run the application. So in that sense open LLMs are more open than these games.


But the training data isn't open...

I agree that open source doesn't mean open assets, but neither does open assets mean open source. You could make a linguistic argument that the training data is part of the "source" of the model (as in, from whence it came), but in any case the point is moot because neither the training data nor the code is open.


OK, but that leaves the tools used to train the model (aka the build scripts). These could be open sourced.


Because marketing (open source is a buzzword after all), and the media just repeats what they read in press releases verbatim. But most people working with the models themselves call them open-weight, except for some occasional exception like OLMo that publishes the dataset and training scripts and is actually open source.


Because Meta called them in this way to differentiate themselves from "Open"AI and everyone else followed the suit.


Because Meta called llama that shortly after it got leaked and it stuck.

The AI crowd doesn’t care much for licenses anyway.


I have read many articles about LLMs, and understand how it works in general, but one thing always bothers me: why other models did't work as good as SOTA ones? What's the history and reason behind the current model architecture?


Simply put the stability of attention based models over non attention based ones.

Google dropped MHA self attention which was a major idea that they showed to work. OpenAI saw and built an empire on FeedForward attention models which are (compared to most alternatives) super stable at generation. DeepSeek showed evidence it's possible to further push these models and use effectively compression in the model design to pass around sufficient information for training. (Hence the Latent) They also did a lot of other cool stuff, but the main "core of the model" difference is this part...

Other than that, the biggest hurdle has been hardware. There's probably no way you could get kit from 2010 without even aes acceleration to evaluate most full-fat mhlffa models let alone train them. There's been a happy convergence of matrix acceleration on GPUs for gaming, graphical and high fidelity stimulation work. This combined with matrix based ML maths combined with high throughput memory advances means we can do what we're doing with llms now.

So inevitable outcome or happy convergence? That's for historians to decide imo. I think it's a bit of both.


I guess nobody really knows why. Everybody just goes with what works, and tries only small variations. It's a bit like alchemy.


Simply put no other model has the same number of effective skip connections or passes as much information through the model from input to output.

Earlier models had huge bottlenecks in terms of information limits and precision. (Auto encoders Vs uNets for example) And LSTM are still semi unstable.

Why the attention design as posited by Google works so well is part the skip forward and part "now we have enough information and processing power to try this".

It's well motivated but from a first principles up do we expect this to work well, it's a bit less well understood still. And if you're good at that you'll likely get a job offer very quickly.


There is a lot of unpublished work on how to train models. A lot of work is cleaning up the data or making synthetic data. This the secret sauce. It was demonstrated by TinyStories and Phi-X and now the recent work on small data for math reasoning.


There's a huge effort going into understanding the statistical information in a large corpus of text especially after people have shown you can reduce the language input needed to carefully selected sources which guarantee enough information for training.

The smaller the input for the same quality the quicker/better/faster we can iterate so everyone is pushing to get the minimum viable training time of a decent llm down to allow both ChainOfThought to get cheaper as a concept and to allow for iteration and innovation.

As long as we live in the future aspoused by early OpenAI of huge models on huge GPUs we were going to stagnate. More GPU always means better in this game, but smaller faster models means you can do even more with even less. Now the major players see the innovation heading into the multi llm instance arena which is still dominated by who has the best training and hardware. But I expect to see disruption there too in time.


what do you refer to by "other models" not belonging to the "SOTA ones"?


You mean the history of pre-transformer language models, and reason for the transformer architecture ?

Once upon a time ....

Language modelling in general grew out of attempts to build grammars for natural languages, which then gave rise to statistical approaches to modelling languages based on "n-gram" models (use last n words to predict next word). This was all before modern neural networks.

Language modelling (pattern recognition) is a natural fit for neural networks, and in particular recurrent neural networks (RNNs) seemed like a good fit because they have a feedback loop allowing an arbitrarily long preceding context (not just last n words) to be used predicting the next word. However, in practice RNNs didn't work very well since they tended to forget older context in favor of more recent words. To address this "forgetting" problem, LSTMs were designed, which are a variety of RNN that explicitly retain state and learn what to retain and what to forget, and using LSTMs for language models was common before transformers.

While LSTMs were better able to control what part of their history to retain and forget, the next shortcoming to be addressed was that in natural language the next word doesn't depend uniformly on what came before, and can be better predicted by paying more attention to certain words that are more important in the sentence structure (subjects, verbs, etc) than others. This was addressed by adding an attention mechanism ("Bahdanau attention") that learnt to weight preceding words by varying amounts when predicting the next word.

While attention was an improvement, a major remaining problem with LSTMs was that they are inefficient to train due to their recurrent/sequential nature, which is a poor match for today's highly parallel hardware (GPUs, etc). This inefficiency was the motivation for the modern transformer architecture, described in the "Attention is all you need" paper.

The insight that gave rise to the transformer was that the structure of language is really as much parallel as it is sequential, which you can visualize with linguist's sentence parse trees where each branch of the tree is largely independent of other branches at the same level. This structure suggests that language can be understood by a hierarchy (levels of branches) of parallel processing whereby small localized regions of the sentence are analyzed and aggregated into ever larger regions. Both within and across regions (branches), the successful attention mechanism can be used ("Attention is all you need").

However, the idea of hierarchical parallel processing + attention didn't immediately give rise to the transformer architecture ... The researcher who's idea this was (Jakob Uszkoreit) had initially implemented it using some architecture that I've never seen described, and had not been able to get predictive/modelling performance to beat the LSTM+attention approach that it was hoping to replace. At this point another researcher, Noam Shazeer (now back at Google and working on their Gemini model), got involved and worked his magic to turn the idea into a realization - the transformer architecture - whose language modelling performance was indeed an improvement. Actually, there seems to have been a bit of a "throw the kitchen sink" at it approach, as well as Shazeer's insight as to what would work, so there was then an ablation process to identify and strip away all unecessary parts of this new architecture to essentially give the transformer as we now know it.

So this is the history and reason/motivation behind the transformer architecture (the basis of all of today's LLMs), but the prediction performance and emergent intelligence of large models built using this architecture seems to have been quite a surprise. It's interesting to go back and read the early GPT-1, GPT-2 and GPT-3 papers (ChatGPT was intitally based on GPT-3.5) and see the increasing realization of how capable the architecture was.

I think there are a couple of major reasons why older architectures didn't work as well as the transformer.

1) The training efficiency of the transformer, it's primary motivation, has allowed it to be scaled up to enormous size, and a lot of the emergent behavior only becomes apparent at scale.

2) I think the details of the transformer architecture - interaction of key-based attention with hierarchical processing, etc, somewhat accidentally created an architecture capable of much more powerful learning than it's creators had anticipated. One of the most powerful mechanism in the way trained transformers operate is "induction heads" whereby the attention mechanism of two adjacent layers of the transformer learn to co-operate to implement a very powerful analogical copying operation that is the basis of much of what they do. These induction heads are an emergent mechanism - the result of training the transformer rather than something directly built into the architecture.


I'm still seeking an answer to what DeepSeek really is, especially in the context of their $5M versus ChatGPT's >$1B (source: internet). What did they do versus not do?


There's a great deal of coverage about DeepSeek on Zvi's newsletter here: https://thezvi.substack.com/archive?sort=new - the first post on R1 is from January 22


Hey, Anfal here. I am actually the author of this article. I have had some really good discussion with a few really intelligent fellows I know and they tie very deeply into deepseek more. I'll create a post about it soon.


This one's for you, from Diana Hu (GP @ YC)

https://x.com/sdianahu/status/1887208144025292975


It is sad to see that much attention given to LLM in comparison to the other types of AIs like those doing maths (strapped to a formal solver), folding proteins, etc.

We had a talk about those physics AIs using those maths AIs to design hard mathematical models to fit fundamental physics data.



Great write up of what is presumably a truly great lecture. Debating trying to follow the original now.


Its a shame his LLC in C was just a launch board for his course.


I haven't watched the video, but was wondering about the Tokenization part from the TL;DR:

"|" "View" "ing" "Single"

Just looking at the text being tokenized in the linked article, it looked like (to me) that the text was: "I View", but the "I" is actually a pipe "|".

From Step 3 in the link that @miletus posted in the Hacker News comment: https://x.com/0xmetaschool/status/1888873667624661455 the text that is being tokenized is:

|Viewing Single (Post From) . . .

The capitals used (View, Single) also makes more sense when seeing this part of the sentence.


It would be great if the hardware issues were discussed more - too little is made of the distinction between silicon substrate, fixed threshold, voltage moderated brittle networks of solid-state switches and protein substrate, variable threshold, chemically moderated plastic networks of biological switches.

To be clear, neither possesses any magical "woo" outside of physics that gives one or the other some secret magical properties - but these are not arbitrary meaningless distinctions in the way they are often discussed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: