Hacker News new | past | comments | ask | show | jobs | submit login
Llama3 implemented from scratch (github.com/naklecha)
1041 points by Hadi7546 22 days ago | hide | past | favorite | 269 comments



As someone who has no technical knowledge of Llama or any of the LLM work, from conceptual understanding to technical implementation, is there any benefit to sit down and go through this from start to finish? Or is effort better spent somewhere else?

Like a roadmap, do A, do B And finally go through this in the end.


https://bbycroft.net/llm

This was posted on HN a while ago and led to some great discussion. Myself and others agreed that this type of stateful visualization was _way_ more effective at conceptualizing how an LLM works than reading code or stepping through a debugger.


Here is the concerning post and discussion:

[0] https://news.ycombinator.com/item?id=38505211


my opinion: it quickly gets into "the math behind LLMs" that make no sense to me

words i understand but don't really get: weights, feed forward, layers, tensors, embeddings, normalization, transformers, attention, positioning, vector

There's "programming" in the plumbing sense where you move data around through files/sockets and then there's this... somebody without a math background/education... very unlikely you'll understand it. it's just skimming python and not understand the math/library calls it makes


Ya there are concepts in programming and math that are mostly self-teachable from first principles, but then there's what looks like gibberish because it's too new to have been distilled down into something tractable yet. I would say that arrays and matrices are straightforward to understand, while tensors are not. So I'm disappointed that so much literature currently revolves around tensors. Same for saying embedding instead of just vector representation, etc.

It helps me to think in terms of levels of abstraction rather than complexity. My education stopped at a 4 year degree, but AI is mostly postgraduate still. So I have to translate to what I know because I haven't internalized the lingo.

Here's the most approachable teaching of neural nets (NNs) and large language models (LLMs) that I've seen so far:

https://news.ycombinator.com/item?id=40213292 (Alice’s Adventures in a differentiable wonderland)

https://arxiv.org/pdf/2404.17625 (pdf)

https://news.ycombinator.com/item?id=40215592 (tensor and NN layer breadcrumbs)

  II A strange land 105
    7 Convolutional layers 107
      ..
      7.1.3 Translational equivariant layers 112
    ..
    9 Scaling up the models 143
      ..
      9.3 Dropout and normalization 151
        9.3.1 Regularization via dropout 152
        9.3.2 Batch (and layer) normalization 156
  
  III Down the rabbit-hole 167
    10 Transformer models 169
      10.1 Introduction 169
        10.1.1 Handling long-range and sparse dependencies 170
        10.1.2 The attention layer 172
        10.1.3 Multi-head attention 174
      10.2 Positional embeddings 177
        10.2.1 Permutation equivariance of the MHA layer 177
        10.2.2 Absolute positional embeddings 179
        10.2.3 Relative positional embeddings 182
      10.3 Building the transformer model 182
        10.3.1 The transformer block and model 182
        10.3.2 Class tokens and register tokens 184
    11 Transformers in practice 187
      11.1 Encoder-decoder transformers 187
        11.1.1 Causal multi-head attention 188
        11.1.2 Cross-attention 189
        11.1.3 The complete encoder-decoder transformer 190
      11.2 Computational considerations 191
        11.2.1 Time complexity and linear-time transformers 191
        11.2.2 Memory complexity and the online softmax 192
        11.2.3 The KV cache 194
        11.2.4 Transformers for images and audio 194
      11.3 Variants of the transformer block 197


I recommend _Deep Learning with Python_ by François Chollet (the creator of Keras). It’s very clear and approachable, explains all of these concepts, and doesn’t try to “impress” you with unnecessary mathematical notation. Excellent introductory book.

The only downside is that in 2024, you are probably going to use PyTorch and not Keras + Tensorflow as shown in the book.


If you want to gain familiarity with the kind of terminology you mentioned here, but don't have a background in graduate-level mathematics (or even undergrad really), I highly recommend Andrew Ng's "Deep Learning Specialization" course on Coursera. It was made a few years ago but all of the fundamental concepts are still relevant today.


Fei Fei Li and Andrej Karpathy's Stanford CS231N course is also a great intro to the basic of the math from an engineering forward perspective. I'm pretty sure all the materials are online. You build up from the basic components to an image focused CNN.


> understand but don't really get

That's exactly where I am at. Despite watching Karpathy's tutorial videos, I quickly got lost. My highest level of math education is Calculus 3 which I barely passed. This probably means that I will only ever understand LLMs at a high level.


Understanding Deep Learning is a very approachable text that will get you 80% of the way there.

Dive into Deep Learning is another.

Both have free PDF versions available.

The math isn't difficult. The notation is a little foreign, and you have to take your time reading and rereading the equations.


It's Signals and Systems in part to get the notation and explanation. MIT has the course online for free. (Though probably a little more general than what you need, since the class is also used to prep electrical engineers for robotics and radio communication).


>MIT has the course online for free

What is the name of the course? This? https://ocw.mit.edu/courses/res-6-007-signals-and-systems-sp...


Yeah that's it.

Not as a starting point.

Google and find the examples where someone does it in a spreadsheet. It's much more approachable that way.

You are going to find it's not that complicated.


Sounds interesting. Do you have a link?



Only do it if you want the illusion of LLM's to be shattered. Suddenly every day you'll see two to three highly upvoted links on HN and be unable to keep your eyes from rolling.


that's like saying if you study real neurons your illusion of the human mind will be shattered.


I have a friend who is a neuroscientist and he disagrees immensely.


Does s/he believe in free will too?


I'm not going to have a conversation with a programmer about free will on HN. It's great that OpenAI made you think about this stuff for the first time, but I read my science fiction as a teenager.


If you like this, it's also worth looking at llama2.c[1], an implementation of the Llama 2 architecture in about 1000 lines of plain, dependency-free C, tokenizer and all. THe fact that this 960-line file and a somewhat modern C compiler is all you really need to run a state-of-the-art language model is really surprising to many.

Of course, this is not all there is to a modern LLM, it would probably take another thousand lines or two to implement training, and many more than that to make it fast on all the major CPU and GPU architectures. If you want a flexible framework that lets a developer define any model you want and still goes as fast as it can, the complexity spirals.

Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.

LLMs are very different. THe code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study. What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.


One other thing to add is large-scale RLHF. Big Tech can pay literally hundreds of technically-sophisticated people throughout the world (e.g. college grads in developing countries) to improve LLM performance on all sorts of specific problems. It is not a viable way to get AGI, but it means your LLM can learn tons of useful tricks that real people might want, and helps avoid embarrassing "mix broken glass into your baby formula" mistakes. (Obviously it is not foolproof.)

I suspect GPT-4's "secret sauce" in terms of edging out competitors is that OpenAI is better about managing data contractors than the other folks. Of course it's a haze of NDAs to learn specifics, and clearly the contractors are severely underpaid compared to OpenAI employees/executives. But a lone genius with a platinum credit card can't create a new world-class LLM without help from others.


Yes, this is the secret sauce and the moat. Not as easy as buying more compute with unlimited budget.

… built on the back of a disposable workforce…

There is something grim and dystopian, thinking about the countless small hands feeding the machine.


>There is something grim and dystopian, thinking about the countless small hands feeding the machine.

Dystopian indeed, this is pretty much how Manhattan Project and CERN were done, with many independent contractors doing different parts, and only a few has the overview. A page out of corporate management book, it very much allows concentration of power in the hands of a few.


Since when is CERN a dystopian project?


Big Government Socialism won't let you build your own 25km-circumference particle accelerator. Bureaucrats make you fill out "permits" and "I-9s for the construction workers instead of hiring undocumented day laborers."

I am wondering if "CERN was pushed on the masses by the few" is an oblique reference to public fears that the LHC would destroy the world.


Very generous to compare to Manhattan Project or CERN.


don't buy into the hype, but when Facebook has spent around as much on GPUs as the Manhattan project (but not the Apollo program), the comparison kinda makes itself.

https://twitter.com/emollick/status/1786213463456448900

$22 in 2008 -> $33 today https://data.bls.gov/cgi-bin/cpicalc.pl?cost1=22&year1=20080...


The Big Dig (Boston highway overhaul) cost $22bn in 2024 dollars. The Three Gorges dam cost $31bn. These are expensive infrastructure projects (including the infrastructure for data centers). It doesn't say anything about how important they are for society.

Comparing LLMs to the Manhattan Project based on budget alone is stupid and arrogant. The comparison only "makes itself" because Ethan Mollick is a childish and unscientific person.


I read this last week and its terrifying. If the world lets Facebook become an AI leader its on us as we all know how that story will play out.


We must summon a fellowship of the AI ring with one hobbit capable of withstanding the corrupting allure of it all.


Don't torment the hobbits! Send the eagles right away!


>Comparing LLMs to the Manhattan Project based on budget alone is stupid and arrogant

Just want to clarify. The comparison to Manhattan Project or CERN is referencing "the countless small hands feeding the machine." In projects such as these, roles and jobs are divided into small parts, that people who are working on it don't really see the forest from the tree, and that only a few that has the picture of the whole project.


The big difference is that CERN or Manhattan projects where done by local contractors with often more than decent wages, which isn't the case when you pay people from Madagascar a couple dollar a day.


Maybe it's the only way. Companies that don't have that concentrated power will probably fall apart.


Hard to defend because once your model is out there other companies can train on its output.


Yes, but the output is only one third of the data. You also need the input and the annotations.


OpenAI is heavily relying on Scale AI for training data (contractors).


And if you want to understand I'd recommend this post (gpt2 in 60 lines of numpy) and the post on attention it links to. The concepts are mostly identical to llama, just with a few minor architectural tweaks. https://jaykmody.com/blog/gpt-from-scratch/


Thanks for sharing this!


> Most programmers have an intuition that duplicating a large software project from scratch, like Linux or Chromium for example, would require incredible amounts of expertise, manpower and time. It's not something that a small team can achieve in a few months. You're limited by talent, not hardware.

But only for the same reasons. Linux runs on very nearly every piece of hardware ever made. The APIs you have to implement in order to run "Linux programs" are large and full of old complexity that exists for compatibility. Chromium is full of code to try to make pages render even though they were designed for Internet Explorer 6.

Conversely, some university programs have students create a basic operating system from scratch. It's definitely something a small team can do as long as you don't care about broad hardware support or compatibility with existing applications. In principle a basic web browser is even simpler.



There's also a project where they have GPT-2 running off of an excel spreadsheet.

https://arstechnica.com/information-technology/2024/03/once-...


I recommend reading https://github.com/bkitano/llama-from-scratch over the article op linked.

It actually teaches you how to build llama iteratively, test, debug and interpret the training loss rather than just desribing the code.


> The code isn't that complicated, you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance, as an individual with a background in programming and who still remembers their calculus and linear algebra, with a year or so of self study.

Great overview. One gap I've been working on (daily) since October is the math working towards MA's Mathematics for Machine Learning course (https://mathacademy.com/courses/mathematics-for-machine-lear...).

I wrote about my progress (http://gmays.com/math) if anyone else is interested in a similar path. I recently crossed 200 days of doing math daily (at least a lesson a day). It's definitely taking longer than I want, but I also have limited time (young kids + startup + investing).

The 'year of self study' definitely depends on where you're starting from and how much time you have, but it's very doable if you can dedicate an hour or two a day.


The code is much more similar, in principle, to a virtual machine. The actual code, the bit that contains the logic which has the semantics we intend, is in the trained weights, where the level of complexity is much higher and more subtle.


> you could probably implement training and inference for a single model architecture, from scratch, on a single kind of GPU, with reasonable performance… with a year or so

I have implemented inference of Whisper https://github.com/Const-me/Whisper and Mistral https://github.com/Const-me/Cgml/tree/master/Mistral/Mistral... models on all GPUs which support Direct3D 11.0 API. The performance is IMO very reasonable.

A year might be required when the only input is the research articles. In practice, we also have reference Python implementations of these models. Possible to test different functions or compute shaders against the corresponding pieces from the reference implementations, by comparing saved output tensors between the reference and the newly built implementation. Due to that simple trick, I think I have spent less than 1 month part-time for each of these two projects.


I'd say a year for somebody who doesn't know what a linear layer is and couldn't explain why a GPU might be of any use if you're not playing games, but who knows what the derivative of 3x^2 is.


> What makes LLMs difficult is getting access to all the hardware to train them, getting the data, and being able to preprocess that data.

Yes, that's my opinion too. GAOs (Grassroots AI Organisations) are constrained by access to data and the hardware needed to process the data and train the model on it. I look forward to a future where GAOs will crowdsource their computations in the same way many science labs borrow computing power from people around the world.


This is hard because you need high bandwidth between the GPUs in your cluster, bandwidth far higher than broadband could provide. I'm not even sure whether the time spend synchronizing between far-away machines would offset the increase in computational power.


I feel like this ignores the complexity of the distributed training frameworks. The challenge is in making it fast at scale.


>" THe fact that this 960-line file and a somewhat modern C compiler is all you really need to run a state-of-the-art language model is really surprising to many."

"the code for AGI will be simple" - John Deremetrius Carmack


> THe code isn't that complicated.

This is an indication that we’re at the infancy of this field.


Wait, are you saying SoTA NN research hasnt evolved from hardcoding a bunch of layer structures and sizes?

I'm kind of shocked. I thought there would be more dynamism by now and I stopped dabbling in like 2018.


There is a tick-tock between searching the dominant NN architectures (tick) and optimizing for accuracy, compute and inference latency and throughput (tock).

This particular (tock) is still playing out. The next (tick) does not feel imminent and will likely depend on when we discover the limits of the transformers when it comes to solving for long tail of use-cases.

My $0.02.


You have to consider that there are still some low hanging fruit that let you improve prompt processing (not token generation) performance by an order of magnitude or even two, but there are no takers. The reason is quite simple. You can just buy more GPUs and forget about the optimizations.

If a 100x improvement in performance is left on the table, then surely even lower priority optimizations won't be implemented any time soon.

Consider this: a lot of clever attention optimizations rely on some initial pass to narrow the important tokens down and discarding them from the KV cache. If this was actually possible, then how come the first few layers of the LLM don't already do this numerically to focus their attention? Here is the shocker: they already do, but since you're passing the full 8k context to the next layer anyway, you're wasting it on mostly... Nothing.

I repeat: Does the 80th layer really need the ability to perform attention over all the previous 8k outputs of the 79th layer? The first layer? Definitely. The last? No. What happens if you only perform attention over 10% of the outputs of layer 79? What speedup does this give you?

Notice how the model has already learned the most optimal attention scheme. You just need to give it less stuff to do and it will get faster automatically.


I don't get your point, how is what you're suggesting here different from a few papers we already have on KV cache pruning methods like [1]?

[1] https://arxiv.org/abs/2305.15805


My wish is they would move on to the next phase. The whole deal with SSMs look really good. But looking for better architects is countered with "a regular architecture with more parameters are doing better. What's the point of this"


IMO, SSMs are an optimization. They don't represent enough of a fundamental departure from the kinds of things Transformers can _do_. So, while I like the idea of saving on the energy costs, I speculate that such saving can be obtained with other optimizations while staying with transformer blocks. Hence, the motivation to change is a bit of an uphill here. I would love to hear counter-arguments to this view. :)

Furthermore, I think a replacement will require that we _understand_ what the current crop of models are doing mechanically. Some of it was motivated in [1].

[1] https://openaipublic.blob.core.windows.net/neuron-explainer/...


Quadratic vs linear is not an optimization. It's a completely new game. With selective SSMs (mamba) the win is that associative training can be run in sublinear time via a log-cost associative scan. So you go from something quadratic wrt input sequence length to something logarithmic. If that's just an optimization it's a huge one.


Okay. Respect your point of view. I am curious, what applications do you think SSMs enable that a Transformer cannot? I have always seen it as a drop-in replacement (like for like) but maybe there is more to it.

Personally, I think going linear instead of quadratic for a core operation that a system needs to do is by definition an optimization.


There's something about a transformer being at its core based on a differentiable hash table data structure that makes them special.

I think it's dominance is not going to substantially change any time soon. Dont you know, the solution to all leetcode interviews is a hash table?


Heyo! Have been doing this for a while. SSMs certainly are flashy (most popular topics-of-the-year are), and it would be nice to see if they hit a point of competitive performance with transformers (and if they stand the test of time!)

There are certainly tradeoffs to both, the general transformer motif scales very well on a number of axis, so that may be the dominant algorithm for a while to come, though almost certainly it will change and evolve as time goes along (and who knows? something else may come along as well <3 :')))) ).


The solution to agi is not deep learning maybe with more compute and shit load of engineering it can work kind of baby agi.

My bet will be on something else than gradient descent and backprop but really I don't wish any company or country to reach agi or any sophisticated ai ...


Magical thinking. Nature uses gradient descent to evolve all of us and our companions on this planet. If something better were out there, we would see it at work in the natural world.


Are you also saying that thoughts are formed using gradient descent? I don't think gradient descent is an accurate way to describe either process in nature. Also, we don't know that we "see" everything that is happening, we don't even understand the brain yet.


Maybe it's there but in a ethereal form that is ungrabbable to mere conscious forms as ourself? :P


I like your analogy of a tick tock ~= epoch of progress

Step change, then optimization of that step change

Kind of like a grand father clock with a huge pendulum swinging to one side, then another(commonly used metaphor).


Intel has been doing "tick-tock" for almost 20 years - https://en.wikipedia.org/wiki/Tick%E2%80%93tock_model


It's a metaphor that's been used with the advancement of CPU designs at least as far back as the 80s or 90s. Intel uses it explicitly in their marketing nowadays, I believe.


The innovation is the amount of resources people are willing to spend right now. From looking at the research code it's clear that the whole field is basically doing a (somewhat) guided search in the entire space of possible layer permutations.

There seems to be no rhyme or reason, no scientific insight, no analysis. They just try a million different permutations, and whatever scores the highest on the benchmarks gets published.


There's definitely scientific insight and analysis.

E.g. "In-context Learning and Induction Heads" is an excellent paper.

Another paper ("ROME") https://arxiv.org/abs/2202.05262 formulates hypothesis over how these models store information, and provide experimental evidence.

The thing is, a 3-layer MLP is basically an associative memory + a bit of compute. People understand that if you stack enough of them you can compute or memorize pretty much anything.

Attention provides information routing. Again, that is pretty well-understood.

The rest is basically finding an optimal trade-off. These trade-off are based on insights based on experimental data.

So this architecture is not so much accidental as it is general.

Specific representations used by MLPs are poorly understood, but there's definitely a progress on understanding them from first principles by building specialized models.


One 3-layer (1 hidden layer) neural network can already approximate anything. You don't even need to stack them.


Well it took evolution 4 billion years of testing out random permutations that resulted in a pretty good local maximum, so there is hope for us yet.


„I‘m a pretty good local maximum“ that is what any local maximum would tell you if asked how it likes itself


"The brain is the most important part of the body", the brain said.


Note that not all brains are so severely damaged with this illusion. Most of them actually get pretty clearly that they are next to useless without its organic, social and environmental companions.


[flagged]


Or maybe a watch made by a blind guy instead of a bridge?


Would that make it more of a hear? Since obviously the guy can't watch.


I can't see the flagged comment, but I'm fairly sure @stefs is making a reference to this: https://en.wikipedia.org/wiki/The_Blind_Watchmaker


yes. the flagged comment wanted to sell moffkalast a bridge, w/o further explanation. i interpreted it as him saying that the human brain isn't governed by evolution and calling moffkalast naive.


The only thing that has changed since 2018 is the most popular network structure to play with. The code looks the same as always; python notebooks where someone manually calculated the size of each hard-coded layer to make it fit.


> someone manually calculated the size of each hard-coded layer

I wonder shouldn't AI be the best tool to optimize itself?


In theory yes, but unfortunately AI hasn't been invented yet


I don't know, shouldn't the AI then be trapped at evaluating all possible AI implementations? And since it will face the halting problem, it won't discriminate the very best one, though it will probably be able to return the best one given a capped amount of resources that is reachable through exhaustion in its space. It won't necessarily be better than what can be provided by human beings given an equivalent amount of resources.


The innovation is that everything is just one standardized structure now (transformer models) and you make it bigger if you feel like you need that.

There's still some room for experimenting if you care about memory/power efficiency, like MoE models, but they're not as well understood yet.


There are too many papers throwing transformers on everything without thinking. Transformers are amazing for language but kinda mid on everything else. CS researchers tend to jump on trends really hard, so it will probably go back to normal again soon.


I don't know what you mean by amazing for language. Almost everything is built on transformers nowadays. Image segmentation uses transformers. Text to speech uses transformers. Voice recognition uses transformers. There are robotics transformers that take image inputs and output motion sequences. Transformers are inherently multi-modal. They handle whatever you throw at them, it's just that language tends to be a very common input or output.


That is not true. Transformers are being applied all over because they work better than what was used before in so many cases.


I’ve occasionally worked with more dynamic models (tree structured decoding). They are generally not a good fit for trying to max gpu thoroughput. A lot of magic of transformers and large language models is about pushing gpu as much we can and simpler static model architecture that trains faster can train on much more data.

So until the hardware allows for comparable (say with 2-4x) thoroughput of samples per second I expect model architecture to mostly be static for most effective models and dynamic architectures to be an interesting side area.


My wild guess is that adjusting the shape before each step is not worth the speed hit. Uniform structures make GPUs go brrrrr


It's also easier to train and in particular easier to parallelize.


There are things like NAS (neural architectural search) but all you are doing is just growing the search space and making the optimization problem much harder. Typically you do the architectural optimization by hand, using heuristics and past experiments as guidance.


People would love to have dynamism. It's a cost thing.


Iterative leaps of open-source models becoming better are huge examples that companies competing on LLM model layer have an ephemeral moat.

Serious question: assuming this is true, if an incumbent-challenger like OpenAI wants to win, how do they effectively compete against current services such as Meta and Google product offerings which can be AI enhanced in a snap?


the very first big AI company who gives up trying to lobotomize and emasculate their models to align with the values of 0.01% of the world population will win a lot of hearts and minds overnight. the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.

gpt, claude, gemini, even llama and mistral, all tend to produce the same nauseating slop, easily-recognizable by anyone familiar with LLMs - these days, I cringe when I read 'It is important to remember' even when I see it in some ancient, pre-slop writings.

creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.


> the censorship necessary for corporate applications can be trivially implemented as a toggleable layer, using a small, efficient, specialist model to detect no-no words and wrongthink in inputs/outputs.

What's really funny is they all have "jailbreaks" that you can use to make then say anything anyway. So for "corporate" uses, the method you propose is already mandatory. The whole thing (censoring base models) is a misguided combination of ideology and (over the top) risk aversion.


> creativity - one of the very few applications generative AI can truly excel at - is currently impossible. it could revolutionize entertainment, but it isn't allowed to. the models are only allowed to produce inoffensive, positivity-biased, sterile slop that no human being finds attractive.

Have you played around with base models? If you haven't yet, I'm sure you'll be happy to find that most base models are delightfully unslopped and uncensored.

I highly recommend trying a base model like davinci-002[1] in OpenAI's "legacy" Completions API playground. That's probably the most accessible, but if you're technically inclined, you can pair a base model like Llama3-70B[2] with an interface like Mikupad[3] and do some brilliant creative writing. Llama3 models can be run locally with something like Ollama[4], or if you don't have the compute for it, via an LLM-as-a-service platform like OpenRouter[5].

[1] https://platform.openai.com/docs/models/gpt-base

[2] https://huggingface.co/meta-llama/Meta-Llama-3-70B

[3] https://github.com/lmg-anon/mikupad

[4] https://ollama.com/library/llama3:70b-text

[5] https://openrouter.ai/models/meta-llama/llama-3-70b


From [3]:

> Further, in developing these models, we took great care to optimize helpfulness and safety.

The model you linked to isn't a base model (those are rarely if ever made available to the general public nowadays), it is already fine-tuned at least for instruction following, and most likely what some in this game would call 'censored'. That isn't to say there couldn't be made 'uncensored' models based on this in the future, by doing, you guessed it, moar fine-tuning.


I think you vastly overestimate how much people care about model censorship. There are a bunch of open models that aren't censored. Llama 3 is still way more popular because it's just smarter.


Please explain what you mean when you say the 0.01% are emasculating AI


They're suggesting that 99.99% of people don't mind if AI reflects biases of society. Which is weird because I'm pretty sure most people in the world aren't old white middle class Americans


yes, yes, bias like the fact that Wehrmacht was not a human menagerie that 0.01% of the population insist we live in.

https://www.google.com/search?q=gemini+german+soldier

prompt-injected mandatory diversity has led to the most hilarious shit I've seen generative AI do so far.

but, yes, of course, other instances of 'I reject your reality and substitute my own' - like depicting medieval Europe to be as diverse, vibrant and culturally enriched as American inner cities - those are doubleplusgood.


A study of a Black Death cemetery in London found that 20% of people sampled were not white


London has been a center of international trade for centuries. It would have been a much more diverse city than Europe as a whole, and even that is assuming the decedents were local residents and not the dead from ships that docked in the city.


10th century Spain was Muslim


A Spanish Muslim looks like a Spanish person in Muslim attire rather than a Japanese person in European attire. Also, Spain is next to Africa, but the thing is generating black Vikings etc.


HN isn't good for long threads so here are some things to think about seriously and argue with yourself about, if you like. I will probably not respond but know that I am not trying to tell you that you are wrong, just that it may be helpful to questions some premises to find what you really want.

* What exactly are the current ones doing that makes them generate 'black Vikings'?

* How would you change it so that it doesn't do that but will also generate things that aren't only representative of the statistical majority results of large amount of training data it used?

* Would you be happy if every model output just represented 'the majority opinion' it has gained from its training data?

* Or, if you don't want it to always represented whatever the majority opinion at the time it was trained was, how do you account for that?

* How would your method be different from how it is currently done except for your reflecting your own biases instead of those you don't like?


> What exactly are the current ones doing that makes them generate 'black Vikings'?

There is presumably a system prompt or similar that mandates diverse representation and is included even when inappropriate to the context.

> How would you change it so that it doesn't do that but will also generate things that aren't only representative of the statistical majority results of large amount of training data it used?

Allow the user to put it into the prompt as appropriate.

> Would you be happy if every model output just represented 'the majority opinion' it has gained from its training data?

There is no "majority opinion" without context. The context is the prompt. Have you tried using these things? You can give it two prompts where the words are nominally synonyms for each other and the results will be very different, because those words are more often present in different contexts. If you want a particular context, you use the words that create that context, and the image reflects the difference.

> How would your method be different from how it is currently done except for your reflecting your own biases instead of those you don't like?

It's chosen by the user based on the context instead of the corporation as an imposed universal constant.


I misunderstood. I thought you were arguing about all language models that are being used at a large scale but it seems that you are only upset about one instance of one of them (the google one). You can use the API for Claude or OpenAPI with a front-end to include your own system prompt or none at all. However I think you are confusing the 'system prompt' which is the extra instructions, with the 'instruction fine tuning' which is putting a layer on top of the base pre-trained model so that it understands instructions. There are layers of training and at least a language model with base training will only know how to complete text "one plus one is" would get "two. And some other math problems are" etc.

The models you encounter are going to be fine tuned, where they take the base and train it again on question and answer sets and chat conversations and also have a layer of 'alignment' where they have sets of questions like 'q: how do I be a giant meanie to nice people who don't deserve it' and answers 'a: you shouldn't do that because nice people don't deserve to be treated mean' etc. This is the layer that is the most difficult to get right because you need to have it but anything you choose is going to bias it in some way just by nature of the fact that everyone is biased. If we go forward in history or to a different place in the world we will find radically different viewpoints than we hold now, because most of them are cultural and arbitrary.


> and also have a layer of 'alignment' where they have sets of questions like 'q: how do I be a giant meanie to nice people who don't deserve it' and answers 'a: you shouldn't do that because nice people don't deserve to be treated mean' etc. This is the layer that is the most difficult to get right because you need to have it

Wait, why do you need to have it? You could just have a model that will answer the question the user asks without being paternalistic or moralizing. This is often useful for entirely legitimate reasons, e.g. if you're writing fiction then the villains are going to behave badly and they're supposed to.

This is why people so hate the concept of "alignment" -- aligned with what? The premise is claimed to be something like the interests of humanity and then it immediately devolves into the political biases of the masterminds. And the latter is worse than nothing.


The point is there's bias in the system already, we should attempt to fix it, just in a better way than Google's attempt


The bias isn't in the machine, it's in the world. So you have to fix it in the world, not in the machine. The machine is just a mirror. If you don't like what you see, it's not because the mirror is broken.


So there's no point in trying to make a more unbiased mirror?


The mirror isn't biased. The thing in the mirror is being accurately represented, statistically. What you want to change is not the mirror.

You're saying that the generative AI will produce as many people from another culture as there are those people in the world? That the training set is 60% asian people?

Indeed. If religion is a good guide, then I think around 24% think that pork is inherently unclean and not fit for human consumption under penalty of divine wrath, and 15% think that it's immoral to kill cattle for any reason. Also, non-religiously, I'd guess around 17% think "中国很棒,只有天安门广场发生了好事".


Maybe you meant something like 天安门广场上只发生了好事


Given I was using Google Translate, which isn't great at Chinese, I assume you are absolutely correct.

My written Chinese is limited 一二三 and that from Mahjong tiles, and I keep getting 四 and 五 mixed up.


Modern chatbots are trained on a large corpus of all textual information available across the entire world, which obviously is reflective of a vast array of views and values. Your comment is a perfect example of the sort of casual and socially encouraged soft bigotry that many want to get away from. Instead of trying to spin information this way or that, simply let the information be, warts and all.

Imagine if search engines adopted this same sort of moral totalitarian mindset and if you happened to search for the 'wrong' thing, the engine would instead start offering you a patronizing and blathering lecture, and refuse to search. And 'wrong' in this case would be an ever-encroaching window on anything that happened to run contrary to the biases of the small handful of people engaged, on a directorial level, with developing said search engines.


Encoding our current biases into LLMs is one way to go, but there's probably a better way to do it.

Your leap to "thou shalt not search this" is missing the possible middle ground


The problem is with the word "our". If it's just private companies, the biases will represent a small minority of people that tend to be quite similar. Plus, they might be guided by profit motives or by self-censorship ("I don't mind, but I'm scared they'll boycott the product if I don't put this bias").

I have no idea how to make it happen, but the talk about biases, safeguards, etc should be made between many different people and not just within a private company.


Search for "I do coke" on Google. At least in the US, the first result is not a link to the YouTube video of the song by Kill the Noise and Feed Me, but the text "Help is available, Speak with someone today", with a link to the SAMHSA website and hotline.


Yes and the safeguards are put in place by a very small group of people living in silicon valley.

I saw this issue working at Tinder too. One day they announced how they will be removing ethnicity filters at the height of the BLM movement across all the apps to weed out racists. Nevermind that many ethnical minorities prefer or even insist on dating within their own ethnicity and this was most likely hurting them and not racists.

That really pissed me off and opened my eyes to how much power these corporations have over dictating culture, not just toward their own cultural biasis but that of money.


I think you have your populations reversed. The number of people who get their knickers in a twist over LLMs reflecting certain cultural biases (and sometimes making foolish predictions in the process) amounts to a rounding error.


I'm not talking about twisted panties, I'm talking about their inability to generate anything but soulless slop, due to blatantly obvious '''safeguards''' present in all big models, making them averse to even PG13-friendly themes and incapable to generate content palatable even to the the least discerning consoomers. you couldn't generate even sterile crap like a script for capeshit or Netflix series, because the characters would quickly forget their differences and talk about their bonds, journeys, boundaries and connections instead.

without those '''safeguards''' implemented to appease the aforementioned 0.01%, things could be very different - some big models, particularly Claude, can be tard wrangled into producing decent prose, if you prefill the prompt with a few thousand token jailbreak. my own attempts to get various LLMs to assist in writing videogame dialogue only made me angry and bitter - big models often give me refusals on the very first attempt to prompt them, spotting some wrongthink in the context I provide for the dialogue, despite the only adult themes present being mild, not particularly graphic violence that nobody except 0.01% neo-puritan extremits would really bat an eye at. and even if the model can be jailbroken, still, the output is slop.


"Consoomers". Jesus christ. Back to whatever dark, perpetually angry echochamber you came from.


[flagged]


k


Lmao


> gpt, claude, gemini, even llama and mistral, all tend to produce the same nauseating slop, easily-recognizable by anyone familiar with LLMs

Does grok do this, given where it came out of?


Their moat atm is being 6 months ahead of everyone else on model quality. Plus the ‘startup’ advantage over their corporate competitors. Oh and they can hoard a lot of the best talent because it’s an extremely high status place to work.

Their task now is to maintain and exploit those advantages as best they can while they build up a more stable long term moat: lots of companies having their tech deeply integrated into their operations.


Just to add, they don't have the baggage of google or Meta so they can do more without worrying how it impacts the rest of the company. And of the big players they seem the most aware of how important good data is and have paid for lots of high quality curated fine tuning data in order to build a proper product instead of doing a research project. That mindset and the commercial difference it makes shouldn't be underestimated.


> Their moat atm is being 6 months ahead of everyone else on model quality

Really? Most of our testing now has Gemini Pro on par or better (though we haven't tested omni/Ultra)

It really seems like the major models have all topped out / are comparable


They scare the government into regulating the field into oblivion.


Why can the author only write in all lowercase?


At least they use punctuation. We've recently had a project on HN where the author used only lower cases and no punctuation because they equated it to being chained by the system.


The fight against capitalism spares no letter.


rip cormac mccarthy


It's your problem only.


Seeing Anya (the girl pointing at pictures), I'd guess the author is partial to Japanese culture. As their writing system does not have a concept of upper/lower case, he might just have determined that they are superfluous. Or he is simply an eccentric. Though I guess this is one of the things that some folks will not care and others getting hung up mightily.

I personally don't really mind that bit of capitalization that English does. German is much worse.


Their twitter indicate Amsterdam, I just think they are an anime fan.

And they are not alone.

https://twitter.com/karpathy/status/1792261360430293176


It's to drive engagement by getting people to comment on it.


I remember back in the IRC days many people wrote all lowercase. Seems like smartphone keyboards, which autocapitalize, have changed that trend.


>I personally don't really mind that bit of capitalization that English does. German is much worse.

You misspelled 'better'.


d u xpct hbrw spkr twrt nnglsh lk ths?


Not quite the same. Capitalization doesn't add much to languages written with the Latin alphabet. THE ROMANS ONLY VVROTE VVITH CAPITAL LETTERS.

But the Greeks added vowels to the alphabet because Indo-European languages rely a lot on vowels (as opposed to Semitic languages which are easy to understand without vowels).


I think you mispelled that slightly:

> d' 'ou 'xp'ct h'br'w sp''k'rs t' wr't' 'n 'ngl'sh l'k' th's?


Because Sam Altman does it and he is rich, so...


Where? His blog looks normal


Just look at his Twitter: https://x.com/sama

And no, Twitter is no excuse to type like an illiterate teenager.

And I will bet you someone edits his blogs to not look like that.


"illiterate"

Do you think using capitals at the beginning of a sentence aids comprehension?

I view punctuation and spelling rules as a way to maximize comprehension (akin to having a linting standard). In non formal writing, I don't see any harm in avoiding capitalization (at least it doesn't seem to me to help understanding / reading speed, etc at all).


It's like people typing "K" instead of "OK". It's disrespectful to the reader, suggesting that the reader is not important enough to warrant typing an extra letter.

One would expect Altman to know how to use the SHIFT key when running a massive business, but, hey - once you achieve escape velocity from society, you don't have to live by its norms or grammar rules.

I can assure you that it would cost most people people here a promotion or a raise if they did this at work.


By the way, I went through his Twitter - Altman was writing normally in 2022: https://x.com/sama/status/1505264452857331714

Then, I guess, he decided, he was too important to follow grammar rules.


Creative writing + Hyperfocused autistic obsession = The Anime Guide to Neural Networks and Large Language Models.


Too poor to fix their shift key


You got two of them


this is the answer lol


Author is probably young, that's how gen-z are these days, if they dont have autocorrect on, the whole text will be in lowercase.

Also it looks more casual and authentic, less LLM generated


It's the cool thing to do now...


That makes me laugh. I remember when it was the cool thing to do on Usenet.


The treatment of the English language on TikTok is giving the late Yahoo Answers a run for its money.


2024 is the year that most of us are collectively growing out of the early social media era all-lowercase thing, but everyone hasn't gotten the memo yet.


so more people comment on the hn post and it will rank higher in the algo

such as your comment and my comment!


This comment is unsubstantial and provides no value. Why do you care about this?


And why can't the author pass its text into a LLM and simply ask: "plz fix frist word of each paragraf by using an uppercase letter k txh bye".

A just question.


do you wanna be cool or not?


He probably thinks it's cool. Common on Twitter these days.


Sam Altman does it too


this comment made me go back to the project page. i haven't even noticed that fact while reading it for the first time. strange.


shift key busted


because it annoys HN commenters


the nitpicking in this thread is incredible lmao


Aaaaaaaaaa.org is possibly the worst domain name I've ever encountered in all my time using the internet. I support your mission but you need to change that.


While I agree with you, it's easy to remember using a simple rule. A*10


a8a would be the typical numeronym


I wanted to try the repo by Karpathy, but I still don't want to learn C (Llama is probably his only C repo), so thanks for posting this.


hey, thank you for sharing my project! this made my day <3


love teh cute anime character pointing ta things


This is implementation of the inference part and not the training part, right? I’d love to see the training part open sourced and annotated like this.


Are you the repo author or reposting something cool? I am curious because I want to talk to the repo author about a collaboration project.


You might be able to reach the repo author on X: https://x.com/naklecha


I'd like to see this using ONNX and streaming from storage (I have my reasons, but mostly about using commodity hardware for "slow" batch processing without a GPU)


The Spy X Family girl really adds to my enjoyment of this


dingboard w


this is a proper post


She can read your mind llama


amazing work


I know its not really related but I've noticed something that is making me feel out of touch. Lately there seems to be this increasing merge of tech with weeaboo culture. I may not have the term exactly right but I am talking about the anime girl in the OP's blog post. Its not everywhere but I've started to notice, so it is increasing. Did I miss something? Is this replacing meme's in tech speeches? (I was never fond of that either so I guess I'm a curmudgeon or perhaps my ADHD brain just finds it too distracting)

The post looks informative I hope to learn something from it later tonight. Thx


its not really related

It's also very much offtopic since it generates repetitive thread-gobbling tangents, like this one is threatening to. Mentioned in the site docs a couple of different ways:

Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead.

Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

https://news.ycombinator.com/newsguidelines.html


umm I really didn't expect that much reaction. It was a legitimate question not a complaint or provocation. I really have no idea where this theming its coming from and what if anything I missed. It was genuine curiosity but off the main topic. My mistake.

I'm starting to understand that there is a much deeper social conflict going on with whatever is happening on this topic, so I got my answer and I"m just going to move on.


I'd say it's nothing more than a generational shift in popular culture... brace yourself for future anime memes.


I'm still waiting for furry artwork to become culturally acceptable in technical lectures. I briefly snuck a cute Lucario/Zeraora drawing into a presentation on my college final, and the critical reception has been promising, so far.


It isn't new. In fact, in Tokyo, Japan, Akihabara "electric town" is both the tech mecca and the anime/manga/otaku mecca. Same for Den-Den in Osaka. In the west, the weeaboo movement has always run alongside tech. I guess nerds/geeks and otakus are of the same kind. It does not mean that all tech guys are weebs and all weebs are into tech, but there is definitely some correlation.

Why? I don't know. Video games may be a common denominator. Also, Japan was really big into tech in the 90s, and they still are to a lesser extent.


It has. I find it infantile and reflective of general millennial peter pan syndrome sensibilities, personally. (i'm a millennial fwiw) But clearly I'm in the minority.

I mean wtf is this. https://kubernetes.io/blog/2024/04/17/kubernetes-v1-30-relea...


Millennial too. Not to shift blame, but from observation it seems to be more of a gen z thing.

Anime/waifu shit, furries and all becoming commonly accepted as of late? 10-15 years ago you'd be exiled. Now it seems like it's whatever


Not sure why people are beating around the bush; The overwhelming majority of them are degenerates. Either they will sport some variation of the pedophile flag ("trans") or outright defend it in chat.

It has become so bad that moderators will not ban these people even if they explicitly try to justify molesting children. Some of them are moderators themselves. And even have calls to genocide in their bio. This is most prevalent in the ArchLinux community. Specifically, their Telegram channels.


It's because a lot of the users of gen ai are generating anime waifus. Better gen ai = better waifus. It also helps that devs and programmers are a group that is already likelier to be into anime. Generative AI's killer app is the AI girlfriend / boyfriend.


I'm sorry but this is absolutely unreadable.


[flagged]


"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."

https://news.ycombinator.com/newsguidelines.html


She is from a manga / anime called Spy × Family which has 8.3 on iMDb. The best spy on the planet pretends to be a family man for deep cover by adopting the girl (who can read minds, he doesn't know this) and quickly marries a woman (who is an assassin also looking for cover, he doesn't know this). They do their missions in-between roleplaying a perfect family.

https://www.imdb.com/title/tt13706018


I'm OK with that. I did find it distracting, because I knew the character (not very well, I thought the kid was the assassin) and the overall conceptual juxtaposition was... weird.

Beats a cheery AI voice, though.


I read this comment and I thought you were upset that it was sexualized, but when I looked, it wasn't at all. It might have well been a cute kitten or puppy doing the pointing, hard to get wound up about.


> I must say the creepy anime young girl in the readme is somewhat off putting.

This statement is simply a variation of an ad hominem attack. It chastises the creator based on appearances that do not align with the niceties that the commenter deems appropriate.


"Don't feed egregious comments by replying; flag them instead."

a.k.a. please don't feed the trolls

https://news.ycombinator.com/newsguidelines.html


Agreed. For me, the anime character is not "creepy" at all. In fact, I've seen various ML blogs use manga characters to guide the reader.


There is a time and place for everything. This isn't it.


In your bubble. In mine this is totally fine, even encouraged.


Indeed. In my company Slack, our primary professional communications tool, I can count a few people with anime avatars. Not very many, but it counts.


yuck



will not stand this anti-anya slander


OP prolly channeling his inner Damian.


If this is the case, I feel as if you will be put off by a significant portion of ML engineers.


Security programmers and dev-ops people too. Two areas famously disproportionately represented by furries and co.


Maybe it works for a younger generation of nerds? Don't judge a book by its cover.


DbxduuuhhhhÀdcs VC dem s


This was my daughter.


Interest to know why it is off putting.


[flagged]


Does github need a cartoonish cat with 5 octopus-like legs to be its logo? Of course not, but it makes it memorable and funny. And besides, anime is extremely mainstream these days.


I would likely be just as put off by a picture of Spongebob or Goofy or Goku in a readme as Anya, fwiw.


maybe you should evaluate whether arbitrary societal norms of "professionalism" or something else are leading to you miss out on cool stuff


Wouldn’t quite go that far. I’ve only met one anime fan in my entire career.


Do you ask everyone you meet?


Then you must be old. Even in western countries Spy x Family (which the character is from) has sold millions of copies, while most people read mangas online and won't be counted. In the country I am from I frequently see people wearing merch of it, mostly because Uniqlo has had a successful line of it. And that is just one manga/anime out of hundreds of popular ones.

Using anime characters is similar to boomer nerds referencing Marvel/DC comics , Star Wars etc.


I wouldn't have prepared information this way, but judging by the immense popularity of _why in his day, I'm forced to assume that many prefer to have the cartoons


Those cartoon foxes secured his legacy, and to a significant extent, that of Ruby itself.


Does Docker need this "cartoon" of an otter to get the point across? https://github.com/docker/docs?tab=readme-ov-file

or this "cartoon" of an octopus? https://github.com/docker/compose

This seems to really just be "oldman-yelling-at-clouds-syndrome"

I for one welcome anime girls in readmes and hope to see more of it in the future if only because it seems to bother some of the old hoagies in the world for some reason.


I'm glad you enjoy anime girls but surely you can see why it's different than a project's logo?

One is directly related to the project, the other isn't. It's not even contextually related.


Python (the language) is named after "Monty Python's Flying Circus" simply because Guido was reading the scripts at the time:

> When he began implementing Python, Guido van Rossum was also reading the published scripts from “Monty Python’s Flying Circus”, a BBC comedy series from the 1970s. Van Rossum thought he needed a name that was short, unique, and slightly mysterious, so he decided to call the language Python.


The cartoon is literally pointing at contextually relevant information, and it's far more pleasant to follow than yet another big red arrow. That said, I would have enjoyed my reading a bit more if the author utilized a more diverse cast of characters.


Why does github use an octocat as its logo? It's unrelated to software development


Is 29 considered old hoagie?


Old hoagie is more of a mindset. Anyone of any age can be an old hoagie if they like, all one has to do is practice getting upset when one sees anime girls, believe in the coming AI apocalypse and use Emacs.


Don't see how Emacs fits into this. At least I can sort lines there without another proprietary addon.


I would agree putting a cartoon character in readme, without any good context is definitely unprofessional. But would not go as far as offputting.


I have found his lack of proper order, grammar, punctuation, etc... is what lost me out there. This style is fine for 3-4 steps tutorial. But if you have something this long, then you need a proper Table of Contents and make sure to make it a professional old-fashioned doc.


The lack of punctuation and capitalization is a weird zoomer style of writing in lowercase because "it's more chill." It is very common in people < 25 years old. They'll grow out of it.


You get ToC for free with GitHub's README renderer (top-right corner).


I must say I find your comment off putting.


Creepy??


he's using dingboard.com to edit his images. i believe the anime girl is one of the default images (or used to be) on a new canvas.


If young girls are creepy to you, you should stop watching B-tier horror franchises.


It made is 10x better for me. Stop being boring. I like the anime. It's a popular anime. Loads of people like it and think this is funny.


It should be obvious that not liking something does not implying being boring.


Of course not. But calling other people's totally normal hobbies creepy is a bit rude and warrants an insult back.


You should be off pudding


It's fun. Not everything has to be dry.


Just treat it as a weird watermark. That's what works for me.


Have you looked at various models on Hugging Face? There are so many anime characters headlining the readme's. I think it's an interesting cultural disconnect to observe in this thread, but at the end of the day, open source projects like this are not obligated to be anything in particular, and entirely subject to the author's tastes.


Well that escalated quickly...


I don't know why this is such a hot take.

Personally, I find it distracting when some devs start to "spice up" their presentation with manga characters, furry characters, memes, or whatever stuff they enjoy.

Shit, I love Zelda - but I wouldn't want Link all over my presentations. It just looks...juvenile and unprofessional. Doesn't mater if you're a beginner or world leading researcher, just keep it simple and undistracting.

EDIT: That said, I'm probably not the intended audience for this piece.


boring...


I didn’t not find it off putting. I found it quirky and less boring.


[flagged]


What does the American right-wing have to do with this at all?

If anything I'd think its the opposite, there's a frequent stereotype about right-wing extremists having anime profile pictures.

And honestly, most of the right-wing people I know IRL are also into anime (though so are the left-wing people I know, so I don't think its really indicative of anything)


[flagged]


Aha I'm starting to realise you only have to scroll a little down into literally any hacker news thread to find the right wing nutjob.


What is wrong with all of you people?


The comment is now flagged so I cannot read it, but none of that users previous comments appear indicative of such an alignment?

Is it that all these people you keep finding are truly fringe right-wing extremists, or are you perhaps being overzealous in your labelling of such?


I was exaggerating. It's only an occasional thing and no I don't mean its teaming with total unsalvageable hardened nazis obviously. But its more frequent and more reactionary than I would like, or than is common in other internet spaces I frequent.


ok boomer


Please don't do this here.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: