Hacker News new | past | comments | ask | show | jobs | submit login

Complexity also comes from the number of papers that work out how different elements of network work and how to intuitively change them.

Why do we use conv operators, why do we use attention operators, when do we use one over the other? What augmentations do you use, how big of a dataset do you need, how do you collect the dataset, etc etc etc




idk, just using attention and massive web crawls gets you pretty far. a lot of the rest is more product-style decisions about what personality you want your LM to take.

I fundamentally don't think this technology is that complex.


No? In his recent tutorial, Karpathy showed just how much complexity there is in the tokenizer.

This technology has been years in the making with many small advances pushing the performance ever so slightly. There’s been theoretical and engineering advances that contributed to where we are today. And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

Also, the post is generally about neural networks and not just LMs.

When making design decisions about an ML system you shouldn’t just choose the attention hammer and hammer away. There’s a lot of design constraints you need to consider which is why I made the original reply.


Are there micro-optimizations that eke out small advancements? Yes, absolutely - the modern tokenizer is a good example of that.

Is the core of the technology that complex? No. You could get very far with a naive tokenizer that just tokenized by words and replaced unknown words with <unk>. This is extremely simple to implement and I've trained transformers like this. It (of course) makes a perplexity difference but the core of the technology is not changed and is quite simple. Most of the complexity is in the hardware, not the software innovations.

> And we need many more to get the technology to an actually usable level instead of the current word spaghetti that we get.

I think the current technology is useable.

> you shouldn’t just choose the attention hammer and hammer away

It's a good first choice of hammer, tbph.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: