Hacker News new | past | comments | ask | show | jobs | submit login
Transformers from Scratch (peterbloem.nl)
265 points by stablemap 60 days ago | hide | past | web | favorite | 28 comments

This is a _great_ article. One of the things I enjoy most is finding new ways to understand or think about things I already feel like I know. This article helped me do both with transformer networks. I especially liked how explicitly and simply things were explained like queries, keys, and values; permutation equivariance; and even the distinction between learned model parameters and parameters derived from the data (like the attention weights).

The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.

And here I was, excited to learn something about actual transformers, something involving wire and metal..

Same. Magnetization current and core saturation are fairly fundamental properties influencing transformer design, but they're barely even mentioned in introductory texts.

I feel like a lot of modern transformers are just sort of cargo-cult imports of old designs because everyone who knew the salient parameters has retired and the current crew just kinda nudges things until they work. A from-scratch explanation, up to the current state of the art, would be invaluable to anyone who deals with them.

But nah. This is HN, where headlines are their own code.

In that case you might enjoy this:


Was posted here a while back. Fascinating guy.

I thought it would be about making a Transformers like toy using 3D printing or something.

That would be a really neat project: printing a functional transformer as a single piece.

Same here, would have loved a build creating some actual transformers :)

We did in school on a coil winding machine, and also learned the math.

I don't rember the winding part being that much fun. :)

It was something old, akin to this: https://www.youtube.com/watch?v=Y-GyMYZ8yTU

RF transformers (in forms of various coils, chokes, baluns, etc) are more interesting and complex to analyze. I once winded three of them, and none worked. I thought I finally found an article that explains the subject, well, not a chance ;-)

I was hoping for monad transformers.

This is the best article I have read so far explaining the transformer architecture. The clear and intuitive explanation can’t be praised enough.

Note that the teacher has a Machine Learning course with video lectures on youtube that he references throughout the article : http://www.peterbloem.nl/teaching/machine-learning

This man was my professor at the VU.

Honestly his lectures were fun and easy to look forward too, I'm really glad his post is getting traction.

If you find his video lectures they are a really graceful introduction to most ML concepts.

Stellar article, I never understood self attention; this makes it so very clear in a few concise lines, with little fluff.

The author has a gift for explaining these concepts.

This is sweet. I've written conv, dense, and recurrent networks from scratch. Transformers next!

Plug: I just published this demo using GD to find control points for Bezier Curves: http://nhq.github.io/beezy/public/

Ah yes, machine learning architecture transformers, I knew that.

Wow. I have been looking for a good resource on implementing self-attention/transformers on my own for the last week - can't wait to read this through.

Noob question: I have some 1D conv net for financial time series prediction. Could a transformer architecture be better for this task, is it worth a try?

If you think a longer context length might be helpful consider stacking convolutions to give higher units a bigger receptive field, or try the convolutional LSTM. If that helps and you have a further argument for why an even larger context window would be helpful then perhaps try attention and in that case a Transformer would be reasonable. But your stacked conv net would be the fastest and most obvious thing that should work (with the caveat that I know nothing else about your data and it's characteristics, which is a really big caveat).

Consider looking at your errors and judging whether they stem from things your current model doesn't do well but that Transformers do, i.e., correlating two steps in a sequence across a large number of time steps. Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware such a thing existed.

> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.

But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.

Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.

Yeah, in some sense any layer is a "memory module". Perhaps more specifically, attention solves the problem of directly correlating two items in a sequence that are very, very far away from each other. I'd generally caution against using attention prematurely as it's extremely slow, meaning you'll waste a lot of your time and resources without knowing if it'll help. Stacking conv layers or using recurrence is an easy middle step that, if it helps, can guide you on whether attention could provide even more gains.

The title is deceiving. I thought this was an article about building your own electrical transformer, or building your own version of the 1980s toy.

FWIW I immediately thought of the ML architecture, not the toy or electrical device. HN is very ML-heavy these days, and in that context 'transformers' has a familiar and obvious meaning.

Same here: i was wondering if he had implemented some autobots in the kids programming language Scratch.

Because that would seriously be awesome.

I, too, thought it was the former, but the toy geek in me hoped for the latter. To my surprise it was neither.

I think the top level domain of the author's blog kinda gives it away :)

Huh? The website's TLD is NL not ML. (And I'll also chime in to say that I thought it was either electrical transformers or 3D printed transformers toys).

Oh wow, my sight must be poor since I honestly thought it was .ml.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact