
Transformers from Scratch - stablemap
http://www.peterbloem.nl/blog/transformers
======
cgearhart
This is a _great_ article. One of the things I enjoy most is finding new ways
to understand or think about things I already feel like I know. This article
helped me do both with transformer networks. I especially liked how explicitly
and simply things were explained like queries, keys, and values; permutation
equivariance; and even the distinction between learned model parameters and
parameters derived from the data (like the attention weights).

The author quotes Feynman, and I think this is a great example of his concept
of explaining complex subjects in simple terms.

------
dusted
And here I was, excited to learn something about actual transformers,
something involving wire and metal..

~~~
NKCSS
Same here, would have loved a build creating some actual transformers :)

~~~
megous
We did in school on a coil winding machine, and also learned the math.

I don't rember the winding part being that much fun. :)

It was something old, akin to this:
[https://www.youtube.com/watch?v=Y-GyMYZ8yTU](https://www.youtube.com/watch?v=Y-GyMYZ8yTU)

~~~
segfaultbuserr
RF transformers (in forms of various coils, chokes, baluns, etc) are more
interesting and complex to analyze. I once winded three of them, and none
worked. I thought I finally found an article that explains the subject, well,
not a chance ;-)

------
yamrzou
This is the best article I have read so far explaining the transformer
architecture. The clear and intuitive explanation can’t be praised enough.

Note that the teacher has a Machine Learning course with video lectures on
youtube that he references throughout the article :
[http://www.peterbloem.nl/teaching/machine-
learning](http://www.peterbloem.nl/teaching/machine-learning)

------
Gallactide
This man was my professor at the VU.

Honestly his lectures were fun and easy to look forward too, I'm really glad
his post is getting traction.

If you find his video lectures they are a really graceful introduction to most
ML concepts.

~~~
kranner
VU Machine Learning 2019 playlist:

[https://www.youtube.com/watch?v=-pve3oIvxa8&list=PLCof9EqayQ...](https://www.youtube.com/watch?v=-pve3oIvxa8&list=PLCof9EqayQgupldnTvqNy_BThTcME5r93)

------
isoprophlex
Stellar article, I never understood self attention; this makes it so very
clear in a few concise lines, with little fluff.

The author has a gift for explaining these concepts.

------
NHQ
This is sweet. I've written conv, dense, and recurrent networks from scratch.
Transformers next!

Plug: I just published this demo using GD to find control points for Bezier
Curves:
[http://nhq.github.io/beezy/public/](http://nhq.github.io/beezy/public/)

------
ropiwqefjnpoa
Ah yes, machine learning architecture transformers, I knew that.

------
siekmanj
Wow. I have been looking for a good resource on implementing self-
attention/transformers on my own for the last week - can't wait to read this
through.

------
ccccppppp
Noob question: I have some 1D conv net for financial time series prediction.
Could a transformer architecture be better for this task, is it worth a try?

~~~
hadsed
If you think a longer context length might be helpful consider stacking
convolutions to give higher units a bigger receptive field, or try the
convolutional LSTM. If that helps and you have a further argument for why an
even larger context window would be helpful then perhaps try attention and in
that case a Transformer would be reasonable. But your stacked conv net would
be the fastest and most obvious thing that should work (with the caveat that I
know nothing else about your data and it's characteristics, which is a really
big caveat).

Consider looking at your errors and judging whether they stem from things your
current model doesn't do well but that Transformers do, i.e., correlating two
steps in a sequence across a large number of time steps. Attention is
basically a memory module, so if you don't need that it's just a waste of
compute resources.

~~~
ccccppppp
Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware
such a thing existed.

> _Attention is basically a memory module, so if you don 't need that it's
> just a waste of compute resources._

But aren't CNNs also like a memory module (ie: they memorize how leopard skin
looks like)? I guess attention is a more sophisticated kind of memory, "more
dynamic" so to speak.

Anyway, I'm glad to hear that a transformer architecture isn't totally stupid
for my task, I will look up the literature, there seems to be a bit on this
matter.

~~~
hadsed
Yeah, in some sense any layer is a "memory module". Perhaps more specifically,
attention solves the problem of directly correlating two items in a sequence
that are very, very far away from each other. I'd generally caution against
using attention prematurely as it's extremely slow, meaning you'll waste a lot
of your time and resources without knowing if it'll help. Stacking conv layers
or using recurrence is an easy middle step that, if it helps, can guide you on
whether attention could provide even more gains.

------
gwbas1c
The title is deceiving. I thought this was an article about building your own
electrical transformer, or building your own version of the 1980s toy.

~~~
coolness
I think the top level domain of the author's blog kinda gives it away :)

~~~
ChickeNES
Huh? The website's TLD is NL not ML. (And I'll also chime in to say that I
thought it was either electrical transformers or 3D printed transformers
toys).

~~~
coolness
Oh wow, my sight must be poor since I honestly thought it was .ml.

