Assuming you have the math and algorithmic background, I would start by reading the “attention is all you need” paper. After reading, attempt to build a baby transformer model in PyTorch. After that, consider constructing some of the building blocks without libraries to understand how they work.
I read this exact advice often here on HN and I can’t help but wonder.
Is the person writing it just repeating something they read? Is it just because they like the ´coding from first principles’ aesthetics?
I mean let’s imagine that someone does read that paper, and manage to replicate the code (quite an effort from someone coming from outside AI and academia).
Then what? I doubt it’s particularly illuminating. That doesn’t really qualify for a job by itself. So what’s the goal there? Is it just a thing to say to look like a cool hacker that code from scratch?
> Is the person writing it just repeating something they read? Is it just because they like the ´coding from first principles’ aesthetics?
of course. you know how i know? absolutely no one except the wannabees has time to read papers - people working in the area have deadlines and meetings. we absorb the content of the paper by osmosis - convos, code bases, occasionally a talk at a conference.
it's especially horrible advice from the perspective of pedagogy to tell a n00b to read a paper (so the person giving the advice has immediately disqualified themselves from possibly being an expert) because papers are horribly written, omit critical details, and function purely as advertising for the authors, group, etc.
for every poor undergrad/n00b soul reading this comment, take this thing to heart that took me too long to unlearn (due to its constant perpetuation by people like gop): reading the paper is 100:1 waste-of-time:value-derived.
if i hear about/see some paper that makes strong claims that are relevant to my work, i will look for a github link and/or email the first author. 5/10 i get a response (ratio is going up as i'm getting to be more ingratiated in my community). the other times i just move on - none of these papers have some revolutionary cure for cancer in them so most of the time what i'm already doing is already close enough that i don't need to kill myself figuring out the new thing.
that paper in particular (attention) has nothing in it that is in the least bit interesting/revolutionary. the hard part of attention isn't writing down softmax(QK^T)V, the hard part is executing that matrix product fast enough that you're not waiting eons for your model to converge.
Could you post a link to "the BERT paper"? I've read some, but would be interested reading anything that anyone considered definitive :) Is it this one? "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" :https://arxiv.org/abs/1810.04805
The goal here is to actually understand the mechanics of the model and begin the process of intuitively understanding the space. I would put the effort here at a few weekends of focus.
Also will add the models turned out to be a lot simpler to understand than I expected going in.
If developing intuitions is the goal, I really do think Jay Alammar's "Illustrated Transformer" [1] is at least a step function better in the service of that outcome than the academic paper itself.
(I totally realize this is subjective, but that has been my experience with my own learning in the space over the last few years as well as some folks I've mentored)
Just a quick reminder to everyone reading that AI / ML (let's face it, it's 0.1% AI and 99.9% ML) is still a ton more than just SOTA Deep Learning models. Depending on where you work, it could be all classical machine learning methods, and zero deep learning - or the other way around.
Having a broad enough understanding in ML would be a good starting point, along with solid SW engineering skills.
I'm not qualified to answer this, but I would state nothing is really "overkill"
I've been "filling in the gaps" in math for almost a year now to learn machine learning stuff casually. I don't even need to use it, I am just obsessed with learning and I read about it for nearly an hour a day, and its still not enough.
Being self-taught at math introduces so many painful problems. If I were to do this seriously I would start ALLLL the way back at algrebra in 5th grade and work forwards ALLL the way up to linear algebra/calculus etc.
There's just too many tiny things and subtle details that are missed that I find. It makes any example require 10 times more brain power just to do simple things I don't remember, like the rules of factorization etc. So I'll go learn that thing which is simple, go back and 2 mins later I'm off trying to find some other simple thing. Mainly the idea of learning 5th grade math and such is so boring I just never have actually done it, so really instead of what I do, I would just learn the ENTIRE freaking thing.
Learning stuff like gradient descent and stuff is easy, but that's not even where all the hard stuff lies. I feel like trying to understand the math deeply behind those topics where youre not just glossing over explanation is where it gets difficult, and to do that you pretty much need the most solid math background without gaps.
If you have a good handle on undergraduate math material I highly recommend graduate math material. You will "restart" math in a sense and start building up those building blocks piece by piece. If you have a shaky understanding of a derivative that's fine because you will do epsilon-delta proofs until your eyes bleed in a Real Analysis course.
Edit: I didn't have the laws/properties of logarithms memorized/understood until maybe 3-4 years into my Math degree. I could have learned it sooner probably, but I just had an aversion to it and would desperately translate any problem into exponentials and work with those instead. I definitely sympathise with the desire to "restart" Math.
> Being self-taught at math introduces so many painful problems. If I were to do this seriously I would start ALLLL the way back at algrebra in 5th grade and work forwards ALLL the way up to linear algebra/calculus etc.