Pretty neat implementation. In general, for these sort of exercises (and even if the intention is to go to prod with custom kernels) I lean towards Triton to write the kernels themselves. It is much more easier to integrate to the tool chain, and allows a level of abstraction that doesn't affect performance even a little bit while providing useful constructs.
It was written with cutlass? No wonder Peter Kim found it valuable and worthwhile to de-obfuscate. Adopting a new programming language invented by OpenAI doesn't sound like a much better alternative. I'd be shocked if either of them were able to build code for AMD GPUs, where it's easy to adapt CUDA code, but not if it's buried in tens of thousands of lines of frameworks. I like open source code to have clarity so I can optimize it for my own production environment myself. When people distribute code they've productionized for themselves, it squeezes out all the alpha and informational value. Just because something's open source doesn't mean it's open source. I think people mostly do it to lick the cookie without giving much away.
zero cost abstractions exist. doesn't mean all abstractions are zero-cost. or being zero-cost somehow invalidates their abstractness/genericness. but maybe we differ on the definition of abstractions.
So does perpetual motion :shrug: but my point is Triton is not an abstraction in the least. Source: 1) I spent 6 months investigating targeting other backends 2) Phil himself said he doesn't care to support other backends https://github.com/openai/triton/pull/1797#issuecomment-1730...
It's amazing how heavily provided hn is. I have a response here that's been deleted that is like 15 words, including a link to source that corroborates my claim but that response contains a transcribed emoji and so it's been deleted by dang or whomever. Lol super rich environment for discourse we've got going here.
For those who have no idea what's being discussed, quick background.
Discussing: Transformer [1] memory issues and approximate attention [2] in machine learning training.
Specifically: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. [3]
As a side comment, this entire industry is sorely in need of at least intros. The entire space has moved so fast in the last year I need an entire new dictionary and thesaurus for all the terms they've created. Notably, because of this, found out Google has a glossary of machine learning terms. Actually somewhat handy.
Regarding your comment about how fast the research and industry is moving, would HN readers be interested in relevant one or two paragraph summaries that are basically "explain it like I am a machine learning engineer from 2020" but also knows the power of these models from a perspective of using ChatGPT or MS Copilot? That is, assume a fair amount of technical knowledge about the fundamentals, but don't assume that the reader is paying any attention to have whitebox knowledge of the current state of the art.
I personally have been looking for "explain it like I'm a CS PhD with lots of experience and the ability to look stuff up". But I suspect your summary would be pretty handy as well.
I reckon you need tacit knowledge. Experience. Luckily in the order
of 100 hours not 10000.
Build a GPT using Python and Pytorch. For a good course: Andrej Karpathy is your keyword. At $1000 his course is great value. But actually it is free which is even better ;-)
It wont take you to flash attention but will ramp you to the point you could probably read papers about it. I almost got that far then life lifed me. But I was able to implement changes to the architecture of GPT and do some “hey mum I am doing SOTA (2021) machine learning”.
That sounds at least somewhat helpful. Honestly, a gradient for some of this stuff would be nice. Explain to me like I'm: "five", "a high schooler", "a college grad (not CS/ML/Eng)", "a CS/Eng not ML".
Although in a couple years, kids in restaurants will probably telling me how they're leveling up attention on their neuro-pet. The singularity is steep.
Zero shot is wrong, but that definition is commonly used.
Zero shot is testing out if distribution, not just "a task" not trained on. The later is ill defined.
The original definition comes from a few papers. But the classic example is a clarifier recognizing zebras but having never been trained in zebras (but may have been trained on horses). There's are out of distribution. But importantly, out of the implicit distribution, not the target distribution.
The common improper usage usually confuses these two. A simple example might me training in 256x256 images and testing on 1024x1024. That's still in the implicit distribution (as long as the classes are identical). A very common example is training on a large dataset like LAION and then testing on coco or image net 1k. This is not zero shot because the classes in ImageNet are in LAION (and in Coco). Basically, this is a useless definition because then any validation or test set is zero shot because those were never seen in the training data and thus out of the training distribution. But remember that data sets are proxies for larger distributions.
Where is can get sometimes tricky is tasks (emergence has entered the chat). For example, you may not intend to train a generative model to do clarification but you probably did (it's very clear -- in the math -- if you're training density models (KLD, score, etc)). This can get hairy because it's very easy to train a model to do things that you aren't realizing you are and later find out. Some people can get upset about this but it's the nature of frameworks that have low interpretability. There's still a lot of mathematics we need to learn and it tends not to be an explicit focus in ML but there are plenty in the community focused on this.
My GPU work is not in ML (deep or otherwise); but ...
1. "100 lines of CUDA" + PyTorch; maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.
2. Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.
> maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.
In this case it's pretty reasonable imo, since the kernel itself is fairly independent - the usage of torch is just for some bindings for the data structures.
> Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.
This is actually the standard way to do things in ML. Assuming you're from a HPC background (where this may seem quite strange), the biggest change is that "More or less everything in ML runs on the GPU", so there is very rarely any device to host synchronizations. In addition, each individual kernel is typically run on fairly large chunks of data (a million elements would be on the smaller side), so maximizing occupancy with streams is not as necessary as in HPC.
Fantastic work! Extremely neat and clear implementation! Interesting note on the backward pass - what do you think are the main blockers for a backward pass?
This is fantastic. I am just starting in the ML space (compile from compilers) and I love short kernels that I can use to understand things better with.