Superficially, it may look similar to Nvidia's Warp or OpenAI's Triton, but instead of transpiling a subset of Python to the target language, it implements dynamic codegen in pure Python.
These slides discuss the existing GPU programming solutions and make a case for Metashade's approach: https://docs.google.com/presentation/d/e/2PACX-1vQtYIwXIkMnV...
One thing that was frustrating was that I couldn't template a function with another function (as far as I know). That lead to some copied and pasted code when I had multiple implementations that I wanted to compare. Can Metashade do that?