Python-to-Metal kernel compiler for Apple Silicon. You write GPU kernels as decorated Python functions — locomp compiles them through an SSA intermediate representation to native Metal Shading Language, optimizes them (CSE, DCE, constant folding), and dispatches on your Apple GPU.
Think Triton, but for Apple Silicon.
It supports the full kernel programming model: SIMD reductions, shared memory, atomics, simdgroup matrix ops (AMX hardware), auto-tuning, float16, INT4/INT8 quantization. 54 working examples including Flash Attention v1/v2/v3, paged attention, RoPE, SwiGLU.
As a proof of concept — SmolLM2-135M runs end-to-end on locomp kernels. No PyTorch, no MLX, no Metal C++. Just @locomp.kernel Python.
Think Triton, but for Apple Silicon.
It supports the full kernel programming model: SIMD reductions, shared memory, atomics, simdgroup matrix ops (AMX hardware), auto-tuning, float16, INT4/INT8 quantization. 54 working examples including Flash Attention v1/v2/v3, paged attention, RoPE, SwiGLU.
As a proof of concept — SmolLM2-135M runs end-to-end on locomp kernels. No PyTorch, no MLX, no Metal C++. Just @locomp.kernel Python.
pip install locomp