It doesn't. You need to generate models for use on the neural engine, which apple did for Stable Diffusion, but this is just taking advantage of lots of fast RAM and lots and lots of threads, if I understand it correctly.
It uses Metal acceleration, and takes advantage of the shared memory architecture, meaning it's basically a GPU with 196GB VRAM. Trading space (VRAM) for time (FLOPs), it can beat the performance of an RTX4080 here.