That's the postMessage bottleneck - PR #1 replaces it with Atomics-based dispatch which should push utilisation much higher. Early numbers look like 6.4 tok/sec on M2 Max
The part I'd point people to first is ARCHITECTURE.md — specifically the WASM binary construction section. Every other CPU inference project I know of uses Emscripten or a compiled Rust backend. PureBee builds the binary in JavaScript. That's the thing I'd most want challenged if I'm wrong about it being novel.
reply