Looking at interpret.cpp for SIMD potential: I bet you could add an allocator for std::vector that aligns and pads everything to 32 bytes then just replace all of the scalar op loops with loops over AVX intrinsics. No need for an external library.
That's a possibility to get something running near term. I'm trying to avoid CPU-specific intrinsics since I have a fantasy that this might be run on ARM in the future, though that may be getting really ahead of myself.