At a first glance, it looks like the way they're optimizing the BLAS/LAPACK impl...

At a first glance, it looks like the way they're optimizing the BLAS/LAPACK implementations is by making it CPU architecture specific - the same game that IMKL plays. That's probably why they are reaching the same performance as MKL as well.

Good to see they aren't reinventing the wheel, and openly expressing inspiration from Numpy is also a nice touch.