NumPy certainly does not have anything like "Fortran kernels". Elementwise arith...

m_mueller · on Aug 29, 2016

There may not be an absolute need in terms of ability to perform operations, but performance is heavily dependant on accessing in the correct order - otherwise your accessible memory bandwidth goes down the drain, as do cache hitrates.

I'm guessing LAPACK might not be too dependant on order though, but I haven't studied its performance tbh.

dagss · on Aug 30, 2016

Algorithms to handle what you talk about efficiently is one of the primary reason these libraries even exist. OF COURSE they access memory in the right order; that is their job.

But it is not really about right vs wrong order, but using an algorithm to tile the data efficiently. Google "Anatomy of High Performance Matrix Multiplication" for examples from OpenBLAS.

This is where NumPy falls through -- "a.T + a" will be very slow no matter what you do, but it can be done efficiently with tiling.