You could also check the world catalog to see if a library near you offers the ebook for lending. Universities typically allow the general public to walk in and look at books without registration.
i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent
nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones
eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)
I skimmed your post and I wonder if mojo is focusing on such small 512x512 matrices? What is your thinking on generalizing your results for larger matrices?
I think for a compiler it makes sense to focus on small matrix multiplies, which are a building block of larger matrix multiplies anyways. Small matrix multiplies emphasize the compiler/code generation quality. Even vanilla python overhead might be insignificant when gluing small-ish matrix multiplies together to do a big multiply.
There are also "official" ones there: https://github.com/CGAL/cgal-swig-bindings. It does not cover all components, but is performed on a "when there are enough requests for it" basis.
reply