For the record, you should never be paying someone to tinker with and/or parallelize low-level matrix routines. The numerical linear algebra folks spend their entire careers thinking about this stuff and writing (FOSS) software to wring every last bit of performance out of the hardware. There are specialized suites for multicore, distributed and even GPU setups.
... unless you have a very specialized use cases. The linalg guys are great, but they write for the generalized case. They have to.
And so, generations of game developers write matrix multiplication code. With quite nice performance results.
It'd be nice to see if some of that performance intensive code would benefit from being written in fortran. Anybody up for porting box2d as a small test case? ;)