I imagine if you could make the most out of vector instruction set in your code (where they can operate on a vector of data at once instead of one by one), you'll get a huge performance boost for "free". GP seem to be working on a vm that let you do that (a lot of it was flying over my head though, need some coffee).