Maybe its easier to specifically describe the code/problem?
For the best speedups, refactoring crappy code isn't going to cut it. It WILL still result in some nice performance boosts, and it's a good way to code in general. But when performance is absolutely essential, you're probably going to have to completely redesign from the ground up.
Mostly it seems like it just comes down to focusing on the parts that are going to take the vast majority of time and then just thinking very, very deeply about the problem you need to solve in that chunk. (Including asking if you're solving the right problem!) and what each and every function call or operation does, and trying to be as smart as you possibly can. Some fairly subtle things here working with intrinsic libraries can sometimes do some really stupid things behind the scenes that are hard to spot. Be careful to avoid any extra steps that are unnecessary, where data might get copied in memory unnecessarily or things like that.
Most of the time for the real speedups, I was writing my own libraries in Fortran that were custom tailored to the problem I was trying to solve. This is important because you might find some weird data validation being done or something that isn't relevant to your pipeline, but is included for generality.
There's certainly no magic bullet, but the most important is thinking hard about your problem, make sure you're solving the right one and the next most important thing is probably custom tailoring the libraries if you have time. ...assuming it's a problem that isn't so well solved. Sometimes that can be a waste of time of course, because your use case isn't so unusual.