She's right about squaring on each iteration, though. And granted, a square root is much more expensive, though done only once. Which method is faster would depend on the number of iterations.
The latency for sqrtss on broadwell is 11 cycles with a throughput of 4, where mul is a latency of 3 and throughput of 1. So, using some concrete numbers, sqrt is more expensive, but not polynomially or even an order of magnitude.
I honestly haven't looked into it in that much detail, but if I had to guess... probably. I'm not sure what the implementation used in hardware currently is, but at very least it's constant bounded,
Conversions are still somewhat expensive, but I do know compared to polynomial time or a large enough constant, it can be a better choice for an optimization. For example, computing log10 of an integer.
Since you're claiming that square root is "~N", the N here must be the number of digits. The number of multiplications required for the "optimization" you're proposing is exponential in this N.