
An adventure in trying to optimize math.Atan2 with Go assembly - tetraodonpuffer
http://agniva.me/go/2017/08/27/fun-with-go-assembly.html
======
dvt
Although this post is interesting, the best way to optimize trig functions is
generally via rational approximations[1] if you can live with the error (and
the clamping). I remember finding a rational function to approximate a tanh-
like soft clipper via a Padé approximator on (-1, 1) a few years ago[2]. See
an equivalent rational approximation for atan2 here[3].

[1] [http://www.embedded.com/design/other/4216719/Performing-
effi...](http://www.embedded.com/design/other/4216719/Performing-efficient-
arctangent-approximation)

[2] [https://stackoverflow.com/questions/6118028/fast-
hyperbolic-...](https://stackoverflow.com/questions/6118028/fast-hyperbolic-
tangent-approximation-in-javascript/6118100#6118100)

[3]
[https://www.dsprelated.com/showarticle/1052.php](https://www.dsprelated.com/showarticle/1052.php)

~~~
raphlinus
A bit OT but my favorite way to approximate tanh is to do g(f(x)) where f(x)
is an odd polynomial and g(x) is 1/sqrt(1 + x^2). The recip sqrt can be
computed very quickly using Newton approximation, and many SIMD instructions
have an instruction for it. Highly recommended for audio applications.

~~~
jacobolus
Something is definitely off about your recollection here. _g_ ( _x_ ) = 1/√(1+
_x_ ²) is always positive, whereas tanh( _x_ ) is negative for negative values
of _x_. So regardless of your _f_ , _g_ ( _f_ ( _x_ )) can’t ever be a correct
approximation for negative _x_.

~~~
jwmerrill
Looks like grandparent probably meant g(x) = x/√(1+x²)

Looks like f(x)=x+x^3/5 gets you two decimal digits everywhere [1], and you
could improve that with more terms in the polynomial.

Edit: you can derive f(x) by doing a series expansion of tanh(x)/√(1-tanh(x)²)
about x=0, and Mathematica tells me the first few terms are

f(x) = x + x^3/6 + x^5/120 + x^7/5040 + O(x^9)

[1]
[https://www.desmos.com/calculator/zbyll5oeqk](https://www.desmos.com/calculator/zbyll5oeqk)

~~~
raphlinus
Oops, yes, you're correct, I was typing from memory and didn't double-check.

And yes, the Taylor series is pretty good but you can usually squeeze a bit
more accuracy-over-range by, eg, using an optimizer for the coefficients.
Also, note that in your notebook you have x^3/5 but x^3/6 is probably what you
wanted.

------
Jyaif
It should be noted that if you are computing sin/cos after atan2, then it's
likely that what you are doing could be done with some regular vector
manipulation (normalisation, possibly dot product, cross product).

~~~
agnivade
Not really. My algorithm just needs atan2.

Its a flight tracking algorithm which calculates velocity of the aircraft from
the messages it broadcasts.

EDIT: > then it's likely that what you are doing could be done with some
regular vector manipulation (normalisation, possibly dot product, cross
product).

Yes, totally. Like I said, I have zero knowledge in assembly. And what you are
talking about "vector manipulation" is totally beyond me ;) Just doing this
part took a lot from me. :D

Maybe someone knowledgeable enough will be kind enough to write an improved
implementation.

~~~
mirashii
If it just needs atan2, it is not in the category described by the comment
that you're replying to.

~~~
agnivade
Sorry, I might have misunderstood something. But OP said - " if you are
computing sin/cos after atan2,", and I don't need any sin/cos computation
after atan2. Just the atan2.

~~~
recursive
Right, so your situation doesn't fit the condition being described.

------
drv
The assembly version is using the packed version of the FMA instruction
(that's what the "P" in the mnemonic stands for), but as far as I can tell,
it's only using one of the packed values, whereas the instruction can
calculate two (AVX) or four (AVX2) FMA operations at once. It might be
possible to get some speedup by rearranging the calculation so it can use the
full width of the vector registers - at first glance, at least the two sides
of the division should be possible to calculate in parallel with half as many
FMA instructions.

------
CorvusCrypto
If anyone was curious why the instruction address didn't match up, it's
because he read the instruction identifier wrong. It's also totally
counterintuitive and you have to essentially read the full documentation to
get to any real information about it. Anyway here we go. VEX is a prefix
notation with specialised encodings:

C4 - For a three-byte instruction (which vfmadd is) this is C4, if a two-byte
instruction this would be C5.

E2 - This one is more involved but basically for this instruction the first
three bits are 1 for setting some modes on it (R, X, and B). Then, because
this is an 0f38H instruction, the next 5 bits are 00010. Altogether you get
11100010 (E2)

E9 - this is also involved, but the first bit is a W mode. This is 1 for the
instruction he used but it's pretty much ignorable. for bits 2-5 it has to do
with the first source register the instruction uses. This is 1101 because it
uses XMM2/YMM2 (it's by lookup in a table) first. bit 6 is set to 0 if it is a
128 bit vector (it is). bit 7-8 are set to 01 based on a lookup table for the
"pp" value which is 66 in the instruction identifier. Altogether that's
11101001 (E9).

A8 - this one is easy, it's the opcode.

C3 - this is the ModR/M byte which is actually used in this case for reporting
the register/memory operands. first 2 bits are 11 to indicate the first
operand is the register itself (not displaced, or truncated). The next 6 bytes
are the register codes. 000 is actually not used since the destination is
overriden for float maths), and the 011 is EBX as the base cpu register to
source the data.

Altogether it works out to be C4E2E9A8C3. Not intuitive at all really.

Edit: Please someone correct me if I missed something. I hate this kind of
stuff and I'm sure I made a mistake.

------
flavio81
> "The reason for optimizing the math.Atan2 function is because my current
> work involves performing some math calculations. And the math.Atan2 call was
> in the hot path. "

He should take a look at:

a) the range of x which will be used for atan2(x)

b) the level of precision he really needs

There is a chance he could get away with precomputing an atan2(x) table in
memory and then just reading the values off the table.

Quick as lightning, and an old trick, probably invented in the 1950s!!

EDIT: I see i am potentially _wrong_ , due to slower DRAM access compared to
some built-in FP instructions...

~~~
Const-me
> i am potentially wrong

Not just potentially. See this: [https://github.com/Const-
me/LookupTables](https://github.com/Const-me/LookupTables)

~~~
flavio81
Great link, you should submit it to HN.

~~~
Const-me
Did that already, everyone ignored that.

[https://news.ycombinator.com/item?id=12724732](https://news.ycombinator.com/item?id=12724732)

------
jcranmer
Not to be mean to the author, but this is an example of why it's generally
useless to try to resort to assembly to beat a compiler unless you're an
expert at assembly. The sort of case that's being optimized here--a straight-
line piece of assembly--is the case where compilers are going to do better
than hand-coded except in specific cases.

The first critical failure is using the source code to guide assembly
optimization, rather than the original generated assembly. It means that
you're going to miss what optimizations the compiler is doing to improve this,
and missing those optimizations is going to turn out worse than the original.

There are two main optimizations that come to mind. The first is instruction
scheduling. On newer CPUs (Haswell or later), you have two separate
instruction ports for vector ALU instructions, so you can execute two
instructions in parallel if there are no conflicts. If you interleave the
instructions for computing the denominator and the numerator, you can take
advantage of the twin execution ports and execute two instructions at once
instead of two (you use more registers though). This is the main optimization
that I'm sure the Go library is getting.

The other possible optimization is SLP vectorization: the numerators and
denominators are both polynomial interpolation, so they're doing the same
operations on data vectors. This means that you can stick both of them in the
vector and use one instruction to compute the operation for numerator and
denominator simultaneously. On the other hand, it's not immediately clear if
the extra cost in vector formation is going to be worth the win in speed,
particularly given that you can already simultaneously schedule the two
instructions in the first operation. I don't know if Go is applying this
optimization.

So if you're wondering why it's slower, it's almost certainly that the hand-
written assembly is actually poorly scheduled.

~~~
dvt
> So if you're wondering why it's slower, it's almost certainly that the hand-
> written assembly is actually poorly scheduled.

The assembly is slower because Go doesn't inline it. The overhead is obvious
when calling one-op assembly blocks[1]. The proposal to inline assembly was
rejected last year[2] by Russ Cox.

[1] [https://lemire.me/blog/2016/12/21/performance-overhead-
when-...](https://lemire.me/blog/2016/12/21/performance-overhead-when-calling-
assembly-from-go/)

[2]
[https://github.com/golang/go/issues/17373](https://github.com/golang/go/issues/17373)

~~~
wolfgang42
Thanks for the link to the issue. I was wondering why the post was talking
about function call overhead instead of being able to inline it.

------
Waterluvian
I dabbled a little in gaming with Unity. I discovered that atan2 is called a
LOT. At least... I used it a lot.

Shouldn't something like this be implemented as a hardware level operation? Or
is it? Or why can't it be? Or maybe it's not used nearly as much as I think?

~~~
hhmc
I think in the case that you're willing to trade some accuracy for speed,
you'd just use a table lookup instead.

~~~
abainbridge
Not necessarily. That's OK if you can guarantee that the table is in cache.
But if it isn't, that's probably the slowest way.

DRAM read takes ~210 cycles (assuming 60ns access time and 3.5 GHz CPU clock
rate).

FPATAN takes ~50 to 200 cycles, depending on the operand value and CPU model
(see
[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf))

I suspect a polynomial approximation takes less than either of these, since
that's what most C standard library implementations use on the PC (I just
check Visual Studio 2017).

~~~
Tomis02
> But if it isn't, that's probably the slowest way.

It's probably the slowest way the first time, but if you call atan2 in a loop,
won't it be much faster?

~~~
abainbridge
It isn't just "the first time" that is slow for the LUT solution.

Each time the table access hits a new cache line, there will be a stall. Say
your LUT has 2048 64-bit entries, that's 256 cache lines. So the total stall
time waiting for cache lines to be filled could be as bad as 210 cycles * 256
= 53760 cycles. In that amount of time, the std library function could have
returned about 1000 results (assuming the std library takes 50 cycles).

And, as well as waiting for the table reads, the LUT based function has other
work to do. If we are comparing performance to the std library function, it
handles input from the entire range of IEEE 64-bit floats. It isn't practical
to do that directly with a lookup table. You would need to clamp the range
down to something manageable. Once that's done, if you want anything like
decent accuracy you need to read two values from the table and do a linear
interpolation between them. Or you could have a HUGE table instead of the lerp
but it would have to be far too big to fit in L1 cache, and thus also quite
slow.

My guess is a clamp and lerp'd lookup table will take ~10 cycles once the
table is entirely in cache. So, I guess I'm claiming that the LUT would then
be 5x faster than the std library.

I don't have a lot of confidence in that prediction though. There are lots of
other things to worry about if we were to do this analysis properly,
including: a) how the input pattern alters the effectiveness of the branch
predictor, b) how exactly you are finding some input data to operate on
without polluting the cache, c) whether we are measuring latency or
throughput.

------
cpr
Wow, this brings back vivid memories of late 80's, when I was assigned to
write atan2 by hand on Multiflow's 28/14/7-wide VLIW machines.

(Everyone in the company who could hand-code (even us OS guys and gals ;-) was
assigned a math routine--it was an all-hands-on-deck kind of performance
situation in our competition with Alliant and Convex.)

Imagine trying to hand-schedule up to 28 operations at once, including all
memory loads with 12 or 14 (forget now) clocks to memory, all bus contending
had to be done by hand, etc, etc. What fun! That's why VLIW compilers were so
challenging to write: all resources software-managed.

But, alas, in the end the mini-super market died. We all lost to Sun
eventually, as everyone wanted his own machine, even if much less powerful--
never bet against having the computer power closer to the user (starting with
mainframes): minicomputer / workstation / PC / mobile / etc, right down the
line to today.

(What's going to displace mobile, IoT? Doesn't seem likely. Perhaps when you
reach the limit of what you can hold in your hand, that's the last form
factor? I don't believe wearable computing will ever really catch on, but I'm
old fustian.)

~~~
orf
> I don't believe wearable computing will ever really catch on, but I'm old
> fustian.)

I used to think the same, but it seems to me that phones are the first step. A
huge number of people have a mini computer on their persons all day. It's only
a matter of time before it becomes even more integrated.

Will be interesting to see, but I doubt the current form factors will win.
Smart watches seem so... not it.

~~~
kbenson
And also, with wearables it's not so much bringing _a_ device closer, but a
_personal network of devices_. The phone may still have it's place as the
larges person-network computing device, but peripherals with their own small
compute power and additional I/O get added to the network. Smart watches
(biometric inputs, small always ready screen output), glasses (AR overlay
output, line-of-sight and image recognition input), etc.

------
skykooler
I always find it interesting to read stories of experiments that didn't end up
working.

~~~
aneutron
IMO they are more important, because the OP learns more, and you acquire part
of the experience without investing as much time.

------
gigatexal
Dang. Props to the blogger for posting even though success wasn’t had. Hours
on end doing assembly is a feat unto itself. I did not see the twist at the
end, I was thinking for sure there’d be a huge speed up. But knowledge is
knowledge.

~~~
agnivade
Haha. I know. My heart was pounding when I started the final benchmark ..
hoping against hope that there will be some gains.

But yea, knowledge is knowledge.

------
b4ux1t3
I would love to see more posts like this. Seeing your thought process when
approaching a problem, even if you don't end up solving the problem, or don't
do it well, is valuable insight into how other people work, and how I can
possibly integrate some of your methods to my own workflow.

Thanks for posting this.

~~~
agnivade
You're welcome ! Glad it was helpful. :)

------
MichaelBurge
I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:

* math.Atan2 is in the standard library, so it should know it has no side effects and can be replaced with an empty statement. Unless Go has a way to mark a function as not having side effects, no user-defined function would have that property.

* The only observable impact of the loop if math.Atan2 is removed is that b.N could be a larger size type than n, so it could loop forever depending on the values. If they're known to be the same type, it could eliminate the loop entirely.

Are you sure you're compiling with optimizations enabled?

I also wonder if that DIVSD instruction in yours might cause a floating point
exception, which would be a different behavior compared to atan2.

~~~
agnivade
Author here.

> I'm surprised it doesn't eliminate the BenchmarkAtan2 entirely:

I'm sorry, could you clarify what you mean by "eliminate" here. Are you saying
I shouldn't benchmark atan2 but something else ?

> Are you sure you're compiling with optimizations enabled?

I just did - `go test -bench=. -benchmem -cpu=1`. Did I miss something ?

> I also wonder if that DIVSD instruction in yours might cause a floating
> point exception, which would be a different behavior compared to atan2.

I checked with the assembly output of math.Atan2. It also uses DIVSD.

~~~
munificent
> I'm sorry, could you clarify what you mean by "eliminate" here.

Most compilers will eliminate code if they can tell the results of a
calculation are never used. Benchmarks typically call some function and
discard the result. If you aren't careful, the compiler can often optimize
away your entire benchmark down to nothing.

This is why you often see benchmarks that look like:

    
    
        int average(a, b) {
          return (a + b) / 2;
        }
    
        main() {
          int sum;
          for (int i = 0; i < 100000; i++) {
            for (int j = 0; j < 100000; j++) {
              sum += average(i, j);
            }
          }
    
          if (sum != 123) printf("!");
        }
    

Instead of:

    
    
        int average(a, b) {
          return (a + b) / 2;
        }
    
        main() {
          for (int i = 0; i < 100000; i++) {
            for (int j = 0; j < 100000; j++) {
              average(i, j);
            }
          }
        }

~~~
agnivade
Ah I see ! Gotcha :)

Yes, it seems Go is smart enough not to eliminate those values entirely.

I think this might be a better way to be sure that the function will be
computed entirely.

`_ = myatan2(-479, 123)`

The result is same though. I will update the post.

~~~
aleksi
Even better is to make a global var sink interface{}, and assign to it.
[https://github.com/golang/go/issues/14813](https://github.com/golang/go/issues/14813)

------
kristianp
I'm no optimisation expert, but I'm wondering if the FMAs are slow because the
result of each one is dependent on the previous one? The dependency on the
result may mean that the processor can't pipeline the operations. Could it be
faster if the two chains of FMAs on either side of the division are
interleaved and use different registers?

z := x * x

z = z * fma(fma(fma(fma(P0, z, P1), z, P2), z, P3), z, P4) /
fma(fma(fma(fma((z+Q0), z, Q1), z, Q2), z, Q3), z, Q4)

z = fma(x,z,x)

This article is quite the nerd-snipe.

~~~
DannyBee
It will definitely be faster on the newest intel processors, which have
dedicated FMA units. In fact, to max out floating point on the things, you'd
likely have to intermix FMA into the normal FP stream (IE FMA by 1 or FMA of
something plus 0)

------
xioxox
In C/C++ it would be much easier to call the intrinsic functions which
represent the fma instructions. The compiler would then be able to inline and
optimize the whole.

~~~
stephencanon
In C or C++ you should be able to use the stdlib `fma( )` function and have
the compiler lower the call to the instruction, if it isn't completely out to
lunch. No intrinsics should be necessary.

Alternatively, set `#pragma STDC FP_CONTRACT ON` and just write `x*y + z`.

~~~
xioxox
I didn't know they added that. The intrinsic could be useful to parallelize
the computation using SSE/AVX.

------
agnivade
Author here. Glad to answer any questions you guys have :)

------
SolarNet
I think the primary thing you are missing to get the speed improvement is
actually doing the packed computations (e.g. computing two values at once;
e.g. the numerator and denominator).

