
Surprises in GopherJS Performance - eatonphil
http://www.gopherjs.org/blog/2015/09/28/surprises-in-gopherjs-performance/
======
gizmo
These micro programs are very easy to JIT, so nearly identical performance is
to be expected. It's when you get larger programs that C compilers with
function inlining and better cache locality and whole program optimization
leave JIT compiled languages in the dust.

When you have a loop of a billion iterations the JIT compiler can instantly
tell that it's worth optimizing whatever is in the loop body. When you have a
more complex program the JIT does not know which parts to optimize. Any time
spent on this kind of "what should I optimize" meta-analysis has to be earned
back by making future code run faster, so you run into diminishing returns
pretty quickly.

~~~
hinkley
When the users complain that the app is 'too slow' and the devs say "we've
done everything we can", I'm usually the guy who goes and finds another 30%
without doing anything crazy.

Two things I see pretty consistently. First is that when there are no tall
tent poles (20 functions take 5% of the time each), people don't know what to
do, so they don't do anything. Second, and possibly easier to fix, is that
people believe the perf analysis tool (the breakdown of where time is spent)
is telling them the objective truth. Often it's wrong, which is why we try
things, benchmark them, and revert changes if things get worse.

When there are no tall tent poles I switch to invocation count, which is the
best secondary indicator of hotspots. There was one method that the perf tool
told me was taking 10% of the run time. But the call count was fishy. Due to a
bad call structure it was being called far more often than necessary. In the
worst spot in the code two sequential calls were calling this function, so I
flipped the code around so they could take the answer as an argument
(memoization).

I reran the benchmarks. I had removed 50% of calls to a function that took 10%
of our time, and the code overall was now 20% (twenty percent!) faster. Why?

Functions allocate memory. They evict cache lines in the data and instruction
caches. They might even access constrained resources, like disk. And as you
said, they change how the JIT decides to optimize things.

Sometimes, the symptom is that the code that runs immediately afterward gets
blamed for problems they didn't create, and the profiler has no way of
following the problem back to the root cause, so it assigns blame at the point
of contention, not at the start of the contention.

The only tools I've found that helps with this are clean coding practices, and
figuring out if your invocation counts match your expectations (I will run a
call tree 100 times and then compare the call counts of everything to find
things that were called 2+ times as often as they strictly should have been
called)

~~~
johnmaguire2013
> I reran the benchmarks. I had removed 50% of calls to a function that took
> 10% of our time, and the code overall was now 20% (twenty percent!) faster.

This reminds me on another comment I read on HN, that I consider something of
a "performance paradox":
[https://news.ycombinator.com/item?id=9895531](https://news.ycombinator.com/item?id=9895531)

~~~
hinkley
In this case I only made one change and remeasured. Runtime dropped more than
the cumulative call time of that function.

But the thread you reference is a whole other kettle of fish, and I would call
that observation #3 about failing at math.

Your boss says the app needs to run 3x as fast. Not "try to make it 3x faster"
but "the customer isn't going to buy unless it's 3x faster because
competitors". With targets like that anything taking more than 3% of run time
is a target for improvement, because they are taking 10% of the goal run time.

People will adamantly refuse to look at the 4th slowest function until they've
done something brilliant with the others, even if it's the easiest to fix.
That function is only taking 10% of the time, they'll say, so it's not
important. But it's taking 30% of the goal run time, and that's huge.

~~~
steveklabnik
Sometimes, I think Amdahl's Law is the most useful thing I learned about
during my CS degree.

------
pauljz
In the final example where he switches to int32 for Go (after "Let's make it
use int32 consistently and try again"), he runs gopherjs twice instead of
running go and gopherjs. I hope that's a typo.

The reasoning seems correct, but it'd be nice to have the benchmarks corrected
to verify.

~~~
shurcooL
It is indeed a typo, sorry about that!

I've fixed it [1] after confirming [2] (yet again just now; I've ran those
numbers many many times so I'm very confident it's not a one time fluke).

Thanks for catching it. You're also welcome to try to confirm the results
yourself, they should be reproducible!

[1]
[https://github.com/gopherjs/gopherjs.github.io/commit/acea7a...](https://github.com/gopherjs/gopherjs.github.io/commit/acea7a862c161796697aed812d9142945d09d1e8#commitcomment-13497916)

[2]
[https://gist.github.com/shurcooL/02183a57c51b28eaadf4](https://gist.github.com/shurcooL/02183a57c51b28eaadf4)

------
dpayne
Great article on the complexities of investigating performance issues! Another
very interesting surprise is when I run this locally on my macbook, clang is
much faster

    
    
      mac >> go run main.go
      approximating pi with 1000000000 iterations.
      3.1415926545880506
      total time taken is: 9.706911232s
    
      mac >> clang++ -O3 -ffast-math -march=native main.cpp
      mac >> ./a.out
      3.1415926545864963
      total time taken is: 2.14196s

~~~
hinkley
The speed of the answer doesn't matter so much when the answer is wrong. And
(after checking Google) BOTH of those are wrong. :/

~~~
e12e
Clearly the two results are different, so something strange is going on. But
you need to define wrong, consider incrementally approximating pi in 7 steps:

    
    
      100,50,25,12,4,2,3,3.5
      > 3.5
    

Or even:

    
    
      100,90,80,70,60,50,40
      > 40
    

I'm assuming you mean wrong as in the algorithm implemented with arbitrary
precision would en up with a different set of digits for the number of
iterations given? Or something along those lines?

~~~
hinkley
Oh, I mean they're both wrong in the 10th position even though they agree. But
it is curious that they diverge after that.

It's probably more a problem of the algorithm not calculating significant
figures and truncating the answer.

~~~
kmill
The algorithm is computing an alternating series, with the largest n being
5x10^7. The error in an alternating series (when the sequence is strictly
decreasing in absolute value) is the absolute value of the next term. This is
4/(2*(5x10^7 + 1) + 1), which is a bit less than 4x10^-8, so we would only
expect 7 digits to be correct. In fact, it is the 8th digit which goes awry
(not the 10th): it should be a 5 rather than a 7.

This particular series, which comes from a series for the arctan of 1, has
very slow convergence. An easily better one is Machin's formula, which
converges faster than a geometric series (that is, the number of terms needed
is at most linear in the number of wanted digits).

------
Touche
These are the types of articles that keep me coming back to HN. Spectacular
analysis.

------
kenOfYugen
JS win and Go win. GopherJS was the reason I was considering making the
transition to full-stack Go.

------
jzelinskie
What's the JS interop story with GopherJS? Is it possible to import native JS
libraries and use them?

~~~
nulltype
GopherJS compiles to "readable" JavaScript, so yes. You have to do something
like js.Global.Call("functionname", arg1)

~~~
fizzbatter
Sort of readable, imo. I mean, a few simple function calls are nicely
readable, but blocking calls quickly turn into an unreadable mess in my
experience.

------
jtblin
What I find the most surprising here is how optimised the v8 engine is.
Javascript code running on v8 is as performant as native go code or as
optimised C code. Obviously it's just a use case but still, mind blowing.

------
mtdewcmu
>> low-level C implementation compiled with -O3, the max optimization setting.

IME, -O2 always comes out faster than -O3. In this case, though, the
difference is fairly insignificant.

~~~
KMag
Note that -Os and -Oz often give you faster code than -O2. Also -O3 doesn't
turn on all optimizations. (Most notably, -O3 won't omit the frame pointer for
leaf calls. Yes, omitting the frame pointer can make debugging harder.)

------
fizzbatter
I want to see examples like this for Go Channels. Benchmarks, analysis, etc.

GopherJS is really cool, i just want to see the real world Go questions asked
and answered.

------
runnr_az
I always expect these stories to be about a JS client for Gopher... and I
guess I'm always a bit disappointed.

~~~
mseepgood
What excites you so much about the Gopher protocol? It's very boring (less
interesting than e.g. FTP) and scarcely in use. If you're looking for a JS
client for the Gopher protocol use OverbiteFF or something.

