
Making the obvious code fast (2016) - GordonS
https://jackmott.github.io/programming/2016/07/22/making-obvious-fast.html
======
barrkel
> _What I would like to see is more of an attitude change among high level
> language designers and their communities._

The reason a language like C makes obvious code fast is because C has very
little expressive power. The lack of power strongly encourages the programmer
to think imperatively and code at a very low level. Apply this to an extremely
trivial problem (mere addition - does it get any simpler?), and of course you
end up with fast code.

The lesson doesn't generalize. The lack of expressive power means higher level
abstractions are clumsy to write, refactoring is more expensive, and
recomposing your program with e.g. a caching layer is much harder.

See [https://blog.codinghorror.com/on-managed-code-performance-
ag...](https://blog.codinghorror.com/on-managed-code-performance-again/) for
example (alas, it appears some of the original source blogs are gone) -
Raymond Chen wrote a dictionary in C++, Rico Mariani implemented it in C# and
and got great performance with the obvious solution.

~~~
tedunangst
But isn't the point that sufficiently smart compilers are supposed to compile
to faster than C?

~~~
barrkel
Are you being facetious? :) "Sufficiently smart compiler" \-
[http://wiki.c2.com/?SufficientlySmartCompiler](http://wiki.c2.com/?SufficientlySmartCompiler)

Don't get me wrong, I like C - especially on small problems. I don't get the
feeling that I'm creating something flexible with C, though, it's very much
bespoke and tailored. I get involved in micromanagement of fine details. This
is OK for a small problem, but extremely aggravating for a big problem. As
soon as I want higher level abstractions, and start building structs of
function pointers and whatnot, I just shake my head - I'd rather write a
parser and corresponding text generator than manually compile that into C.

C++ has been hobbled by the concept of a sufficiently smart compiler. It's so
afraid of introducing abstractions that aren't cost-free, it tries to make
them all optional, and the resulting explosion in feature combination means
some combinations don't work well together, and then every big C++ project has
to decide which set of features they'll adopt, and which ones they won't.

Give me the freedom to think closer to the problem, and even if the
abstractions have some costs, and even if the compilers aren't quite
sufficiently smart, the chances are I'll come up with algorithm improvements
that meet speed and space requirements and let me move on to the next problem
much sooner. Agility trumps hyper-optimization.

~~~
jstimpfle
According to some people, the problem is the data. C is a nice language for
dealing with plain data.

Just today I rewrote my own streams abstraction, because libc sucks and I want
to support my own printf style functions and my own custom streams (for
example dynamically memory allocating streams. fopencookie() is not portable).
I think that streams are one of the most abstract and most useful
abstractions. It went pretty good. Making a few vtables explicitly in a few
places is not _that_ terrible.

------
gameswithgo
Since I wrote this, Rust did stabilize SIMD intrinsics, and using crates like
[https://github.com/AdamNiederer/faster](https://github.com/AdamNiederer/faster)
or [https://github.com/jackmott/simdeez](https://github.com/jackmott/simdeez)
you can write pretty idiomatic code to leverage them too.

Also I believe the browsers have sped up some of the Javascript cases shown
there substantially.

~~~
sambe
Does a more modern Rust/LLVM not auto-vectorize this?

~~~
timerol
Yes. Rust 1.34.0 (and presumably earlier as well) generates an unrolled SIMD-
loop for the iterator option:
[https://rust.godbolt.org/z/c6uZCv](https://rust.godbolt.org/z/c6uZCv)

~~~
gameswithgo
It will with integers, but not floating point. All compilers will use simd
registers and instructions for floating point but its only using the 1 lane
there. its not really vectorized.

rust has no way to pass ffast-math flag (unless that changed recently) which
will enable this for C/C++

~~~
steveklabnik
The rust approach to this kind of this is not to use global flags to control
this kind of behavior, but to use wrapping types to get the behavior on the
spots where you want it: [https://crates.io/crates/fast-
floats](https://crates.io/crates/fast-floats)

------
Gajurgensen
> Rust achieves impressive numbers with the most obvious approach. This is
> super cool. I feel that this behavior should be the goal for any language
> offering these kinds of higher order functions as part of the language or
> core library.

Certainly it should be _a_ goal, but I don't think it should be _the_ goal.
Higher levels of abstraction aren't typically motivated by performance. The
priority is ease of expression and reasoning about code. Of course both are
desirable, but when the two are in conflict and a trade off is necessary,
designers will lean toward expressive abstractions, not performant
abstractions. After all, performance is already available without the
abstractions.

But I do agree it is super cool how Rust finds a sweet spot between
performance and expressibility. It is one of my favorite languages for that
reason. I just don't think it should be a universal goal.

------
stygiansonic
As for the section on Java, Java can indeed auto-vectorize certain loops, but
the loops have to be "simple". Something like this would work:

    
    
        for (int i = 0; i < x.length; i++) {
            z[i] = x[i] * y[i];
        }
    

(Above code copied from:
[http://prestodb.rocks/code/simd/](http://prestodb.rocks/code/simd/) )

But the reason it doesn't work for the article's code is because of the
accumulator variable. That is not supported yet. See:
[https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7192383](https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7192383)

However, the difference between using sum() and reduce() are interesting. Will
eventually have to JMH to verify if this is still the case and if it matters
in our codebase.

------
twic
> Java 8 includes a very nice library called stream which provides higher
> order functions over collections in a lazy evaluated manner, similar to the
> F# Nessos streams library and Rust. Given that this is a lazy evaluated
> system, it is odd that there is such a performance difference between map
> then sum and a single reduction. The reduce function is compiling down to
> the equivalent of SSE vectorized C, but the map then sum is not even close.
> It turns out that the sum() method on DoubleStream "may be implemented using
> compensated summation or other technique to reduce the error bound in the
> numerical sum compared to a simple summation of double values." A nice
> feature, but not clearly communicated by the method name!

As i said recently [1], the Java way is to sacrifice a little performance to
allow users to do silly things and still get the right answer.

[1]
[https://news.ycombinator.com/item?id=19644366](https://news.ycombinator.com/item?id=19644366)

------
jerf
You can get to SIMD in Go via assembly support:
[https://goroutines.com/asm](https://goroutines.com/asm) I would give a pretty
decent guess it'd also score 17 ms since it'll be the same inner loop as
everything else.

How you score that in terms of "support" I freely leave up to the reader. Go
assembly does integrate into Go more than simply raw assembler would, as it
handles some of the runtime concerns, so the "support" is more than just
"yeah, we can write some stuff in assembler and link it in, I can do that
anywhere", but on the other hand, it is a form of assembler, not something
that looks like a function call in the native language or something.

~~~
chrchang523
Yes, it's clumsier than intrinsics, but ever since Go 1.10 added mnemonic
support up to AVX2, I've found it usable enough. (And it's easier to
efficiently exploit multiple processor cores in Go than most other
languages...)

------
bufferoverflow
I tried JS in chrome, and the best I got was 78ms on i7-8750H

    
    
        <script>
        let arr = new Float64Array(32000000);
        let sum = new Float64Array(1);
        sum[0] = 0;
        let l = arr.length, i, el;
        // Initialize with radom values
        for (i=0; i<l; i++) arr[i] = Math.random();
    
        // Sum
        let st = performance.now();
        for (i=0; i<l; i++) sum[0] += arr[i] * arr[i];
        let en = performance.now();
        console.log(en - st, sum[0]);
        </script>
    

Using a temporary value for arr[i] is definitely slower.

------
apfel912
Would be interesting to see how later versions of Node perform compared to v6
in the tests.

Node 8 especially, with turbofan.

------
carapace
FWIW, working in the Joy language you could write this:

    
    
        [dup *] map sum
    

And given the definition of _sum_ (the Joy combinator _step_ is like _fold_ ),

    
    
        sum == 0 swap [+] step
    

There could be a (pre-)compilation transform that generated:

    
    
        0 swap [dup * +] step
    

From which the proverbial "sufficiently smart compiler" would generate SIMD
instructions. (I'm working on that now, in Prolog. At this point the type-
inferencer can already tell that this function expects a list of numbers and
returns a single number (it's a "Catamorphism"[1].) But I'm nowhere near
specializing the output code to SIMD. My target architecture is Prof. Wirth
RISC machine for Project Oberon. However, there's a body of research on Prolog
compilation that includes work on retargeting machine code generator
generators, so there's that.)

[1]
[http://joypy.osdn.io/notebooks/Recursion_Combinators.html](http://joypy.osdn.io/notebooks/Recursion_Combinators.html)

------
tomohawk
GC overhead almost always becomes an issue in performance or latency sensitive
code. Raw timing benchmarks such as these do not capture that. It's mentioned
in the article, but its worth emphasizing that the times on some of the higher
order approaches may greatly understate the actual overhead of the approach.

~~~
GordonS
That depends on whether any actual allocations take place - looking at the C#
code samples in the article, the only ones that would cause allocations would
be the ones that used Linq. And Linq is kind of _verboten_ in performance
sensitive code; I assume the author used Linq in a few samples precisely
_because_ they would cause allocations to demonstrate the perf disadvantage.

With the advent of Span, stackalloc and the like, there has been something of
a "war on allocations" in .NET recently - it's great to see some extra focus
on performance!

------
hyperpape
It would be a nice update to redo the Java examples with both Hotspot and
Graal, as many (but not all) of the cases where Graal outperforms Hotspot are
in how it optimizes higher order functions. I don't have an intuition whether
it would matter for these particular examples, but I'd like to see.

------
wilgertvelinga
I created a jsbench of the JavaScript tests:
[https://jsbench.me/80jule61z5/1](https://jsbench.me/80jule61z5/1)

------
emsy
Tangential to Jon Blow mentioned in the post is the Handmade Network, which
advocates for better programming by sticking closer to the hardware. While I
think this is a worthwhile and noble effort, I think what this post really
shows is that we need better tools to help developers to do the right thing
more effortlessly.

~~~
jstimpfle
What is "the right thing", though? When requirements get more and more
complex, "better tools" stop working. Their built-in mechanisms are not the
answer to anything and everything. Beyond some complexity treshold, you can't
do with generic solutions, but need to tailor solutions to the problem.

~~~
emsy
There is no hidden depth in "the right thing": perform the intended action
with as little time/memory complexity as possible. I disagree completely with
what you're saying. If anything, the examples in the post show we still have a
lot of opportunities to build better tools at the micro level. We didn't even
try To build better tools for complex systems yet.

~~~
jstimpfle
The "micro level" is almost completely irrelevant. Performance comes from
doing the right thing globally, and it will be a long time before tools can
help there.

I conjecture that apart from a few artificial benchmarks, the choice of the
different approaches here is totally irrelevant. Just don't use the 10000ms
approach, and you won't notice a difference in 99% of the real world
applications.

------
apocalypses
Swift can also auto vectorise these simple loops, but only with the right
combination of compiler flags.

[https://swift.godbolt.org/z/JmtOMx](https://swift.godbolt.org/z/JmtOMx)

------
pjmlp
Intel has contributed SIMD support to Hotspot and it is available since Java
10, including AVX.

Likewise Google, Azul and IBM have SIMD support on their JIT/AOT compilers.

