Hacker News new | past | comments | ask | show | jobs | submit login
Making the obvious code fast (2016) (jackmott.github.io)
109 points by GordonS on April 17, 2019 | hide | past | favorite | 39 comments

> What I would like to see is more of an attitude change among high level language designers and their communities.

The reason a language like C makes obvious code fast is because C has very little expressive power. The lack of power strongly encourages the programmer to think imperatively and code at a very low level. Apply this to an extremely trivial problem (mere addition - does it get any simpler?), and of course you end up with fast code.

The lesson doesn't generalize. The lack of expressive power means higher level abstractions are clumsy to write, refactoring is more expensive, and recomposing your program with e.g. a caching layer is much harder.

See https://blog.codinghorror.com/on-managed-code-performance-ag... for example (alas, it appears some of the original source blogs are gone) - Raymond Chen wrote a dictionary in C++, Rico Mariani implemented it in C# and and got great performance with the obvious solution.

Rust managed to make some of these high level abstractions compile to fast machine code almost every time. Its quite refreshing to go from Linq/Streams etc to Rust iterators.

So does C++ iterators for that matter but they are not so pleasant to actually write.

It's true. Rust's approach to passing lambdas to iterators usually gives the compiler good info to inline. Some of my exercise programs when learning Rust exceeded fairly straightforward C implementation too, possibly due to stronger guarantees around aliasing.

>See https://blog.codinghorror.com/on-managed-code-performance-ag.... for example (alas, it appears some of the original source blogs are gone)

They are there... it's just the recent migration at MSDN blogs broke a lot of links. Basically it looks like Rico doesn't blog anymore (no longer at Microsoft) so his blog is archived and links to it work, but Raymond Chen's blog is active and was moved en masse to new links.

The links from Rico Mariani's blog (https://blogs.msdn.microsoft.com/ricom/2005/05/10/performanc...) are here:

Raymond Chen's posts:








But isn't the point that sufficiently smart compilers are supposed to compile to faster than C?

Are you being facetious? :) "Sufficiently smart compiler" - http://wiki.c2.com/?SufficientlySmartCompiler

Don't get me wrong, I like C - especially on small problems. I don't get the feeling that I'm creating something flexible with C, though, it's very much bespoke and tailored. I get involved in micromanagement of fine details. This is OK for a small problem, but extremely aggravating for a big problem. As soon as I want higher level abstractions, and start building structs of function pointers and whatnot, I just shake my head - I'd rather write a parser and corresponding text generator than manually compile that into C.

C++ has been hobbled by the concept of a sufficiently smart compiler. It's so afraid of introducing abstractions that aren't cost-free, it tries to make them all optional, and the resulting explosion in feature combination means some combinations don't work well together, and then every big C++ project has to decide which set of features they'll adopt, and which ones they won't.

Give me the freedom to think closer to the problem, and even if the abstractions have some costs, and even if the compilers aren't quite sufficiently smart, the chances are I'll come up with algorithm improvements that meet speed and space requirements and let me move on to the next problem much sooner. Agility trumps hyper-optimization.

According to some people, the problem is the data. C is a nice language for dealing with plain data.

Just today I rewrote my own streams abstraction, because libc sucks and I want to support my own printf style functions and my own custom streams (for example dynamically memory allocating streams. fopencookie() is not portable). I think that streams are one of the most abstract and most useful abstractions. It went pretty good. Making a few vtables explicitly in a few places is not that terrible.

That has been the dream for a long time. When Java introduced JIT (Hotspot?) a lot of people claimed that Java was faster than C. Lately people are telling me that JavaScript "can" run faster than C. Usually they present some well written code in their favorite language and then compare it to horribly written C. Usually I tell them that most likely the runtime of their language is written in C so you can always do as well in C.

> Usually I tell them that most likely the runtime of their language is written in C so you can always do as well in C.

The point is that with a JIT the runtime is executing as little as possible, with the rest being dynamically optimized machine code. So this isn’t really a fair thing to say…

I've never seen in practice any jit do better than hand tuned AOT compiled code.

I've seen lambdas and virtuals inlined dynamically. AOT would need to be statically typed and instantiated all the way to do the same thing.

I could create a sample program that would be hard to make as fast in C or C++, with the constraint that the code needs to be dynamically loaded at runtime.

In practice, it's easier to modify a more dynamic program and be sure you're not breaking much. Highly tuned C tends to bake in assumptions about memory lifetime, layouts, call graph, etc. The flexibility to modify is what gives you the ability to target higher level improvements in a shorter amount of time.

Given any concrete implementation, after it's been designed in a more dynamic language, you can convert it to something fast in something like C. But starting from C is probably not a great idea. And not everything can be written multiple times economically.

E.g. the .net GC was prototyped in Lisp, and initially converted to C++ with a translator.

I have, try to make an AOT compiler remove dynamic dispatch across shared libraries and convert it into either direct calls or even inline them.

It's rare, but it can happen occasionally. A just-in-time compiler has access to tracing and profiling that a ahead-of-time compiler may not, which means that it can do things like inline a virtual method call that is actually monomorphic at runtime.


That’s the keyword here.

I was responding to a comment that had the word "never" in it.

Since I wrote this, Rust did stabilize SIMD intrinsics, and using crates like https://github.com/AdamNiederer/faster or https://github.com/jackmott/simdeez you can write pretty idiomatic code to leverage them too.

Also I believe the browsers have sped up some of the Javascript cases shown there substantially.

Does a more modern Rust/LLVM not auto-vectorize this?

Yes. Rust 1.34.0 (and presumably earlier as well) generates an unrolled SIMD-loop for the iterator option: https://rust.godbolt.org/z/c6uZCv

It will with integers, but not floating point. All compilers will use simd registers and instructions for floating point but its only using the 1 lane there. its not really vectorized.

rust has no way to pass ffast-math flag (unless that changed recently) which will enable this for C/C++

The rust approach to this kind of this is not to use global flags to control this kind of behavior, but to use wrapping types to get the behavior on the spots where you want it: https://crates.io/crates/fast-floats

Unrelated: Haven't used compiler explorer in a while and just got the "new privacy policy" overlay.

What a delightfully clear, complete, and readable privacy policy.

> Rust achieves impressive numbers with the most obvious approach. This is super cool. I feel that this behavior should be the goal for any language offering these kinds of higher order functions as part of the language or core library.

Certainly it should be a goal, but I don't think it should be the goal. Higher levels of abstraction aren't typically motivated by performance. The priority is ease of expression and reasoning about code. Of course both are desirable, but when the two are in conflict and a trade off is necessary, designers will lean toward expressive abstractions, not performant abstractions. After all, performance is already available without the abstractions.

But I do agree it is super cool how Rust finds a sweet spot between performance and expressibility. It is one of my favorite languages for that reason. I just don't think it should be a universal goal.

As for the section on Java, Java can indeed auto-vectorize certain loops, but the loops have to be "simple". Something like this would work:

    for (int i = 0; i < x.length; i++) {
        z[i] = x[i] * y[i];
(Above code copied from: http://prestodb.rocks/code/simd/ )

But the reason it doesn't work for the article's code is because of the accumulator variable. That is not supported yet. See: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=7192383

However, the difference between using sum() and reduce() are interesting. Will eventually have to JMH to verify if this is still the case and if it matters in our codebase.

> Java 8 includes a very nice library called stream which provides higher order functions over collections in a lazy evaluated manner, similar to the F# Nessos streams library and Rust. Given that this is a lazy evaluated system, it is odd that there is such a performance difference between map then sum and a single reduction. The reduce function is compiling down to the equivalent of SSE vectorized C, but the map then sum is not even close. It turns out that the sum() method on DoubleStream "may be implemented using compensated summation or other technique to reduce the error bound in the numerical sum compared to a simple summation of double values." A nice feature, but not clearly communicated by the method name!

As i said recently [1], the Java way is to sacrifice a little performance to allow users to do silly things and still get the right answer.

[1] https://news.ycombinator.com/item?id=19644366

You can get to SIMD in Go via assembly support: https://goroutines.com/asm I would give a pretty decent guess it'd also score 17 ms since it'll be the same inner loop as everything else.

How you score that in terms of "support" I freely leave up to the reader. Go assembly does integrate into Go more than simply raw assembler would, as it handles some of the runtime concerns, so the "support" is more than just "yeah, we can write some stuff in assembler and link it in, I can do that anywhere", but on the other hand, it is a form of assembler, not something that looks like a function call in the native language or something.

Yes, it's clumsier than intrinsics, but ever since Go 1.10 added mnemonic support up to AVX2, I've found it usable enough. (And it's easier to efficiently exploit multiple processor cores in Go than most other languages...)

I tried JS in chrome, and the best I got was 78ms on i7-8750H

    let arr = new Float64Array(32000000);
    let sum = new Float64Array(1);
    sum[0] = 0;
    let l = arr.length, i, el;
    // Initialize with radom values
    for (i=0; i<l; i++) arr[i] = Math.random();

    // Sum
    let st = performance.now();
    for (i=0; i<l; i++) sum[0] += arr[i] * arr[i];
    let en = performance.now();
    console.log(en - st, sum[0]);
Using a temporary value for arr[i] is definitely slower.

Would be interesting to see how later versions of Node perform compared to v6 in the tests.

Node 8 especially, with turbofan.

FWIW, working in the Joy language you could write this:

    [dup *] map sum
And given the definition of sum (the Joy combinator step is like fold),

    sum == 0 swap [+] step
There could be a (pre-)compilation transform that generated:

    0 swap [dup * +] step
From which the proverbial "sufficiently smart compiler" would generate SIMD instructions. (I'm working on that now, in Prolog. At this point the type-inferencer can already tell that this function expects a list of numbers and returns a single number (it's a "Catamorphism"[1].) But I'm nowhere near specializing the output code to SIMD. My target architecture is Prof. Wirth RISC machine for Project Oberon. However, there's a body of research on Prolog compilation that includes work on retargeting machine code generator generators, so there's that.)

[1] http://joypy.osdn.io/notebooks/Recursion_Combinators.html

GC overhead almost always becomes an issue in performance or latency sensitive code. Raw timing benchmarks such as these do not capture that. It's mentioned in the article, but its worth emphasizing that the times on some of the higher order approaches may greatly understate the actual overhead of the approach.

That depends on whether any actual allocations take place - looking at the C# code samples in the article, the only ones that would cause allocations would be the ones that used Linq. And Linq is kind of verboten in performance sensitive code; I assume the author used Linq in a few samples precisely because they would cause allocations to demonstrate the perf disadvantage.

With the advent of Span, stackalloc and the like, there has been something of a "war on allocations" in .NET recently - it's great to see some extra focus on performance!

It would be a nice update to redo the Java examples with both Hotspot and Graal, as many (but not all) of the cases where Graal outperforms Hotspot are in how it optimizes higher order functions. I don't have an intuition whether it would matter for these particular examples, but I'd like to see.

I created a jsbench of the JavaScript tests: https://jsbench.me/80jule61z5/1

Tangential to Jon Blow mentioned in the post is the Handmade Network, which advocates for better programming by sticking closer to the hardware. While I think this is a worthwhile and noble effort, I think what this post really shows is that we need better tools to help developers to do the right thing more effortlessly.

What is "the right thing", though? When requirements get more and more complex, "better tools" stop working. Their built-in mechanisms are not the answer to anything and everything. Beyond some complexity treshold, you can't do with generic solutions, but need to tailor solutions to the problem.

There is no hidden depth in "the right thing": perform the intended action with as little time/memory complexity as possible. I disagree completely with what you're saying. If anything, the examples in the post show we still have a lot of opportunities to build better tools at the micro level. We didn't even try To build better tools for complex systems yet.

The "micro level" is almost completely irrelevant. Performance comes from doing the right thing globally, and it will be a long time before tools can help there.

I conjecture that apart from a few artificial benchmarks, the choice of the different approaches here is totally irrelevant. Just don't use the 10000ms approach, and you won't notice a difference in 99% of the real world applications.

Swift can also auto vectorise these simple loops, but only with the right combination of compiler flags.


Intel has contributed SIMD support to Hotspot and it is available since Java 10, including AVX.

Likewise Google, Azul and IBM have SIMD support on their JIT/AOT compilers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact