Hacker News new | past | comments | ask | show | jobs | submit login
The death of optimizing compilers [pdf] (cr.yp.to)
217 points by fcambus on Apr 17, 2015 | hide | past | web | favorite | 216 comments

I wish this was given in a better format instead of just slides, hopefully the talk is actually better. Here's what I think the talk is about:

The pervasive view of software performance is that compilers are better than humans at optimizing code, but the few humans who optimize important bits of code to the maximum extent disagree Similarly, computer programs today are increasingly diverging into a state where there is a tiny amount of extremely performance critical code, and a large amount of code where performance is so good on our hardware today that even horribly unoptimized code has no noticeable effect on performance.

Thus, optimizing compilers are useless on the first type of code (humans are better), and useless on the second (performance doesn't matter). So what good are they at all?

If optimizing compilers aren't useful, what system should we use instead for making performant code? The author and collaborators' experience suggests that the reasons a compiler can't optimize code as well as a human when it matters is that our current programming languages don't give the compiler enough information about the intent of the code to optimize it. Therefore we should design programming languages that on the surface look very unoptimized, but specify enough information that compilers can do a really good job. It sounds like no one knows what such a programming language would look like.

I recently wrote a simulation that involved some heavy numerical calculations. Without optimization it was dog slow and would have taken probably years to produce the results I needed. With compiler optimization it took a few days, which was good enough for me. Perhaps I could have improved on that with manual optimizations, but the speed was good enough so instead of spending time optimizing I did other things.

The implicit assumption in the talk (based on this summary) is that we always want the most optimized code. But in fact that is never the case -- what we really want is for the code to be fast enough (or be space-optimized enough). So the relevant question is this: in what percentage of cases do compiler optimizations take us from 'not fast enough' to 'fast enough'. In all these cases the compiler is useful, because it saves us the cost of doing our own optimizations. In my experience this percentage is far from negligible, so at least for me optimizing compilers are highly valuable.

I think there's an interesting distinction here between general-purpose software and single-use code. General-purpose software has many users with many kinds of data. Unless your domain is intrinsically limited, it's impossible to be "fast enough" -- you can always find some data that your code will take to long to process. There's always an incremental benefit to optimization, and with enough users, it's irresponsible to waste collective years of people's time by leaving performance on the table.

What you've said makes perfect sense for one-off simulations, however. And in practice, optimizers are designed to make tight numerical code run fast. I don't think your use case will ever be ignored. It's just that most systems today are not made of tight loops of arithmetic operations. We need more tools to optimize the kind of code that actually makes our computers slow: code that allocates and deallocates memory, code that transforms data between different formats, code that constructs and operates on complex data structures -- that sort of thing.

I don't entirely disagree, but I would like to make one point. Taking 1 millisecond from 1000 people, does not translate to 1 second of lost human productivity. There's no way to collect all those little slices of time and they are not appreciable individually. (Assuming those milliseconds are sufficiently spread out, i.e., not 1 ms out of every 2, or something like that)

> Taking 1 millisecond from 1000 people, does not translate to 1 second of lost human productivity

I believe that it does. If you average over a large population even tiny imperceptible increases in e.g. page load time have a measurable effect. Perhaps 1000 people isn't a large enough sample to measure the effect from 1ms delay considering all the confounding factors at play, but I absolutely believe you could see it with precise enough measurement and a large enough sample.

Reminds me of this semi-relevant story about improving the boot times of the original Macintosh — http://www.folklore.org/StoryView.py?story=Saving_Lives.txt

The irony being that Jobs himself limited the memory in the original Mac, so it would have to go to disk more often!


Its a fairly well known story. The engineers ignored him and built in capability for 512k anyway.


He also hated fans and expansion slots.

Hmm I think the heart of the question is whether productivity is linear with time. I guess he's right that, for small values of t, productivity is highly superlinear. So giving 1 person an extra minute has an impact on that person, but because of superlinearity, if you distribute that among 60 persons the extra second won't allow anything new, like an original thought or whatever.

You're thinking about it wrong. Of course if you give 60 people 1 second each they won't be able to accomplish a task that takes a minute; that's obvious. Instead what if you give a group of people 59 seconds each to complete a task that takes 1 minute +-10 seconds, and then you give a different group of people 60 seconds each to complete the same task. More people in the second group will complete the task.

Well in your case there's a "phase transition" for the task around 1 minute, so in that case the return function is sublinear in t (if you assume given time G starts at 0. G=t), or linear in it (if you assume it starts at 50, G=50+t). I just wanted to highlight this principle: his statement is equivalent to "For small t, the return for freeing t seconds from computer tasks in superlinear on t". And in general: "The analysis of time redistribution hinges on the linearity of return function for the freed time t".

I do believe that for most persons and most situations, they do tasks in a segmented fashion, working to completion. So if you give them a small amount of extra time it's just going to shift their segments until it hits a synchronized event, like sleeping at a predetermined time or having a meeting -- so all that happens is they get e.g. a millisecond each before they sleep. Under this model it's quite clear to me for very small t the return function is superlinear, and it's better to give 1 person 100 seconds than 100k persons a millisecond (assuming this is one-off of course): for the 100k persons the millisecond is insufficient to elicit any change whatsoever. It's much less than a neuron firing time.

However the carbon footprint of the code does reduce like that. It would be very wasteful in terms of energy not to optimize a widely used binary.

Beware: optimising for speed and optimising for energy don't yield the same results.

If you want speed, you typically want to use your CPU as much as possible. If you want to save energy, you'll want to underclock your CPU. If we go all the way to hardware design, saving energy means going massively multicore, with simple, slow, in-order cores.

You can optimise both, to some extent. But there is a point where you have to chose one or the other.

> If you want to save energy, you'll want to underclock your CPU.

This is incorrect. To save energy, the strategy is to do your thing as quickly as possible, with maximum CPU clocks and enter power saving mode as soon as possible. Otherwise the leakage current in the CPU will consume more energy than can be saved by underclocking.

It may be counterintuitive but fast code consumes the least energy.

I insist: speed and energy savings are correlated to a point.

To. A. Point. Some instructions in your CPU may be faster, but slower alternative often consume less total energy (even taking everything into account —there have been experiments and power measurements). Even if it means your CPU is up a bit longer. Then there are considerations that save energy, but don't matter for speed at all, such as the width of integer operations: no need to go 32 bits wide or 64 bits wide, when no value exceeds 127.

Also, I'm not sure about current leakage exceeding the benefits of underclocking. Those benefits are massive. Lowering the frequency also means you can lower the voltage. In the talks I have seen, halving the speed of a core (and lowering the voltage accordingly) makes it dissipate about a fourth the power. Half the energy, considering everything takes twice the time. This of course ignores leakage, but I doubt it would counter such a strong effect.

Besides, with a phone running a number of background processes, I wonder how much power saving time you can get. Not to mention, the potential costs of switching in and out of CPU saving mode.

"Race to sleep" is a well known power saving strategy.

> This of course ignores leakage

You can't ignore leakage. The ability of a chip to turn off currently unused silicon is crucial to power management on modern devices.

Of course you can't. But if leakage consume 25% of CPU power, then halving the frequency is still worth it. Leakage may be important, but that doesn't mean it dominates.

Now if as an app developer you have zero control over the CPU frequency, sure, the best strategy is probably to write your code to be as fast as possible. But if you're really big on battery life, someone will have control over CPU clocking or voltage. In which case race to sleep is no longer the only viable strategy.

> There's always an incremental benefit to optimization, and with enough users, it's irresponsible to waste collective years of people's time by leaving performance on the table.

I could just as easily claim that skilled developer's time is a precious resource, and it's irresponsible to waste it squeezing out performance when so much software still remains unwritten.

There's another side to that, as well.

Developers' time is expensive, and the time it takes to get feedback from code runs adds up when you have lots of developers. Especially if the roundtrip times are short enough not to allow switching between tasks, but long enough to cause constant "wall time waits".

I'd be more than willing to have couple of extremely skilled developers profile and optimise the common code so that the feedback loop becomes shorter for everyone. That way more of the yet-unwritten software might just be tackled.

My old boss once had me spend more than a week packaging and streamlining our access VPN configurations. When I asked him about it some time later, he explained that even though I was away from billable projects for that time, it would have cost a whole lot more overall to have all our developers and managers massage their setups every time something changed.

Large figures are sums of lots of small figures...

A typical compiler takes -O -O2 and so on flags for how much optimization you want. But what I would really like is a -O60s meaning, do as much optimization as you can in 60s and we're good to go. For the production version I might give it -O1h or -O12h and run it overnight. I am not aware of any product that does this.

The one issue with this type of compilation is that it encourages hisenbugs. You add debugging code, it can't spend as much time optimizing, and all of a sudden it is producing the correct results again.

But none the less, I agree.

I agree it's a trade-off, it always is :) But if the goal is to write performant code, there's a few more variables to consider.

An important one, IMO, is experience and practice.

Optimization can be found in the weirdest places. My latest experience is not with C or assembly instruction crunching, but with Processing/Java. If I weren't minded to tinker with and optimize my inner loops, I might have never discovered that the java.lang.Math routines are a lot faster than the equivalent Processing functions (functions like abs, max, etc).

Back when I was messing with 80x86 assembly code, I did the same. It's all about having a somewhat accurate model of what goes on inside. With Processing it's the model that the functions are probably wrapped in something. With assembly code it's (among other things) the model that the order of independent instructions matters for purposes of pipelining and preventing AGI-stalls (I dunno if those are still relevant, this was the 386 era).

When you get the hang of it, it's not really hard to experiment, measure, experiment, a few times and end up with spectacularly more performant code.

And really, if the optimized result took a few days to calculate, and an hour of tinkering might save you another day or two, isn't that (often) worth it?

Then there's the fact that I think it's fun :) Getting the code to work properly first, is a whole different process than spot-optimizing certain critical parts. I think it's refreshing to switch between these two different ways of looking at the same code.

How about just giving up on automatic optimization? I love the approach taken by http://halide-lang.org/. The programmer writes code, but then instead of having a compiler do the optimizations automatically, the programmer also explicitly specifies the optimizations to perform!

The advantage is that changing the optimizations cannot affect the correctness of the algorithm. So you get to write your algorithm once, verify its correctness, and then try drastically changing data layout, order of computations, parallelization, etc, as freely as you want. You can even change the optimizations you're using to match the data you're processing, so you can do much better than an optimizing compiler could do without profile-guided optimization.

Edit: After looking more closely at the presentation, I think Halide is essentially the realization of the hypothetical language described in the last few slides. Somebody should tell DJB about it!

Halide is quite domain specific though, it won't help you to create a new database system for instance, or help you on your hadoop algorithm.

GP suggested using the same approach as Halide, not Halide itself. Obviously you would first need to apply it to a general-purpose language.

As one of those "few humans", I'd just like to say that I think the default -O0 of compilers today should really be more like what they call -O1 or -Os, since the unoptimised compiler output I've seen is ridiculously stupid and contains things no sane human programmer would do even when not making any attempt at optimisation. O0 is more like pessimisation, in my opinion. To be clear, I'm not saying to apply fancy SIMD instructions/autovectorisation, loop transformations/unrolling, or any of those other advanced techniques. I just don't want to see things like saving/restoring registers that are never ever used in the function, lots of unnecessary memory accesses and data movement instructions, and register allocation that looks more like register non-allocation (e.g. always using the same 2-3 registers and memory instead of the other 4-5 that are available.)

Also, for rarely-run pieces of code (micro-)performance may not matter, but size certainly does - the smaller the non-performance-critical code, the more room there is in the cache for the performance-critical code. On a larger scale, the smaller the whole process is, the less chance there is of the non-performance-critical code being swapped out (and suddenly becoming very performance-critical...) when memory is constrained, and of course it also reduces overall memory usage, which is a good thing in a multiprocess environment.

Perhaps we need to fundamentally change how compilers and their optimisers work, since the current model seems to be "generate the most stupidly unoptimised code possible, then apply optimisation passes to it". I think it's a rather counterintuitive and wasteful way of doing it, compared to the "generate only the code you need" method a human would likely follow. Thus I'd say that current compiler optimisations are still useful, if only for cleaning up the mess they would make otherwise. :-)

Yeah, exactly. I feel like the author is missing the point of the optimizer. It's not something that attempts to fix bad code, it just expresses the programmers intent efficiently. For example, if I have a common subexpression, I would expect the compiler to notice it and take care of it automatically instead of me having to tediously code it explicitly to do the right thing. Given how large open source libraries are today, all of those little inefficiencies add up (in size and space) if we just let the compiler be stupid.

But the easiest counterargument is to just to tell someone to compile their program at -O0 (which I've done inadvertently on an number of occasions). The program is noticeably bigger and slower.

It's quite funny to hear somebody call for the reinvention of a field based on the quality of -O0 code.

Just get over yourself and pass -Os.

That'd be great but here is what Torvalds has to say about -Os


tl;dr - sometimes the smallest code being generated is slow and dumb

That's nothing compared to the -O0 crud that the GP was complaining about. It's also not a problem with the field of compiler design, but a (probably temporary) weakness in gcc.

(It is an interesting anecdote though.)

I think we should definitely reconsider how compilers are doing code generation, if by default they are generating far more code than necessary and need to use optimisation to make up for it.

The large volume of -O0 code is just a result of skipping register allocation to save compilation time. There's no problem here.

One of the big problems here is that the C calling convention, which a lot of other languages use as well, requires a lot of that brain damage. Calls are supposed to put just the first argument in the A register, and everything else on the stack, save any potential register that an operation the function may use, and generally be painfully slow.

One of the things optimization does is figure out what rules it can break in order to make the program go faster -- when it can put more arguments in registers, because only the code being compiled will see that function, etc. So, in order to make compilers smarter and more intuitive, we have to agree to break binary compatibility, which is something people are rather gun shy about. But, I think it's time to have that conversation.

There is no such thing as a C calling convention. It isn't in the language spec.

Also I am not sure which one in common use behaves exactly as you say. On ia32 it's usually all stack. (Though I think Microsoft uses ECX for the c++ this pointer.) On amd64 it's common to use registers for the first few integers. (Say, 4 of them.)

Further, within a non-exported function it's fair game to break ABIs/calling conventions.

The C language spec alone is not sufficient to build a useful system. You're also going to need some kind of operating system interface (defined in POSIX), binary format (ELF), and so on.

The specification that you're both trying to find is called the "System V Application Binary Interface, AMD64 Architecture Processor Supplement" and is found here: http://www.x86-64.org/documentation/abi.pdf

There are similar variations for other architectures. Microsoft does something different, I don't know where their spec lives.

That is the sysv abi, popular with unixy platforms. It's one abi. You have already mentioned there are others. So how is it "what I was trying to find"? My point is there is not one that is called "C".

A convention is a convention. It doesn't need a spec.

C is a language with a spec. When you say "c calling convention" this implies that it is C, and not something else. There are various ABIs, some with specs and some with unwritten rules, but two C programs can run on the same platform conforming to differing ABIs and still be both C.

If this sounds far fetched, consider for example that it's common to experience a lot of pain trying to link code compiled with GCC against code compiled with the ms compiler or vice-versa. Same language, CPU, OS. Differing abi.

" It sounds like no one knows what such a programming language would look like."

Actually, they do.

The answer is "a mess"

See things like "high performance fortran" (Led by the now-sadly-passed Ken Kennedy, who was doing compiler optimizations since well before most people here were born - 1971)


And more recently, things like fortress: http://en.wikipedia.org/wiki/Fortress_%28programming_languag...

How do you feel about the Halide language/project, which seems to me like a fairly promising application of these ideas.

http://halide-lang.org http://people.csail.mit.edu/jrk/halide12/halide12.pdf http://people.csail.mit.edu/jrk/halide-pldi13.pdf

Halide is a domain specific language. I think Halide is wonderful, and i think if you have a domain, it's entirely possible to make great things for that domain.

But if your domain is "general purpose", i haven't seen anything that has come out well.

IE i think it is exactly the restriction to given domains or choices that enable you to do the kinds of things halide does.

Doesn’t most performance-critical code involve some sort of similar “domain”? The examples people have been talking about in this thread, like media codecs, cryptography, matrix solvers, databases, etc. all seem to.

Yes, but these domains are all very different, and would require different languages, which would defeat the point.

Why? Why, if I care about performance, should I write my database, my kernel scheduler, my TCP stack, my JPEG codec, my cryptographic hash function, my physics engine, and my 3d-rendering pipeline in the same language?

If we assume (obviously this isn’t quite true today) the domain specific languages existed which would let me write more readable and modifiable source, in a highly portable way, with explicit control over optimization strategies so that I could get orders of magnitude better performance than the naive C++ version, why wouldn’t I use those whenever possible?

Then all the cold “glue” code that isn’t any kind of performance bottleneck can be written in whatever language, and doesn’t need to be optimized at all.

That shifts a lot of cost to defining and implementing many DSLs. (For an extreme case, consider the VPRI STEPS project where it looked like some of the DSLs were only used for one program.)

That extreme case may still be worth it, if the result (program + DSL implementation) is easier to maintain than the mainstream alternative (program in a general purpose language).

Besides, it wouldn't be that many DSLs. There are relatively few domains where performance is really critical. Cryptography, rendering, encoding/decoding… that's a few dozens at most, with global impact for each. If those DSL have any positive impact at all, they will be worth the cost a million times over.

So instead of writing a database, a kernel scheduler, a TCP stack, a JPEG codec, a cryptographic hash function, a physics engine, and a 3D rendering pipeline, we have to write a DSL for most/all of these programs and then the program itself. And this is less work than just writing them in hand-optimized C in the first place?

In many (but not all) cases, yes - because you don't just write programs. You have to read them and maintain them and update them too - sometimes 'forever'. A DSL is designed with the whole lifecycle in mind. Also, these tools can come with things like specialized analyzers, debuggers, verifiers etc for those specific domains, which can help increase assurances and improve your ROI a lot.

Cryptol is a decent example of a domain-specific cryptographic toolkit which can easily do verification of C code against specifications, for example. It can't do much besides writing crypto, but it's pretty good at that.

In other words, the idea is that the cost of the toolset implementation is amortized over the whole lifespan of the project, and multiplied many times over for every use - it's not just the initial implementation. Naturally this approach doesn't always work of course, for a million reasons that might be specific to your use case.

So then it stands to reason that we need a DSL for writing DSL's.

Less code / easier to read / less bugs / easier to maintain / easier to formally verify? Alan Kay seems to be a fan of this principle and from what I have seen it makes sense. He doesn't do it for optimisation per se but the same idea applies.

Because these are all artificial boundaries you have created. Many of the operations within the domains you mention here are going to be the same. It all comes down to effects, data structures, and algorithms.

We have some very good tools for writing DSLs (Forth and Lisp come to mind). And since in both of those you in principle have access to an assembler (Lisps are less good about that nowadays, sadly) they seem like a decent option.

I read a couple of the Fortress specs as they were developing it, and quite liked the language. It was complicated, but (to me) the complexity was warranted. What about the language do you consider a mess? Or is it more that you believe that any language with similar goals would inevitably be a mess?

I'd like to add, based on this quotation:

"The programmer using such a system will write his beautifully-structured, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient."

Sounds to me as if what he is suggesting is that coding could take place in whatever language... but then during compilation they would "interactively" give the compiler answers to questions to that previous systems could not answer just by analyzing the program. Its not clear to me how the author envisions this "dialog" to take place... would the compiler literally pause and "ask" a person sitting at the console about the use cases? Maybe... and then perhaps it would cache those responses for future builds. Or perhaps the compiler might request another program to feed it sets of typical input for that particular section of code, etc. Definitely an interesting idea if that's where they're going with this.

You don't want to make the compiler itself interactive. Compilers have a well-known interface: either they spit out an output binary and exit cleanly, or they spit out an error. This is a case where I'd imagine the compiler could just spit out an error: something like "not 100% confident on the correct way to optimize the following lines: [...]" (this is oddly reminiscent of INTERCAL's "please" directives, actually.)

What the compiler would be expecting, would be an "optimization annotation file" to exist alongside each source file, that it would take as batch input. If said file has all the answers it needs, it compiles cleanly; if it doesn't (because the program changed, or because it doesn't exist yet) then it'll error out.

The important bit is that an IDE can read such an error, and transform that into some form of semi-interactive process to create/modify the optimization-annotation file. But to the compiler, everything stays a pipeline, like usual.

Could those be specified in comments in the file? Most development nowdays is done in an IDE that could intelligently "hide" the comments when you're just reading the code, and then when you're looking at optimizing the code you could start annotating things that way.

A similar approach is to use #pragma directives, though many languages do not have an analogue to that.

Java Annotations are probably wrong. The compiler can benefit from more freedom (and finer grained detail), so for Java comments seem like a better choice to me.

What can comment say that an annotation could not? Heck, annotations can contain text strings.

I believe it's a question of where you put the optimisation hints. Alongside the code, or in a separate file?

Do you mean Java annotations?

Perhaps something along the lines of the compiler saying, "if you write this that way, then I know X is true and I can perform some more optimisations". But that may be even harder to do than a "compiler-friendly" language.

We have two problems: people don't want to know about what's best for their processor (that's the job of the compiler after all), and they don't want to know about when compiler optimizations can take place (which actually is kind of the same problem).

> Similarly, computer programs today are increasingly diverging into a state where there is a tiny amount of extremely performance critical code, and a large amount of code where performance is so good on our hardware today that even horribly unoptimized code has no noticeable effect on performance.

> Thus, optimizing compilers are useless on the first type of code (humans are better), and useless on the second (performance doesn't matter).

It seems to me that while speed may not matter much anymore, power consumption matters now more than ever before (the proliferation of computers that are run off of batteries). If you're computer spends less time running code and more time idle, even if we are talking about milliseconds that the end user won't use, you might use noticeably less power over the course of several hours.

The difference between "the phone ran out of power when I was still 5 minutes away from home" and "the phone ran out of power after I walked through the front door and kicked back for the evening" may be the difference between "the battery-life on this is shit, buy [competitors product]!" and "no complaints".

The point is that all of the time spent running code is spent running only a small portion of code -- This is still case 1 listed above.

One difference is that before only bursts needed to be super tight, now idle performance have to be in addition. Maybe his rule is still valid, but the two cases do seem a little bit different to me.

Previously it used to be about optimizing that innermost loop that was only a few instructions long so it went twice as fast. A small local change with inline assembly.

But idle behaviour involves architectural optimizations. Like pushing events instead of polling for them.

> now idle performance have to be in addition.

Your idle time should be with all processes blocked in select()/poll()/epoll() and friends, and the CPU fully shut off with no code at all running.

If your CPU isn't fully asleep -- if it's running instructions -- you've already lost. There was a bug with my cellphone where a service was holding a wakelock and preventing the CPU from shutting off. This was enough to drain enough power that it would drain even while on the charger.

Yeah, hopefully nobody busyloops ala

But I could see someone doing


    setInterval(checkForEvent, sometime);
This wouldn't be terrible but not ideal.

Bernstein’s logic applies equally well to power consumption as to pure runtime.

>Similarly, computer programs today are increasingly diverging into a state where there is a tiny amount of extremely performance critical code, and a large amount of code where performance is so good on our hardware today that even horribly unoptimized code has no noticeable effect on performance.

>Thus, optimizing compilers are useless on the first type of code (humans are better), and useless on the second (performance doesn't matter).

Even today, in practice, this is a false dilemma. Consider Postgres, Firefox, or Photoshop for that matter: I expect the user-experience of these products to be significantly degraded when compiled with -O0.

The trouble with this analysis is that it assumes that a codebase is a static, finalizable goal: it will be written, it will be done, and then it is perfect. And that might be true of... oh, I don't know... qmail. But the development of most software is an ongoing process, and the tradeoff between developer time and runtime will never go away: nobody in the video-game industry would consider for a second the idea that performance-critical code doesn't need to make deadlines! A great deal of performance-critical code is written with a deadline, by people who would never pass a tenure hearing. If you can eliminate all "hot" parts of your application by linking assembly libraries, even today, you're in a very, very small minority! The amount of hand-written assembly in the wild is so small that unless you work on [cryptography, scientific computing, emulation, embedded] you'll rarely cross paths with it.

The call at the end of the presentation for an "interactive optimizer" of sorts is a bit of a non-sequitur: we might ask as a necessary prerequisite the demonstration of any system which can effectively compete with (∆t ≤ 10%) hand-written assembly on any problem of a significant size. This extremely-tuned linear algebra optimizer ("AUGEM"):


demonstrates that it can beat g++, but no comparison with OpenBLAS is given (the researchers used an old version of GotoBLAS, which is sort of unfair by their own admission). Can human-parity ASM can be generated efficiently with an interactive compiler? I have no bloody idea.

Re: the AUGEM paper, the generated kernel is the OpenBLAS kernel. Xianyi is one of two(?) primary OpenBLAS developers, and it says in the paper that "the GEMM kernel generated by our framework for the Intel Sandy Bridge CPU has been adopted as a part of our open-source BLAS library OpenBLAS."

Hi scythe,

I am Xianyi, a co-author of AUGEM paper and the developer of OpenBLAS.

We used the AUGEM generated assembly codes at OpenBLAS sandy bridge kernel(OpenBLAS/kernel/x86_64/dgemm_kernel_4x8_sandy.S).

Meanwhile, we need add some hand written codes to deal with the tail ( undivided by block size).

Thus, we didn't compare AUGEMM with OpenBLAS. However, we compared the performance with Intel MKL and AMD ACML.


Yeah, except there are large amounts of code that could stand being faster, but it is not worth the cost to hand-optimize.

So what are they good at? Optimizing large amounts of code well enough at a negligible cost. In other words, scaling productivity. The thing machines usually are good at :)

> Therefore we should design programming languages that on the surface look very unoptimized, but specify enough information that compilers can do a really good job. It sounds like no one knows what such a programming language would look like.

This entire post would be painfully familiar to anyone working with probabilistic programming. An intriguing overview of the state of the art was posted here a little while ago: https://moalquraishi.wordpress.com/2015/03/29/the-state-of-p...

The task here (I'm adapting an older comment of mine) is that the human should write a relatively simple procedural program that says "if we were to fix a bunch of properties/states of the natural world and provide them as input, here's the output/reading you'd hypothetically get," and then from that, the "compiler" and runtime create a function that, given a large number of outputs/readings, provides a posterior probability distribution over those hidden properties/states of the world. The implementation choices for this "function" can make a difference of days or weeks of runtime on huge computer clusters.

The problem is that machine learning researchers use all sorts of insights to make manual optimizations when hand-rolling machine learning code - for instance, if there are real-world reasons why a certain type of search over your space may be well-suited or ill-suited to the problem domain. And not only can the languages not capture these insights yet, but even if they could, there's fragmentation in the community (since it's at such an early stage), and there's no standardized way to map heuristics detecting optimizable patterns to actual optimizations.

I'd imagine optimizing compilers looked very similar at one point, though... so only time will tell how far the community progresses!

Perhaps the authors deliberately ignored the embedded systems "niche". Embedded systems have this "unusual" property that the one-time cost of software development is dwarfed by the 10^N-times cost of more expensive components. This is also true in theory in the desktop world, but it's so easy to shame your customer for not using a high-end computer that it's a non-issue. Optimizing compilers are needed for embedded systems because low-end micro-controllers don't have the horsepower of desktops or even your latest smartphone, but yet the more features/the smarter behaviour the better. So you have better things to do than optimizing stuff for a processor you don't know very well and that may be replaced in 5 years.

The actually useful question is not whether there are humans who, in their domains of expertise, can out-perform a general-purpose compiler.

The actually useful question is how many humans actually possess or can reasonably be expected to gain the knowledge to write screaming-fast bare-metal hand-tooled compiler-beating code safely on modern processor architectures, and figure out the trade-off if we decide to abandon optimizing compilers -- even though there will always be some nonzero number of humans who can write faster code -- in favor of either forcing every programmer to hand-roll performance-critical code or re-tooling everything and everyone everywhere to a hypothetical language (and probably a hypothetical hardware architecture) you don't even know how to describe in general terms yet.

(for the record, my money is on the "useless" compilers winning that cost-benefit analysis)

> If optimizing compilers aren't useful, what system should we use instead for making performant code? The author and collaborators' experience suggests that the reasons a compiler can't optimize code as well as a human when it matters is that our current programming languages don't give the compiler enough information about the intent of the code to optimize it. Therefore we should design programming languages that on the surface look very unoptimized, but specify enough information that compilers can do a really good job. It sounds like no one knows what such a programming language would look like.

Sounds like one hell of a rabbit hole. Could have sworn that codifying intent is one of those things that humans have struggled with since the dawn of civilization. As in, what is a clever optimizing hack on society vs a criminal offense.

Good point.

I had a professor who worked on the Java HotSpotVM and was a proponent of optimizing compiler. His attitude was RAM is cheap. But he also thought us LL1 Pascal style compilers too.

In real world, native compiled applications written in C/C++/ObjectC/Rust rule the desktop world. In the enterprise/backend server we see a lot of Java/DotNet VM based memory hungry applications, Interpreter or modern VM based languages like PHP(HHVM/PHP7)/LuaJIT(nginx/openresty)/Javascript(Nodejs v8)/PyPy dominate the web and on smartphone one can decide to go native with iOS or the 8-core CPU and still a bit laggy Java lang VM based Android.

Fortran and C and their compilers are still the benchmark to beat. As modern CPUs optimize and predict branches, some automatic optimization techniques in higher levels are counter productive.

Haskell? :)

Haskell is a good starting point. It's rather high level.

I'd like to see a variant that banishes bottom _|_ to the same corner that unsafePerformIO already occupies today. I want to see total Haskell. If all programs had to halt by default, the compiler would be much freer in choosing evaluation models---strict vs lazy would make no semantic difference.

There are a bunch of languages like that. I've just done a project in HOL, where all functions must terminate. Proving that functions terminate is a pain in the backside, because for any trivial structurally recursive function you write (easy to prove), there's also a function that's nothing like that and will just be a waste of your time.

Also, without a lot of work, you'd get very little out of evaluation models. In particular, the only disadvantage at the moment is that you can't strictly evaluate arguments where the function isn't strict in that argument. That's cases where the argument might not get used. I don't think it would get you massive improvements that you couldn't already get in most cases by simply producing specialised versions of the function.

The big advantage is that you can prove more interesting stuff about your code (with again more effort). See Agda, where you can (for example) write a zip function that won't typecheck unless both lists are the same length.

See the Morte language proposed (not sure if actively developed) by Gabriel Gonzalez, author of many popular Haskell libraries.

Found this discussion pretty interesting on the subject : http://www.reddit.com/r/haskell/comments/2g6wsx/haskell_for_...

> If all programs had to halt by default, the compiler would be much freer in choosing evaluation models---strict vs lazy would make no semantic difference.

Lazy evaluation permits the use of infinite structures, which would be _|_ when evaluated strictly, but not when evaluated lazily.

Good point.

That's where lots of terms with `co-' in front come in. Co-recursion can deal with certain infinite structures.

Bitcoin script has to halt by default to prevent DOSs on nodes that evaluate the blockchain. However it doesn't have looping or flow control besides if/else.

I don't think you need looping or flow control, even for a general-purpose language; what you really want is the ability for almost all the code you write to be in terms of "step functions" with the flow control lifted out (e.g. edge functions in finite-state machines; recursive functions that recurse using a Y combinator instead of knowing that they're calling themselves; cellular automata where you only get to define the local transformation, not the automata itself; etc.)

In other words—in the same way that Haskell hides IO "within" the implementation, giving the programmer an encapsulated abstraction in the form of an IO monad where it's effectively immaterial whether real IO happened or not—you could have a language even purer, where you only get to specify comonads, and the definitions of monads (where all non-halting behavior leaks into the language) themselves are abstracted away and inaccessible to the language programmer, except perhaps within the equivalent of Rust's unsafe{} blocks.

In such a language, what monads were used to run the program would be up to the implementation—thus allowing the same program to be given unbounded resources, or bounded resources, for any given resource, as a run-time-configurable decision.

My token attempts at writing numerical code in Haskell have been ... underwhelming.


I was thinking Ruby actually, although Haskell could probably be close too. In any case is sounds like the suggestion is to move away from imperative, procedural programming, which has been talked about a lot in various ways.

From the insides of a compiler, imperative and procedural have little meaning. C turns into a purely functional programming language called static single assignment form very easily.

Actually, when you start thinking about everything like your compiler does, they stop having meaning in real life too!

If you want to make a language easier to optimize, you need to reduce runtime decisions the program might have to make and incidental details about the program the compiler has to preserve[1]. That's "it". I think the closest languages for some problems would be Julia, array languages, OpenCL, etc. Not so much Ruby…

[1] Like function signatures, structure layout, error behavior on bad inputs, memory aliasing, dynamic method calls, integer overflows, array overflows, stuff like that.

SSA turns C into a pure functional language? What about globals, volatile, etc.?

It's inside a memory monad.

What on earth are you talking about? SSA has nothing to do with monads. http://en.wikipedia.org/wiki/Static_single_assignment_form

Ruby is the language I like best to work in, but you're right. When I started Ruby, I wanted everything to be as static as possible, and I still do. What seduced me with Ruby was how clean it reads to a human.

But as someone who writes compilers as a hobby, Ruby is also a massive challenge.

Even reasoning about a simple Ruby program as a human is hard unless you make a lot of assumptions about how a reasonable developer will behave.

What does "2 + 2" do? In Ruby you technically don't know without knowing what's lead up to that point. Some trickster might have overridden Fixnum#+ to return "42" on every addition. Or format your hard drive. And if there's a single "eval" with user controllable input before that point, you can't statically determine what it will do, because that trickster may or may not be present at the keyboard for any specific run.

Instead a JIT will need to be preferred to de-optimize and re-optimize code differently if you want efficiency, or an AOT compiler will need to add all kinds of guards and fallback paths to handle the crazy alternatives wherever whole-program analysis is insufficient to rule out the tricksters..

And here's the kicker: common Ruby libraries use eval() all over the place. Often for inconsequential things that a compiler could in theory figure out. But not without a whole lot of extra analysis.

For ahead-of-time compilation there are a ton of further challenges, such as the practice of executing code to alter the load path and executing code to determine exactly what "require" calls are made to pull in code. 99% of the time it results in a static set of "require" calls and you could just compile it as normal. That 1% it's a plugin mechanism, and the intent is for you to load and eval() code at runtime.

Ultimately very few mainstream languages have such a fluid distinction between read/compile time and runtime as Ruby, and are as hard to statically analyse as Ruby.

That's part of what makes it fun to try to ahead-of-time compile Ruby, but also drives me crazy at times when trying to figure out how to do it semi-efficiently.

A lot of this can be resolved with relatively small tweaks, or even just current language features that most people don't use. E.g. if you were to do "Fixnum.freeze", the class can't be modified any more.

Now we know what 2+2 means, assuming we know what Fixnum looked like at the point of freezing.

One can make it vastly easier to reason about, and optimize, Ruby just with careful application of a few things like that. Unfortunately, because current Ruby implementations does not reward behaviour like that, nobody takes those steps.

Ruby is the wrong direction. You need to make more information to the compiler statically. Haskell is the right direction here.


Not a game programmer but, impression I get is a number of games use interpreted languages as glue. Certainly sound to me like thee is a dichotomy between must run fast at all costs, and must be flexible speed matters not at all.

I think game programming is an area where a high percentage of the code base needs to run fast, even if it's not run very often. If it causes the frame rate to slip, the user is going to notice.

In my case, embedded, optimizing for size is everything. And usually only a very small portion of the code needs to run fast. One problem that people are running into with optimizing compilers is with embedded often you are interested in the physical side effects of your code, not it's mathematical end state. Optimizing compilers are increasing not respecting the former in preference to the latter.

It is a typical DJB presentation. Hard to read with repeated text all over (since he wants a transition effect of adding text during the talk). This one actually contains images. DJB is very consistent in the way he do presentations.

> the few humans who optimize important bits of code to the maximum extent disagree

That sounds like selection bias. The only people who would spend time hand writing assembly are those who don't like compilers.

The talk was given to a packed room at ETAPS, a European computer science conference leaning more on the theoretical side (I suppose everyone was curious how optimizing compilers were dying). All in all, the audience did not bust out the pitchforks, although one might say that "domain-specific compilers" is basically the direction academia has already been heading. I doubt any of the people in the audience who were working on compilers/JITs are planning to stop working on them, though it did make for some fun dinner discussion.

There was one mid-talk exchange was one professor asking djb upfront whether or not he thought, in ten years, Mike Pall (author of LuaJIT) would be out of a job--after all, JITs are basically optimizing compilers. Well, the original question was more diplomatic than that, but eventually he pushed it enough that he got djb to not deny that this would be the case.

The talk was somewhat marred by a very large digression into an undergrad level primer of computer architecture (it probably would have been better served by an extended Q&A session), although the sorting example he finally built up to was pretty cute.

I really like the conversation with the compiler approach. I had the good fortune to write some code for one of these: http://en.m.wikipedia.org/wiki/Cray_XMT which is a multi socket TB-scale shared memory machine with 128 hardware thread contexts per socket. It has an autoparallelizing C compiler that attempts to parallelize mostly for loops where it thinks it can (really quite a clever thing), but you can also tell it where to do things and give it hints through compiler pragmas. The compiler infrastructure will print out annotated copies of the code that tell you where it did and didn't parallelize and why. The effect is that you have this conversation with the compiler in which each of you tries to tell the other how to make the code more parallel (which in the case of the XMT means better and faster). It's very simple but the result can be orders of magnitude improvement.

A couple of years back I was working as a programmer, working on an application written in C. The application was always built with all optimizations disabled, because a bug in the compiler's optimizer (is that the correct term?) had bitten the programmers, so they had become really careful.

One day, as an experiment, I built the applications with pretty much all safe optimizations enabled and did a simple benchmark that showed a performance improvement of ... (drum-roll) two percent. This left me scratching my head, wondering if that meant the compiler really sucked at optimizing the code or if the code was so badly written the optimizer could not really do anything about it (or maybe it was so well-written it already was close to optimal without "optimizing" it? Probably not.).

How nice would it be, I thought, if the compiler could give me some kind of feedback on which parts of my source code it did or did optimize, and why.

I am relieved to hear (well, read, technically speaking) that I am not the only person to have had that idea.

Many programmers have all kinds of ideads in their heads about how writing their code in a certain way will make it easier for the compiler to optimize the heck out of it, but I guess few actually go and check if their assumption have any basis in reality.

It would be so incredibly nice to have a compiler that told you about such things, so you could know if your assumptions were true or bogus.


SGI's compilers provided this feedback back in the 90's. I have been working with domain-specific languages for years but I would have thought what was relatively common back then persisted...

In my experience (computationally-intensive numerical simulations), the optimizers helped. Although we had to use a language that was algebra (Fortran derivatives), write C-tran with #pragmas to help out the optimizer or use hand-coded libraries for things like LAPACK.

I'm guessing that the XMT compiler mentioned upthread came from Tera, rather than SGI, but it's still interesting that significant parts of SGI (maybe including those same compilers?) ended up at the same company as the XMT compilers.

It would be so incredibly nice to have a compiler that told you about such things

gcc can output such information for autovectorization (-ftree-vectorizer-verbose=[n])

That is good to know.

But while for some tasks this may be crucial, vectorization is only a part of the arsenal of optimizations gcc may apply.

Recently, there was discussion here on HN about C compilers (gcc, clang) exploiting undefined behaviour for optimizations that could be ... surprising. That, too, would be nice to know about.

The general idea of a dialogue between a programmer and a compiler about how (and where) to optimize a program is very intriguing.

gcc and clang can output before/after output for all their passes. Try '-fdump-ipa-all -fdump-tree-all -fdump-rtl-all-slim' on gcc, I forget the llvm options.

It's not especially readable, though.

Also, for ghc you can use '-v5' and the option that writes out C output. Prepare to be impressed(?) at how much it doesn't look like a fast C program.

SBCL will emit (rather verbose) annotations if you ask it to optimize.

Just curious, was this a CPU bound program?

Pretty much. The CPU-intensive part was computational geometry. It certainly did not help that we were four programmers, and none of us had a strong background in math. I suggested once hiring a math-kahuna just to optimize the geometry-stuff, but money was rather tight. Also, the other programmers were very conservative. Once, an intern rewrote the geometry stuff using DirectX, so stuff like overlap detection for polygons could be offloaded to the graphics hardware, but my boss refused to adopt this into our code.

Also, we used a version of OpenWatcom that was rather old-ish even back then ("back then" being ~2007-2009), which knew nothing of SSE or stuff like that. Multi-core CPUs were around, but few if any of our customers used them, so multi threading was not really an option, either.

UPDATE: As a side note, I once tried recoding a few very simple but heavily-used utility functions in assembly. That was pretty much my only contact with raw assembly. When I compared my versions to what the compiler made of the C-version, they looked nearly identical. I never knew whether to be proud or ashamed of that. Then again, the assembly version was glaringly obvious... (For comparison, I also compiled those C functions with gcc, and the output looked nearly identical to my version.)

I'm still a big fan of the feedback offered by the XMT's compiler. I think it's a lot more important/useful when the difference between success and failure is a few orders of magnitude (as when trying to parallelize for a target with plenty of processors).

Some bits of Fran Allen in Coders At Work book: http://www.codersatwork.com/fran-allen.html

"We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization. The nubbin of the debate was Steve’s defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer’s issue"

"We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are . . . basically not taught much anymore in the colleges and universities. "

I attended djb's ETAPS talk and am still not true if he was deliberately provocative or genuine. Assuming the latter, I disagree with several of his points. Here I want to bring one to the readers' attention and it's to do with the economics of correctness proofs.

One of his key arguments was that compiler optimisations are difficult to prove correct, and that's one of the reasons why optimising compilers will be replaced by a combination of simple compilers + assembly hand-written by domain experts. It is true that such proofs are (currently) expensive, but misses the point of the economics of correctness proofs: Correctness proofs are difficult, but proving compilers correct amortises that cost over all subsequent uses. In contrast, program specific correctness proofs are typically of comparable difficulty, but don't amortise in this way. Therefore it seems to be cheaper in the long run, to focus on the correctness of optimising compilers. Moreover, as compilers and optimisations are quite a restricted class of algorithms, hence it is more likely that we can reuse (parts of) correctness proofs and prover technology for compilers.

As I understand, the point is precisely that optimizations are not a restricted class of algorithms, because you need entirely new optimization techniques for new domains.

I'm not sure I understand what you mean. Those "entirely new optimization techniques" cannot be baked into the compiler, they will have to be invented and implemented by a human programmer on an ad-hoc basis. Compilers only use general purpose optimisations.

To finish the Don Knuth quote: "There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. "

"Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified."

Sadly at least some of the points of relevance of this quote are outdated.

For example, it would behoove a large company to spend a great deal of time optimizing say, their JSON parsing library. Although it may not identify as a hotspot in any one place in their immense codebase, it's extreme prevalence causes performance degradation subtly but pervasively.

I also measured injected object creation using Guice to be 40x slower than a simple constructor in java (agree or disagree with the 40x, but using reflection to set a variable instead of simple object construction is intuitively far slower).

Guice may not show up on any profiler as problem - but if you slow down object creation by a factor of 40x, something you may do thousands of times per second for the life of your program, you are degrading performance across the board. Rather the same if you simply clock your CPU down a few hundred Mhz.

To counter some of the naysayers, suppose your JSON parsing library or Guice object instantiation is slow. In my case, my CSV parsing library is slow. You will change your code to avoid these things. In my case, I’m considering converting the CSV files to Thrift or Protocol Buffers, which means that I’ll have hundreds of gigabytes of reformatted data that I’ll need to rebuild when the CSV changes. But then the CSV parsing would not show up in the profiler any more. If, instead, I magically had a CSV parser that was ten times faster, I would have an overall simpler system that didn’t have to manage a cache of hundreds of gigabytes of reformatted data in order to avoid CSV parsing.

So it’s entirely possible that a slow JSON parsing library (or DI library or whatever) could result in an overall slow and overcomplicated system, and yet not show up in the profiler, because the programmers had added enough overcomplicated, somewhat faster alternatives to avoid using the JSON library much.

"Guice may not show up on any profiler as problem - but if you slow down object creation by a factor of 40x, something you may do thousands of times per second for the life of your program, you are degrading performance across the board."

If your profiler doesn't point right at the thing that you're doing 40x slower thousands of times, either your profiler is BROKEN or that isn't actually a bottleneck.

If you have ten thousand different kinds of objects then the creation of instances of any given class will not rise to the top of the profile.

When I care about performance over most anything else, I take this guy's approach: https://www.youtube.com/watch?v=Kdwwvps4J9A

It's not appropriate for every project, but if maximizing performance is your goal, you should be tracking performance constantly as you work, not as an afterthought late in the project.

Late profiling leaves too much room for a thousand small inefficiencies to add up, and it adds too much time between making a decision and understanding its impact on performance.

I'm not sure how what you describe is any different than focusing on the critical parts of the program. If a function is short but is called many many times, that could still be a hot spot.

It's also worth noting that even then, it still may not matter if you're stalled out on something like I/O or whatever (while still keeping in mind other constraints like battery life, CPU throttling, etc).

> Guice may not show up on any profiler as problem

The it isn't a problem.

> but if you slow down object creation by a factor of 40x, something you may do thousands of times per second for the life of your program

Then it'll show up in the profiler. You aren't going to do something that slows down the program across the board yet doesn't create new hot spots that show up in the profiler.

Counterexample: anything that involves cache.

You do something that blows away your cache, it'll show up as slowdowns elsewhere in the program.

> You do something that blows away your cache, it'll show up as slowdowns elsewhere in the program.

Presumably that will show up as additional cache misses at those points in the profile, as suggested, assuming the profiler can count more than cycles on the relevant hardware.

Unfortunately, it is really difficult to diagnose cache effects.

Do you have a good idea of what levels of cache misses are typical? Do you regression-check cache miss rates between versions?

What about the whole "debugging your program makes it slower" observer-effect? Do you know what effect your profiler has on cache?

And again, you do something that blows away cache it'll show up as cache misses at scattered random other parts of your problem. Even if you catch it can (will) be hard to track back to the actual source.

I thought it was regression which was at issue here.

I'm not convinced cache is so different from all the other aspects of performance we have to deal with in HPC systems (insofar as they're isolated), but no matter. At least there's plenty of tool support and literature.

I often boggle at people who claim that compilers are magic and outperform humans-- perhaps thats true for unimportant code that you'd pay no attention to, or with developers who aren't familiar with the underlying micro-architecture at all.

It's pretty usual for me to see a factor of 2 performance for the same algorithm implemented in the same manner when moving from SIMD intrensics (which almost directly map the underlying platform) to hand coded ASM.

Even non-SIMD code can result in some pretty stark changes.

A non-SIMD example from a crypto library I work on which isn't (yet) very well optimized for ARM, benchmarked on my novena (with GCC 4.9.1 -O3):

ecdsa_verify: min 1927us / avg 1928us / max 1929us

And a hand conversion of two hot inner functions (which are straight line multiple and add lattices, no control flow) into arm assembly:

ecdsa_verify: min 809us / avg 810us / max 811us

Again, same algorithm.

(The parallel change for x86_64 is still significant, but somewhat less extreme; in that case converting the same functions is only a 15% speedup overall, partially because the 64bit algorithm is different).

When thats a difference which results in 2x the amount of hardware (or 15% for that matter) for a big deployment; it can justify a LOT of optimization time.

(Or in my case, the performance of this code will have a material effect on the achievable scale and decentralization of the Bitcoin network.)

From a straight development time vs performance perspective I'd use even more hand written code; ... but there is a maintenance/auditing/eview/verification overhead too. And often the same code that you cannot tolerate being slow you also cannot tolerate being wrong.

Copied from lobste.rs since I think it'd be interesting to the audience here as well


This “dialogue with the compiler” bit that djb lands on is in some sense obviously the right way to go forward. I’ve found this to be the case not in optimization—though I’m not in the least bit surprised to see it there—but instead in the world of dependent types. The language that the program writer writes in is often just a skeleton of the genuine information of the program. For instance, in a dependently typed program it’s often very difficult for an author to immediately write all of the complex intermediate types required to drive proofs forward, but it’s much easier to achieve this in collaboration with the compiler (really the typechecker, and e.g. via a tactics language and interactive proof general-like interface). The ultimate result, the “elaborated” program, contains much, much more information than the skeletal program the programmer originally wrote. It has been annotated by the collaboration of the compiler and the program writer to extract more of the programmer’s intention.

The same kind of thing could clearly arise from a “collaboration” over optimization. It’s even quite possible that these are the same dialogue as dependent types certainly provide much more information to the compiler about the exact properties the code ought to substantiate—in a nice, machine readable format even.

It's worth adding in archive that this link is highly relevant: http://cs.ru.nl/~freek/courses/tt-2010/tvftl/epigram-notes.p...

An optimizing compiler may not be better than me at optimizing hot code paths, but my time is a very limited resource. The compiled version may only be 75% as fast as my hand optimized version, but writing that hand optimized version will likely take several times longer. Sometimes it is worth spending the extra time for that performance, but usually it is not.

Indeed, plus you need multiple hand-optimized versions. Not only per architecture, but also e.g. pre-AVX and AVX. An optimizing compiler will give you optimizations for all current and future platforms for free.

Another problem is that the number of people who can write good general hand-optimized assembly is small. E.g. I used a numeric Go library (which I will not name, because I should've submitted an issue) that used assembly-'optimized' routines. Replacing those with simple C loops and turning on auto-vectorization beat those hand-written routines handsomely.

Very related, but surprisingly not covered, I'll point out what was covered in depth at Mike Acton's CppCon14 keynote "Data-Oriented Design and C++"



And that is: Because of the ever-growing disparity between ALU vs IO speeds, the vast majority of time spent waiting on computers is because of issues that the compiler can not optimize. In general, compilers have very few opportunities to rearrange your data structures without your explicit, manual input. They can't help your CPU stall on memory/disc/network IO less by any significant amount. They can only help when your CPU actually has the data it needs to proceed --which often is less than 20% of total execution time.

In that case, no matter how smart GCC gets, it probably can't ever speed up your existing code by over 20% than it does today. It's not allowed to by the spec. I'm not aware of any general-purpose language where this is an option to any significant degree (silent AOS-to-SOA, hot-vs-cold data segregation, tree clustering, etc...)

If your program is too slow, it's almost certainly because you haven't done the hard, still-manual work of optimizing your data access patterns. Not just your Big-O's (N^2) vs (NlogN), but also your Big O's hidden, implicit K. The K that academia actively ignores and that most people rarely think about because it is mostly composed of cache misses that are implicit and invisible in your code. x = sqrt(y) is super cheap compared to x = *y. But, the same people who fret over explicit ALU costs usually think very little of x->y->z.

> I'm not aware of any general-purpose language where this is an option to any significant degree (silent AOS-to-SOA, hot-vs-cold data segregation, tree clustering, etc...)

Not yet, of course, because it hasn't been released, but Jonathan Blow's Jai language offers transparent switching between AOS and SOA, as well as many other features that make it easy to explore the 'optimization space' of a program manually.


(for those who don't know, Jonathan Blow was the creator of Braid)

> It's not allowed to by the spec.

Oh, compilers are allowed to do whatever you tell them to - they just need to see all your code at once, to know which side effects don't matter. C with LTO is such a platform (if you don't use function pointers…), so is any JIT language (if you don't call into C…).

That said, data reorganization is very underdeveloped, and the only thing that comes close is tiling access patterns in GPUs, AFAIK. Here's a dead research project about it:


If you agree with the author's conclusion that we need better languages that allow us to give the compiler more information about the optimization needs of our program, then I think you have to look in the direction of languages like Haskell, Idris, etc. Fortran can be faster than C because C has aliasing that limits the optimizations that the compiler can perform. Similarly, strongly typed and pure languages like Haskell give you even more. You can do a lot with types, but on top of that Haskell allows you to define rewrite rules that tell the compiler how things can be optimized. This allows the compiler to automatically convert things like this:

    map f (map g items)
...into this:

    map (f . g) items

Assuming f and g are not evaluated for their side-effects...

While I get that Haskell allows/demands you to write what you mean, consider an app that adjust reverb and volume up, from 1..10 (here items is the range 1..10). You could argue (rightfully) that the code should be structured differently (in this imaginary case, f and g would be identity functions on integers -- or perhaps Maybe integers (return int if volume/reverb set to the requested number, error/nil otherwise).

I realize this is a strange and contrived example -- the main point is that while functional programming is great -- iff what is wanted is a single multidomain language, it needs to function (heh!) in many different paradigms.

Btw, what would be the appropriate way to define functions for their side effects in Haskell - eg a function like this that takes an integer, performs some kind of i/o, interrupt, writes to a hw address -- to set some external state -- and returns the new value on success? Anyone have some pointers to that at the top of their head? Maybe there's a chapter in "Real World Haskell" that deals with it?

[ed: I just realized that the behaviour of (f . g) and map-map might be similar with the "Maybe identity"-behaviour I sketched above - modulus error handling/recovery (eg: do we want to try f on a given int i, even if g failed?). Oh well :-) ]

> Assuming f and g are not evaluated for their side-effects...

And that is the whole point. In Haskell we can know that f and g have no side effects. If they had side effects you would not be able to compose them with the dot operator.

I don't quite understand your example but I assure you that there are elegant ways to accomplish that in Haskell. There was recently a presentation [1] at the NY Haskell meetup that talked about a new Haskell sound synthesis library called vivid.

Making functions that perform side effects is also easy. Here's a simple example:

    intToFile :: Int -> IO ()
    intToFile anInt = do
        writeFile "myfile.txt" (show anInt)
The idea that you have to be multiparadigm to function in real world settings is a fallacy. Haskell is simply more expressive, which equips it to solve just about any problem with more clean and concise code than you're likely to get in other languages. Even if the code size stayed the same the extra safety you get from purity and strong types would be worth it alone.

[1] https://www.youtube.com/watch?v=xo3zUvPsizo

> The idea that you have to be multiparadigm to function in real world settings is a fallacy.

I'm not so sure. The best solution for a subset of programmers is unlikely to be the best solution for every programmer. Somewhat like the quote about literate programming:

"Jon Bentley probably hit the nail on the head when he once was asked why literate programming hasn’t taken the whole world by storm. He observed that a small percentage of the world’s population is good at programming, and a small percentage is good at writing; apparently I am asking everybody to be in both subsets." -- According to Donald Knuth http://www.informit.com/articles/article.aspx?p=1193856

While I think there is much more overlap between good programmers that write good Forth, those that write good Ada (or other procedural code), those that write good functional programs, those that bask in the glory of Smalltalk/Self (classic oo) and those that get by in java ("modern"/class-oriented oo) -- than between good technical writers and good programmers -- I think the problem is partly the same.

There's no guarantee that haskell is the best tool for a 10 year old write interactive fiction, or automate his or her schedule -- and for a finance manager to manage numbers and for an engineer to write code for a robot -- or for high frequency trading... or etc.

Machines don't understand Haskell anyway, only humans do. If many of us find Visual Basic or Excel or SQL easier... we probably have a need for such languages (even if I don't want to touch Visual Basic with a stick...).

Multiple programming paradigms has nothing to do with computers and systems, and everything to do with human-machine interface and augmenting the capabilities of humans through computers.

>The idea that you have to be multiparadigm to function in real world settings is a fallacy. Haskell is simply more expressive, which equips it to solve just about any problem with more clean and concise code than you're likely to get in other languages. Even if the code size stayed the same the extra safety you get from purity and strong types would be worth it alone.

Haskell is also not used any major project. It's an academia language.

I don't think it's quite right to say it's (still) just an academia language -- but it's absolutely marginal. I use two (three) programs implemented in haskell on a (semi)regular basis: xmonad (and xmobar) and pandoc.

Other than that, I know of git-annex that is both actual programs and is in actual use.

That's still more than the number of programs I use that are implemented in Arc...

Haskell is academic in the sense that it is very opinionated about a few things (as this sub-thread illustrates).

Lets see how Shen, Open Dylan and clojure end up doing ... (and of those, probably only clojure have meaningful contemporary systems implemented in it - so far). Maybe there's even more life left in Common Lisp (sbcl etc) or Scheme (Racket, guile w/guix nix (haskell! yay!) ...).

Then there's OCaml of course...

I think that has more to do with syntax and the overdependence on rote memorization that it requires, and less to do with the actual concepts themselves. Lots of things are silly in Haskell-land, but that's mostly because it's made to be elegant or math-like, if I've understood it correctly.

That said, multiparadigmatic programming is awesome, we just have to find some way to give as much information to the compiler in imperative instructions as we do in functional. This means that imperative programming has to be tightened up quite a bit, but I don't think it would be very drastic, really.

If you wanted their side effects, they would have to be in a context that tabulates their side effects - i.e., a monad. Operations in a monadic context (I/O, mutating state, etc.) are generally not commutable - or if they are, the circumstances under which they are can be expressed in the definition of the monad. On the other hand, part of the point of Haskell is that any side effects must be made explicit, so if all you've got is a function, you can be guaranteed that there are no side effects.

In Haskell, there are no side effects. Any effects you care about will explicitly be part of the type.

Haskell provides many combinators to combine various combinations of effectful and effect-free functions. Look into the documentation for Control.Monad to see some.

Your example of aliasing in C is a good one, because this is a case where modern C allows the programmer to give the compiler the necessary information to do a better job. The "restrict" keyword indicates that an object accessed through a given pointer argument cannot alias any other object accessed by the function.

Which doesn't help when you're trying to express anything more complex.

Such as "a won't alias c, but there are no other restrictions".

More discussion from a month ago https://news.ycombinator.com/item?id=9202858

There are a few basic optimizations we should routinely have in compilers today, but don't always find there.

-- Multidimensional array optimizations. Basically, get to the level of subscript calculation overhead seen in FORTRAN compilers of 50 years ago.

-- Subscript checking optimization. Work on Pascal compilers in the 1980s showed that about 95% of subscript checks could be eliminated or hoisted out of inner loops without loss of safety. This was forgotten during the C era, because C is vague about array sizes. Go optimizes out checks for the simple cases; Rust should and probably will in time. Optimization goal: 2D matrix multiply and matrix inversion should have all subscript checks hoisted out of inner loops or eliminated. Compilers that don't have this feature lead to users demanding a way to turn off subscript checking. That leads to buffer overflows. (Compilers must know that it's OK to detect a subscript check early; it's OK to abort before entering a loop if there will inevitably be an subscript error at iteration 10000.)

-- Automatic inlining. If the call is expensive relative to the code being called, inline, then optimize. Ideally, this should work across module boundaries.

-- Atomic operations and locking. The compiler needs to know how to do those efficiently. Calling a subroutine just to set a lock is bad. Making a system call is worse. Atomic operations often require special instructions, so the compiler needs to know about them.

-- Common subexpression elimination for pure functions. Recognize pure functions (where x=y => f(x) = f(y) and there are no side effects) and routinely optimize. This is essential in code with lots of calls to trig functions.

> Go optimizes out checks for the simple cases; Rust should and probably will in time.

rustc uses the industrial strength LLVM optimiser, which is perfectly capable of elimatinating bounds checks: almost certainly more capable than the main Go compiler in any case.

This has been pointed out to you several times, and so your repeated assertions otherwise are now almost malicious. Maybe you could be more concrete (e.g. with an instance of a subscript check eliminated by the Go compiler but not rustc)?

"Rust essentially never wastes cycles on bounds checking thanks to the design of its iterators. The Servo team tells me that bounds checking has never shown up in any performance profiles."

Source: https://news.ycombinator.com/item?id=9392131

That's... not exactly relevant. :)

Rust's iterators definitely resolve a lot of problems one might have with bounds checking, but there's still times when one is essentially required to do explicit indexing (e.g. iterating down a column of a multidimensional array) but in such a way that the bounds checks should be removable.

My point was that, modulo bugs, rustc (more specifically, LLVM) will eliminate such bounds checks. With or without iterators.

I really need to write a matrix multiply and see what code Rust generates.

"Recognize pure functions (where x=y => f(x) = f(y) and there are no side effects) and routinely optimize. This is essential in code with lots of calls to trig functions."

It sounds simple when written, but it's less than trivial. For you as the user, the trig function is something "pure." For the compiler author, the trig function is most probably "some huge library function with a huge amounts of the alternative paths, and in any of those the "exception" can potentially happen."

If the modern compiler actually receives the info about the "purity" it will actually do the common subexpression elimination. It certainly knows how to do such stuff.

Many compilers do inline, it's one of the most common optimisations. Tracing JIT compilers in particular, by nature of tracing, can easily and do inline in the hot path, even across module boundaries.

As to multidimensional arrays, could you please point me towards a description of the optimisations they enable?

"For some reason we all (especially me) had a mental block about optimization, namely that we always regarded it as a behind-the-scenes activity, to be done in the machine language, which the programmer isn't supposed to know. This veil was first lifted from my eyes when I ran across a remark by Hoare that, ideally, a language should be designed so that an optimizing compiler can describe its optimizations in the source language. Of course!"

That sounds like he wants some sort of homoiconic assembly or machine language to target. Does such a thing even exist?

Some optimisations can be described in any language (eg dead code elimination). But something that maps 1:1 to LLVM IR would work for pretty much everything.

"Lisp is... a good assembly language because it is possible to write Lisp in a style that directly reflects the operations available on modern computers."

Peter Norvig, Paradigms of Artificial Intelligence Programming.

He actually quotes my rebuttal comment - "“Except, uh, a lot of people have applications whose profiles are mostly flat, because they’ve spent a lot of time optimizing them.”

and his response is "this view is obsolete, and to the degree it isn't, flat profiles are dying".

Oh great, that's nice, i guess i can stop worrying about the thousands of C++ applications google has built that display this property, and ignore the fact that in fact, the profiles have gotten more flat over time, not less flat.

Pack it up boys, time to go home!

Basically, he's just asserting i'm wrong, with little to no data presented, when i'm basing mine on the results of not only thousands of google programs (which i know with incredible accuracy), but the thousands of others at other companies that have found the same. I'm not aware of him poring over performance bugs for many many thousands of programs for the past 17 years. I can understand if he's done it for his open source programs (which are wonderful, BTW :P)

He then goes on to rebut other comments with simple bald assertions (like the luajit author's one) with again, no actual data.

So here's some more real data: GCC spent quite a while optimizing interpreter loops, and in fact, did a better job than "the experts" or whatever on every single one it was been handed.

So far, as far i can tell, the record is: If GCC didn't beat an expert at optimizing interpreter loops, it was because they didn't file a bug and give us code to optimize.

There have been entire projects about using compilers/jits to supplant hand-written interpreter loops.

Here's one: https://code.google.com/p/unladen-swallow/wiki/ProjectPlan

While the project was abandoned for other reasons, it produced 25+% speedups over the hand written interpreter versions of the same loop by doing nothing but using compilers.

Wake me up when this stops happening ....

He then goes on to make further assertions misundertanding compiler authors and what they do:"A compiler will not change an implementation of bubble sort to use mergesort. ... they only take responsibility for machine-specific optimization”.

This is so false i don't know where to begin. Compilers would, if they could, happily change algorithms, and plenty do. They change the time bounds of algorithms. They do in fact, replace sorts. Past that, the problem there is not compilers, but the semantics of languages often do not allow them to safely do it.

But that is usually a programming language limitation, and not a "compilers don't do this" problem.

For example, the user may be able to change the numerical stability of an algorithm, but the programming language may not allow the compiler to do so.

Additionally, it's also generally not friendly to users.

As an example: ICC will happily replace your code with Intel performance primitives where it can. It knows how to do so. These are significant algorithm changes.

But because users by and large don't want the output of ICC to depend on Intel's Math Kernel Library or anything similar, they don't usually turn it on on by default.

GCC doesn't perform quite as much here, because even things like replacing "printf" with "puts" has caused tremendous amounts of annoyed users. Imagine the complaints if it started replacing algorithms.

Past that, i'd simply suggest he hasn't looked far enough into the history of optimizing compilers, because there has been tons of work done on this. There are plenty of high level language optimizers that have been built that will completely rewrite or replace your code with rewritten algorithms, etc.

I stopped reading at page 50.

Maybe you should have continued to the end, because as far as I can tell, you are almost completely missing the point.

You are saying "look at all the wonderful things optimizing compilers can do". And he is saying "that's cool, but it doesn't really matter".

Now I am pretty sure he wold concede immediately that there are some examples where it matters, but my experience matches what he is saying very closely.

A little bit on my background: hired in 2003 by the BBC for the express purpose of making one of their systems faster, succeeded at around 100x - 1000x with simpler code[1], hired by Apple to work on their Mac OS X performance team, similar successes with Spotlight, Mail indexing, PDF, etc. Also worked with/on Squeak. Squeak runs a bytecode interpreter, with a bunch of primitives. The interpreter is perfectly adequate for the vast majority of the system. Heck, the central page imposition routine in my BookLightning PDF page imposition program[2] is written in Objective-Smalltalk[3], or more precisely an interpreter for the language that's probably the slowest computer language implementation currently in existence. Yes it's slower than Ruby, by a wide margin. And yet BookLightning runs rings around similar programs that are based on Apple's highly optimized Quartz PDF engine. And by "rings" I mean order(s) of magnitude.

Why is BookLightning so fast? Simple: it knows that it doesn't have to unpack the individual PDF pages to impose them, and is built on top of a PDF engine that allows it to do that[4]. The benefit of an optimizing compiler in this scenario? Zilch.

At the BBC, the key insight was to move from 20+ machine distributed and SQL database backed system to a single jar event logger working in-memory with a filesystem based log[5]. How would a compiler make this transformation? And after the transformation, we probably could have run interpreted byte-code and still be fast enough, though the JIT probably helped us a little bit in not having to worry about performance of the few algorithmic parts.

As to changing a bubblesort to a mergesort, you hand-wave around that, but I think the point is that this is not a safe transformation, because the author may have specifically chosen bubblesort because he likes its characteristics for the data he knows will be encountered (or cares more about that case).

When I worked at Apple, I saw the same pattern: low-level optimizations of the type you discuss generally gained a few percent here and there, and we were happy to get them in the context of machine/system wide optimizations. However, the application-specific optimizations that were possible when you consider the whole stack and opportunities for crossing or collapsing layer boundaries were a few factors x or orders of magnitude.

And here is where the "optimizing compiler" mindset actually starts to get downright harmful for performance: in order for the compiler to be allowed to do a better job, you typically need to nail down the semantics much tighter, giving less leeway for application programmers. So you make the automatic %-optimizations that don't really matter easier by reducing or eliminating opportunities for the order-of-magnitude improvements.

And yes, I totally agree that optimizations should be user-visible and controllable, rather than automagically applied by the compiler. Again, the opportunities are much bigger for the code that matters, and the stuff that is applied automatically doesn't matter for most of the code it is applied to.

[1] http://link.springer.com/chapter/10.1007%2F978-1-4614-9299-3...

[2] http://www.metaobject.com/Products/

[3] http://objective.st

[4] http://www.metaobject.com/Technology/

[5] http://martinfowler.com/bliki/EventPoster.html

Is it your position that all of OS X could run on Squeak, or gcc -O0, with performance comparable to what it does now?

Because that seems to be the logical conclusion of your position.

Yes, it may be that low-level optimizations are fighting for single-digit percentage improvements. And of course re-architecting big systems at a high level can yield much bigger gains. But you have to remember that the baseline is an entire system that's already using an optimizing compiler.

If you want to argue that optimizing compilers are dead, you'd have to show that you can remove optimizing compilers from your toolchain, and have nobody notice the difference. That's a pretty hard argument to make.

This is where my mind keeps coming back too. I keep hearing "it doesn't matter" and yet nearly every compiler I come into contact with is an optimizing compiler. Heck, even with some pretty straight up "performant" code that difference between optimizations on and off in a lot of these can be significant.


Just about everything.. I can only imagine what would happen if every process on all the systems I deal with were re-compiled with optimizations off :|

>Is it your position that all of OS X could run on Squeak, or gcc -O0, with performance comparable to what it does now? >Because that seems to be the logical conclusion of your position.

No, that's not my position, though it is likely true. In fact, in a sense this has already happened: the default optimization switch on OS X is -Os, so optimize for size, not for speed. It turns out that this is usually also OK for raw speed, but it would not matter (much) if it didn't. Memory is more important. With Squeak bytecode being so much more compact, it would probably be an overall win (think cache-pressure for code executed once or twice).

But let’s go with your assumption that we don’t set optimization flags. Much worse regressions happen all the time and people don't notice or don't care. Think virtualization, running code in browswers, the overhead of Dalvik, the introduction of atomic properties in ObjC, or ARC or Swift. How many people noticed the locale switch a couple of years ago that caused grep to become ~100x slower on OSX? How much of the web runs on PHP and Ruby or similarly comically slow tech? We’re spending Moore’s dividend[1] in lots and lots of places, having optimizations turned off in our compilers would be a small fraction of the total.

However, the point is not that you can remove optimizations and everything would simply be the same. The point is that optimizing compilers do more than is needed for most code, and not enough for the rest.

This is not really (or mostly) about what optimizers can or cannot do, it is about the nature of code and the performance requirements of that code. On today’s machines, most of the code isn’t noticeable. Some of the code is highly noticeable, and usually that is in (soft) real time situations. What’s interesting about real time is that speed does not matter. Not missing your deadline does. These are subtly but fundamentally different goals.

I learned this when I was working on pre-press (90ies), and saw newspapers switching to pre-rendered bitmap workflows, where the pages (or page fragements) are rasterized at high dpi (1270dpi is typical for newsprint) early on and then you work with that in all further process steps. This seemed utterly insane to me, because it is the most inefficient way of working possible. Little 100KB Postscript files turned into 50-100MB TIFFs. However, it absolutely makes sense for a newspaper: once a file is rendered, further processing is completely predictable. Postscript (or PDF) on the other hand is orders of magnitude more efficient in not just the best but also the average cases. However, the worst case is essentially unbounded. For a (daily) newspaper, having the pages earlier than needed is nothing, but not having them later than needed is everything, so once you have sized your workflow to handle the (significantly larger) load, you are golden.

The same goes for most desktop and mobile apps. If you’re under 100ms response time and under 16.66ms animation time, you’re good. Being faster does not help. Again, predictability is more important than performance: it is infinitely better to hit 15ms 100% of the time than 1ms 90% and 50ms 10% of the time, even though the latter is more than 2x faster.

Optimizing compilers help with speed, but they do make predicting what code will perform well (or how well) more difficult. You essentially have to have a model of the machine code and access patterns that you are targeting for the platform that you are targeting in your head along with a model of the optimizations your compiler can and does perform, and then write code that will goad your compiler into generating the machine code that you want. Phew! Though admittedly not as tough as doing the same with JITs (I remember the “optimizing for Java Hotspot” session at WWDC back when that was a thing. Much, much easier to just write the damn assembly code yourself!)

DannyBee pointed out a major reason for this: the compiler is bound by the semantics of the language, and maybe more importantly the code as written. Many times, those constraints are much tighter than necessary. The optimizer can’t do a thing because it is not allowed to touch the semantics. The programmer doesn’t know why the compiler isn’t generating the code it should. Maybe the two should talk?

There are exceptions to this, for example Google-scale operations where shaving 1% off some computation, though undetectable by users), means you can run about a million fewer servers and save the energy budget of a small country. But again, I would say that that’s an exceptional case.

[1] Spending Moore’s Dividend, http://cacm.acm.org/magazines/2009/5/24648-spending-moores-d...

"Maybe you should have continued to the end, because as far as I can tell, you are almost completely missing the point. " I if i can't find the point of a presentation in 50 pages, ...

The fie is huge because it seems that animations were converted to multiple PDF pages.

So page count is not a useful metric for presentation progression and information amount.

> He then goes on to rebut other comments with simple bald assertions (like the luajit author's one) with again, no actual data.

I assure you, the LuaJIT case is real.

Here is data about LuaJIT's interpreter (in assembly) vs. Lua 5.1's interpreter (in C):


It's true that these are completely different implementations. But Lua 5.1 is one of the fastest dynamic language interpreters already. There is not room to optimize it 1.5-5x further, like you would need to catch LuaJIT's interpreter. And as the link above shows, the LuaJIT 2.0 interpreter beats Mike's own LuaJIT 1.x JIT compiler in some cases.

Mike's post made lots of specific and concrete arguments for why it's hard for C compilers to compete. Most notably, Mike's hand-written interpreter keeps all important data in registers for all fast-paths, without spilling to the stack. My experience looking at GCC output is that it is not nearly so good at this.

Look at luaV_execute() here and tell me that GCC is really going to be able to keep the variable "pc" in a register, without spilling, in all fast paths, between iterations of the loop: http://www.lua.org/source/5.1/lvm.c.html

I don't agree with the talk's overall point, but if you are skeptical about pretty much anything Mike Pall says regarding performance, you need to look harder.

These numbers appear to be done are without any profiling data, while the hand optimized version has in fact, had profiling data guiding it (the human profiled it).

Give me numbers with profile data, and file bugs about the differences in assembly generation, and i bet it could be pretty easily fixed.

Again, we've done this before for other interpreters.

> Give me numbers with profile data

Because the interests me, I took a few minutes to try this out.

I ran this test on GCC 4.8.2-19ubuntu1, since it was the newest official release I could get my hands on without compiling my own GCC.

Here are my raw numbers (methodology below):

    LuaJIT 2.0.2 (JIT disabled): 1.675s
    Lua 5.1.5:                   5.787s (3.45x)
    Lua 5.1.5 w/FDO:             5.280s (3.15x)
    Lua 5.1.5 -O3:               6.536s (3.90x)
    Lua 5.1.5 -O3 w/FDO:         4.288s (2.56x)
For a benchmark I ran the fannkuch benchmark with N=11 (https://github.com/headius/luaj/blob/master/test/lua/perf/fa...).

My machine is a Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.

To test LuaJIT with the JIT disabled I ran:

    $ time luajit -j off benchmark.lua
To test regular and FDO builds for Lua 5.1.5 I ran (in the "src" directory of a Lua 5.1.5 tree):

    $ make all
    $ time ./lua benchmark.lua
    $ make clean
    $ make all MYCFLAGS=-fprofile-arcs MYLIBS=-fprofile-arcs
    $ ./lua benchmark.lua
    $ make clean (note: does not delete *.gcda)
    $ make all MYCFLAGS=-fbranch-probabilities
    $ time ./lua benchmark.lua
Because Lua's Makefiles use -O2 by default, I edited the Makefile to try -O3 also.

> and file bugs about the differences in assembly generation

It would be pretty hard to file bugs that specific since the two interpreters use different byte-code.

It would be an interesting exercise to write a C interpreter for the LuaJIT bytecode. That would make it easier to file the kinds of performance bugs you were mentioning.

Thank you for taking the time to perform these tests!

One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.

You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.

Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.

Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.

I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.

But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).

And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.

See my ramblings about preserving programmer intent, I made in the past: http://www.freelists.org/post/luajit/Ramblings-on-languages-...

> One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case.

Yes, I meant to mention this but forgot. The numbers I posted are a best-case example for FDO, because the FDO is specific to the one benchmark I'm testing.

> Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.

True, though I think one clear win with FDO is helping the compiler tell fast-paths from slow-paths in each opcode. I would expect this distinction to be relatively universal regardless of workload.

The theory of fast-path/slow-path optimization would say that fast-paths are much more common than slow-paths. Otherwise optimizing fast-paths would be useless because slow-paths would dominate.

The fact that a compiler can't statically tell a fast-path from a slow-path is a big weakness. FDO does seem like it should be able to help mitigate that weakness.

You seem to be talking past DJB.

Libjpeg-turbo is 2x-4x faster than libjpeg because it's written in assembly.

Skia, ffmpeg and many other libraries that deal with graphics have hand-written assembly that provide 2x-4x speedup over what a C compiler can do (for the specific routines).

DJB has lots of experience using assembly to speed up numeric-heavy (crypto) code.

Mike Pall is actually talking real numbers when he compares his assembly lua interpreter vs. C-based interpreter.

There are plenty other examples to support DJB thesis. No amount of protesting and trying to knock his arguments on the edges will change that.

You even agree with him: "But that is usually a programming language limitation, and not a "compilers don't do this" problem.".

From practical point of view this is academic distinction whether it's the language or its compiler to blame.

After protesting so much you essentially confirm his main point: on some code a C compiler can't beat assembly written by an expert and, at least according to DJB, things are getting worse not better i.e. C compiler are getting comparatively worse at optimizing for current hardware than an expert in assembly.

"There are plenty other examples to support DJB thesis. No amount of protesting and trying to knock his arguments on the edges will change that."

I didn't knock it around the edges, i said it's flat out wrong. The fact that some code, and there are plenty of examples, has hot regions does not mean that overall, performance critical code for most people does.

As I said, we have thousands of performance critical apps, and the profilers keep getting flatter anyway. ffmpeg has not climbed the list over time, even if you disable the assembly versions, etc.

"You even agree with him: "But that is usually a programming language limitation, and not a "compilers don't do this" problem.". From practical point of view this is academic distinction whether it's the language or its compiler to blame. " It's not academic when one solution proposed is hand-coding performance critical parts that change the semantics of the program.

If you tell the compiler it can violate the semantics of the programming language, it will happily do the same thing.

ffmpeg only writes inner loops in assembly, and doesn't suffer too much. There would be speed gain in a few situations, but it's just not worth it when it would be so much harder to change the code.

But inside inner loops, you don't need to change code, and assembly can be easier to read and write than C. Have you seen Intel's C SIMD intrinsics? They're so ugly!

> C compiler are getting comparatively worse at optimizing for current hardware than an expert in assembly.

It seems to be Intel's opinion that optimizing compilers for specific desktop CPUs is not worth it. GCC and LLVM have machine-specific models for Atom CPUs, but almost nothing for desktops, even though Intel engineers work on those compilers.

They're probably right, since most people are running generic binaries.

This is Mike Pall's post: http://article.gmane.org/gmane.comp.lang.lua.general/75426

Has this been fixed in GCC in the last 4 years? Can GCC 5.1 beat Mike Pall? It would be very significant if it were true. Same thing with BLAS.

I think this conversation is incomplete without considering the economics of the Mike Palls of the world. Yes, there exist experts who can write hand-tuned assembly that dramatically outperform optimizing compilers. More importantly, a well-designed algorithm can outperform an optimized version of a poorly-designed algorithm by several orders of magnitude; far beyond what we can expect from constant folding and function inlining.

You'll note that the domains to which these experts apply themselves tend to be those which affect millions of people and are critical to broad swaths of industry: everybody needs a faster BLAS implementation, and the economic impact of faster TLS implementations measures in the hundreds of millions of dollars [1].

But there is a very long tail of people writing performance-sensitive software for domains which cannot attract the time and attention of experts but for whom GCC and the JVM, for example, deliver tremendous gains.

In many cases, you might say the metric to optimize is not battery usage or throughput, but rather 'thinking time'. If a biologist is held ransom to the attention of one of a handful of experts who is capable of hand-optimizing a novel data-processing technique, that biologist will be significantly handicapped in what they can measure or explore. An optimizing compiler that can make their proof-of-concept code run "fast enough" is quite powerful.

[1] https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf

Saying GCC can outperform experts is incorrect. Saying that every project can afford an expert that can outperform GCC is incorrect.

You can optimize for minimum CO2 emission and then brain power will be spent on problems that waste most CPU cycles.

I don't think even DJB is arguing that optimizing compilers don't bring a benefit. I think DJB says that if you want a secure hypervisor/OS/browser, you should hand code the performance critical parts (e.g. bulk encryption and authentication for TLS, handshake for TLS, video decode, screen composition) and compile the rest with CompCert. If you can add more optimizations to CompCert without affecting its correctness, that's nice. But DJB does not want to run the output of GCC -O3 instead of the output of CompCert for the cold code, and he doesn't want to run the output of GCC -O3 instead of assembly by DJB, Adam Langely, Mike Pall, Jason Garrett-Glaser, etc. for the hot code.

Let's take it piece by piece

" * Diamond-shaped control-flow is known to be the worst-case scenario for most optimizations and for register alloction. Nested diamond-shaped control-flow is even worse. "

Without more info, i can't possibly agree or disagree with this.

Diamonds are definitely not the worst case for register allocations?

" * The compiler doesn't have enough hints to see what the fast paths and what the slow paths are. Even if you'd be able to tell it, it still sees this as a single giant control-flow graph."

With profiling info, dynamic or static, this is wrong.

" Anything in this loop could potentially influence anything else, so almost nothing can be hoisted or eliminated. The slow paths kill the opportunities for the fast paths and the complex instructions kill the opportunities for the simpler instructions."

This is not accurate, First it assumes no optimizations or alias analysis is path sensitive, which is false.

Second, even in those cases where it is true, a smart compiler will simply insert runtime aliasing tests for what it can't prove. They do this already. A JIT (again, another type of compiler, just not static) would not even need this.

Let's skip his hand coding and get to what he says the advantages are:

"* Keep a fixed register assignment for all instructions.

* Keep everything in registers for the fast paths. Spill/reload only in the slow paths.

* Move the slow paths elsewhere, to help with I-Cache density."

All of these things are done by optimizing compilers, already, given profiling info.

If his statement about what he thinks the advantages are is accurate , and what we do, i'm 100% positive that GCC + FDO either already does, or could be tuned without a ton of work, to beat Mike Pall for this kind of loop.

> If his statement about what he thinks the advantages are is accurate , and what we do, i'm 100% positive that GCC + FDO either already does, or could be tuned without a ton of work, to beat Mike Pall for this kind of loop.

I love the gauntlet-throwing fire in this statement, and I only wish that it was practical to arrange for a showdown on this to actually happen. Especially since it would do nothing but improve the state of the art on both sides, and be a great case study in the capabilities and limitations of compilers.

Unfortunately I think it is impractical to actually have an apples-to-apples showdown because of the fact that the two interpreters use different byte-code. Writing a pure-C interpreter for LuaJIT would be non-trivial.

> I love the gauntlet-throwing fire in this statement, and I only wish that it was practical to arrange for a showdown on this to actually happen.

Yes, but it's already available in a simpler case, as eluded to above and in discussion on the preview of this talk. There wasn't an answer to the question about which compiler is competitive with OpenBLAS on the linpack benchmark in Fortran (dominated by the triply-nested matrix multiplication loop).

A typical computational chemistry program of the type that consumes many cycles on HPC systems might use three types of kernel in assembler for good reason, e.g. OpenBLAS, FFTW, and ELPA (scalable eigenvalues). This is unfortunate, but at least a few people doing the work can have a large benefit; many thanks to them.

I would love to see the results. I suggest contacting him either directly or through the mailing list for a representative sample of a C interpreter that can be benchmarked against luajit. http://luajit.org/contact.html

To be frank, if it wasn't interesting enough for him to ever file a bug against a compiler for, and he's already got something he likes, i don't see why i should take the time to produce results? I mean, what would the point be? He's obviously happy with what he's got (and more power to him).

Even if i did it, outside of supporting an argument on the internet, what changes?

Today people rely on sources like Mike's mailing list post to decide whether their project can be done in C or not. If GCC significantly improved since 2011, it's important that people are able to show evidence of this.

You make a distinction between restrictions placed by the semantics of the language and the compiler not performing an optimisation.

This isn't a practical distinction for the end user. In both cases, no amount of money thrown at the optimising compiler will help, and the user will have to rewrite their algorithm.

I'd be very leery of replacing arbitrary algorithms. In particular, the user can make the optimisation unsafe so completely so quickly, just by adding debug statements or whatever.

Also, the examples you gave are all of specific classes of problem, which are probably quite easy to dedicate resources to because they affect many, and because optimised assembly exists.

The second-to-last slide ends with:

"ideally, a language should be designed so that an optimizing compiler can describe its optimizations in the source language. Of course!"


"The programmer using such a system will write his beautifully-structured, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient. Such a system will be much more powerful and reliable than a completely automatic one."

Which I find to be a much more agreeable conclusion than

"everybody should write in-line assembly for inner-loops"

On the first, so he wants something like stratego (or any of the other failed systems that do this and were determined not useful) :)

On the second, i'm going to just disagree.

Anything you can point me to?

For which part?

For the last part, you can find hundreds of papers on high level optimization of programming languages through compilers, without breaking a sweat

THere are entire books on it.

Here's the first one google found: https://books.google.com/books?id=u38h_fV3UqgC&pg=PA173&lpg=...

I expect you can find this stuff dating back to the early eighties without too much trouble.

His vision:

The time is clearly ripe for program-manipulation systems... The programmer using such a system will write his beautifully-structured, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient.

But what if the answers the programmer gives the compiler turn out not to match reality, and some weird bug is introduced that has no representation in the source? The compiler's decisions need to be somehow spelled out in debuggable form.

There is another approach (that can also be complementary). The programmer specifies in advance various specific scenarios, and a JIT compiler guesses which of the scenarios is in effect and optimizes for that (e.g. a certain condition, like input size is always true), but adds a guard (that hopefully adds negligible overhead). If the scenario does not match reality, the JIT deoptimizes and tries another. This process itself, of course, adds overhead, but it's warmup overhead, but it's more robust. This is the approach being tried by Graal, HotSpot's experimental next-gen JIT (and Truffle, Graals complementary a programming language construction toolkit aimed at optimization): https://wiki.openjdk.java.net/display/Graal/Publications+and...

Even if DJB wrote some very effective code, now when he "goes meta" he comes somehow in the strange area of being "not even wrong." Or maybe we miss his ideas when we read the slides instead of hearing him at the talk.

People who make compilers used in the production know: if the naive users claim that "optimizing compilers don't matter" it's because the optimizing compilers are so good in doing what they're doing.

There's the argument which is here buried deep in the discussions and which I think DJB missed to address, nicely stated by haberman:

"If you want to argue that optimizing compilers are dead, you'd have to show that you can remove optimizing compilers from your toolchain, and have nobody notice the difference."

So, Mel Kaye got the last word in?


Interacting with a compiler sounds horrible. Think of the questions it might need to ask: "Hey! I could use AVX2 instructions here, after I inline a bunch of stuff and eliminate some dead code, but it requires doing a bunch of memory copying. Is that a good idea?". How would you answer a question like that?

And then, since optimizations are target dependent, you would need to go through this exercise for each target. Sounds fun.

I would imagine it would be more things like "This function would optimize so much better if I could assume that the two parameters don't alias. Should I add that restriction to the function's type signature?"

Maybe it would be more productive to make a new language more amenable to optimization rather than trying to clumsily bolt it onto C. It's not like C is a good language, anyway. I find FORTRAN more pleasant in many ways, and that's not quite the state of the art...

How you'd answer depends very much on what you're doing -- which is the point of asking. You might be able to remove the memory copying by fiddling with alignment somewhere in calling code and adding an annotation saying you did that. Bam, problem solved. Or you might not be able to do that, in which case it would still be nice to know what tradeoffs you're making.

But consider, for it to be worth asking, it's got to be something that's very difficult to infer, or the compiler wouldn't need to ask. That means it will probably be difficult for you to infer as well.

In practice, you might have to profile all possible options, which would quickly turn into a major project if you're asked more than a handful of questions.

And, of course, even if you can figure it out, that alignment annotation you just made is only optimal on the one target. Even on x86 it will depend on what instructions are available.

If you can identify and apply these decisions, why can't the compiler?

Everything about this seems to be saying, "We're stagnant with our current compiler/optimization paradigm...screw it."

I'd be happy if C and C++ had 2,3,and 4 element vectors as built in types, along with cross and dot product operations. There are intrinsics, and GCC has it's own intrinsics that can be used across architectures. But the languages need to have these. They are so fundamental to so many things.

There are many more things to wish for, but I'm starting with one of the simplest.

Why would you need this in the language? It seems to me that these things can be provided by a library without any loss.

Why would you need this in the language? It seems to me that these things can be provided by a library without any loss.

No loss in function. C++ will certainly let you create classes, but you need to be extremely good to make them low overhead. You also can't pass them by value (the intrinsics can). And lastly, classes don't exist in C.

Vector math is so common it should have direct support and operators.

Well, to be fair, there aren't good C++ operators to use for dot and cross product. (Funny, looking at my own C++ vector op file, I see in addition to the usual +, -, *, / I've added function definitions Dot, Normalize, Cross, and Length. I don't think of Normalize and Length as operators, but it might make sense if they were...)

Here's a compiler which uses program synthesis to target a mesh network type architecture: http://pl.eecs.berkeley.edu/projects/chlorophyll/ . It uses a guided process like is implied in the last slides.

Compiler optimization is obviously a sleeping field currently. There's nothing that's made it into practical compilers to address the bottlenecks shifting from ALU work to data storage and layout considerations.

Consider all the gnashing of teeth and wringing of hands that goes on in C++ circles about inefficient data layout & representation by inferior programmers, and the stories of victorious manual data layout refactorings by performance heroes.

DJB's slides don't address the data side because he only does crypto, and that's one of the fields where the ALU twiddling is still relevant. But crypto is also rarely a bottleneck.

The original abstract for the talk (which summarizes the slides) were posted by the author here:


This is very good. I've been working towards something similar to what DJB is talking about. (Nice to know I'm in good company. :)

In a nutshell, although automated systems will be good (are already and getting better) there will always be aspects that require humans in the loop (unless and until the machines actually become sentient, defined in this context as gaining that je ne sais quoi that humans do seem to have.)

The cases where optimizing compilers aren't good enough are where Java Hotspot Compiler and similar techniques really shines. Combined with novel superoptimization techniques, hotspot optimization could far outperform hand-optimization (although AFAIK that hasn't happened in practice yet).

I completely agree, although hotspot is really hampered by the fact that objects are allocated willy nilly on the heap instead of together for pipeline efficiency. So while hotspot (and graal) generate lovely assembly the lack of data locality kills a lot of possible performance. Hoping objectlayout.org changes that!

Yet I think that hotspot and intrinsics are a nice case study of showing why optimising compilers are not dead. Even for performance critical code. https://news.ycombinator.com/item?id=9368137 discusses in part how intrinsics (hand optimised) at some point get beaten by the optimiser (simd/superword) and then hand-optimised again to again beat the optimiser. Mostly, because machines change.

A whole problem with static binaries as produced by C and GO compilers is that they assume machines are static. Which leads to lowest common denominator optimiser settings :( When the optimisers are humans this gets even worse. Then you end up with optimisations in your C code that was a good idea 15 years ago, but that make no use of SIMD today (or to short SIMD e.g. SSE2 when AVX-512 is available).

Of course real optimisations happen, not by doing the same thing faster but by doing a faster thing. Take for example the HMMER (https://news.ycombinator.com/item?id=9368137) family of algorithms. HMMER2 is part of SPECCPU, and the compiler guys doubled the speed of this algorithm, in about 5 years. Then the bioinformaticians redid it as HMMER3 which is quite different in how it globally works and gets 100x speed up in practice.

You might find this interesting (it's what I'm talking about when I said superoptimization).


So just for a start: Which optimizing compiler actually properly solves the optimization problems? I know of none.

When I look at the list of optimization solvers for constrained linear or non-linear programming models (i.e. https://en.wikipedia.org/wiki/List_of_optimization_software) and the list of compilers the intersection is still zero.

All optimizers are still using their own ad-hoc versions of oversimplified solvers, which never fully explore their problem space. Current optimizing compilers are still just toys without a real-world solver. And it's clear why so. Current optimizable languages are still just toys without a real world solvable optimization goal.

You can think of strictly typed, sound declarative languages where solvers would make sense, or you can think of better languages, like fortress or functional languages, which are not encumbered by not-properly optimizable side-effects and aliasing which harm most modern language designs.

> or you can think of better languages, like fortress or functional languages, which are not encumbered by not-properly optimizable side-effects and aliasing which harm most modern language designs.

Remember, if you ban aliasing then you can't express any problems requiring aliasing! That kind of sucks for writing programs that are fast in the first place.

I've never had to use a language with really no aliasing (Fortran?), but many languages forbid possibly overlapping arrays, because of the pain interior pointers make for GC.

If you were ever to write an image decoder in Java, you'd see some steps involve reading and writing from different parts of the same image (usually called intra prediction), but now you have to pass all these extra coordinates along with the image array. And the optimizer is not good enough to make up for all the extra work it was just given.

Isn't that why people advocates C ? Isn't C just that type of language you can tell the compiler how to optimize ?

C might not give very explicit information on how to optimize, but isn't it simple and bare enough to let the compiler do a better job ?

What a waste of time.

Yes, specialists can squeeze the last performance improvements in ASM compared to C. Doesn't mean that -O2/-O3, auto-vectorizing can't do a nice job and get to 90% of that

Optimizing DO matter. Just compare -O0 and -O1. Really. It's not because CPUs are fast that people shouldn't do that and compilers shouldn't optimize a bare minimum.

It's even better that the compiler do that because ASM optimizing by hand is very error prone.

And compilers get better every day. Just look at LLVM.

Room for optimizing compilers = distance between programming languages and CPU instructions/microarchitecture

He is soooo spot on! This "dialogue with the compiler" is going to be really big in the next decades, but it's in no way a death of automatic optimization, it's just the beginning of it.

Here's a simple example how I expect it to work: You write a code that uses a list-like data structure. The compiler then instruments the code and you run some tests. The tests will then be evaluated and the (post?)compiler selects what kind of data structure is to be used (let's make example simple, choices are array vs. list). For instance, if you need to look up elements a lot (based on the evidence from testing), an array will be chosen as the underlying data structure.

And you actually get (if you want to see it, normally this information will be hidden) a little box in the IDE, where the variable is used, that tells you: "Here, an array will be used." You can then with one click say: "I don't want array, make it a list." So for all the possible optimizations, there are two viewpoints presented: The viewpoint of the compiler (based-on evidence from the tests or static analysis) and the viewpoint of the programmer (which allows for confirmation or override in case there are some unknown assumptions).

And if the specifications change (say, we chose list earlier but now we have actually a lot of direct access to elements), you can just recompile the same code with the previously-agreed compiler choices removed! And without changing any line of code, a different and more fitting data structure will be used.

You can easily see this can apply to many things, not just data structures. You can also see that different ways how the dialogue can be implemented are possible, once we syntactically separate the "what" from the "how" in the programming language. In the future, I believe, we will program just with abstract data types, and the concrete type specification will be selected based on the evidence from the running program (or static analysis augmented with that information). So the dialogue will not happen just with the programmer, but the compiler will also observe the real world behavior of the program and facilitate adaptation to that.

In this way, it's even possible to input assumptions that doesn't have to be provably correct. This approach can potentially bridge the static vs dynamic types divide, and others.

Finally, Haskell and functional languages are very nice, but I don't think they are final word in programming. If we wanted the above, they have syntactical problems such as mixing of concrete and abstract types (type classes). Also, there are limits to static analysis in the real world. The future will be lot more interesting.

The birth of an optimizing assembler.

If you're curious, here is an optimizing assembler.


I would prefer death to the font-size: 2^1000px

Trying to read this gave me cancer

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact