and his response is "this view is obsolete, and to the degree it isn't, flat profiles are dying".
Oh great, that's nice, i guess i can stop worrying about the thousands of C++ applications google has built that display this property, and ignore the fact that in fact, the profiles have gotten more flat over time, not less flat.
Pack it up boys, time to go home!
Basically, he's just asserting i'm wrong, with little to no data presented, when i'm basing mine on the results of not only thousands of google programs (which i know with incredible accuracy), but the thousands of others at other companies that have found the same. I'm not aware of him poring over performance bugs for many many thousands of programs for the past 17 years.
I can understand if he's done it for his open source programs (which are wonderful, BTW :P)
He then goes on to rebut other comments with simple bald assertions (like the luajit author's one) with again, no actual data.
So here's some more real data: GCC spent quite a while optimizing interpreter loops, and in fact, did a better job than "the experts" or whatever on every single one it was been handed.
So far, as far i can tell, the record is: If GCC didn't beat an expert at optimizing interpreter loops, it was because they didn't file a bug and give us code to optimize.
There have been entire projects about using compilers/jits to supplant hand-written interpreter loops.
Here's one: https://code.google.com/p/unladen-swallow/wiki/ProjectPlan
While the project was abandoned for other reasons, it produced 25+% speedups over the hand written interpreter versions of the same loop by doing nothing but using compilers.
Wake me up when this stops happening ....
He then goes on to make further assertions misundertanding compiler authors and what they do:"A compiler will not change an implementation of bubble sort to use mergesort. ... they only take responsibility for machine-specific optimization”.
This is so false i don't know where to begin. Compilers would, if they could, happily change algorithms, and plenty do.
They change the time bounds of algorithms. They do in fact, replace sorts.
Past that, the problem there is not compilers, but the semantics of languages often do not allow them to safely do it.
But that is usually a programming language limitation, and
not a "compilers don't do this" problem.
For example, the user may be able to change the numerical stability of an algorithm, but the programming language may not allow the compiler to do so.
Additionally, it's also generally not friendly to users.
As an example: ICC will happily replace your code with Intel performance primitives where it can. It knows how to do so. These are significant algorithm changes.
But because users by and large don't want the output of ICC to depend on Intel's Math Kernel Library or anything similar, they don't usually turn it on on by default.
GCC doesn't perform quite as much here, because even things like replacing "printf" with "puts" has caused tremendous amounts of annoyed users. Imagine the complaints if it started replacing algorithms.
Past that, i'd simply suggest he hasn't looked far enough into the history of optimizing compilers, because there has been tons of work done on this. There are plenty of high level language optimizers that have been built that will completely rewrite or replace your code with rewritten algorithms, etc.
I stopped reading at page 50.
You are saying "look at all the wonderful things optimizing compilers can do". And he is saying "that's cool, but it doesn't really matter".
Now I am pretty sure he wold concede immediately that there are some examples where it matters, but my experience matches what he is saying very closely.
A little bit on my background: hired in 2003 by the BBC for the express purpose of making one of their systems faster, succeeded at around 100x - 1000x with simpler code, hired by Apple to work on their Mac OS X performance team, similar successes with Spotlight, Mail indexing, PDF, etc. Also worked with/on Squeak.
Squeak runs a bytecode interpreter, with a bunch of primitives. The interpreter is perfectly adequate for the vast majority of the system. Heck, the central page imposition routine in my BookLightning PDF page imposition program is written in Objective-Smalltalk, or more precisely an interpreter for the language that's probably the slowest computer language implementation currently in existence. Yes it's slower than Ruby, by a wide margin. And yet BookLightning runs rings around similar programs that are based on Apple's highly optimized Quartz PDF engine. And by "rings" I mean order(s) of magnitude.
Why is BookLightning so fast? Simple: it knows that it doesn't have to unpack the individual PDF pages to impose them, and is built on top of a PDF engine that allows it to do that. The benefit of an optimizing compiler in this scenario? Zilch.
At the BBC, the key insight was to move from 20+ machine distributed and SQL database backed system to a single jar event logger working in-memory with a filesystem based log. How would a compiler make this transformation? And after the transformation, we probably could have run interpreted byte-code and still be fast enough, though the JIT probably helped us a little bit in not having to worry about performance of the few algorithmic parts.
As to changing a bubblesort to a mergesort, you hand-wave around that, but I think the point is that this is not a safe transformation, because the author may have specifically chosen bubblesort because he likes its characteristics for the data he knows will be encountered (or cares more about that case).
When I worked at Apple, I saw the same pattern: low-level optimizations of the type you discuss generally gained a few percent here and there, and we were happy to get them in the context of machine/system wide optimizations. However, the application-specific optimizations that were possible when you consider the whole stack and opportunities for crossing or collapsing layer boundaries were a few factors x or orders of magnitude.
And here is where the "optimizing compiler" mindset actually starts to get downright harmful for performance: in order for the compiler to be allowed to do a better job, you typically need to nail down the semantics much tighter, giving less leeway for application programmers. So you make the automatic %-optimizations that don't really matter easier by reducing or eliminating opportunities for the order-of-magnitude improvements.
And yes, I totally agree that optimizations should be user-visible and controllable, rather than automagically applied by the compiler. Again, the opportunities are much bigger for the code that matters, and the stuff that is applied automatically doesn't matter for most of the code it is applied to.
Because that seems to be the logical conclusion of your position.
Yes, it may be that low-level optimizations are fighting for single-digit percentage improvements. And of course re-architecting big systems at a high level can yield much bigger gains. But you have to remember that the baseline is an entire system that's already using an optimizing compiler.
If you want to argue that optimizing compilers are dead, you'd have to show that you can remove optimizing compilers from your toolchain, and have nobody notice the difference. That's a pretty hard argument to make.
Just about everything.. I can only imagine what would happen if every process on all the systems I deal with were re-compiled with optimizations off :|
No, that's not my position, though it is likely true. In fact, in a sense this has already happened: the default optimization switch on OS X is -Os, so optimize for size, not for speed. It turns out that this is usually also OK for raw speed, but it would not matter (much) if it didn't. Memory is more important. With Squeak bytecode being so much more compact, it would probably be an overall win (think cache-pressure for code executed once or twice).
But let’s go with your assumption that we don’t set optimization flags. Much worse regressions happen all the time and people don't notice or don't care. Think virtualization, running code in browswers, the overhead of Dalvik, the introduction of atomic properties in ObjC, or ARC or Swift. How many people noticed the locale switch a couple of years ago that caused grep to become ~100x slower on OSX? How much of the web runs on PHP and Ruby or similarly comically slow tech? We’re spending Moore’s dividend in lots and lots of places, having optimizations turned off in our compilers would be a small fraction of the total.
However, the point is not that you can remove optimizations and everything would simply be the same. The point is that optimizing compilers do more than is needed for most code, and not enough for the rest.
This is not really (or mostly) about what optimizers can or cannot do, it is about the nature of code and the performance requirements of that code. On today’s machines, most of the code isn’t noticeable. Some of the code is highly noticeable, and usually that is in (soft) real time situations. What’s interesting about real time is that speed does not matter. Not missing your deadline does. These are subtly but fundamentally different goals.
I learned this when I was working on pre-press (90ies), and saw newspapers switching to pre-rendered bitmap workflows, where the pages (or page fragements) are rasterized at high dpi (1270dpi is typical for newsprint) early on and then you work with that in all further process steps. This seemed utterly insane to me, because it is the most inefficient way of working possible. Little 100KB Postscript files turned into 50-100MB TIFFs. However, it absolutely makes sense for a newspaper: once a file is rendered, further processing is completely predictable. Postscript (or PDF) on the other hand is orders of magnitude more efficient in not just the best but also the average cases. However, the worst case is essentially unbounded. For a (daily) newspaper, having the pages earlier than needed is nothing, but not having them later than needed is everything, so once you have sized your workflow to handle the (significantly larger) load, you are golden.
The same goes for most desktop and mobile apps. If you’re under 100ms response time and under 16.66ms animation time, you’re good. Being faster does not help. Again, predictability is more important than performance: it is infinitely better to hit 15ms 100% of the time than 1ms 90% and 50ms 10% of the time, even though the latter is more than 2x faster.
Optimizing compilers help with speed, but they do make predicting what code will perform well (or how well) more difficult. You essentially have to have a model of the machine code and access patterns that you are targeting for the platform that you are targeting in your head along with a model of the optimizations your compiler can and does perform, and then write code that will goad your compiler into generating the machine code that you want. Phew! Though admittedly not as tough as doing the same with JITs (I remember the “optimizing for Java Hotspot” session at WWDC back when that was a thing. Much, much easier to just write the damn assembly code yourself!)
DannyBee pointed out a major reason for this: the compiler is bound by the semantics of the language, and maybe more importantly the code as written. Many times, those constraints are much tighter than necessary. The optimizer can’t do a thing because it is not allowed to touch the semantics. The programmer doesn’t know why the compiler isn’t generating the code it should. Maybe the two should talk?
There are exceptions to this, for example Google-scale operations where shaving 1% off some computation, though undetectable by users), means you can run about a million fewer servers and save the energy budget of a small country. But again, I would say that that’s an exceptional case.
 Spending Moore’s Dividend, http://cacm.acm.org/magazines/2009/5/24648-spending-moores-d...
So page count is not a useful metric for presentation progression and information amount.
I assure you, the LuaJIT case is real.
Here is data about LuaJIT's interpreter (in assembly) vs. Lua 5.1's interpreter (in C):
It's true that these are completely different implementations. But Lua 5.1 is one of the fastest dynamic language interpreters already. There is not room to optimize it 1.5-5x further, like you would need to catch LuaJIT's interpreter. And as the link above shows, the LuaJIT 2.0 interpreter beats Mike's own LuaJIT 1.x JIT compiler in some cases.
Mike's post made lots of specific and concrete arguments for why it's hard for C compilers to compete. Most notably, Mike's hand-written interpreter keeps all important data in registers for all fast-paths, without spilling to the stack. My experience looking at GCC output is that it is not nearly so good at this.
Look at luaV_execute() here and tell me that GCC is really going to be able to keep the variable "pc" in a register, without spilling, in all fast paths, between iterations of the loop: http://www.lua.org/source/5.1/lvm.c.html
I don't agree with the talk's overall point, but if you are skeptical about pretty much anything Mike Pall says regarding performance, you need to look harder.
Give me numbers with profile data, and file bugs about the differences in assembly generation, and i bet it could be pretty easily fixed.
Again, we've done this before for other interpreters.
Because the interests me, I took a few minutes to try this out.
I ran this test on GCC 4.8.2-19ubuntu1, since it was the newest official release I could get my hands on without compiling my own GCC.
Here are my raw numbers (methodology below):
LuaJIT 2.0.2 (JIT disabled): 1.675s
Lua 5.1.5: 5.787s (3.45x)
Lua 5.1.5 w/FDO: 5.280s (3.15x)
Lua 5.1.5 -O3: 6.536s (3.90x)
Lua 5.1.5 -O3 w/FDO: 4.288s (2.56x)
My machine is a Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.
To test LuaJIT with the JIT disabled I ran:
$ time luajit -j off benchmark.lua
$ make all
$ time ./lua benchmark.lua
$ make clean
$ make all MYCFLAGS=-fprofile-arcs MYLIBS=-fprofile-arcs
$ ./lua benchmark.lua
$ make clean (note: does not delete *.gcda)
$ make all MYCFLAGS=-fbranch-probabilities
$ time ./lua benchmark.lua
> and file bugs about the differences in assembly generation
It would be pretty hard to file bugs that specific since the two interpreters use different byte-code.
It would be an interesting exercise to write a C interpreter for the LuaJIT bytecode. That would make it easier to file the kinds of performance bugs you were mentioning.
One thing that people advocating FDO often forget: this is statically tuning the code for a specific use case. Which is not what you want for an interpreter that has many, many code paths and is supposed to run a wide variety of code.
You won't get a 30% FDO speedup in any practical scenario. It does little for most other benchmarks and it'll pessimize quite a few of them, for sure.
Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.
Anyway, my point still stands: a factor of 1.1x - 1.3x is doable. Fine. But we're talking about a 3x speedup for my hand-written machine vs. what the C compiler produces. And that's only a comparatively tiny speedup you get from applying domain-specific knowledge. Just ask the people writing video codecs about their opinion on C vector intrinsics sometime.
I write machine code, so you don't have to. The fact that I have to do it at all is disappointing. Especially from my perspective as a compiler writer.
But DJB is of course right: the key problem is not the compiler. We don't have a source language that's at the right level to express our domain-specific knowledge while leaving the implementation details to the compiler (or the hardware).
And I'd like to add: we probably don't have the CPU architectures that would fit that hypothetical language.
See my ramblings about preserving programmer intent, I made in the past: http://www.freelists.org/post/luajit/Ramblings-on-languages-...
Yes, I meant to mention this but forgot. The numbers I posted are a best-case example for FDO, because the FDO is specific to the one benchmark I'm testing.
> Ok, so feed it with a huge mix of benchmarks that simulate typical usage. But then the profile gets flatter and FDO becomes much less effective.
True, though I think one clear win with FDO is helping the compiler tell fast-paths from slow-paths in each opcode. I would expect this distinction to be relatively universal regardless of workload.
The theory of fast-path/slow-path optimization would say that fast-paths are much more common than slow-paths. Otherwise optimizing fast-paths would be useless because slow-paths would dominate.
The fact that a compiler can't statically tell a fast-path from a slow-path is a big weakness. FDO does seem like it should be able to help mitigate that weakness.
Libjpeg-turbo is 2x-4x faster than libjpeg because it's written in assembly.
Skia, ffmpeg and many other libraries that deal with graphics have hand-written assembly that provide 2x-4x speedup over what a C compiler can do (for the specific routines).
DJB has lots of experience using assembly to speed up numeric-heavy (crypto) code.
Mike Pall is actually talking real numbers when he compares his assembly lua interpreter vs. C-based interpreter.
There are plenty other examples to support DJB thesis. No amount of protesting and trying to knock his arguments on the edges will change that.
You even agree with him: "But that is usually a programming language limitation, and not a "compilers don't do this" problem.".
From practical point of view this is academic distinction whether it's the language or its compiler to blame.
After protesting so much you essentially confirm his main point: on some code a C compiler can't beat assembly written by an expert and, at least according to DJB, things are getting worse not better i.e. C compiler are getting comparatively worse at optimizing for current hardware than an expert in assembly.
I didn't knock it around the edges, i said it's flat out wrong.
The fact that some code, and there are plenty of examples, has hot regions does not mean that overall, performance critical code for most people does.
As I said, we have thousands of performance critical apps, and the profilers keep getting flatter anyway. ffmpeg has not climbed the list over time, even if you disable the assembly versions, etc.
"You even agree with him: "But that is usually a programming language limitation, and not a "compilers don't do this" problem.".
From practical point of view this is academic distinction whether it's the language or its compiler to blame.
It's not academic when one solution proposed is hand-coding performance critical parts that change the semantics of the program.
If you tell the compiler it can violate the semantics of the programming language, it will happily do the same thing.
But inside inner loops, you don't need to change code, and assembly can be easier to read and write than C. Have you seen Intel's C SIMD intrinsics? They're so ugly!
> C compiler are getting comparatively worse at optimizing for current hardware than an expert in assembly.
It seems to be Intel's opinion that optimizing compilers for specific desktop CPUs is not worth it. GCC and LLVM have machine-specific models for Atom CPUs, but almost nothing for desktops, even though Intel engineers work on those compilers.
They're probably right, since most people are running generic binaries.
Has this been fixed in GCC in the last 4 years? Can GCC 5.1 beat Mike Pall? It would be very significant if it were true. Same thing with BLAS.
You'll note that the domains to which these experts apply themselves tend to be those which affect millions of people and are critical to broad swaths of industry: everybody needs a faster BLAS implementation, and the economic impact of faster TLS implementations measures in the hundreds of millions of dollars .
But there is a very long tail of people writing performance-sensitive software for domains which cannot attract the time and attention of experts but for whom GCC and the JVM, for example, deliver tremendous gains.
In many cases, you might say the metric to optimize is not battery usage or throughput, but rather 'thinking time'. If a biologist is held ransom to the attention of one of a handful of experts who is capable of hand-optimizing a novel data-processing technique, that biologist will be significantly handicapped in what they can measure or explore. An optimizing compiler that can make their proof-of-concept code run "fast enough" is quite powerful.
You can optimize for minimum CO2 emission and then brain power will be spent on problems that waste most CPU cycles.
I don't think even DJB is arguing that optimizing compilers don't bring a benefit. I think DJB says that if you want a secure hypervisor/OS/browser, you should hand code the performance critical parts (e.g. bulk encryption and authentication for TLS, handshake for TLS, video decode, screen composition) and compile the rest with CompCert. If you can add more optimizations to CompCert without affecting its correctness, that's nice. But DJB does not want to run the output of GCC -O3 instead of the output of CompCert for the cold code, and he doesn't want to run the output of GCC -O3 instead of assembly by DJB, Adam Langely, Mike Pall, Jason Garrett-Glaser, etc. for the hot code.
* Diamond-shaped control-flow is known to be the worst-case
scenario for most optimizations and for register alloction.
Nested diamond-shaped control-flow is even worse.
Without more info, i can't possibly agree or disagree with this.
Diamonds are definitely not the worst case for register allocations?
* The compiler doesn't have enough hints to see what the fast
paths and what the slow paths are. Even if you'd be able to tell
it, it still sees this as a single giant control-flow graph."
With profiling info, dynamic or static, this is wrong.
Anything in this loop could potentially influence anything else,
so almost nothing can be hoisted or eliminated. The slow paths
kill the opportunities for the fast paths and the complex
instructions kill the opportunities for the simpler instructions."
This is not accurate, First it assumes no optimizations or alias analysis is path sensitive, which is false.
Second, even in those cases where it is true, a smart compiler will simply insert runtime aliasing tests for what it can't prove. They do this already.
A JIT (again, another type of compiler, just not static) would not even need this.
Let's skip his hand coding and get to what he says the advantages are:
"* Keep a fixed register assignment for all instructions.
* Keep everything in registers for the fast paths. Spill/reload
only in the slow paths.
* Move the slow paths elsewhere, to help with I-Cache density."
All of these things are done by optimizing compilers, already, given profiling info.
If his statement about what he thinks the advantages are is accurate , and what we do, i'm 100% positive that GCC + FDO either already does, or could be tuned without a ton of work, to beat Mike Pall for this kind of loop.
I love the gauntlet-throwing fire in this statement, and I only wish that it was practical to arrange for a showdown on this to actually happen. Especially since it would do nothing but improve the state of the art on both sides, and be a great case study in the capabilities and limitations of compilers.
Unfortunately I think it is impractical to actually have an apples-to-apples showdown because of the fact that the two interpreters use different byte-code. Writing a pure-C interpreter for LuaJIT would be non-trivial.
Yes, but it's already available in a simpler case, as eluded to above and in discussion on the preview of this talk. There wasn't an answer to the question about which compiler is competitive with OpenBLAS on the linpack benchmark in Fortran (dominated by the triply-nested matrix multiplication loop).
A typical computational chemistry program of the type that consumes many cycles on HPC systems might use three types of kernel in assembler for good reason, e.g. OpenBLAS, FFTW, and ELPA (scalable eigenvalues). This is unfortunate, but at least a few people doing the work can have a large benefit; many thanks to them.
Even if i did it, outside of supporting an argument on the internet, what changes?
This isn't a practical distinction for the end user. In both cases, no amount of money thrown at the optimising compiler will help, and the user will have to rewrite their algorithm.
I'd be very leery of replacing arbitrary algorithms. In particular, the user can make the optimisation unsafe so completely so quickly, just by adding debug statements or whatever.
Also, the examples you gave are all of specific classes of problem, which are probably quite easy to dedicate resources to because they affect many, and because optimised assembly exists.
"ideally, a language should be designed so that an optimizing compiler can describe its optimizations in the source language. Of course!"
"The programmer using such a system will write his beautifully-structured, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient. Such a system will be much more powerful and reliable than a completely automatic one."
Which I find to be a much more agreeable conclusion than
"everybody should write in-line assembly for inner-loops"
On the second, i'm going to just disagree.
For the last part, you can find hundreds of papers on high level optimization of programming languages through compilers, without breaking a sweat
THere are entire books on it.
Here's the first one google found:
I expect you can find this stuff dating back to the early eighties without too much trouble.