Turns out that main reason the vaunted "overkilled" solution was so abysmally slow was that the author happily used BSF and BTC in the hottest part of the loop... which are actually rather slow instructions, particularly when you're using them to control a branch (compare-and-jump is a fused µop in practice, but BTC-and-jump is not).
The point of this tail is that if you want to absolutely wring the last clock cycle out of a hot path, you usually need good microarchitectural knowledge about which operations are going to be faster and which are not. Sure, you can beat a compiler with hand-written, hand-optimized assembly code most of the time--but the people who have the skills to write such code are going to be the people working on the compilers.
The tools for optimizing compilers are getting better, and probably faster than we are capable of pumping out performance engineers to hand-craft the inner loops. In the past decade, we've seen polyhedral loop transformations become production-quality. Auto-vectorization is getting better, particularly when user-directed (think #pragma openmp simd); I know Intel has been pushing "outer-loop vectorization" very hard in the past few years. The other big fruit on the horizon is superoptimizers: I suspect we'll see superoptimizers shipping in production compilers within a decade or two.
Depends on the hardware you're running on. They used to be very fast on Intel machines. (There's an argument here that C survives architecture changes better than the ASM alternative, but BSF/BSR don't have C equivalents that the compiler can translate if appropriate, AFAIK. The best generic, widely fast implementation I know of uses multiply-shift with de Bruijn sequences and isn't automatically translated for obvious reasons)
If I can find a better algorithm I can of course beat the compiler on all of the above. However does such an algorithm exist?
For the other 99% of code it's hard (and definitely not worthwhile to try) to beat the compiler though. I sometimes watch programming streams where people go very deep into the compiler-generated code to analyze runtime behavior of their code, and it's almost scary to see the kinds of things compilers can do these days.
> worthwhile to try) to beat the compiler though.
Bernstein's thesis is that the other 99% of the code is so cold that it's not really worth running the optimizer, and my experience (and evidence from industry) suggests that he is right about that. In fact, you might be better off with an interpreter and a really compact byte-code.
Examples: Tcl used for "steering" supercomputer computations, Squeak Smalltalk doing fine for multi-media work with a simple bytecode interpreter + primitives, Objective-C, Ruby on Rails and Sinatra for a huge swathe of web-sites, the (incorrect) perception of Swift as "fast", even in Debug mode (without optimizations), Python/Numpy, etc.
I don't think such numbers can be meaningfully presented as representative data. They are probably true for certain specific application domains such as crypto or signal processing, if you squint at the data just the right way (i.e., by benchmarking a crypto library with a tiny driver program, not as part of a real application).
I did some profiling on the SPEC CPU 2000 benchmarks some years ago. I don't have the exact numbers lying around. But I recall large variation among the different benchmarks, with anything from 90% of execution time in the top function to the top 60% of execution time spread evenly (i.e., 20% each) over the top three functions. So some things certainly follow the patterns above, and others... don't.
I've also several times optimized programs until their profile was so flat that you could say that 100% of the code was lukewarm.
That thing on the link above easily does almost everything you described as unattainable for compiler and does it in automatic manner. AFAIK, it even allows your code to MPI parallelized, with optimal communication.
Not all of these video's are equally interesting, but there's some real gems in there, e.g. where Casey (of HH) hand-optimizes his bilinear bitmap interpolation code (somewhere between streams ~115 and ~120 if I remember correctly).
I cringe a lot about their coding style and complete disregard of architecture and basic code hygiene they appear to have, but there's still a lot of interesting things to learn from watching these streams.
Could starting with compiler output for a given problem be a good idea for new programmers?
For example, I might take a complex problem I want to solve and then implement a series of solutions in multiple high-level(-ish) languages like C, FORTRAN, or even possibly Haskell. Once the implementations are correct, I get all languages to output assembly, and then spend a while pouring over the output, possibly doing some further iteration on the high-level source code in the process.
My thinking here is that, the compiler is basically a gigantic database+graph of "use this instruction in this scenario" and a crazy complicated state machine to direct operations throughout the graph. Way too complex to read through... but I can make a compiler tell me the 90th-percentile+-most-relevant instructions to use by simply writing my program in that compiler's language.
Obviously, the standard caveats about compilers producing terrible output apply - I'd never recommend treating compiler output as normative or "good for educational use", just a decent startpoint - and something to have no qualms about surpassing.
My question was along the lines of "is this a sane default or am I wandering into a minefield," your answer is pretty much what I was looking for.
> Some of the current optimizations are so advanced and specific for each particular bit of code that it might not good to generalize from those output.
If only it were possible for the compiler to concisely describe its "workings-out". I imagine compilers can generate trace information, but it's probably pages long.
> I remember ICC generating different (but similar) code for each Intel CPU generation. I suppose it has to do with the decoder limitation (4-1-1-1, 4-2, 2-2, etc.) and exhausting execution port usage...
That's amazing. Definitely keeping that in mind, thanks!
- Calculating skeleton animations (updating bone positions, sometimes skinning vertices too)
- Clipping geometry in the scene (finding out what things are inside/outside the camera frustum, etc.)
- Processing game logic, things like AI can be quite costly, so much is game dependent
- Walking through all the geometry that needs drawing and issueing draw calls
- Decompressing streaming audio and sending it to a sound driver buffer/queue
- Stepping the physics world (integrating positions/rotation working and resolving out intersections, etc.)
The difference between a non optimized, and an optimized build, is often 5 FPS and 60 FPS, and optimizing a single hot file or function would not get the game running anywhere near 60 FPS. I think the idea that optimizing compilers aren't required is completely laughable, but then again I only have one perspective from the games development scene - maybe someone else will reply and say they make AAA games in C/C++ and don't need compiler optimizations :)
In other words, if you rewrite the algorithms from object oriented into stream processing, you would get most of the gains.
This is caused by all CPU, FPU and GPUs actually internally being simple numerical stream machines.
This is an example of a provably correct optimization, and it's a shame that more compilers do not optimize this situation. Wouldn't it be awesome to model the data the way that makes sense for your domain and let the compiler worry about cache locality?
This is possible in C99 and C++11 because type punning is forbidden in general. But that would need a special flag to specify incompatible internal ABI.
I love the way Dynamo  is able to reproduce many of the benefits with a fraction of the complexity by doing some of the optimizations at runtime with simpler algorithms. Can we use this approach to "garbage collect" some of the complexity embodied in humongous projects like LLVM?
 Dynamo: https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20...
Note also that I don't believe that Dynamo "reproduce many of the benefits [...]", since the paper you linked benchmarked running Dynamo on top of the O2 generated code (by a ~20y old compiler). It isn't clear if they ripped "low-hanging fruit" that static compilers are catching nowadays. Also the same thing on top of a PGO+LTO build may be a more fair comparison.
Motivating example: The CPU can predict whether branch instructions will be taken with uncanny accuracy. This is achieved using simple dynamic heuristics. I believe it would be much more challenging for the compiler to predict these branches statically.
Why do you believe so? "Branch prediction for free" is a compiler literature classic, and was published in PLDI 1993.
I would be interested in a comparison of modern static vs dynamic predictors. Do you know of one?
The misprediction rates that I see in practice are much lower than the ~20% that they are talking about in this paper. For example, gcc compiling sqlite3 misses only ~3.8% of branches on my Haswell processor (see below.)
I attribute this improvement to dynamic predictors using more information like recent history, return stack context, loop counter value, computed branch address, etc. This information seems beneficial and also challenging for a static predictor to use.
$ perf stat -e branches,branch-misses gcc sqlite3.c
463,727,878 branch-misses:u # 3.80% of all branches
That again. People jump right to static versus dynamic. Best to mix static and dynamic analysis to get best of both worlds. It was already done in both whole-program optimization and security to get some great benefits. My classic default is to feed the common case as test runs into an instrumented version of the system to get execution traces and measurements. That is combined with static analysis to provide better transformations or scheduling. Also, this is a high-level description of a solution that applies to many problems rather than just compiler components.
Of course, even if I'm correct you could turn the optimizations back on and be faster still. One reason why you might not want to do that is the cost of increased miscompliation.
Except he has no data to show this, and every piece of data i have (and my colleagues have), says that he is wrong.
Additionally, the compilers can and do replace algorithms if users let them. Most users don't want it, even if it makes code faster.
Remember that the vast majority of people want software that works and is reliable, or has fast development cycles, or a million other things, and happily choose that over software that is fast. Compilers changing algorithms on you isn't highly compatible with this, especially when, for example, most programming languages don't even give you a good way to express the invariants you are trying to maintain (except good ol' Ada).
However, besides that, certainly you can see that in the end, the people lose. It's like arguing that computer would never win at go.
If i really really really cared about some piece of hot code, i wouldn't pay an expert to optimize it, i'd throw the cycles of 100k spare cpus at optimizing it.
That is also the other strange part of his argument to me, he assumes the capabilities of compilers based on compilers he sees that are meant to operate pretty much single threaded on single machines, and complains "they will never beat people".
It's like looking at the machine learning algorithm that takes months or years or roughly forever to train on a single cpu and say "it'll never do a good job". The world is not that small anymore.
It's blindingly obvious the optimizing compilers win if we want them to.
In any case, this particular debate will restart again as gpus and accelerators follow the same old compiler cycle.
People want software that is fast, slow (and bloated) software frustrates normal people, they just can't articulate it or identify the correct source of their frustrations. As I said in another post just yesterday, you'll often hear people say everything from "god I hate Microsoft" to "i think I need a new machine" to "I might have a virus". Once you dig into the problem it's often just a particular app (or bloated website) that is slow, or something else on the machine eating up way to much RAM and causing everything to swap.
I'd love to run an experiment where group A has native apps written in a low level language and group B has entirely electron based ones. I'd bet my house that group B has a hell of a lot more general complaints.
What compilers do this? With what algorithms?
Not about changing what algorithms the user coded.
The parent posts are referring to macro-level optimisations, where large changes to how a system works are implemented, in order to get massive gains in performance. Picking out the essence of the larger implementation would be a difficult task for a compiler, but more importantly it wouldn't have enough information about the context of the application to make optimisation decisions. Things like: do we need to read from that file each time we need this piece information, or is it a once-off that can be cached? Unless this contextual information is expressed somehow (and it basically never is), only the programmer will be able to make the change.
def clamp(num, min, max)
[min, num, max].sort
def clamp(num, min, max)
if (num < min)
if (num < max)
if (min < max)
else return max;
That said, in my own limited experience, everyday optimisation problems tend to exist at a much higher level, on the high-level approach or "what are we doing" level. Perhaps these are "obvious" to some, or below the level of discussion here. But essentially, a compiler can optimise away endlessly at a piece of code, but will never beat code that shouldn't exist in the first place. My comment above was that a compiler has insufficient information to make decisions about what's truly wasteful or useless.
As an example, no compiler today will come up with a Courgette update  by itself. And the day it does, I think we can pack up our bags and go home (hopefully in a nice, comfy retirement kind of way).
I still argue this way. But these kinds of optimizations are tedious and time-intensive (thus costly). Also dependent on the architecture, i.e. when some extensions (say new SIMD instructions) are added to the instruction set, the compiler sometimes is able to use them automatically (though typically in a very sub-optimal way), while hand-optimized assembly code has to be rewritten (costly). Also if you port your program to a new CPU architecture (say Intel -> ARM or ARM-A32 -> ARM-A64) hardly anything can be reused. Finally tight hand-optimized code tends to be much harder to add features to than more high-level code (say, C code).
So I believe these vehemently arguing people are right (and have always been). But this does not contradict the fact that in many cases a very suboptimal code generated by the (say C) compiler is often fast enough and C is much more economical to use.
However, the top level discussion is about is it worth using an optimizing compiler. Or in my example is it worth using TruffleRuby instead of MRI. I would argue yes because that optimizing compiler gets you 10x more perf per cpu.
Which isn't to say these optimizations aren't important and can lead to very real improvements in speed, I just wouldn't consider it algorithm rewriting in the strictest sense.
See also the citations to it, http://dl.acm.org/citation.cfm?id=340757
As the last paper says:
"e. The development of techniques for automatic recognition of computational kernels such as
inductions, reductions and array recurrences has been an intensive research area in the scope of
compiler technology during the 90’s."
If you want a simpler example, all of your compilers are usually replacing your divisions by a different algorithm :)
It replaces your sum reductions, recurrences, etc.
Seriously, these are really really old compiler problems.
Compilers do the parts people care about, and will often replace the algorithms that are usually "slow", like math, or have well defined semantics that won't cause api breakage for you. Compilers replace memcpy, printf, etc, all the time.
If you want a compiler that replaces your much higher level algorithms, as the papers show, they have existed over the years. When the research papers are scanned copies, it's old tech :P
But the truth is that, for example, a lot of common languagers (C++, etc) are not good languages for that. It's not about compiler technology. I can teach a compiler to recognize your sorts, for example, pretty easily. But proving i can legally change it and not destroy the programming semantics is often literally not possible. You'd have to tell me "it's okay". It's only stuff that is well specified enough (like standard library functions) that i could do.
Think about trying to swap STL data structures for some code. We can recognize when it is possible. We can even determine when it is profitable, and do it.
Usually, the problem is it destroys the API you've built to do so (IE you pass it around as an argument), etc.
and in the end, it's usually not worth it for those languages to do it in the compiler. It's infinitely more effective to do something like http://lists.llvm.org/pipermail/llvm-dev/2016-April/098355.h... and tell people "you should switch this datastructure".
At least, for these languages.
But hey, if you want the compiler to duplicate and inline all the functions and swap out the algorithms, it could!
It's just that, for example, C and C++ are not languages that lends itself to compilers doing it, it's one that lends itself to telling humans to do it (or at least, asking them if they want it done and then doing it)
This is obviously all easier in functional languages, and they actually have a fairly rich history of doing algorithm replacement :)
No they're not. I don't actually specify the algorithm to be used for division, I just write "/" (usually), and so the compiler (or runtime or whoever) is certainly free to implement that "high level" specification (I want these two quantities divided) in any way it sees fit.
I am being precise where you are incredibly sloppy.
> You are making a lot of assumptions here.
Really? Which assumptions am I making? Try to be precise.
Considering the fact that you are calling everyone else idiots who don't have any data, the fact that what you write has (a) no data to back it up and (b) is usually trivially, obviously and comically wrong doesn't exactly help your argument.
> ...will idiom recognize division algorithms and change them
Really? Which ones? How is this useful? How many real-world programs actually code a division "algorithm"? I don't think I've seen that once in the last 20+ years or so, but of course YMMV.
UPDATE: Last I remember, "idiom recognition" was invented for the sole purpose of cheating on industry standard benchmarks, for example replacing the Dhrystone with a computed result. Has this changed?
Exercise: Write an assembly function that is faster than "-O3" and loop-free. Should be easy.
See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
GCC version makes for much better speculative execution too due to the use of the adder.
Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.
> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.
Likewise preferring microcoded lea shafts everything older than Haswell on Intel side. Not to mention modern Atom.
I think idiom recognizers are worthy of mention. That said, it being a manual pattern recognition is probably going to eventually lose edge to things like LLVM souper or other superoptimizers (exhaustive or predictive)
What does Ada do that helps with this in ways C and other languages don't?
Also, the type system lets you specify legal values with more precision than the word-based types that most languages inherited from C. For example, you can have range types that may hold integers of only a certain range, and the compiler will enforce that when assigning to variables of the type, or throw an exception if an arithmetic operation results in a value outside the range (you also have modulus types that explicitly give the wraparound behavior familiar in C). Containers allow generic constraints, and you can explicitly specify the dimensions of array types.
> he has no data to show this
Yes he does. And so do countless of other people. For example, I just gave a talk where I describe an afternoon optimization session giving me a 1000x and and 8000x improvement for a data-conversion and a query task, respectively.
Was the compiler helpful? Probably, I didn't measure with and without. But the vast majority of the improvement was not the compiler. And I don't see how the compiler could have come up with the changes (moving from a layered architecture of Model Objects -> ORM -> SQLite to a hexagonal architecture with an optimized data representation and completely different access procedure for the one data type that mattered).
And this is what I find almost everywhere I go. For example the BBC system I made somewhere between a hundred and a thousand times faster: https://www.slideshare.net/MarcelWeiher/in-processrest , https://link.springer.com/chapter/10.1007/978-1-4614-9299-3_...
And so on and so on. It is also interesting that in none of these cases did I revert to assembler, so I think that part of DJB's thesis is not entirely correct, but only in being too specific: hand-coding assembler is just one part of hand-optimizing, and very often it is not the or even a crucial part (though in security algorithms it might very well be).
But the idea that optimizing compilers are getting squeezed at both ends is, IMHO, 100% correct. On the one hand, so much code is so incredibly cold, see Ruby on Rails web sites or most UI programming. RoR can do maybe 10-100 requests/s. It's so slow it's comical. However, most sites would be happy to receive 100 hits per day, so "comically slow" is fast enough. Or when I set the end-parameters for an animation that is going to take half a second, it wouldn't matter if the code doing that were run on a LISP interpreter badly written in Ruby. Or why hasn't PyPy completely replaced Python? Why was TCL adequate for steering supercomputer workloads?
On the other hand, the performance critical parts have gotten more critical. That means that the "make everything a bit faster, but without any guarantees" that optimizing compilers are great at providing is simply not that useful any more for a lot of workloads.
And yes, there are workloads where it is useful, and it seems that you have some of these workloads. That doesn't change the broader picture.
And of course it's both: it probably won't do it right where you need it, and it shouldn't even try where you don't (especially if that means messing up the code, see http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_201... )
Please be concrete about what exact data you think he does and countless others have.
"Was the compiler helpful? Probably, I didn't measure with and without. But the vast majority of the improvement was not the compiler. And I don't see how the compiler could have come up with the changes (moving from a layered architecture of Model Objects -> ORM -> SQLite to a hexagonal architecture with an optimized data representation and completely different access procedure for the one data type that mattered)."
First, you've actually said "i have no idea if the compiler would have fixed this, because i didn't run it". That means what you have is not actually data. It's an assumption. If you want to prove it out, that's data. You actually have to do the science if you want to say you've done the science.
Second, Compilers do actually come up with such optimized data representations, memory layouts, loop walking, you name it (see, for example, https://hal.inria.fr/hal-01425546)
Had you written it in a language that allowed for semantic changes of the kind you are talking about, we could make an optimizing compiler do it.
15 years ago I saw a compiler that did automatic splitting and distribution of COM objects to split up a game to run half on a server and half on a client.
We can do this stuff. You'd just have to use something other than C/C++.
Which is why more and more people are moving towards those kinds of languages ever day, and why more and more invasive "extensions" to C/C++ get made every day.
Your particular example used to be the domain of dataflow, ETL, and stream compilers, all of which, yes, could reorganize your entire hierarchy to be better if you let them.
As mentioned, this isn't about the capabilities of optimizing compilers, but about "what tradeoffs am i making in choosing my tools and languages".
I don't think it's reasonable to, for example, program in qbasic and then say "optimizing compilers suck at reorganizing my data layout". Well, yeah.
"On the other hand, the performance critical parts have gotten more critical. That means that the "make everything a bit faster, but without any guarantees" that optimizing compilers are great at providing is simply not that useful any more for a lot of workloads."
Again, you strongly are lacking in actual data.
But normal people consider the output of a compiler to be a single binary (or bytecode meant to run in a single possibly-forking process) and the optimizations the compiler is allowed to perform in the "as-if" principle do not change the output of strace of the resulting binary.
I would be very surprised if a compiler ever changed the externally-visible IO operations my program performed, even if for the better. What I call a compiler cannot change the file format or the serialization format or the indexing or the way data is written to a socket or things like that. It can only change computations and memory accesses between syscalls as long as the same syscalls are performed.
For systems where the task is defined at such high level that choosing the file format is up to the "compiler", I would want to use a different name than "compiler".
No I did not. Please don't distort what I wrote (which appears to be the only way you have of 'making' your point) What I did write was "I didn't measure with and without", meaning I didn't do the cross check of what the total improvement would have been without the normal compiler optimizations, because both the original and the final were run with standard -Os optimizations, and because I have a fairly robust intuition in terms of the bounds of the effect.
For shits and giggles, I now did the cross check. The difference between -O0 and -Ofast was in the noise. Sometimes the -O0 was faster, sometimes the -Ofast was faster, slightly skewing towards -Ofast and with times varying by around 10%.
So let's be generous and assume that it wasn't noise, that it was 10% in the direction of the -Ofast option. Heck, let's double it to 20% just to be on the safe side. That's 20% vs. 1000x. Compared to the total, the contribution of the compiler optimizations vs. the manual optimizations is 2:10000. Or 99.98% vs 0.02%. Compare Proebsting's Law.
The rest of your post is full of "had you" and "might have" and "could have". And you berate others for not delivering data. Hmm... And then you go off and cite research that did something that seems vaguely keyword-related to the things I described, but is actually nothing at all like it.
Again, my example is
CSV -> ( Object Model : ORM (CoreData) : SQL Database (SQLite) )
CSV -> ( Optimized binary data model ) -> fwrite()
Please also take into consideration that the CoreData ORM and SQLite DB are not available in source, but only via shared library / API.
The original is described here: https://www.objc.io/issues/4-core-data/importing-large-data-...
My modified version is described in the talk: https://www.youtube.com/watch?v=kHG_zw75SjE
Sample data sets: https://vbb-gtfs.jannisr.de/2017-05-10
> Had you written it in a language ...
If my grandmother had wheels, she'd be a bus. Starting an argument (or a logical deduction) with a false premise is not exactly useful. Sorry, if your idea of the power of optimizing compilers is "you need to first rewrite the code", then I can just rewrite the code to be fast and leave out the middle-man. Your argument here is ridiculous beyond the pale.
> 15 years ago I saw a compiler...
And I saw a UFO. This is your strong "actual data"? Yes, research compilers can sometimes do funky stuff in research settings. Same for JITs. I very much remember that special issue of the IBM system's journal on Java performance. It featured a bunch of reports from the researchers touting performance numbers, and then one real world report from the San Francisco project that reported their performance "challenges". See Jitterdämmerung: http://blog.metaobject.com/2015/10/jitterdammerung.html
> ...that did automatic splitting and distribution of COM objects to split up a game to run half on a server and half on a client.
And how is this relevant to the example at hand, except in "proving" (which it doesn't, by the way) that compilers can do crazy transformations (which so isn't the point)? Was this a production compiler? Was it general purpose? Is it still around today?
The point is not what compilers can do. It is how useful that is in practice. And yes, there exists code outside of Google data centers.
> ...qbasic and then say "optimizing compilers suck at reorganizing my data layout"
Nobody said that, except you. You need to stop making up straw-man arguments that you can then demolish, but rather deal with actual facts.
I am rebutting his absurd claim that compilers can and more specifically do transform the code I hand-optimized to the tune of 1000x / 8000x by doing a top-level architectural transform. They can't and don't. In this particular case, the optimizer's contribution was trivial (actually less then I would have expected). And of course DJB gave a bunch of other examples as well, and both he and I (not that I am putting myself on the same level, not by a long shot) and many others have seen many other cases.
And he is wrong. He alludes (at most) to compilers that do things that are at best superficially-keyword related to what I've described. And that is being very, very generous.
I have given the example and made available the source code. If you have a compiler that can do this transformation, show me.
And to reiterate, the transformation is to take a system that uses an SQL database via an ORM, both not present as source code but via an OS-level ABI, and convert that to not use either the SQL database or the ORM, and instead use a packed binary representation that takes domain knowledge not available to the compiler (certain fields in the input CSV are hours and minutes, but the hours go to 30 because of midnight wraparound), and use that knowledge to create an indexing structure based on double array lookup into buckets, and a bucket sort to create those buckets. And use fwrite() for persistence (instead of SQLite) and mmap() to read the data in (instead of SQLite) because the structure was purposely created in such a way as to (a) be contiguous and (b) not contain any pointers.
Go. I'll wait.
And yes, the burden of proof is on the person why says compilers do this. And yes, the task is not to transform the program the original authors (not me) should have written into the optimized version, but the program they did write.
Oh, another example is the BBC one, where the start was a distributed system with around a dozen or two machines running around 100 processes, some talking go each other via HTTP, or shared folders, or via a shared Oracle DB. Or some combination of these. The replacement was a single process event-poster using in-process REST that was between 100-1000x faster. Again, show me the compiler that will do this.
> He doesn't say your compiler does.
So "no true scotsman", ey? Anyway, I was always talking about real world examples, and real world examples are a combination of a real-world problem with a real-world compiler. Just as DJB's argument (and my similar argument for JITs, see "Jitterdämmerung") is at least in part about the real world not matching the expectations of the JIT/optimizer engineers.
In theory, all this works super swimmingly and is totally amazing, and yes you can construct benchmarks where it does. In practice it is less useful, at least for the vast majority of users, because either they don't need it or they need more. And this is the crux of the misunderstanding: the point here is not that compiler's can't do certain things (they can), but that those things are, for the most part (not universally) either: (a) not needed (RoR, ...) or (b) not enough (my examples).
Which is why the counter of "they do" and giving various examples of compilers doing amazing things is completely pointless, as it only show you're completely missing the point.
Now there obviously are specialized use-cases where this is not true, where all of this actually works and those capabilities match the requirements. While I have no insight into Google's operations, my mental model is that theirs is one of those specialized use cases, which is why DannyBee is so adamant that this all works perfectly. In his world, it does. And it's super-sophisticated, super-cool rocket science and he is a super-smart engineer.
However, not everyone is Google. In fact, I contend that Google is very much an outlier in terms of requirements and capabilities.
> He says there exist production compilers that do.
And he hasn't shown a single one that does, partly because he so far hasn't even grasped what "do" in this case is (and continuously transformed my statements into straw-men), which is why he has thrown out this barrage of unnamed compilers ("I hear things", "I see things") that do random stuff that is at best marginally related to what I've described.
Put up or shut up.
Yes there is: "ohh, gcc, that's not a true optimizing compiler". And of course so far refusing to show any other real compiler that does all these magical things.
> He runs compiler infrastructure at Google
Yes, I know. So? Argument from perceived authority much?
> says he's seen production compilers doing stuff you say is impossible.
Actually, I haven't seen him claim that. He has claimed all sorts of things that he has seen optimizing compilers do that have very, very tangential (at best) similarities with some of the keywords I mentioned, and has claimed that as a rebuttal to my claim, which it is not.
> You say, essentially, he's lying.
If he actually claims what you say he's claiming then: yes, he's lying. I haven't seen him claim that, but I have seen him use these fairly random claims to rebut some very specific claims I have made.
And of course, if there are compilers that do exactly what I've described (transform the two systems I mentioned in the way I described), then I am absolutely happy to eat crow. While anything is possible, in theory, this would be so far beyond the current state of the art, that we could probably retire >90% of programmers right now (assuming that a small fraction of that capability were also applied to other fields of programming). And seeing the consumer-visible parts of Google, those compilers would be so super-secret that they are not used on any of that stuff.
The code is out there, somebody run it through the compiler and prove what a dunce I am.
It's OK not to know everything about your field, or even to be wrong sometimes. But find a less obnoxious way to handle those situations.
I searched this whole thread for mentions of GCC, and I didn't find anything like this. Are you referring to the part about having to use a different language than C or C++ if you want this kind of optimization?
(I agree with you that "look, there is an old paper describing a research compiler in a research setting on self-chosen benchmarks" is not the same as production compilers doing the same thing in practice.)
That's because those data only represent quantitative statistics on code, not whether said code would get great benefits if the programmer actually bothered to make algorithmic adjustments.
I see "optimizations are bad/unnecessary/problematic/whatever" in my job writing a compiler. The underlying issue is almost never "I don't want optimizations" - even when someone thinks that's what they want; it's "I don't want optimizations that produce unexpected behavior".
Implementation differences in the exact same, say, simple linear algorithm, can still add up to a tenfold difference. Compiler optimisations can take it further, though some are not generally applicable. E.g. I have made good experiences with PGO (careful analysis and/or knowing your workload is needed), but it's obviously not an option for the typical open source software distribution model.
Obviously, combining a better algorithm with a better implementation that is optimized will be fastest.
My HPC lecturer used to say that the speed difference between bad Matlab code and good Matlab code will often be greater than the speed difference between good Matlab code and good Fortran code.
Tweaking the hell out of your code at the assembly level is a waste of time if you're using a naive O(2^N) algorithm.
Absolutely but often it's very hard to make those changes or those changes lead to a loss of accuracy, whereas compiling with -O3 is very easy and leads to no loss in accuracy (in the common case).
Often it's so fast that it doesn't matter at all that it can be 1/100 slower then C. 0.1 sec is still good even if it's not 0.001 secs.
Further, it doesn't help with the required restructuring; when the functionality is scattered across a dozen or more modules (complex software + good Python practice), then you first have to pull the functionality needed for the path you're trying to get fast together. That leads to duplication, removing abstractions and most certainly vetoes from other people.
The problem is that, say, C does not allow to express the necessary details about the optimizations that you want to do. This is exactly something DJB calls for in programming languages.
IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job. But there's a clear answer. JIT authors are not out of job not because optimizing compilers are not dead, but because they're writing compilers, their distinguishing ability is writing "pre-compiled" code.
You might say, "well a JIT author sped up your code's execution so he/she is writing an optimizing compiler". Well you have to realize that, traditionally, JIT authors don't just translate the code into object code, they also apply these things called "compiler optimizations". The point is that if they didn't do that, and simply produced a faithful translation of the code, they would still make the code faster because of pre-compilation (and if they enabled the "compiler optimizations", the code wouldn't run significantly faster than the simply pre-compiled code).
Regardless of whether I agree with it or not, "Optimizing compilers are dead" is not the same as saying "JIT authors will be out of business". (Even compiler writers won't be out of business).
Your suggestion is that a templating JIT that just drops in some machine code that matches the method, doing no optimization, would get all the win. Such a compiler is indeed much faster than an interpreter, but it's nowhere close to an optimizing compiler. Mike Pall, the author of LuaJIT, would be very surprised if you suggested that his compiler performed similarly to something simple like that.
You can compare the performance of LuaJIT 1.x and 2.0 yourself on the benchmark page (for x86). The LuaJIT 1.x JIT-compiled code is only slightly faster than the heavily tuned LuaJIT 2.x VM plus the 2.x interpreter written in assembly language by hand. Sometimes the 2.x interpreter even beats the 1.x compiler.
A lot of this is due to the better design of the 2.x VM (object layout, stack layout, calling conventions, builtins etc.). But from the perspective of the CPU, a heavily optimized interpreter does not look that different from simplistic, template-generated code. The interpreter dispatch overhead can be moved to independent dependency-chains by the CPU, if you're doing this right.
Of course, the LuaJIT 2.x JIT compiler handily beats both the 2.x interpreter and the 1.x compiler.
Article: "We can also refute Bernstein’s argument from first principles: the kind of people who can effectively hand-optimize code are expensive and not incredibly plentiful."
Commenter: "IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job."
Guy in audience: "I was that guy in the audience."
LuaJIT author: "Actually, LuaJIT 1.x is just that"
Voice in my head: "Aspen 20, I show you at one thousand eight hundred and forty-two knots, across the ground."
Meta: Apologies for the abstract response, but I couldn't figure out a better way to present the parallel. It can be hard to explain artistic allusions without ruining them. What I mean to say is that this pattern of responses reminded me in a delightful way of the classic story of the SR-71 ground speed check: http://www.econrates.com/reality/schul.html
But, as you said, my main point is your last sentence -- optimizing compilers like LuaJIT 2.x are impressive and necessary.
In languages with bad optimisers I have to worry about separating code in a hot loop out into a function -- the cost of a function call is too high. This one in particular I find can lead to some horrible code, as functions grow larger and larger and lots of cutting+pasting happens to avoid function call costs.
On a smaller note, making sure I cache the values of function calls which won't change -- when instead I could trust the compiler to know the value won't change and the the caching itself.
The problem comes when you are using other languages (scripting ones like Python come to mind) where inlining is but a dream (ignoring systems like PyPy).
This doesn't follow at all. If you had one hot loop and a bunch of cold code, and auto-optimize your code to be a measly factor of 2 faster, you're still going to need to hand-optimize the hot loop and what it does to the cold code is irrelevant.
> hand-optimized code has higher ongoing maintenance costs than does portable source code; we’d like to avoid it when there’s a better way to meet our performance goals.
True, but again, only applies if you can optimize by enough to make hand-optimization unnecessary.
> we’d also have to throw away many of those 16 GB phones that are cheap and plentiful and fairly useful today.
This part is nonsense. No-one's got anything like 16GB of code on their phone.
Optimization could be valuable but current compilers are too opaque, making optimization too much of a black art. I believe we need to do something along the lines of "turning the database inside out" ( https://www.confluent.io/blog/turning-the-database-inside-ou... ); we should turn the compiler inside out, build it as more of a library, give the developer more insight into what's going on, have a high level language that lets you understand how it compiles. Interesting and vaguely along the same lines: https://www.microsoft.com/en-us/research/publication/coq-wor... .
Huh? You start with performance goals. If compiler-optimized code meets your performance goals, you are done. You do not "need to hand-optimize" in that case.
Why would a "factor of 2" not be good enough? What is it compared to? What makes you so sure that you must optimize further, disregarding any possible context?
Reading through the various comments asserting that optimizing compilers should not be necessary if you 'just use better algorithms' and whatnot, I'm kind of wondering why it has to be one thing or the another. Who wouldn't want to have both?
If I tell my employer I'm not interested in getting a 2x speedup by flipping a compiler switch, but prefer to spend an uncertain time looking for or inventing this hypothetical algorithm that will give me 100x to 1000x speedup, I don't think they'll be very happy about that. For most of the code we deliver to our customers, a factor 'so marginal as 2x' could literally save them millions of dollars over the lifetime of the software. Two times faster saves a lot of computing time if you have jobs that run for 72+ hours. If only things were as easy as you pretend they are (we already have the 'marginal' 2x from the compiler optimizations, and we already have efficient algorithms)...
If we're talking about algorithmic improvements we're usually talking about going from O(n^2) to O(n log n) or similar, which is easily a 100x-1000x or more speedup if your n is big enough to be bothering with performance at all. It's very easy to be accidentally quadratic and this causes a lot of performance bugs. Talking as though algorithmic speedups and optimization speedups are of a similar magnitude is really misleading.
> it could still be 2x too slow to meet some hard realtime performance constraint
Theoretically it could be, but that would take an incredible amount of luck. There's so much variation in runtime that the odds of getting within a factor of 2x are tiny; more likely you're thousands of times too slow, or thousands of times faster than you need to be.
> Two times faster saves a lot of computing time if you have jobs that run for 72+ hours.
Sure, if you've got jobs that run for 72 hours in your code (i.e. aren't spending those 72 hours doing something that's already highly optimized in a standard library e.g. linear algebra) and have already carefully optimized your choice of algorithms, then the rules are different. That's a pretty rare case though.
(technically I have implemented sort since my degree, but it was bogo sort: I obviously did it only for fun)
The original DJB presentation, which this is a response to, is very good and interesting.
It would really be nice if the field of compiler engineering started to address the obvious neglected areas, like optimizing memory layout and data types/representations based on partial evaluation / profile feedback.
Interesting tidbit: LLVM was developed by Chris Lattner for his PhD thesis. It was titled "Macroscopic Data Structure Analysis and Optimization" and about "Instead of analyzing individual load/store operations or structure definitions, this approach identifies, analyzes, and transforms entire memory structures as a unit", in his own words. Apparently, the world was more interested in a good compiler framework than specific memory layout optimization proposed.
They already can. The issue is usually one of what is allowable within language semantics, not of compiler optimization technology.
One reason you may see this as neglected is that outside of, say, polyhedral loop optimizations, research in the 80's and 90's (and sometimes earlier) did a really really good job of exploring this area because fortran allowed so much freedom.
The struct layout is fixed by platform-specific ABI documents, which must specify the layout precisely so that structs can be passed between separately compiled code, e.g. between applications and dynamically linked libraries.
Changing ABIs is virtually impossible, so...
It would be nice to see opt-in optimizations of struct layouts, e.g. by annotating structs with a packing-style attribute. Though the best gains would probably be obtained from profile-guided optimizations for cache line optimizations, and those can only really be done with link-time optimizations for structs that are not passed outside the linker's scope.
"Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared."
So I don't think "languages don't support it" is a good excuse.
I'm not sure you are really saying anything that proves me wrong?
Using those extensions, there are optimizing compilers that will perform very heavy memory layout optimization.
Which does, in fact, go to show that it took "invasive language extensions" to make the original language able to be optimized in this fashion?
IE you invasively changed the original semantics, and now it made these things possible. How does that not show that the original language semantics were not the problem, which is what i said?
Admittedly foremost I was thinking about entirely different kinds of data and layout optimizations: Herarchical data, pointer heavy structures, and potential compressed representations. Things like converting node pointers into 16-bit indices into a node array.
CUDA compiler developer here. We definitely perform range analysis on values stored in registers. This is an important optimization.
Via scalar replacement of aggregates, we can sometimes also replace a struct with a set of scalars. Once we do that, we can again perform range analysis on those scalars.
We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.
Which is precisely a language semantic problem, since there are plenty of languages where you can change SOA to AOS.
In fact, it's wildly common among high perf fortran compilers
What does this mean?
(Glad you're enjoying your time at Google, Justin!)
For some recent compiler research with regards to optimizing memory and cache use, see e.g. "Optimizing Indirect Memory References with Milk", PACT 2016 .
Regarding the Milk paper specifically, it appears to be about reordering and clustering memory accesses without changes to data layout. This is a fine thing but I think it reinforces my point about lack of attention to memory/data layout optimizations.
Nope, much is gained by unrolling loops, inlining functions etc, which all increase code size.
Of course C++ compilation with no optimization at all can be rather wasteful with performance and code size, but to squeeze the final performance out you probably need to sacrifice code size (whether manual or automatic)
That's because those are performance optimizations, not size optimizations (though as an aside, inlining functions can reduce code size in the event that the inlined version is smaller than the function-call overhead, or in the case where the function is used only once).
There are plenty of size optimizations that can be performed. -Os will enable them on gcc/clang if you want to try for yourself.
Something I took for granted for the longest time about Haskell (which remains the only language I know of with the feature) is the ability to write user-defined "rewrite rules". You can say, "okay, GHC, I know for a fact that if I use these functions in such-and-such way, you can replace it by this instead".
foo x (bar x) =
Now, anywhere a user of this code writes something like
foo [1,2,3] (bar [1,2,3])
You can also "specialize" by using faster implementations in certain cases. For example,
timesFour :: Num a => a -> a
timesFour = a + a + a + a
timesFourInt :: Int -> Int
= rightShift x 2
timesFour :: Int -> Int
High-performance Haskell libraries like vector, bytestring, text, pipes, or conduit capitalize on this feature, among other techniques. When compiling code written using libraries like this, this is how it goes:
- rule #1 fires somewhere
- it rewrites the code into something that matches rule #2, "clearing the way" for it to fire
- rule #2 fires
- rule #3 fires
- rule #1 fires again
- rule #4 fires
and so on, triggering a "cascade" of optimizations.
The promise of Haskell is that we already have a "sufficiently smart compiler": today, with good libraries, GHC is capable of turning clear, high-level, reusable functional code with chains of function compositions and folds and so on into tight, fast loops.
I must add, though, that getting rewrite rules to fire in cascades to get "mad gainz" requires one to grok how the GHC inliner/specializer works.
Data.Vector also utilizes an internal representation that makes fusion explicit and hence predictable (inevitable, even) called a "bundle":
but this relies on rewrite rules too, e.g. the previous module contains this rule:
"zipWithM xs xs [Vector.Stream]" forall f xs.
zipWithM f xs xs = mapM (\x -> f x x) xs #-}
It is very interesting, though, and is a good avenue to explore. I've got some reading to do :)
In practice, most of the libraries that include tons of rewrite/specialise/inline (ab)use are either "core" libraries (like vector/bytestring) or have a large userbase (e.g. Conduit), and rules don't really change too much from version to version, so this has never actually had the detrimental effects that "silent replacement of code" might have in the worst case.
This might sound similar to apologetics for dynamically-typed languages: the only real answer to your question is that rewrite rules are ultimately a way to improve performance, and they come with caveats usually associated with replacing clearer algorithms with faster ones. (I'm thinking of the adaptive sort from the C++ STL, which iirc uses different sorting algorithms for differently-sized containers to maximize expected performance. It's not exactly intuitive that vectors of size 100 and vectors of size ~10000 are sorted differently, is it?)
Of course, the only verifiably-correct way to do these things is through the use of machine-checkable proofs of the transforms. The well-known "foldr/build" rule that many optimizations of linked-list functions reduce to, for instance, has been proven outside of GHC, and there are usually similar proofs for other rules known to the library author. The appeal of dependently-typed languages is how they are a nice vehicle for the inclusion of those proofs in library code: if the compiler can't verify that your rule makes sense, it can complain instead. You then end up supplying a proof, or saying "just believe me, okay?" ;)
: "In practice"? Seriously? Next thing I know I'll be parroting "in the real world" and so on...
> Using just these two rules, I saw a 225% increase in the number of nodes processed per second. Under the right circumstances, GHC can even outperform C code by applying these kinds of inter-procedural optimizations.
Also, we have QuickCheck, which allows us to check properties by randomly generating testcases. The post I linked mentions this.
let x = (await grabSomeIntegers()).map(double).filter(biggerThan(3)).map(substract(1));
How do Haskell library authors deal with that?
x = do
ints <- grabSomeIntegers
return (ints & map (*2) & filter (> 3) & map (-1))
x = fmap (map (- 1) . filter (> 3) . map (* 2)) grabSomeIntegers
... right? Nope, that doesn't have to be the case: Haskell supports more than just your $grandparent's parameterized types!
The vector package uses something called data families to have different in-memory representations for vectors of different types, without "letting go of the DRY" at all.
Here, Vector a is a data family: how a value of type Vector a "looks" in memory depends on a. So you'll notice a bunch of lines of the form
newtype instance Vector Int16 = MV_Int16 (P.Vector Int16)
Written using vectors, code like in your example is perfectly DRY; it would look like (this is untested code!) this:
combobulate :: Num a => IO (Vector a) -> IO (Vector a)
combobulate = do
as <- grabSomeIntegersFromNetwork
return (ints & V.map (*2) & V.filter (> 3) & V.map (-1))
The key is the data families: the types expose a simple, consistent API, hiding the actual guts of the memory representation from the user. The library author is free to perform whatever ugly/evil/unsafe manipulations she feels like behind the scenes on the internal representations, or not.
 Defining numeric operations on animals is left as an exercise to the careful reader.
The specialization example is the same as template specialization in C++.
> The specialization example is the same as template specialization in C++.
This sounds very interesting to me, because you're saying it's accomplished entirely in the type system: do you have an example or a reference you can point me to?
I'm guessing there's a type-level tag for an algorithm, and the compiler is told how to combine tags corresponding to a sequence of operations on vectors that looks a certain way. Something like do_the_thing<t, DotWith<Self>, HasFastNorm<t>> = do_the_thing<Norm<t>>? (just thinking aloud in pseudo-C++, nothing to see here, move along) In that case, it would be closer to something using a function on types, which Haskell has in the form of type families.
What you have towards the end of that post is actually something like
type family Vector a where
type instance Vector (Ptr a) = ...
type instance Vector a = ...
I still don't know how the linear algebra/Eigen examples might work, though.
I believe that the basic idea was that not all rewrites end up being equally fast ( a * b * c can be fused into ab * c, but maybe a * bc is faster). Trying all the combinations is an option in theory, but you get a combinatorial explosion so you normally dont get far in practice. Enter machine learning. I'm not sure how successful they were, but I imagine that the same sort of thing could be applied to SIMD.