Several years ago, I happened on a blog post where someone demonstrated a very fast variant of the N-queens solution based on hand-written, SSE vectorized that was presented as very fast. I managed to write a faster, non-vectorized C solution that was recursive and that the compiler couldn't vectorize, and much easier to understand than the vectorized original it was based off of.
Turns out that main reason the vaunted "overkilled" solution was so abysmally slow was that the author happily used BSF and BTC in the hottest part of the loop... which are actually rather slow instructions, particularly when you're using them to control a branch (compare-and-jump is a fused µop in practice, but BTC-and-jump is not).
The point of this tail is that if you want to absolutely wring the last clock cycle out of a hot path, you usually need good microarchitectural knowledge about which operations are going to be faster and which are not. Sure, you can beat a compiler with hand-written, hand-optimized assembly code most of the time--but the people who have the skills to write such code are going to be the people working on the compilers.
The tools for optimizing compilers are getting better, and probably faster than we are capable of pumping out performance engineers to hand-craft the inner loops. In the past decade, we've seen polyhedral loop transformations become production-quality. Auto-vectorization is getting better, particularly when user-directed (think #pragma openmp simd); I know Intel has been pushing "outer-loop vectorization" very hard in the past few years. The other big fruit on the horizon is superoptimizers: I suspect we'll see superoptimizers shipping in production compilers within a decade or two.
the author happily used BSF and BTC in the hottest part of the loop... which are actually rather slow instructions
Depends on the hardware you're running on. They used to be very fast on Intel machines. (There's an argument here that C survives architecture changes better than the ASM alternative, but BSF/BSR don't have C equivalents that the compiler can translate if appropriate, AFAIK. The best generic, widely fast implementation I know of uses multiply-shift with de Bruijn sequences and isn't automatically translated for obvious reasons)
That is the point: I can spend a couple months hand optimizing a hot spot, but for what CPU? The optimizations I'd do for an intel core i7 are different and intel core i9 - and these are generations of the same architecture try to support AMD ryzen and things are very different. Now add in an ARM CPU (which?)...
If I can find a better algorithm I can of course beat the compiler on all of the above. However does such an algorithm exist?
I think it depends a lot. I have work on some hand-crafted SIMD for multimedia use before. It is fairly easy to outperform the auto-vectorization in that field IMO. I have also played around with various attempts to hand-craft inner loop (SIMD and no SIMD) for other project but most of the time I failed to outperform the compiler.
Yes, it depends a lot. Typical signal-processing tasks you will definitely want to hand-optimize, no matter how good the compiler is, it will never be able to get close to hand-crafted SIMD code in combination with careful memory-layout, reducing the working set, eliminating dependencies, maximizing cache usage etc. There's a lot involved with that kind of optimizations that supposes semantic knowledge of the data going through the algorithm to be able to make transformations to optimize the code.
For the other 99% of code it's hard (and definitely not worthwhile to try) to beat the compiler though. I sometimes watch programming streams where people go very deep into the compiler-generated code to analyze runtime behavior of their code, and it's almost scary to see the kinds of things compilers can do these days.
> For the other 99% of code it's hard (and definitely not
> worthwhile to try) to beat the compiler though.
Bernstein's thesis is that the other 99% of the code is so cold that it's not really worth running the optimizer, and my experience (and evidence from industry) suggests that he is right about that. In fact, you might be better off with an interpreter and a really compact byte-code.
Examples: Tcl used for "steering" supercomputer computations, Squeak Smalltalk doing fine for multi-media work with a simple bytecode interpreter + primitives, Objective-C, Ruby on Rails and Sinatra for a huge swathe of web-sites, the (incorrect) perception of Swift as "fast", even in Debug mode (without optimizations), Python/Numpy, etc.
Certainly in my experience it's about 90-9-1; where 90% of the runtime is "cold", 9% is "warm" (optimising compiler is good enough), and 1% is "hot". I'm literally writing some code right now where -O2 pushes us from "not fast enough in the average case" to "just about fast enough in the worst case". I fully accept that things might be different in Web Development, but are optimising compilers really that common in Web Dev?
>> [99% of all code is cold]
> in my experience it's about 90-9-1
I don't think such numbers can be meaningfully presented as representative data. They are probably true for certain specific application domains such as crypto or signal processing, if you squint at the data just the right way (i.e., by benchmarking a crypto library with a tiny driver program, not as part of a real application).
I did some profiling on the SPEC CPU 2000 benchmarks some years ago. I don't have the exact numbers lying around. But I recall large variation among the different benchmarks, with anything from 90% of execution time in the top function to the top 60% of execution time spread evenly (i.e., 20% each) over the top three functions. So some things certainly follow the patterns above, and others... don't.
I've also several times optimized programs until their profile was so flat that you could say that 100% of the code was lukewarm.
There's an impressive amount of resources poured into making Javascript faster; we often have to think about what the interpreter will do with the code we write in order to let it do some optimisations.
That thing on the link above easily does almost everything you described as unattainable for compiler and does it in automatic manner. AFAIK, it even allows your code to MPI parallelized, with optimal communication.
Handmade hero [1] mostly, and sometimes Jonathan Blow's compiler programming streams [2].
Not all of these video's are equally interesting, but there's some real gems in there, e.g. where Casey (of HH) hand-optimizes his bilinear bitmap interpolation code (somewhere between streams ~115 and ~120 if I remember correctly).
I cringe a lot about their coding style and complete disregard of architecture and basic code hygiene they appear to have, but there's still a lot of interesting things to learn from watching these streams.
> ...if you want to absolutely wring the last clock cycle out of a hot path, you usually need good microarchitectural knowledge...
Could starting with compiler output for a given problem be a good idea for new programmers?
For example, I might take a complex problem I want to solve and then implement a series of solutions in multiple high-level(-ish) languages like C, FORTRAN, or even possibly Haskell. Once the implementations are correct, I get all languages to output assembly, and then spend a while pouring over the output, possibly doing some further iteration on the high-level source code in the process.
My thinking here is that, the compiler is basically a gigantic database+graph of "use this instruction in this scenario" and a crazy complicated state machine to direct operations throughout the graph. Way too complex to read through... but I can make a compiler tell me the 90th-percentile+-most-relevant instructions to use by simply writing my program in that compiler's language.
Obviously, the standard caveats about compilers producing terrible output apply - I'd never recommend treating compiler output as normative or "good for educational use", just a decent startpoint - and something to have no qualms about surpassing.
I would not entirely agree. Some of the current optimizations are so advanced and specific for each particular bit of code that it might not good to generalize from those output. I remember ICC generating different (but similar) code for each Intel CPU generation. I suppose it has to do with the decoder limitation (4-1-1-1, 4-2, 2-2, etc.) and exhausting execution port usage, but I never bother to confirm this. It gets very complex after a while.
My question was along the lines of "is this a sane default or am I wandering into a minefield," your answer is pretty much what I was looking for.
> Some of the current optimizations are so advanced and specific for each particular bit of code that it might not good to generalize from those output.
If only it were possible for the compiler to concisely describe its "workings-out". I imagine compilers can generate trace information, but it's probably pages long.
> I remember ICC generating different (but similar) code for each Intel CPU generation. I suppose it has to do with the decoder limitation (4-1-1-1, 4-2, 2-2, etc.) and exhausting execution port usage...
That's amazing. Definitely keeping that in mind, thanks!
Is anyone else in games development here? If we are looking to run the game at 60 FPS, we have 16.67 ms per frame to do everything required to run the game. Because of this real time requirement, a decent amount of profiling is typically done on each game. I typically see that the frame time is getting split up over a whole array of different sections of the game, i.e.:
- Calculating skeleton animations (updating bone positions, sometimes skinning vertices too)
- Clipping geometry in the scene (finding out what things are inside/outside the camera frustum, etc.)
- Processing game logic, things like AI can be quite costly, so much is game dependent
- Walking through all the geometry that needs drawing and issueing draw calls
- Decompressing streaming audio and sending it to a sound driver buffer/queue
- Stepping the physics world (integrating positions/rotation working and resolving out intersections, etc.)
The difference between a non optimized, and an optimized build, is often 5 FPS and 60 FPS, and optimizing a single hot file or function would not get the game running anywhere near 60 FPS. I think the idea that optimizing compilers aren't required is completely laughable, but then again I only have one perspective from the games development scene - maybe someone else will reply and say they make AAA games in C/C++ and don't need compiler optimizations :)
Those gains are often caused by globally applied optimizations like inlining and devirtualization.
These then can enable actual stream processing and loop chunking optimizations which are of important but lesser effect.
Perhaps dead code and constant elimination of you leave in flags.
In other words, if you rewrite the algorithms from object oriented into stream processing, you would get most of the gains.
This is caused by all CPU, FPU and GPUs actually internally being simple numerical stream machines.
AAA dev here. The performance gains mostly come from designing data, not code. Factoring an array of structures into a structure of arrays and accessing linearly, e.g., makes better use of the processor cache and saturates throughput. Code optimization mostly gives us the confidence that there aren't unnecessary accesses or branches interleaved that might bust the cache.
> Factoring an array of structures into a structure of arrays and accessing linearly, e.g., makes better use of the processor cache and saturates throughput.
This is an example of a provably correct optimization, and it's a shame that more compilers do not optimize this situation. Wouldn't it be awesome to model the data the way that makes sense for your domain and let the compiler worry about cache locality?
Well, in C memory is explicitly laid out, so it's not on the optimizer's menu, and in a dynamic language the runtime flexibility rules out most data-layout optimizations beyond occasionally coalescing a dictionary into a memory-offset-based buffer. There's some experimental game languages like JAI which include these semantics as a language primitive, but it's still declared rather than implied. Fundamentally, data layout is design because it depends on actual game content, and not just code; it's categorically different than code optimization.
Jonathan Blow's Jai language lets you do this by annotating structs, along with some other assisted memory optimizations (pointer compression is another).
I suspect GCC could do it too for invisible or fully inlined symbols.
This is possible in C99 and C++11 because type punning is forbidden in general. But that would need a special flag to specify incompatible internal ABI.
A bit of trivia: Computation is frequently pipelined so that part of the work is done during previous frame(s). This of course means that the game logic latency is longer than 1 frametime.
I often think about Proebsting's Law: Compiler Advances Double Computing Power Every 18 Years. Sure, optimizing compilers are nice to have, but maybe their complexity is disproportionate to their benefit?
I love the way Dynamo [1] is able to reproduce many of the benefits with a fraction of the complexity by doing some of the optimizations at runtime with simpler algorithms. Can we use this approach to "garbage collect" some of the complexity embodied in humongous projects like LLVM?
You're mentioning "disproportionate complexity" for current compilers and at the same time advocating for a runtime system modifying and caching the code on the fly? Ouch...
Note also that I don't believe that Dynamo "reproduce many of the benefits [...]", since the paper you linked benchmarked running Dynamo on top of the O2 generated code (by a ~20y old compiler). It isn't clear if they ripped "low-hanging fruit" that static compilers are catching nowadays. Also the same thing on top of a PGO+LTO build may be a more fair comparison.
The exciting possibility, from my perspective, is that many optimizations may be simpler to implement dynamically (JIT) than statically. So perhaps you can have 10% of the compiler code to get 90% of the benefit. I see this as the basic premise of LuaJIT.
Motivating example: The CPU can predict whether branch instructions will be taken with uncanny accuracy. This is achieved using simple dynamic heuristics. I believe it would be much more challenging for the compiler to predict these branches statically.
I would be interested in a comparison of modern static vs dynamic predictors. Do you know of one?
The misprediction rates that I see in practice are much lower than the ~20% that they are talking about in this paper. For example, gcc compiling sqlite3 misses only ~3.8% of branches on my Haswell processor (see below.)
I attribute this improvement to dynamic predictors using more information like recent history, return stack context, loop counter value, computed branch address, etc. This information seems beneficial and also challenging for a static predictor to use.
$ perf stat -e branches,branch-misses gcc sqlite3.c
12,212,167,816 branches:u
463,727,878 branch-misses:u # 3.80% of all branches
That again. People jump right to static versus dynamic. Best to mix static and dynamic analysis to get best of both worlds. It was already done in both whole-program optimization and security to get some great benefits. My classic default is to feed the common case as test runs into an instrumented version of the system to get execution traces and measurements. That is combined with static analysis to provide better transformations or scheduling. Also, this is a high-level description of a solution that applies to many problems rather than just compiler components.
Got a bit confused here. I was thinking about a very old software I used in college for my systems dynamics class. Completely different (unless I am mistaken), but still brings back painful memories of that class.
It should be quite easy to see the value of optimizing compilers. Compile your program with optimizations turned off. Now make it as fast as your release build again, while still keeping them off. For much of my code, I think this would take years.
You may be significantly underestimating your ability to get phenomenal gains from algorithmic improvements. I think DJB would argue that the compiler optimizations are insignificant in comparison to those improvements for a very large portion of hot code.
Of course, even if I'm correct you could turn the optimizations back on and be faster still. One reason why you might not want to do that is the cost of increased miscompliation.
" I think DJB would argue that the compiler optimizations are insignificant in comparison to those improvements for a very large portion of hot code."
Except he has no data to show this, and every piece of data i have (and my colleagues have), says that he is wrong.
Additionally, the compilers can and do replace algorithms if users let them. Most users don't want it, even if it makes code faster.
Remember that the vast majority of people want software that works and is reliable, or has fast development cycles, or a million other things, and happily choose that over software that is fast. Compilers changing algorithms on you isn't highly compatible with this, especially when, for example, most programming languages don't even give you a good way to express the invariants you are trying to maintain (except good ol' Ada).
However, besides that, certainly you can see that in the end, the people lose. It's like arguing that computer would never win at go.
If i really really really cared about some piece of hot code, i wouldn't pay an expert to optimize it, i'd throw the cycles of 100k spare cpus at optimizing it.
That is also the other strange part of his argument to me, he assumes the capabilities of compilers based on compilers he sees that are meant to operate pretty much single threaded on single machines, and complains "they will never beat people".
It's like looking at the machine learning algorithm that takes months or years or roughly forever to train on a single cpu and say "it'll never do a good job". The world is not that small anymore.
It's blindingly obvious the optimizing compilers win if we want them to.
In any case, this particular debate will restart again as gpus and accelerators follow the same old compiler cycle.
> Remember that the vast majority of people want software that works and is reliable, or has fast development cycles, or a million other things, and happily choose that over software that is fast.
People want software that is fast, slow (and bloated) software frustrates normal people, they just can't articulate it or identify the correct source of their frustrations. As I said in another post just yesterday, you'll often hear people say everything from "god I hate Microsoft" to "i think I need a new machine" to "I might have a virus". Once you dig into the problem it's often just a particular app (or bloated website) that is slow, or something else on the machine eating up way to much RAM and causing everything to swap.
I'd love to run an experiment where group A has native apps written in a low level language and group B has entirely electron based ones. I'd bet my house that group B has a hell of a lot more general complaints.
If you look at modern JIT compilers such as Truffle-Ruby, then yes it will change algorithms. e.g. changing sort algorithms depending on data size. if the data is small enough it will do a bubblesort on registers, when its big it will do TimSort on heap data etc...[1]
Exactly, that's two completely different things. It's akin to a compiler using the famous "switch statement becomes if-else chain if less than 7 items, otherwise lookup table".
The parent posts are referring to macro-level optimisations, where large changes to how a system works are implemented, in order to get massive gains in performance. Picking out the essence of the larger implementation would be a difficult task for a compiler, but more importantly it wouldn't have enough information about the context of the application to make optimisation decisions. Things like: do we need to read from that file each time we need this piece information, or is it a once-off that can be cached? Unless this contextual information is expressed somehow (and it basically never is), only the programmer will be able to make the change.
Sure, and at this level, I would also argue that compilers are as good as humans at optimisation (likely better than almost all, actually). These arguments were constantly going on in the 80s and 90s, where (assembly-enthusiast) people vehemently argued that machines (C compilers) will never optimise things as well as humans. While the argument still exists in some form today, it's certainly died down a lot.
That said, in my own limited experience, everyday optimisation problems tend to exist at a much higher level, on the high-level approach or "what are we doing" level. Perhaps these are "obvious" to some, or below the level of discussion here. But essentially, a compiler can optimise away endlessly at a piece of code, but will never beat code that shouldn't exist in the first place. My comment above was that a compiler has insufficient information to make decisions about what's truly wasteful or useless.
As an example, no compiler today will come up with a Courgette update [1] by itself. And the day it does, I think we can pack up our bags and go home (hopefully in a nice, comfy retirement kind of way).
> These arguments were constantly going on in the 80s and 90s, where (assembly-enthusiast) people vehemently argued that machines (C compilers) will never optimise things as well as humans.
I still argue this way. But these kinds of optimizations are tedious and time-intensive (thus costly). Also dependent on the architecture, i.e. when some extensions (say new SIMD instructions) are added to the instruction set, the compiler sometimes is able to use them automatically (though typically in a very sub-optimal way), while hand-optimized assembly code has to be rewritten (costly). Also if you port your program to a new CPU architecture (say Intel -> ARM or ARM-A32 -> ARM-A64) hardly anything can be reused. Finally tight hand-optimized code tends to be much harder to add features to than more high-level code (say, C code).
So I believe these vehemently arguing people are right (and have always been). But this does not contradict the fact that in many cases a very suboptimal code generated by the (say C) compiler is often fast enough and C is much more economical to use.
I agree with this, reading it I feel once again the limitations of written communication forms.
However, the top level discussion is about is it worth using an optimizing compiler. Or in my example is it worth using TruffleRuby instead of MRI. I would argue yes because that optimizing compiler gets you 10x more perf per cpu.
I mean that's basically constant folding on steroids, in this case taking advantage of the fact it knows the array has three elements, and that we only care about entry 1. Which is impressive, really impressive, but it is fundamentally limited. This approach won't ever be able to take your bubble sort implementation and turn it into a quick sort or merge sort. In general I don't think this sort of optimization will lead to asymptotic improvements in runtime, excluding certain edge cases.
Which isn't to say these optimizations aren't important and can lead to very real improvements in speed, I just wouldn't consider it algorithm rewriting in the strictest sense.
As the last paper says:
"e. The development of techniques for automatic recognition of computational kernels such as
inductions, reductions and array recurrences has been an intensive research area in the scope of
compiler technology during the 90’s."
If you want a simpler example, all of your compilers are usually replacing your divisions by a different algorithm :)
It replaces your sum reductions, recurrences, etc.
Seriously, these are really really old compiler problems.
Compilers do the parts people care about, and will often replace the algorithms that are usually "slow", like math, or have well defined semantics that won't cause api breakage for you. Compilers replace memcpy, printf, etc, all the time.
If you want a compiler that replaces your much higher level algorithms, as the papers show, they have existed over the years. When the research papers are scanned copies, it's old tech :P
But the truth is that, for example, a lot of common languagers (C++, etc) are not good languages for that. It's not about compiler technology. I can teach a compiler to recognize your sorts, for example, pretty easily. But proving i can legally change it and not destroy the programming semantics is often literally not possible. You'd have to tell me "it's okay". It's only stuff that is well specified enough (like standard library functions) that i could do.
Think about trying to swap STL data structures for some code. We can recognize when it is possible. We can even determine when it is profitable, and do it.
Usually, the problem is it destroys the API you've built to do so (IE you pass it around as an argument), etc.
and in the end, it's usually not worth it for those languages to do it in the compiler. It's infinitely more effective to do something like http://lists.llvm.org/pipermail/llvm-dev/2016-April/098355.h... and tell people "you should switch this datastructure".
At least, for these languages.
But hey, if you want the compiler to duplicate and inline all the functions and swap out the algorithms, it could!
It's just that, for example, C and C++ are not languages that lends itself to compilers doing it, it's one that lends itself to telling humans to do it (or at least, asking them if they want it done and then doing it)
This is obviously all easier in functional languages, and they actually have a fairly rich history of doing algorithm replacement :)
> replacing your divisions by a different algorithm
No they're not. I don't actually specify the algorithm to be used for division, I just write "/" (usually), and so the compiler (or runtime or whoever) is certainly free to implement that "high level" specification (I want these two quantities divided) in any way it sees fit.
I'm really unsure why you want to be this pedantic, but okay.
You are making a lot of assumptions here.
My point is that bunch will idiom recognize division algorithms and change them. So yes, yes they do!
I am being precise where you are incredibly sloppy.
> You are making a lot of assumptions here.
Really? Which assumptions am I making? Try to be precise.
Considering the fact that you are calling everyone else idiots who don't have any data, the fact that what you write has (a) no data to back it up and (b) is usually trivially, obviously and comically wrong doesn't exactly help your argument.
> ...will idiom recognize division algorithms and change them
Really? Which ones? How is this useful? How many real-world programs actually code a division "algorithm"? I don't think I've seen that once in the last 20+ years or so, but of course YMMV.
UPDATE: Last I remember, "idiom recognition" was invented for the sole purpose of cheating on industry standard benchmarks, for example replacing the Dhrystone with a computed result. Has this changed?
As a very simple example see if you can spot the loop in the optimised version of a sum to n function[0] (hint: you can't). I don't know to what extent this scales up to more complex programs, but it is at least possible in principle.
UPDATE: For those who don't want to do the exercise: The Intel compiler (icc) optimizes it in about the way that I would if I were to write it in assembly:
Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.
If you disable sse it generates essentially what icc does with less reliance on ucode.
See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
GCC version makes for much better speculative execution too due to the use of the adder.
> Which is slower on non-Intel and older CPUs. Add -march=native next time. Nets you a vector version.
Accepted (though the central advantage of SHLX (with-march=native) and SHL (no -march=native) for this example lies in the greater flexibility of register parameters). To my defense I only have a computer whose processor supports BMI2 for a few months now - so I could not play around with BMI2 before. Otherwise I am sure I would have known such tricks.
> See, rep is a workaround for older AMD CPU branch predictor. In addition gcc does not use the highly opcoded variant of lea with offsets because it is slow on older Intel.
I know that. My personal code philosophy is to avoid such hacks for circumventing performance bugs in outdated processors.
> the compilers can and do replace algorithms if users let them
I think idiom recognizers are worthy of mention. That said, it being a manual pattern recognition is probably going to eventually lose edge to things like LLVM souper or other superoptimizers (exhaustive or predictive)
Ada has a rich set of language features to express preconditions, postconditions, and invariants directly in the code, and then have those logical statements available to the compiler so that it can assume certain facts about the program's execution when optimizing:
Also, the type system lets you specify legal values with more precision than the word-based types that most languages inherited from C. For example, you can have range types that may hold integers of only a certain range, and the compiler will enforce that when assigning to variables of the type, or throw an exception if an arithmetic operation results in a value outside the range (you also have modulus types that explicitly give the wraparound behavior familiar in C). Containers allow generic constraints, and you can explicitly specify the dimensions of array types.
>> compiler optimizations are insignificant in comparison to those improvements
> he has no data to show this
Yes he does. And so do countless of other people. For example, I just gave a talk where I describe an afternoon optimization session giving me a 1000x and and 8000x improvement for a data-conversion and a query task, respectively.
Was the compiler helpful? Probably, I didn't measure with and without. But the vast majority of the improvement was not the compiler. And I don't see how the compiler could have come up with the changes (moving from a layered architecture of Model Objects -> ORM -> SQLite to a hexagonal architecture with an optimized data representation and completely different access procedure for the one data type that mattered).
And so on and so on. It is also interesting that in none of these cases did I revert to assembler, so I think that part of DJB's thesis is not entirely correct, but only in being too specific: hand-coding assembler is just one part of hand-optimizing, and very often it is not the or even a crucial part (though in security algorithms it might very well be).
But the idea that optimizing compilers are getting squeezed at both ends is, IMHO, 100% correct. On the one hand, so much code is so incredibly cold, see Ruby on Rails web sites or most UI programming. RoR can do maybe 10-100 requests/s. It's so slow it's comical. However, most sites would be happy to receive 100 hits per day, so "comically slow" is fast enough. Or when I set the end-parameters for an animation that is going to take half a second, it wouldn't matter if the code doing that were run on a LISP interpreter badly written in Ruby. Or why hasn't PyPy completely replaced Python? Why was TCL adequate for steering supercomputer workloads?
On the other hand, the performance critical parts have gotten more critical. That means that the "make everything a bit faster, but without any guarantees" that optimizing compilers are great at providing is simply not that useful any more for a lot of workloads.
And yes, there are workloads where it is useful, and it seems that you have some of these workloads. That doesn't change the broader picture.
I usually express this as "No compiler will turn your DFT into an FFT" but of all the examples I could pick, that's the one I'd expect someone to prove me wrong with.
"Yes he does. And so do countless of other people. For example, I just gave a talk where I describe an afternoon optimization session giving me a 1000x and and 8000x improvement for a data-conversion and a query task, respectively.
"
Please be concrete about what exact data you think he does and countless others have.
"Was the compiler helpful? Probably, I didn't measure with and without. But the vast majority of the improvement was not the compiler. And I don't see how the compiler could have come up with the changes (moving from a layered architecture of Model Objects -> ORM -> SQLite to a hexagonal architecture with an optimized data representation and completely different access procedure for the one data type that mattered)."
First, you've actually said "i have no idea if the compiler would have fixed this, because i didn't run it". That means what you have is not actually data. It's an assumption. If you want to prove it out, that's data. You actually have to do the science if you want to say you've done the science.
Second, Compilers do actually come up with such optimized data representations, memory layouts, loop walking, you name it (see, for example, https://hal.inria.fr/hal-01425546)
Had you written it in a language that allowed for semantic changes of the kind you are talking about, we could make an optimizing compiler do it.
15 years ago I saw a compiler that did automatic splitting and distribution of COM objects to split up a game to run half on a server and half on a client.
We can do this stuff. You'd just have to use something other than C/C++.
Which is why more and more people are moving towards those kinds of languages ever day, and why more and more invasive "extensions" to C/C++ get made every day.
Your particular example used to be the domain of dataflow, ETL, and stream compilers, all of which, yes, could reorganize your entire hierarchy to be better if you let them.
As mentioned, this isn't about the capabilities of optimizing compilers, but about "what tradeoffs am i making in choosing my tools and languages".
I don't think it's reasonable to, for example, program in qbasic and then say "optimizing compilers suck at reorganizing my data layout". Well, yeah.
"On the other hand, the performance critical parts have gotten more critical. That means that the "make everything a bit faster, but without any guarantees" that optimizing compilers are great at providing is simply not that useful any more for a lot of workloads."
These are super interesting and I would love if you submit to HN a cool video, slide deck, blog post or paper about systems that optimize a large task across machine boundaries!
But normal people consider the output of a compiler to be a single binary (or bytecode meant to run in a single possibly-forking process) and the optimizations the compiler is allowed to perform in the "as-if" principle do not change the output of strace of the resulting binary.
I would be very surprised if a compiler ever changed the externally-visible IO operations my program performed, even if for the better. What I call a compiler cannot change the file format or the serialization format or the indexing or the way data is written to a socket or things like that. It can only change computations and memory accesses between syscalls as long as the same syscalls are performed.
For systems where the task is defined at such high level that choosing the file format is up to the "compiler", I would want to use a different name than "compiler".
> First, you've actually said "i have no idea if the compiler would have fixed this, because i didn't run it"
No I did not. Please don't distort what I wrote (which appears to be the only way you have of 'making' your point) What I did write was "I didn't measure with and without", meaning I didn't do the cross check of what the total improvement would have been without the normal compiler optimizations, because both the original and the final were run with standard -Os optimizations, and because I have a fairly robust intuition in terms of the bounds of the effect.
For shits and giggles, I now did the cross check. The difference between -O0 and -Ofast was in the noise. Sometimes the -O0 was faster, sometimes the -Ofast was faster, slightly skewing towards -Ofast and with times varying by around 10%.
So let's be generous and assume that it wasn't noise, that it was 10% in the direction of the -Ofast option. Heck, let's double it to 20% just to be on the safe side. That's 20% vs. 1000x. Compared to the total, the contribution of the compiler optimizations vs. the manual optimizations is 2:10000. Or 99.98% vs 0.02%. Compare Proebsting's Law.
The rest of your post is full of "had you" and "might have" and "could have". And you berate others for not delivering data. Hmm... And then you go off and cite research that did something that seems vaguely keyword-related to the things I described, but is actually nothing at all like it.
CSV -> ( Optimized binary data model ) -> fwrite()
And now tell me what concrete, existing optimizing compiler (production preferred, research might be OKish) is going to do that transformation and get me my 1000:1 performance improvement. No "there was a research compiler that did some data layout transformations somewhere", but concretely, a compiler that will take that first solution and automatically generate the second solution with the 1000:1 perf. improvement.
Please also take into consideration that the CoreData ORM and SQLite DB are not available in source, but only via shared library / API.
If my grandmother had wheels, she'd be a bus. Starting an argument (or a logical deduction) with a false premise is not exactly useful. Sorry, if your idea of the power of optimizing compilers is "you need to first rewrite the code", then I can just rewrite the code to be fast and leave out the middle-man. Your argument here is ridiculous beyond the pale.
> 15 years ago I saw a compiler...
And I saw a UFO. This is your strong "actual data"? Yes, research compilers can sometimes do funky stuff in research settings. Same for JITs. I very much remember that special issue of the IBM system's journal on Java performance. It featured a bunch of reports from the researchers touting performance numbers, and then one real world report from the San Francisco project that reported their performance "challenges". See Jitterdämmerung: http://blog.metaobject.com/2015/10/jitterdammerung.html
> ...that did automatic splitting and distribution of COM objects to split up a game to run half on a server and half on a client.
And how is this relevant to the example at hand, except in "proving" (which it doesn't, by the way) that compilers can do crazy transformations (which so isn't the point)? Was this a production compiler? Was it general purpose? Is it still around today?
The point is not what compilers can do. It is how useful that is in practice. And yes, there exists code outside of Google data centers.
> ...qbasic and then say "optimizing compilers suck at reorganizing my data layout"
Nobody said that, except you. You need to stop making up straw-man arguments that you can then demolish, but rather deal with actual facts.
It seems awfully weird to rebut someone who says they're doing large-scale optimization on specialized cluster compilers with better languages than C/C++ by compiling something first with -O0 and then with -Ofast.
Who said I am rebutting whatever he says he is doing?!?
I am rebutting his absurd claim that compilers can and more specifically do transform the code I hand-optimized to the tune of 1000x / 8000x by doing a top-level architectural transform. They can't and don't. In this particular case, the optimizer's contribution was trivial (actually less then I would have expected). And of course DJB gave a bunch of other examples as well, and both he and I (not that I am putting myself on the same level, not by a long shot) and many others have seen many other cases.
He keeps saying they do. He doesn't say your compiler does. He says there exist production compilers that do. Which, again, makes your attempt to rebut him with gcc a little weird.
And he is wrong. He alludes (at most) to compilers that do things that are at best superficially-keyword related to what I've described. And that is being very, very generous.
I have given the example and made available the source code. If you have a compiler that can do this transformation, show me.
And to reiterate, the transformation is to take a system that uses an SQL database via an ORM, both not present as source code but via an OS-level ABI, and convert that to not use either the SQL database or the ORM, and instead use a packed binary representation that takes domain knowledge not available to the compiler (certain fields in the input CSV are hours and minutes, but the hours go to 30 because of midnight wraparound), and use that knowledge to create an indexing structure based on double array lookup into buckets, and a bucket sort to create those buckets. And use fwrite() for persistence (instead of SQLite) and mmap() to read the data in (instead of SQLite) because the structure was purposely created in such a way as to (a) be contiguous and (b) not contain any pointers.
Go. I'll wait.
And yes, the burden of proof is on the person why says compilers do this. And yes, the task is not to transform the program the original authors (not me) should have written into the optimized version, but the program they did write.
Oh, another example is the BBC one, where the start was a distributed system with around a dozen or two machines running around 100 processes, some talking go each other via HTTP, or shared folders, or via a shared Oracle DB. Or some combination of these. The replacement was a single process event-poster using in-process REST that was between 100-1000x faster.[1][2] Again, show me the compiler that will do this.
Go. I'll wait.
> He doesn't say your compiler does.
So "no true scotsman", ey? Anyway, I was always talking about real world examples, and real world examples are a combination of a real-world problem with a real-world compiler. Just as DJB's argument (and my similar argument for JITs, see "Jitterdämmerung"[3]) is at least in part about the real world not matching the expectations of the JIT/optimizer engineers.
In theory, all this works super swimmingly and is totally amazing, and yes you can construct benchmarks where it does. In practice it is less useful, at least for the vast majority of users, because either they don't need it or they need more. And this is the crux of the misunderstanding: the point here is not that compiler's can't do certain things (they can), but that those things are, for the most part (not universally) either: (a) not needed (RoR, ...) or (b) not enough (my examples).
Which is why the counter of "they do" and giving various examples of compilers doing amazing things is completely pointless, as it only show you're completely missing the point.
Now there obviously are specialized use-cases where this is not true, where all of this actually works and those capabilities match the requirements. While I have no insight into Google's operations, my mental model is that theirs is one of those specialized use cases, which is why DannyBee is so adamant that this all works perfectly. In his world, it does. And it's super-sophisticated, super-cool rocket science and he is a super-smart engineer.
However, not everyone is Google. In fact, I contend that Google is very much an outlier in terms of requirements and capabilities.
> He says there exist production compilers that do.
And he hasn't shown a single one that does, partly because he so far hasn't even grasped what "do" in this case is (and continuously transformed my statements into straw-men), which is why he has thrown out this barrage of unnamed compilers ("I hear things", "I see things") that do random stuff that is at best marginally related to what I've described.
This is a lot of words repeating the same things you've said previously. There's no Scotsman in these arguments. He runs compiler infrastructure at Google and says he's seen production compilers doing stuff you say is impossible. You say, essentially, he's lying. Do you have anything else to say?
Yes there is: "ohh, gcc, that's not a true optimizing compiler". And of course so far refusing to show any other real compiler that does all these magical things.
> He runs compiler infrastructure at Google
Yes, I know. So? Argument from perceived authority much?
> says he's seen production compilers doing stuff you say is impossible.
Actually, I haven't seen him claim that. He has claimed all sorts of things that he has seen optimizing compilers do that have very, very tangential (at best) similarities with some of the keywords I mentioned, and has claimed that as a rebuttal to my claim, which it is not.
> You say, essentially, he's lying.
If he actually claims what you say he's claiming then: yes, he's lying. I haven't seen him claim that, but I have seen him use these fairly random claims to rebut some very specific claims I have made.
And of course, if there are compilers that do exactly what I've described (transform the two systems I mentioned in the way I described), then I am absolutely happy to eat crow. While anything is possible, in theory, this would be so far beyond the current state of the art, that we could probably retire >90% of programmers right now (assuming that a small fraction of that capability were also applied to other fields of programming). And seeing the consumer-visible parts of Google, those compilers would be so super-secret that they are not used on any of that stuff.
The code is out there, somebody run it through the compiler and prove what a dunce I am.
> Yes there is: "ohh, gcc, that's not a true optimizing compiler".
I searched this whole thread for mentions of GCC, and I didn't find anything like this. Are you referring to the part about having to use a different language than C or C++ if you want this kind of optimization?
(I agree with you that "look, there is an old paper describing a research compiler in a research setting on self-chosen benchmarks" is not the same as production compilers doing the same thing in practice.)
>Except he has no data to show this, and every piece of data i have (and my colleagues have), says that he is wrong.
That's because those data only represent quantitative statistics on code, not whether said code would get great benefits if the programmer actually bothered to make algorithmic adjustments.
I think one of the issues here is what compiler optimizations encompass. In a language like C or C++, you may be able to write optimal algorithms but end up with poor performance because of memory access. An unoptimized compile puts almost everything on the stack - a = b + c involves two loads, one add, and one store in an unoptimized compile. All of that extra memory access is going to kill performance; even if you have optimal data structures and algorithms.
I see "optimizations are bad/unnecessary/problematic/whatever" in my job writing a compiler. The underlying issue is almost never "I don't want optimizations" - even when someone thinks that's what they want; it's "I don't want optimizations that produce unexpected behavior".
> You may be significantly underestimating your ability to get phenomenal gains from algorithmic improvements.
Implementation differences in the exact same, say, simple linear algorithm, can still add up to a tenfold difference. Compiler optimisations can take it further, though some are not generally applicable. E.g. I have made good experiences with PGO (careful analysis and/or knowing your workload is needed), but it's obviously not an option for the typical open source software distribution model.
Obviously, combining a better algorithm with a better implementation that is optimized will be fastest.
You may be significantly underestimating your ability to get phenomenal gains from algorithmic improvements.
My HPC lecturer used to say that the speed difference between bad Matlab code and good Matlab code will often be greater than the speed difference between good Matlab code and good Fortran code.
Tweaking the hell out of your code at the assembly level is a waste of time if you're using a naive O(2^N) algorithm.
> You may be significantly underestimating your ability to get phenomenal gains from algorithmic improvements.
Absolutely but often it's very hard to make those changes or those changes lead to a loss of accuracy, whereas compiling with -O3 is very easy and leads to no loss in accuracy (in the common case).
One of the things I hate about meld, it is a great diff tool most of the time, the 0.1 sec . The moment I diff a larger project I cannot even scroll its directory tree without lag and that is after waiting for it to load. The "it is just 0.1 sec" doesn't scale at all.
Python performance is uh ah... difficult. I work on a piece of software that has a lot of hot paths in Python code, and it's not nice. Combined with deployment and tooling issues it made me wonder quite a few times now, whether the a little bit lower contribution barrier is worth all these hassles.
Cython is a very nice way to transition hot python paths into a C extension. The best part is that it's a python superset, so it should provide some speed up without any typing information (and can be very fast once the hot paths are all in cdef'd functions).
Further, it doesn't help with the required restructuring; when the functionality is scattered across a dozen or more modules (complex software + good Python practice), then you first have to pull the functionality needed for the path you're trying to get fast together. That leads to duplication, removing abstractions and most certainly vetoes from other people.
> It should be quite easy to see the value of optimizing compilers. Compile your program with optimizations turned off. Now make it as fast as your release build again, while still keeping them off. For much of my code, I think this would take years.
The problem is that, say, C does not allow to express the necessary details about the optimizations that you want to do. This is exactly something DJB calls for in programming languages.
IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job. But there's a clear answer. JIT authors are not out of job not because optimizing compilers are not dead, but because they're writing compilers, their distinguishing ability is writing "pre-compiled" code.
You might say, "well a JIT author sped up your code's execution so he/she is writing an optimizing compiler". Well you have to realize that, traditionally, JIT authors don't just translate the code into object code, they also apply these things called "compiler optimizations". The point is that if they didn't do that, and simply produced a faithful translation of the code, they would still make the code faster because of pre-compilation (and if they enabled the "compiler optimizations", the code wouldn't run significantly faster than the simply pre-compiled code).
Regardless of whether I agree with it or not, "Optimizing compilers are dead" is not the same as saying "JIT authors will be out of business". (Even compiler writers won't be out of business).
Your suggestion is that a templating JIT that just drops in some machine code that matches the method, doing no optimization, would get all the win. Such a compiler is indeed much faster than an interpreter, but it's nowhere close to an optimizing compiler. Mike Pall, the author of LuaJIT, would be very surprised if you suggested that his compiler performed similarly to something simple like that.
Actually, LuaJIT 1.x is just that: a translator from a register-based bytecode to machine code using templates (small assembler snippets) with fixed register assignment. There's only a little bit more magic to that, like template variants depending on the inferred type etc.
You can compare the performance of LuaJIT 1.x and 2.0 yourself on the benchmark page (for x86). The LuaJIT 1.x JIT-compiled code is only slightly faster than the heavily tuned LuaJIT 2.x VM plus the 2.x interpreter written in assembly language by hand. Sometimes the 2.x interpreter even beats the 1.x compiler.
A lot of this is due to the better design of the 2.x VM (object layout, stack layout, calling conventions, builtins etc.). But from the perspective of the CPU, a heavily optimized interpreter does not look that different from simplistic, template-generated code. The interpreter dispatch overhead can be moved to independent dependency-chains by the CPU, if you're doing this right.
Of course, the LuaJIT 2.x JIT compiler handily beats both the 2.x interpreter and the 1.x compiler.
Article: "We can also refute Bernstein’s argument from first principles: the kind of people who can effectively hand-optimize code are expensive and not incredibly plentiful."
Commenter: "IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job."
Guy in audience: "I was that guy in the audience."
LuaJIT author: "Actually, LuaJIT 1.x is just that"
Voice in my head: "Aspen 20, I show you at one thousand eight hundred and forty-two knots, across the ground."
Meta: Apologies for the abstract response, but I couldn't figure out a better way to present the parallel. It can be hard to explain artistic allusions without ruining them. What I mean to say is that this pattern of responses reminded me in a delightful way of the classic story of the SR-71 ground speed check: http://www.econrates.com/reality/schul.html
I'm impressed (as usual with your work) that you're able to get that level of performance from an interpreter, although as you note it's not an apples-to-apples comparison. I wonder what you'd get from the 2.x VM design with a templating JIT compiler.
But, as you said, my main point is your last sentence -- optimizing compilers like LuaJIT 2.x are impressive and necessary.
Having used some language with awful compilers, compiler optimisations let me write cleaner code.
In languages with bad optimisers I have to worry about separating code in a hot loop out into a function -- the cost of a function call is too high. This one in particular I find can lead to some horrible code, as functions grow larger and larger and lots of cutting+pasting happens to avoid function call costs.
On a smaller note, making sure I cache the values of function calls which won't change -- when instead I could trust the compiler to know the value won't change and the the caching itself.
There is, but in practice just normal inline, or even trusting the compiler to decide what to inline itself, are just fine.
The problem comes when you are using other languages (scripting ones like Python come to mind) where inlining is but a dream (ignoring systems like PyPy).
> If an optimizing compiler can speed up code by, for example, 50%, then suddenly we need to optimize a lot less code by hand.
This doesn't follow at all. If you had one hot loop and a bunch of cold code, and auto-optimize your code to be a measly factor of 2 faster, you're still going to need to hand-optimize the hot loop and what it does to the cold code is irrelevant.
> hand-optimized code has higher ongoing maintenance costs than does portable source code; we’d like to avoid it when there’s a better way to meet our performance goals.
True, but again, only applies if you can optimize by enough to make hand-optimization unnecessary.
> we’d also have to throw away many of those 16 GB phones that are cheap and plentiful and fairly useful today.
This part is nonsense. No-one's got anything like 16GB of code on their phone.
Optimization could be valuable but current compilers are too opaque, making optimization too much of a black art. I believe we need to do something along the lines of "turning the database inside out" ( https://www.confluent.io/blog/turning-the-database-inside-ou... ); we should turn the compiler inside out, build it as more of a library, give the developer more insight into what's going on, have a high level language that lets you understand how it compiles. Interesting and vaguely along the same lines: https://www.microsoft.com/en-us/research/publication/coq-wor... .
> auto-optimize your code to be a measly factor of 2 faster, you're still going to need to hand-optimize the hot loop
Huh? You start with performance goals. If compiler-optimized code meets your performance goals, you are done. You do not "need to hand-optimize" in that case.
Why would a "factor of 2" not be good enough? What is it compared to? What makes you so sure that you must optimize further, disregarding any possible context?
Additionally, if a factor 2 improvement by the compiler on top of a factor 4 improvement by using better algorithms and data structures can give you an 8x improvement overall, why would you not take it?
Reading through the various comments asserting that optimizing compilers should not be necessary if you 'just use better algorithms' and whatnot, I'm kind of wondering why it has to be one thing or the another. Who wouldn't want to have both?
Because using a better algorithm is more like a factor of 100 or 1000. Factors of 2 are just not worth bothering with; if the business is so marginal that a factor of 2 in the hardware requirements makes a noticeable difference then you don't want to be in that business in the first place.
That makes no sense at all. Only if you presume you are always going to start with some horrible piece of code that only uses the worst possible algorithms you may get that kind of speedup. And after you got your 100x to 1000x speedup, or if you start with a piece of code that already is quite efficient, it could still be 2x too slow to meet some hard realtime performance constraint. Or it could be fast enough, but you could potentially save 2x computing power that can be used for other tasks, just by letting the compiler optimize your code.
If I tell my employer I'm not interested in getting a 2x speedup by flipping a compiler switch, but prefer to spend an uncertain time looking for or inventing this hypothetical algorithm that will give me 100x to 1000x speedup, I don't think they'll be very happy about that. For most of the code we deliver to our customers, a factor 'so marginal as 2x' could literally save them millions of dollars over the lifetime of the software. Two times faster saves a lot of computing time if you have jobs that run for 72+ hours. If only things were as easy as you pretend they are (we already have the 'marginal' 2x from the compiler optimizations, and we already have efficient algorithms)...
> Only if you presume you are always going to start with some horrible piece of code that only uses the worst possible algorithms you may get that kind of speedup
If we're talking about algorithmic improvements we're usually talking about going from O(n^2) to O(n log n) or similar, which is easily a 100x-1000x or more speedup if your n is big enough to be bothering with performance at all. It's very easy to be accidentally quadratic and this causes a lot of performance bugs. Talking as though algorithmic speedups and optimization speedups are of a similar magnitude is really misleading.
> it could still be 2x too slow to meet some hard realtime performance constraint
Theoretically it could be, but that would take an incredible amount of luck. There's so much variation in runtime that the odds of getting within a factor of 2x are tiny; more likely you're thousands of times too slow, or thousands of times faster than you need to be.
> Two times faster saves a lot of computing time if you have jobs that run for 72+ hours.
Sure, if you've got jobs that run for 72 hours in your code (i.e. aren't spending those 72 hours doing something that's already highly optimized in a standard library e.g. linear algebra) and have already carefully optimized your choice of algorithms, then the rules are different. That's a pretty rare case though.
True, but that assumes a better algorithm exists that can give the 100x speedup. In a good language (most of them) the common algorithms are somehow implemented in the standard library and optimized - I just use them. Once I finished my degree I never wrote sort from scratch, I just looked up the syntax for calling the sort built into my language. Now it is possible that I can do better than the sort built in - but that is because I know my typical data and so I might be able to make a different compromise: we are not talking about a 100x speedup we are talking about a 1.0001x speedup (and that is only after several months of work to replicate tricks the built in sort already does)
(technically I have implemented sort since my degree, but it was bogo sort: I obviously did it only for fun)
A 1.00001x speedup really isn't worth getting out of bed for unless you're working in a very specialised environment; your time is likely worth a lot more than whatever you saved.
TL;DR "Optimizing compilers are still good to have because they are cheaper than programmer labour needed for hand optimization"
The original DJB presentation, which this is a response to, is very good and interesting.
It would really be nice if the field of compiler engineering started to address the obvious neglected areas, like optimizing memory layout and data types/representations based on partial evaluation / profile feedback.
> It would really be nice if the field of compiler engineering started to address the obvious neglected areas, like optimizing memory layout
Interesting tidbit: LLVM was developed by Chris Lattner for his PhD thesis. It was titled "Macroscopic Data Structure Analysis and Optimization" and about "Instead of analyzing individual load/store operations or structure definitions, this approach identifies, analyzes, and transforms entire memory structures as a unit", in his own words. Apparently, the world was more interested in a good compiler framework than specific memory layout optimization proposed.
" like optimizing memory layout and data types/representations based on partial evaluation / profile feedback."
They already can. The issue is usually one of what is allowable within language semantics, not of compiler optimization technology.
One reason you may see this as neglected is that outside of, say, polyhedral loop optimizations, research in the 80's and 90's (and sometimes earlier) did a really really good job of exploring this area because fortran allowed so much freedom.
I know C and C++ can't because the language standard forbids it. Anyone knows the reason why? It should be hard but doable. That's, of course, assuming everyone properly uses macros like offsetof instead of (shudder) hand-calculated offsets, does not use type punning to access the first members of structs, etc. The compiler will certainly need to store the memory layout to enable separate compilation of translation units so that's a complication. Any other issues I haven't thought of?
Actually, the C (and I believe C++) language standards both allow reorganizing struct layouts, except for the first member which has to be in first place (for tagged unions).
The struct layout is fixed by platform-specific ABI documents, which must specify the layout precisely so that structs can be passed between separately compiled code, e.g. between applications and dynamically linked libraries.
Changing ABIs is virtually impossible, so...
It would be nice to see opt-in optimizations of struct layouts, e.g. by annotating structs with a packing-style attribute. Though the best gains would probably be obtained from profile-guided optimizations for cache line optimizations, and those can only really be done with link-time optimizations for structs that are not passed outside the linker's scope.
C99 TC3 §6.7.2.1p13 "structure and union specifiers":
"Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared."
The as-if rule allows compilers to perform this optimization, provided they elide the optimization for those structs that have addresses of members taken/used, though. This requires whole program analysis of course (as in clang / gcc "LTO" compilation mode). And of course there could be language extensions to loosen this rule further on a per-struct basis.
The compiler is free to change the layout (even from AOS to SOA) as long as it can prove that the program can't tell the difference. Of course this is hard to prove in practice even with whole program optimization.
Besides the role of grouping a bunch of related fields, structures are also regularly used to define file formats, network protocols, or in OS development for data structures interpreted by the hardware. All these things would break without a defined memory layout.
Compilers and languages of course co-evolve. And we already have fairly widespread production use of invasive language extensions/dialects designed to enable use of parallelism, such as OpenMP and CUDA. And older language extensions that have ove time become everyday stuff, like SIMD intrinsics.
So I don't think "languages don't support it" is a good excuse.
"And we already have fairly widespread production use of invasive language extensions/dialects designed to enable use of parallelism, such as OpenMP and CUDA"
I'm not sure you are really saying anything that proves me wrong?
Using those extensions, there are optimizing compilers that will perform very heavy memory layout optimization.
Which does, in fact, go to show that it took "invasive language extensions" to make the original language able to be optimized in this fashion?
IE you invasively changed the original semantics, and now it made these things possible. How does that not show that the original language semantics were not the problem, which is what i said?
Maybe OpenMP and CUDA do more optimizations than I thought they did. Do they perform AoS->SoA reorganizations, conversions from pointer members to inline members, or range analysis to choose smaller data types?
Admittedly foremost I was thinking about entirely different kinds of data and layout optimizations: Herarchical data, pointer heavy structures, and potential compressed representations. Things like converting node pointers into 16-bit indices into a node array.
> Maybe OpenMP and CUDA do more optimizations than I thought they did. Do they perform [...] range analysis to choose smaller data types?
CUDA compiler developer here. We definitely perform range analysis on values stored in registers. This is an important optimization.
Via scalar replacement of aggregates, we can sometimes also replace a struct with a set of scalars. Once we do that, we can again perform range analysis on those scalars.
We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.
"We can't change structs that don't get SROA'ed because of limitations of the language. For one thing, essentially any memory we write to the GPU's global memory can be read from the CPU side, and it's basically impossible to tell what is and isn't read, so we have to keep the memory layout the same.
"
Which is precisely a language semantic problem, since there are plenty of languages where you can change SOA to AOS.
In fact, it's wildly common among high perf fortran compilers
Sun's optimizing compiler did AoS->SoA reorganization, data member inlining/ordering, and range analysis in a whole program optimization mode, even for C++(!) without any extra #pragma, when it can correctly prove no aliasing and no pointer escaping. That was nearly 15 years ago. It could also transform an array of pointers to malloc'ed arrays into a single 2d array.
Optimizing memory and cache use has been a primary focus of compiler research for a while now. I'm don't understand why you say partial evaluation w.r.t. optimizing memory use is a neglected area though: I think partial evaluation is orthogonal to this, and useful in other optimization types as well.
For some recent compiler research with regards to optimizing memory and cache use, see e.g. "Optimizing Indirect Memory References with Milk", PACT 2016 [1].
The research side may not be the main culprit, hence I said "compiler engineering" - meaning usable tools delivered to the hands of developers. Though I think the research side has not focused on this enough either.
Regarding the Milk paper specifically, it appears to be about reordering and clustering memory accesses without changes to data layout. This is a fine thing but I think it reinforces my point about lack of attention to memory/data layout optimizations.
In the linked slides, DJB talks about a language for communication with the compiler, separating optimizations from specification. This reminded me of VPRI's "Meaning separated from optimization" [1] principle. Does anyone know what became of that line of thinking? Is this idea making it's way into Ohm? I remember reading a post/paper about optimizing Nile/Gezira to better exploit the CPU cache (and the struggle to use SIMD), but can't seem to find it now.
IBM's old PL/S language allowed you to give hints like where data would go or what checks would happen right in the function declaration. The compiler would handle it from there.
Nope, much is gained by unrolling loops, inlining functions etc, which all increase code size.
Of course C++ compilation with no optimization at all can be rather wasteful with performance and code size, but to squeeze the final performance out you probably need to sacrifice code size (whether manual or automatic)
> Nope, much is gained by unrolling loops, inlining functions etc, which all increase code size.
That's because those are performance optimizations, not size optimizations (though as an aside, inlining functions can reduce code size in the event that the inlined version is smaller than the function-call overhead, or in the case where the function is used only once).
There are plenty of size optimizations that can be performed. -Os will enable them on gcc/clang if you want to try for yourself.
I'm posting this as a top-level comment, but it's really a reply to the discussion downthread about compilers being able to work magic if we let them. Better still, why not help them?
Something I took for granted for the longest time about Haskell (which remains the only language I know of with the feature) is the ability to write user-defined "rewrite rules". You can say, "okay, GHC, I know for a fact that if I use these functions in such-and-such way, you can replace it by this instead".
{-# RULES
foo x (bar x) =
superOptimizedFooBar x
#-}
A rule like this would be based on the programmer's knowledge of FooBar theory, which tells her that such an equality holds. The compiler hasn't studied lax monoidal FooBaroids and cannot be expected to infer this on its own. :)
Now, anywhere a user of this code writes something like
foo [1,2,3] (bar [1,2,3])
the compiler will substitute
superOptimizedFooBar [1,2,3]
in its place. This is a nice way to bring the compiler "closer" to the programmer, and allow the library author to integrate domain-specific knowledge into the compiler's optimizations.
You can also "specialize" by using faster implementations in certain cases. For example,
timesFour :: Num a => a -> a
timesFour = a + a + a + a
timesFourInt :: Int -> Int
timesFourInt x
= rightShift x 2
{-# RULES
timesFour :: Int -> Int
= timesFourInt
#-}
If you call timesFour on a double, it will use addition (ha!) but using it on an Int uses bitshifting instead because this rule fires.
High-performance Haskell libraries like vector, bytestring, text, pipes, or conduit capitalize on this feature, among other techniques. When compiling code written using libraries like this, this is how it goes:
- rule #1 fires somewhere
- it rewrites the code into something that matches rule #2, "clearing the way" for it to fire
- rule #2 fires
- rule #3 fires
- rule #1 fires again
- rule #4 fires
and so on, triggering a "cascade" of optimizations.
The promise of Haskell is that we already have a "sufficiently smart compiler": today, with good libraries, GHC is capable of turning clear, high-level, reusable functional code with chains of function compositions and folds and so on into tight, fast loops.
--
I must add, though, that getting rewrite rules to fire in cascades to get "mad gainz" requires one to grok how the GHC inliner/specializer works.
My biggest concern with something like this is how it affects debuggability and the principal of least surprise, for a couple of reasons. The biggest is that in the presence of bugs. If there is a bug in foo but you only use foo in the context of foo x (bar x) then all those operations get transformed and you end up with the correct behavior even though your code has a bug that will suddenly appear when you use foo in a way that that rule isn't applied. Or there's a bug in superOptimizedFooBar, so you correctly write "foo x (bar x)" which is correct, but you get the wrong results, and you may waste time trying to debug foo and bar not realizing the rule replacement. And there is also the possibility of bugs in the rules themselves. In general, I'm a little hesitant to use something that is replacing my code behind my back.
It is very interesting, though, and is a good avenue to explore. I've got some reading to do :)
You're right, of course. There are ways around this: for one, without optimization (i.e. -O0) you don't have any rewrite rules firing, so any discrepancies in behavior can be tracked down this way.
In practice, most of the libraries that include tons of rewrite/specialise/inline (ab)use are either "core" libraries (like vector/bytestring) or have a large userbase (e.g. Conduit), and rules don't really change too much from version to version, so this has never actually had the detrimental effects that "silent replacement of code" might have in the worst case.
This might[0] sound similar to apologetics for dynamically-typed languages: the only real answer to your question is that rewrite rules are ultimately a way to improve performance, and they come with caveats usually associated with replacing clearer algorithms with faster ones. (I'm thinking of the adaptive sort from the C++ STL, which iirc uses different sorting algorithms for differently-sized containers to maximize expected performance. It's not exactly intuitive that vectors of size 100 and vectors of size ~10000 are sorted differently, is it?)
Of course, the only verifiably-correct way to do these things is through the use of machine-checkable proofs of the transforms. The well-known "foldr/build" rule that many optimizations of linked-list functions reduce to, for instance, has been proven outside of GHC, and there are usually similar proofs for other rules known to the library author. The appeal of dependently-typed languages is how they are a nice vehicle for the inclusion of those proofs in library code: if the compiler can't verify that your rule makes sense, it can complain instead. You then end up supplying a proof, or saying "just believe me, okay?" ;)
[0]: "In practice"? Seriously? Next thing I know I'll be parroting "in the real world" and so on...
> Using just these two rules, I saw a 225% increase in the number of nodes processed per second. Under the right circumstances, GHC can even outperform C code by applying these kinds of inter-procedural optimizations.
Also, we have QuickCheck, which allows us to check properties by randomly generating testcases. The post I linked mentions this.
I wonder if there will be a time we start seeing this with Javascript transpilers(JS->JS), which are normally used to bring feature parity and compatibility to runtimes but starting to get obsolete as the runtimes catch-up.
For example
let x = (await grabSomeIntegers()).map(double).filter(biggerThan(3)).map(substract(1));
can be optimized automatically to use a single loop but one can rewrite it to be even faster (even though insignificantly, in this trivial example) because you know they will be integers. But then you let go of the DRY.
Now the compiler can very well perform optimizations on the big composed function. Even then, as you say, this has its limits: polymorphic/DRY code still suffers a performance penalty because it has to work the same way with all allowed inputs.
... right? Nope, that doesn't have to be the case: Haskell supports more than just your $grandparent's parameterized types!
The vector package uses something called data families to have different in-memory representations for vectors of different types, without "letting go of the DRY" at all.
What this means is "if we have a vector of values of type Int16, use a primitive vector (essentially a raw array) to represent it in memory". (P is an alias for the Data.Vector.Primitive module.) Now I can write code that uses the beautiful user-facing API of vector:
Written using vectors, code like in your example is perfectly DRY; it would look like (this is untested code!) this:
combobulate :: Num a => IO (Vector a) -> IO (Vector a)
combobulate = do
as <- grabSomeIntegersFromNetwork
return (ints & V.map (*2) & V.filter (> 3) & V.map (-1))
and under the hood, whenever I use a Vector Animal[0], it uses whatever fast general-purpose array implementation is available, but the moment I call this with a vector of Int16 or Bool or Complex Double, it switches to the fast implementations in the Primitive/Internal modules.
The key is the data families: the types expose a simple, consistent API, hiding the actual guts of the memory representation from the user. The library author is free to perform whatever ugly/evil/unsafe manipulations she feels like behind the scenes on the internal representations, or not.
[0] Defining numeric operations on animals is left as an exercise to the careful reader.
Your example of rewrite rules reminds me a lot of expression templates in C++, though there's no direct support by the language and it's done through the type system instead. Some libraries (e.g. eigen or the boost library for linear algebra) use expression templates to produce optimized code for a chain of functions or operators.
The specialization example is the same as template specialization in C++.
I've heard a little about generics/templating/whatever-you-call-the-things-between-angle-brackets techniques in C++ from far away.
> The specialization example is the same as template specialization in C++.
This sounds very interesting to me, because you're saying it's accomplished entirely in the type system: do you have an example or a reference you can point me to?
I'm guessing there's a type-level tag for an algorithm, and the compiler is told how to combine tags corresponding to a sequence of operations on vectors that looks a certain way. Something like do_the_thing<t, DotWith<Self>, HasFastNorm<t>> = do_the_thing<Norm<t>>? (just thinking aloud in pseudo-C++, nothing to see here, move along) In that case, it would be closer to something using a function on types, which Haskell has in the form of type families.
I'm not sure, but I remember seeing some research into using it for Haskell stream fusion (can't find the video, sorry).
I believe that the basic idea was that not all rewrites end up being equally fast ( a * b * c can be fused into ab * c, but maybe a * bc is faster). Trying all the combinations is an option in theory, but you get a combinatorial explosion so you normally dont get far in practice. Enter machine learning. I'm not sure how successful they were, but I imagine that the same sort of thing could be applied to SIMD.
Turns out that main reason the vaunted "overkilled" solution was so abysmally slow was that the author happily used BSF and BTC in the hottest part of the loop... which are actually rather slow instructions, particularly when you're using them to control a branch (compare-and-jump is a fused µop in practice, but BTC-and-jump is not).
The point of this tail is that if you want to absolutely wring the last clock cycle out of a hot path, you usually need good microarchitectural knowledge about which operations are going to be faster and which are not. Sure, you can beat a compiler with hand-written, hand-optimized assembly code most of the time--but the people who have the skills to write such code are going to be the people working on the compilers.
The tools for optimizing compilers are getting better, and probably faster than we are capable of pumping out performance engineers to hand-craft the inner loops. In the past decade, we've seen polyhedral loop transformations become production-quality. Auto-vectorization is getting better, particularly when user-directed (think #pragma openmp simd); I know Intel has been pushing "outer-loop vectorization" very hard in the past few years. The other big fruit on the horizon is superoptimizers: I suspect we'll see superoptimizers shipping in production compilers within a decade or two.