
On the Cleverness of Compilers - nkurz
http://alexey.radul.name/ideas/2013/cleverness-of-compilers/
======
corysama
I've worked with a sufficiently smart compiler before: The HLSL (shader)
compiler for the Xbox360. It truly is a grand experience. The general
recommendation was to not try to outsmart the compiler. Write everything as
straight-forward as possible. Just watch out for work done on values that
influence the result but aren't really necessary (ex: multiplying by an input
that will always be 1).

There were several occasions where I'd write out some elegant math and think
"Yeah, that's pretty. But, it would run 30% faster if I transposed everything.
But, then I'd have to hand-interleave the internals of all these functions..."
Then I'd check the assembly output and Bam! the compiler was already producing
exactly what I had in mind!

The downside was that it took several minutes to compile <100 line sources
that produce <1000 instruction programs. That's with a language completely
specialized around small-scale linear algebra that maps very directly to
hardware that is completely specialized around small-scale linear algebra;
with branching highly discouraged, all functions calls expected to inline away
completely, no recursion, no pointers, no heap, no imports, tiny-if-any
headers and a tiny standard lib.

It's my understanding that most of the smartness of the 360's HLSL compiler
came from using O(n^3) analysis algorithms that were simply impractical at a
larger scale. Our game had thousands of shader programs that were mechanically
specialized for different situations. As a result, recompiling all of them
after changing the tiny shared header took many hours. "Thankfully" the C++ of
the game took so long to compile that we were already using Xoreax
Incredibuild to distribute the compilation. We adapted that setup to
distribute shader compilation and brought the turn-around time down to 15
minutes by leaching cycles from several dozen of my co-workers' machines.

~~~
foobarbazqux
Most optimizing compilers use data flow analyses which are cubic, so it's not
just that it's O(n^3).

[http://dl.acm.org/citation.cfm?id=788019.788876](http://dl.acm.org/citation.cfm?id=788019.788876)

~~~
larsberg
You can do pretty well for analyses where n is the number of variables instead
of the number of program points / expressions. In those cases, the typical
trick is to shorten the height of the tracked lattice, as you only get in
trouble in a few places.

To make that concrete, control-flow analysis (which answers "which possible
functions could I call through this variable?") is cubic in general. But if
you cut it off to only tracking N separate functions and then giving up to ANY
when you hit that N, you can cut off the few bad cases (e.g., all of the
bindings to the function passed to map, which you will probably just inline
later anyway) that generate the cubic behavior and keep your execution time
well-behaved, even for whole program compilers, at very small loss of
precision in practice.

~~~
foobarbazqux
If your language has pointers or aliases, precise call graphs are exceedingly
expensive (for large programs), because you need a precise pointer or alias
analysis. Do you have a link to a publication about the technique you
describe?

~~~
larsberg
There are meatier technical presentations around about the formal bounds, but
I always recommend Manuel Serrano's early work, which is much more accessible
than more modern presentations (my own included):

[http://dl.acm.org/citation.cfm?id=315891.315934](http://dl.acm.org/citation.cfm?id=315891.315934)

A good survey of many of the tricks used in practice (and the formal
techniques, as well) is available in Jan Midtgaard's tome, which was finally
published in the last year or so:

[http://dl.acm.org/citation.cfm?id=2187671.2187672](http://dl.acm.org/citation.cfm?id=2187671.2187672)

~~~
foobarbazqux
Cool, thanks. I wasn't clear whether you were talking about functional
programs, but I can believe that it works for them.

------
ef4
Part of the reason compilers don't do really thorough analysis and
optimization is that it makes the compiles take a long time.

But we can make that not matter if we rethink the relationship between the
compiler and the programmer. What if you got the _first_ version of your
binary very quickly, but then you could let the compiler run arbitrarily long,
and watch it continue to produce better and better versions?

What if the compiler could compile not just source code, but _diffs_ of source
code? It could apply your one-line change without redoing its exhaustive
analysis of all the things that you couldn't have effected.

Compilers try hard to avoid whole-program analysis, but I think that's
misguided. Spend the cycles, they're getting cheaper all the time. If you can
update your analysis incrementally as the programmer works, you're golden.

~~~
Dewie
> Part of the reason compilers don't do really thorough analysis and
> optimization is that it makes the compiles take a long time.

Compiler flags are all you need to solve that, no?

~~~
wtallis
Only in the strictest sense. The mere presence of the switches does nothing on
its own to improve the workflow. You need a development environment that is
aware of why you're running the compiler: basic syntax checking, deeper static
analysis using expensive heuristics, unoptimized code generation for quick
turnaround time on correctness tests, or optimized code generation for
performance testing.

------
omn1
> Doubtless the first difference that jumps out at you is that one of these
> programs is written using a beautiful, familiar surface syntax that makes
> you feel right at home, while the other uses an ugly, foreign notation that
> makes your eyes glaze over.

I love that sentence. A sharp-witted summary of the recent _imperative_ vs.
_functional_ discussion.

~~~
dandrews
"One is a genius, the other's insane."

------
oh_teh_meows
For people who are interested in learning about compiler optimizations (they
are a wonderful area of research in their own right!), I'm throwing out a list
here so you can google about them (if you know more, please reply to this
comment! I'd love to learn more about them :D)

Expression related optimizations: Constant Folding, Constant Propagation,
Global Propagation, Strength Reduction, Common Subexpression Elimination,
Partial Redundancy Elimination, Induction Variable Elimination, Reassociation

Loop related optimizations: Loop Invariant Code Motion, Loop Peeling, Loop
Unrolling, Loop Distribution, Loop Autoparallelization, Loop Fusion, Loop
Fission, Loop Interchange, Loop Tiling/Stripmining, Vectorization,
Scalarization

Memory/cache related optimizations: Cache blocking, False Sharing Elimination,
Structure Peeling, Structure Splitting, Array Contraction, Multi-dimensional
Array Dimension Reordering

Control flow related optimizations: Code block re-ordering (by frequency),
Branch prediction (by static analysis/feedback guided), Code hoisting/sinking
(to optimize CPU pipeline), Automatic Inlining, Tail Call Optimization

Code generation related optimizations: Register allocation (np complete),
Peephole optimization, Superoptimization (no one really does this yet, but
cool nonetheless)

Analysis (interprocedural/intraprocedural): Alias Analysis (flow/context
(in)sensitive), Points-to Analysis, Escape Analysis, du chain Analysis, Live
Variables Analysis, Memory Access Pattern Analysis (to guide memory layout
optimizations), Available Expressions/Copies Analysis, Loop Dependency
Analysis, Control Flow Analysis, Dominator data flow Analysis, Globals
Analysis

Miscellaneous: Static Single Assignment, Data Flow Analysis (in, out,
transfer, meet!), Worklist Algorithm (for Data Flow Analysis!), Symbolic
Execution/Abstract Interpretation

Writing straightforward code is the best thing you can do for a compiler. They
can tell exactly what you want to do, and do their best to generate the
fastest/most memory efficient code possible. Help compiler help you!

Also, compiler usually comes with several different levels of optimizations.
And you can tune each one of these optimizations, like you would with a racing
car. Read the manual!

~~~
pg
Please don't use all caps for emphasis.

~~~
oh_teh_meows
I'm sorry it bothered you. I'll edit them.

~~~
nsmartt
[https://news.ycombinator.com/formatdoc](https://news.ycombinator.com/formatdoc)

------
AlexanderDhoore
What ever happened/happens to Stalin [1], the scheme compiler? That's supposed
to be a real Smart Compiler, right?

It's so mysterious... It's developed by this one guy. The only thing you can
find on his website is a tarball [2]. It's like programming mythology.

[1]
[http://community.schemewiki.org/?Stalin](http://community.schemewiki.org/?Stalin)
[2]
[https://engineering.purdue.edu/~qobi/](https://engineering.purdue.edu/~qobi/)

~~~
rayiner
It's a test bed for aggressive control-flow based optimization a for Scheme.
It's not a practical production compiler--too slow and requires whole program
analysis.

~~~
jevinskie
Right, the qobi doesn't even use that scheme in his AI class. He uses another
scheme that he wrote. =)

------
tachyonbeam
Compilers _are_ getting smarter over time. Better analyses and clever
optimization schemes are being designed and implemented. In my opinion though,
as a compiler writer, the best route is to make languages easier to analyze.
Things like purity and referential transparency (e.g.: as seen in Haskell)
make code analysis and transformation much simpler. They enable things like
pattern-matching type transformations on code. If the compiler has to reason
about pointers, invisible side-effects and alias analysis, it greatly limits
the kinds of things that can be inferred about a program.

------
btilly
This article was worth it to me if only for the following observation:

 _And there we have a clue as to why, in modern computing, modular code is not
high-performance code. The very advantage of modularity is the source of its
performance cost: modularity is the ability to do things other than what you
did. Modularity is the ability to reuse the same idea in many different ways
and in many different places._

~~~
marijn
See also Rust's emphasis on zero-cost abstractions [1]. The current compiler
doesn't emphasize specialization (though it does specialize generics), but the
ability to abstract in source code without leaving any trace in the actual
compiled binary is a design goal.

[1]:
[http://www.infoq.com/presentations/Rust](http://www.infoq.com/presentations/Rust)

------
ScottBurson
The first encounter I had with automatic differentiation was in a simulation
language called PROSE, circa 1974. I never actually used it, but I came across
a manual for it, which I found quite fascinating. As I recall, it ran on the
CDC 6000 series -- at the time I was doing numerical physics on a 6400, and
PROSE was being presented as an alternative to the ubiquitous FORTRAN.

PROSE seems to have been completely forgotten; I can find nothing on it now.
But I distinctly recall that it did AD.

Everything in computer science was invented before 1980 :-)

~~~
fsck--off
This page should keep you entertained for a while:

[http://metacalculus.com/prose.html](http://metacalculus.com/prose.html)

It links to several manuals (most likely including the one you read) and has
links to papers about PROSE and it's predecessor, SLANG.

~~~
ScottBurson
Oh wow! Thanks!!!

------
srean
Wow thats quite coincidence, but before I get in to that, I like the the tone
on HN much much better today. There were several language related posts, all
were reasonably well received. None with the frequent hostility that I have
come to see on HN in the recent past. No one, came here to piss and rain
smugly on another's voluntary piece of work with "Repeat after me, no one
gives a flying f%$#! about your language, where are the libraries, where are
the users. Dont waste your time in pointless excercises".

That said, the coincidence that I mentioned is entirely personal. After a
hiatus, I am getting my toes wet again in this language that I find quite
interesting. Its a whole program analyzed, aggressively inlined, functional,
but not purely so, ML like type-checked and type-inferred language with
generics that interacts quite effortlessly with C++, without the need for any
ffi like library. It has coroutines and preemptive threads.

I think of it as something that does the same to C++ what Scala does to Java
or F# does to C#. It is really performant and achieves the speed via
aggressive inlining, tail calls, whole program analysis and what can be called
opportunistic but indeterminate laziness. I have not quite grokked its model
yet, one thing that I want to get a handle on is to be able to reason what
triggers garbage collection and what doesn't. According to the author of the
language, garbage collection can be mostly avoided, but I am not yet good at
this aspect, but its just been a few days that I have actually used the
language (as opposed to reading about it).

Its statically typechecked but feels unusually flexible in one aspect: one can
move functions and types around the between and across different translation
units freely (for refactoring) without the need for forward declarations
because the linkage semantics are permutation independent.

The language that I am talking about is Felix [http://felix-
lang.org/share/src/web/tutorial.fdoc](http://felix-
lang.org/share/src/web/tutorial.fdoc) [http://felix-
lang.org/share/src/web/nutut/intro/intro_index....](http://felix-
lang.org/share/src/web/nutut/intro/intro_index.fdoc). It has been discussed on
HN a few times and I have mentioned it a few times myself just because I find
it really interesting, not because I have a dog in the fight. The language
author does frequent HN but very very rarely.

------
andrewflnr
Personally, the way I want to write that program involves destructuring. All
the cons, car and cdr crap makes my eyes glaze over almost as bad as the js.

More on topic, it seems like at least in the example given, inlining of
functions followed by standard optimization would give good results, and I
thought we were pretty decent at inlining. Is that just because the example is
so simple, and the point is that that ability needs to be more general?

~~~
agumonkey
I don't know if the author stopped at that level on purpose. That's basic SICP
abstraction layer. And I believe types and then pattern matching was another
abstraction on top of that to get literal symmetry, but to my eyes at this
point it's two faces of the same coin.

------
tomp
> Doubtless the first difference that jumps out at you is that one of these
> programs is written using a beautiful, familiar surface syntax that makes
> you feel right at home, while the other uses an ugly, foreign notation that
> makes your eyes glaze over

But I'm not sure all of us would agree which program is beautiful/familiar and
which is ugly/foreign...

Edit: I love LISP, but I love math more.

~~~
srl
I read that as somewhat tongue-in-cheek. I mean, I'm a lisp fan too, but it's
not like the syntax is beautiful. (It's just "not as bad". Or that's what we
tell ourselves.)

~~~
ced
I read it as intentionally ambiguous.

~~~
Leszek
It's almost certainly intentionally ambiguous, and therein lies the joke.

------
6ren
Well. This made me install Firefox 23, to get asm.js. I got a speed-up of x5
on his Mandelbrot code. Wow.

(I had Firefox 20 before. I checked a few websites and didn't notice any
speed-up in general.)

PS: of course you can write modular code in JS too, with a complex lib etc
(maybe not quite as nice as scheme). Would it be compiled to be as efficient?

~~~
yoklov
Not unless the compiler did some magic to remove the heap allocations, which
iirc none of the JS engines do yet LuaJIT, however, will. see
[http://wiki.luajit.org/Allocation-Sinking-
Optimization](http://wiki.luajit.org/Allocation-Sinking-Optimization) . I
believe the Dart VM does this too, so it might not be far off for JS engines.

Still, other semantics might slow the code down compared to ASM.js, but the
heap allocations will be the biggest.

------
Someone
I know little of the field and nothing of the state of the art, but wonder how
smarter compilers would combine with the statistical analysis that profile-
guided optimization, the JVM and branch prediction logic do. Do the
improvements they bring up add up perfectly, or is smarter compilation more
effective for dumber CPUs/runtimes?

------
fmax30
Interestingly ,I get 0.86 mflops/msec on my 2.1 ghz sanybridge processor on
firefox and 0.307 mflops/msec on chrome.

~~~
marijn
If it's a recent version of Firefox, the "use asm" in [1] probably explains
the difference.

[1]: [http://alexey.radul.name/ideas/2013/cleverness-of-
compilers/...](http://alexey.radul.name/ideas/2013/cleverness-of-
compilers/mandel.js)

------
tel
How does DVL compare with optimizations GHC runs? Especially re the ad
library?

------
Dewie
I'm actually a little disappointed that a mature compiler like javac doesn't
seem features such as escape analysis and being able to compile Enum types to
primitive values (compiling them to a single byte would be enough for most
constants). I might be wrong about these specific cases, but my overall
impressions is that compilers like the main one for Java aren't very
adventourous when it comes to compiler optimizations. They sometimes seem
pretty primitive, even. It's understandable if optimizations would add a lot
of time to the compilization, but I don't see why it can't be added as
compiler flags at the least, to be used for shipping software.

I have never hacked compilers myself, so I might be totally off the mark on
all of this. And the specific things I've mentioned might be much harder than
what they seem at a cursory glance. But this post is more of an inquiry into
the matter than a definitive statement about the status quo.

~~~
ahomescu1
Keep in mind that javac is only half of the compiler, the JVM is the other
half. I'm not sure you need escape analysis when you compile down to JVM
bytecode. Could you show an example?

~~~
Dewie
I'm talking about stuff like having a method where you make an object - let's
say comprising of two ints, a coordinate - and figures out if it can instead
solve the same problem with, say, just making two stack-allocated ints instead
of one heap-allocated object. Maybe the programmer wanted to make an object
since it looked more readable than making two ints.

~~~
kevingadd
Java does that, it just happens in the JVM during jit

~~~
azakai
Yes, that is the pretty standard scalar replacement of aggregates
optimization.

