
Increasing the D Compiler Speed by Over 75% - andralex
http://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941
======
nkurz
_Time to stop guessing where the speed problems were, and start instrumenting.
Time to trot out gprof, the Gnu profiler. I fired it up on the slow example,
and waited. And waited, and waited, and waited. I waited overnight. It pole-
axed my Ubuntu box, I had to cold boot it. The test case was so big that it
plus gprof was too much. gprof slows things down a lot, so I had to cut things
way down to get it to work._

It's been a long time since I've used gprof. I switched to Valgrind and
OProfile about 10 years ago, and more recently to 'perf' and 'likwid'. If the
goal is finding hot-spots, these last might be more convenient since they run
with minimal overhead --- a couple percent rather than 100x.

Are there benefits to gprof that I've forgotten?

Are there newer and better profiling tools I don't know about?

~~~
mrich
Vtune and Zoom need to be mentioned, although they are not free. Vtune is even
able to do power consumption analysis these days.

~~~
gngeal
_Vtune is even able to do power consumption analysis these days._

...on AMD machines, too? (Not to mention mobile development on ARM.)

~~~
mcpherrinm
As far as I understand, having not used VTune, some basic features work on
both, but the interesting stuff relies on Intel-specific counters in the CPU.

This doesn't mean it's worthless, though: Finding your performance bottlenecks
on Intel will surely be at least a first-order approximation of the
performance on AMD, too.

ARM is a little bit of a different story since the memory model is different,
and you might be getting killed by something like unaligned accesses, but that
doesn't mean the information is worthless; it just means you should probably
use more than one tool. After all, not everything is a nail, but hammers are
still good tools.

------
WalterBright
This article chronicles some fun I had boosting DMD's speed by doing some
simple changes.

~~~
kevingadd
The idea of never free()ing and then taking advantage of that with a dumb
allocator to get better performance is pretty clever. I wish I could do that
with my compiler; sadly I can't let it leak since I invoke it from
unit/functional tests so the test runner would run out of memory and explode
:( For DMD tests do you just eat the cost of a process setup and compiler
startup for every test run?

~~~
WalterBright
The compiler is a batch tool, so restarting the process for every run is
normal usage.

~~~
to3m
I've taken to not bothering to free anything in batch tools if it's not super-
obvious what to do. Objects with a simple life time (including memory owned by
a stack-local object such as a std::vector, etc.) get freed; everything else
just leaks. You'd think this would cause masses of problems, but it doesn't.
It's easy to imagine data that would be too large, but it seems to be rarer in
practice than you might think.

If the alternative is something like a handle-based or smart pointer system,
then you'll reap the benefits in terms of ease of debugging on a daily basis.

------
chondl
Have you considered or tested using either closed hashing or linear array
lookups as a replacement for you linked list open hashing implementation.
Years ago I significantly improved the speed of a color quantization operation
that several other engineers had already optimized by replacing it with a
simpler closed hashing algorithm straight out of Knuth. More recently I've had
success for small collections using arrays and performing linear search. This
technique is used in Redis (see [http://redis.io/topics/memory-
optimization](http://redis.io/topics/memory-optimization))

~~~
acqq
As soon as pools are used (see my other comment here) chaining is much faster
than storing all elements in the table behind the hash -- you can use simpler
hash function and have better performance even when the table is relatively
full.

~~~
WalterBright
I haven't spent much time looking into cache effects in the compiler's
internal data structures, that gold hasn't been mined yet.

~~~
p0nce
Some ideas (at least on x86):

\- alignment greater than 16-bytes, eg. 128 bytes for isolated buffers.

\- the hardware prefetcher like to load cachelines around the memory actually
accessed, just in case. So data chunks that will be accessed at the same time
better be near each other to save cache usage a bit.

\- memory access which does not have a simple pattern is slower than one which
is contiguous or have a simple stride.

------
aidenn0
A lot of people underestimate the performance impact of malloc(). It is dog
slow. In addition if you use a poor malloc() implementation heavily with
varying sized data, you can easily end up using more memory than had you used
a copying GC!

~~~
haberman
Not only is it slow, many malloc() implementations (like even in glibc, I
think) take a global lock, so they are prone to contention when called
concurrently. Other mallocs (like tcmalloc and Hoard) satisfy small
allocations from a thread-local pool.

~~~
scott_s
glibc's malloc has actually been decent with regards to concurrent memory
allocation for a while now, in my experience. I googled around, and found
someone talking about glibc's malloc not using a global lock:
[http://siddhesh.in/journal/2012/10/24/malloc-per-thread-
aren...](http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-
glibc/)

~~~
haberman
Thanks for the info; good to know that glibc has improved in this regard. I'd
correct my comment but it's no longer editable. :/

------
jongraehl
I wondered why you don't store the reciprocal w/ the hash table object.
Obviously it wastes some space, but it wouldn't be any slower than your
specific checks for 4 and 31, I think. (If most of the tables have size 4 or
31, then I'd use your code).

~~~
WalterBright
It's not necessary since the set of possible divisors is known in advance.

Also, "multiplying by the reciprocal" is a bit simplistic - there's some other
instructions added in based on the specific divisor value. Adding more tests
and branches for these likely would not pay off.

~~~
jongraehl
Yeah, I looked up the method and agree that if most of your tables are small,
it's worth those simple ifs.

I only meant to store it in addition because the alternative, storing an index
into the list of divisors (and thus reciprocals) might be slower due to an
extra indirection.

------
martin_
Changing the modulus to use known constants is an awesome trick! Great read

~~~
qznc
I reverted that part in dmd once. Same speed on my Intel i7 processor.
Branching vs division is probably a tricky tradeoff.

~~~
WalterBright
It's worth checking the generated assembler to see if the optimization
actually took place in your build.

Note that you may be using an older dmc which did not do the divide
optimization.

------
shasta
Walter, could you explain why lexing was a bottleneck? That's very surprising
to me. You don't re-lex template instantiations do you?

~~~
WalterBright
Lexing has been a bottleneck in every compiler I've built. The only answer I
have is that ASTs are a lot smaller than source code.

Templates are stored as ASTs. They are not re-lexed.

~~~
acqq
Do you consider lexing only "reading the file, finding out if the sequence of
characters is the keyword of the literal or the comment" or something more? I
admit I lex the source which is already in memory, and there lexing takes less
than all other processing (in CPU use). Also in my case the source is always
smaller in memory than any AST.

~~~
WalterBright
Lexing is converting the source text into a token stream.

If lexing takes relatively smaller times for you, perhaps you have bottlenecks
in the later passes?

~~~
shasta
I'm just having a hard time understanding how you could be having complaints
about the compile times if you have a fast lexer and that lexer is a
significant percentage of the total time. Can't your lexer handle a million
lines of code in a few seconds? How big are these code bases?

~~~
WalterBright
It does handle a million lines in a few seconds - it's just that the rest of
the compiler's work goes even faster.

The top cycle sucker is Lexer::scan(). Here's the source:

[https://github.com/D-Programming-
Language/dmd/blob/master/sr...](https://github.com/D-Programming-
Language/dmd/blob/master/src/lexer.c)

See line 440. It's entirely possible I've missed something glaringly obvious -
have a go at it and see if you spot anything.

Oh, and "complaints" is a relative term. DMD is incredibly fast at compiling
compared to other compilers - it's just that I want it to go even faster.
Anything less than instantaneous I regard as "needs improvement." When D gets
a design win at a company, I'll often ask what put D out in front of the
competition. "Compile speed" is usually mentioned. Compile speed has a huge
effect on programmer productivity.

~~~
jeremiep
Looks like the same kind of problem that plagues interpreter loops: that top-
level switch statement is causing a branch misprediction on almost every hit.

The solution i've seen was to have each opcode handler determine what the next
handler would be and directly jump there.

~~~
acqq
The problem is that most of the time the main switch of lex isn't entered in
the loop, instead it is called from the many different points in the parser (a
token is considered, something is done, then there's a call for a next token).
The only case I see at the moment where there's loop over the switch is when
doing the whitespaces, that can maybe be slightly improved.

The whole topic of lexing turning out to be slow is really a thing that should
be measured. Really intriguing.

~~~
nkurz
Glancing at the source, I think the whitespace parsing could be improved.
Since whitespace will rarely if ever (?) be followed by more whitespace, and
since most other things are likely to be followed by whitespace, just having
both "parse_expecting_whitespace()" and "parse_probably_not_whitespace()"
should improve the prediction.

Other tokens probably have similar but less strong "preferences". It's
possible that a smart enough branch predictor will learn these on its own, but
changing things so that each token ends with its own predicted branch would
likely be a win.

Since the branch predictor uses the address of the branch as the 'key', the
general goal is to have a separate branch instruction (if, switch) in each
place where the probabilities are different enough to favor a different
prediction. A wrong prediction costs about 15 cycles, but a correctly
predicted branch costs less than 1, so you can put in quite a lot of these and
still come out ahead.

Perhaps just ending every token handler with a best guess at what comes next?

    
    
      if (T->ptr[0] == most_likely_next) 
        handle_most_likely_next(T);
      else scan(T);
    

I don't have the syntax in my head, by I think you can use 'perf' to record
mispredicted indirect branches and then display that sorted by caller.

~~~
acqq
In every line there are more consecutive whitespaces at the begining unless
the code isn't indented.

~~~
nkurz
Great, that sounds like a fine opportunity for a "best guess": if newline,
expect whitespace. Currently, I'd guess that 'whitespace except newline' is
the default prediction for the switch() at line 451. I'd also guess that if
not followed by a space or tab, a newline is frequently followed by another
newline.

Maybe you could combine the case statements for space and newline, and do a
branchless 'cmov' to increment loc.linenum if the match was a newline? This
could be combined with loop to grab all the whitespace/newlines in one if you
think whitespace is occurs in clumps.

~~~
acqq
FWIW I'd never do a CPU dependent asm code in the lex.

~~~
nkurz
I played with how to phrase that. You don't need to actually use a CMOV, just
write code that allows your compiler to use one if supported.

    
    
      int tmp = real;
      if (foo == bar) {
        tmp = newval;
      }
      real = tmp;
    

I've seen it referred to as a 'hammock'[1], and at least for GCC and ICC it
usually is a strong enough hint.

[1]
[http://people.engr.ncsu.edu/ericro/publications/conference_M...](http://people.engr.ncsu.edu/ericro/publications/conference_MICRO-45.pdf)

------
gridspy
A massive advantage of your new linear allocator is that it keeps your memory
access continuous. This means that the processor is more likely to have the
most recently used memory locations already in cache.

You might see further improvements if you split your allocations between two
(or more) allocators. One for memory you expect to remain hot (core to the
compiler) and one for stuff you think is one-off. That might improve access
locality further.

------
oh_teh_meows
Does your compiler perform any transformations at all? I imagine it can run
out of memory pretty quickly if you're performing multiple transformations in
succession on large code base, unless you recycle some of those used memory.

Granted...since you explicitly stated that your compiler focused on compile
speed, I guess optimized code generation isn't your main concern, since the
two are more or less mutually exclusive.

~~~
WalterBright
Compile speed issues are for non-optimized builds. Optimized builds take
significantly longer, as those are for maximum generated code speed rather
than compile speed.

~~~
oh_teh_meows
I guess I wasn't being clear. I was just curious how do you handle your memory
in the case of doing optimized builds?

~~~
WalterBright
The same.

