

Compilers can never generate even good code - gnosis
http://www.zyvra.org/laforth/

======
a1k0n
> I know of no compiler that generates the Intel string instructions. Not even
> Forth!

Yes, because they are measurably slower than their equivalent primitive
instruction sequences, and the space saved isn't really all that significant
when you consider all the other loads and stores necessary to run them. I
think they've been actually slower since either the 486 or Pentium days. I'm
pretty sure, for instance, Watcom C++ would generate "rep movsd" sequences for
memcpy back in the 386/486 days, back when it was actually an optimization to
do so.

~~~
a1k0n
Actually I have to update my own comment here, due to a completely unexpected
circumstance. I just happened to diassemble some 64-bit test code I wrote
tonight (unrelated to testing out the string instruction theory) and lo and
behold, Apple gcc-4.2.1 with -O3 turned memset(x, 0, 7*sizeof(uint64_t)) into
a 3-byte "rep stos" instruction:

    
    
        0x0000000100000cee <main+30>:	cld  
        0x0000000100000cef <main+31>:	mov    $0x7,%ecx  
        0x0000000100000cf4 <main+36>:	xor    %eax,%eax  
        0x0000000100000cf6 <main+38>:	mov    %r15,%rdi  
        0x0000000100000cf9 <main+41>:	rep stos %rax,%es:(%rdi)  
        0x0000000100000cfc <main+44>:	...
    

It even does pretty much the same thing with -m32 and with optimization turned
down to -O1. So perhaps things have changed.

------
jbri
Ironically, the vast majority of "improvements" listed in the post (everything
except using scasd) are algorithmic ones that _are_ in fact doable in a
compiled language.

And even special machine instructions seem quite achievable with a pattern-
matching code generator.

Overall I think the HN title is very misleading - the article seems more
"compilers don't do this well yet", definitely not "compilers can never do
this".

~~~
kragen
And scasd probably isn't a speed improvement on modern CPUs.

------
pshc
>How do you tell a C compiler that?

You'd need dependent types and a compiler that understood exactly how much
information in bits was required for a particular proposition, I'm guessing. I
think this is the sort of thing (not using tricky assembly instructions
specifically, but using unboxed types with expressive type systems) that makes
languages like ATS[1] possibly faster than C.

The point is: good code generation is possible in theory, even if
"sufficiently smart" compilers are too hard to write. (Or are equivalent to
general A.I.)

[1] <http://en.wikipedia.org/wiki/ATS_(programming_language)>

~~~
wickedchicken
Modern day CS: every problem can be solved with a better type system

------
mynegation
I hope it is some kind of a joke.

It is not responsibility of a compiler to recognize specific problems in
specific domains and perform optimizations, it is a job for a human
programmer. Sure, you can make compiler aware of Riemann zeta function values
in specific ranges, but what's the use of that outside of very specialized
programs? Optimizations mentioned in the article could be programmed in C and
compiler would happily translate these savings into assembly.

The reasons why we don't really program assembly anymore is that _on average
and on a large scale_ compilers generate vastly better machine code than
humans. Take for example register allocation problem which is proven to be NP-
complete. Obviously compiler does an approximated algorithm but save for a
case of few registers, no human can match it.

Some architectures, like VLIW, actually rely on the compiler to generate well-
performing machine code.

------
bluekeybox
> It looks like the Intel 8086 architecture and its descendents will be with
> us for ever ... Why not take advantage of every idiosyncrasy of that
> instruction set?

Because ARM.

~~~
chubs
I was thinking that, but to be fair he gave that talk 4 years ago before the
tablet/phone market took off.

~~~
to3m
Additionally, the 8086 is not much like the 386! And the 386 isn't much like
the 486, which differed surprisingly from the Pentium. And each subsequent
revision has followed this trend of being (at the very least) different enough
from its predecessors that assembly language programmers need to re-read the
manuals.

Today we even have a 64-bit version, which differs from its trail of 32-bit
ancestors even more than they differed from one another...

------
laughinghan
I'm sure he agrees that algorithmic improvements are more effective than
constant-factor improvements like optimizing the assembly code. I think he'd
also agree that algorithmic improvements are also easier to make in higher-
level languages than in assembly. I'd argue, then, that it's more worth your
time to spend your time iterating and squeezing as much algorithmic
optimization of the higher-level program, than squeezing as much assembly
optimization out of x86 as you can.

There is the case that there is a solution is known to be algorithmically
optimal, or there is only one leading contender for optimal and it's unlikely
to be toppled and you have to have a PhD and a couple postdocs in theoretical
CS to begin to approach a way to challenge it, in which case it would indeed
be more worth your time to optimize the assembly code. I hear Adobe Photoshop
uses an FFT written in optimized assembly because almost every CPU-intensive
thing they do uses it. Has he taken a gander at that? Pretty sure the fastest
known way to calculate prime numbers changes a couple times a year.

However, that is only the case with textbook CS problems, and I think most HN
members would agree that in industry, software engineering is almost never
actually about solving textbook problems, and tends to have quirks that lead
to clever algorithmic optimizations that are continually being discovered.

------
cloudwalking
> The current vogue among computer users is to get > as far from the real
> hardware as possible. > ... > Assembly is the epitome of using the features
> of > a given machine.

But what about the features of a particular language? Mucking in assembly is
great if you're trying to squeeze out every ounce of power, but it's much too
verbose for building complex systems. If you're trying to build _user-facing_
features, layers and layers of language abstraction is great. See
<http://www.paulgraham.com/hundred.html>

If you're building a prime-number generator, go ahead and write in assembly.
But if you're writing an _application_ , like most of us on here (I assume?),
use a high-level language. "[Taking] advantage of every idiosyncrasy of [the]
instruction set" is a waste of time.

------
rwmj
Compilers have been written which (for tiny microbenchmarks like this one)
search every possible machine code sequence to find the fastest. Of course
they take absolutely forever to run, but they do _exist_.

<http://www.arnetminer.org/viewpub.do?pid=600808>

[http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=8A1C...](http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=8A1CFFA39BD95B8CA4C024F422A71387?doi=10.1.1.13.7059&rep=rep1&type=pdf)
[PDF]

------
SoftwareMaven
That was quite an arrogant post that really seemed to miss the forest for the
trees. Sure, perhaps you can build a faster prime number generator in
assembly, but you'll never build a better application in assembly (where
better is defined as feature rich, secure, etc), and that's what the job is
about.

~~~
archangel_one
Performance can also be a feature.

~~~
SoftwareMaven
Absolutely! But performance in most applications is not determined at the bit
level. Most apps performance needs are met at the architectural level (prevent
blocking on _whatever_ ). Fewer apps performance needs are met at the
algorithmic level (O(log n) is much faster than O(n log n)). It is a rare app
that finds it's performance requirements in the ops and bits (Highly math-
centric apps. Others exist, I'm sure).

I've generally found that's the order I go in meeting performance
requirements. When I've failed to meet those requirements, I've generally
worked backwards up the list, with each step getting more expensive in an
exponential fashion.

------
carsongross
In an I/O bound world nobody can hear your hand-rolled assembler scream...

------
apaprocki
Relevant to everyone interested in this topic: [http://www.linux-
kongress.org/2009/slides/compiler_survey_fe...](http://www.linux-
kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf)

"Abstract - People often write less readable code because they think it will
produce faster code. Unfortunately, in most cases, the code will not be
faster. Warning: advanced topic, contains assembly language code."

------
michaelochurch
I actually think this is a subtle parody of the "Scala/Haskell/Java/Ocaml is
impractical because you can't write performant code in it" crowd.

On tiny and contrived micro-benchmarks, it's nearly impossible for a (non-
contrived) expressive language to beat less expressive languages like machine
code, X86 assembly, or C. For real, large software projects, the case for a
less expressive language is not nearly as convincing, because the shortcomings
of the language mean that programmers aren't able to write the same quality of
code. For example, compare the outright mess that is concurrency in C or Java
to the much easier concurrency available in more expressive languages. As
parallelism becomes far more important for people who need high performance
(yes, I know concurrency and parallelism aren't exactly the same thing, but
they're similar) this advantage of the less expressive languages is likely to
diminish, if not disappear outright.

[Edit: Added-- This is analogous to the static/dynamic typing debate. Dynamic
typing test drives extremely well but multi-developer projects in dynamically
typed languages usually turn into outright nightmares, due to problems around
interfaces. Static typing's better for large projects. Same with expressive
languages: assembly's going to win on microbenchmarks, but writing good code
in less expressive languages becomes impossible as the size and scope of the
project gets large.]

------
samth
Articles on how fast some program is, with no benchmarks, are not that
persuasive. Especially since it's based mostly on the claim that "no compiler
would do _that_ ".

~~~
colinhowe
I'd laugh so hard if someone knocked an equivalent up in C and it was
faster...

------
scg
Actually, if you're using a higher level language you stand a chance of
implementing a much faster algorithm which would trump any of these O(1)
speed-ups.

------
gavanwoolery
Machines and CPU time are cheap...at least compared to the price of hiring
additional programmers to maintain the mess of assembly code you just wrote.

~~~
timthorn
Assembly code doesn't need to be a mess. Usual rules apply - you can write a
mess in any language, much as you can write well structured code in any
language.

~~~
hapless
Obsessive micro-optimization obscuring the function of code is _always_ a
mess.

You are correct that the mess is language agnostic. For example, if he'd
attempted to discard the low order bit of his prime numbers in C, it would
result in the same awful maintenance.

Some optimizations waste too much programmer time to be worth tackling.
Distorting the meaning of code for some tiny performance gains never
constitutes "well-structured" code.

~~~
timthorn
Optimisation is not the same as programming. You can program in Assembly
without trying to make your code as performant as possible. Just because you
use Assembly as your medium of describing your intent, doesn't mean that
you're obsessive about micro-optimisation.

------
EdiX
Maybe I am missing something but all the optimizations in this article except
the xor (2) and the scasd (4) can be done in high level programming languages.

Number 3 (output of a sequence of integers as the sequence of differences
between consecutive elements) is a very common optimization for inverted
indexes formats, IIRC lucene, written in java, does it.

~~~
jules
Why can't the xor be done in high level languages?

~~~
EdiX
You can't force the compiler to output an xor on a single-byte register (CL).
I wonder whether that's actually any faster.

------
georgieporgie
This guy has a very different definition of 'good' than I do -OR- this is
funny.

