
Don't use assembly unless you're an expert - AndrewDucker
http://quetzalcoatal.blogspot.co.uk/2014/04/if-you-want-fast-code-dont-use-assembly.html?m=1
======
jwr
I'd rather say "don't use Intel x86 assembly unless you are measuring very,
very carefully". First, few architectures have so many weird traps and
interactions as x86 does. Second, most of these things become visible once you
start measuring (although you have to be very careful when measuring
performance, as it is an art in itself).

The x86 these days is effectively a poor-man's VLIW machine: you have a number
of units with out-of-order execution. This means that you can look at your
instruction stream as a stream of VLIW instructions for all of the CPUs units.
The performance of your code might depend not just on the exact number of
instructions, their execution times, their latencies, but also on the
interdependencies between your instructions and scheduling constraints.
Getting that right is a nightmare, and even if you do get it right on your
chip, there are at least several major x86 architectures in use, each one with
its own peculiarities.

As a counterexample, I recently wrote some ARM (Thumb-2, for the Cortex-M4)
code. It wasn't very difficult, and the code I wrote was _much_ better than
what gcc could produce.

~~~
charlieflowers
A random thought just crossed my mind -- how about a genetic algorithm that
experiments with all these options and evolves towards the best performance?

Run speed is objective and measurable, which is desirable for a fitness
criterion.

Somehow, you'd have to specify which instructions can be moved around and
which can't -- or maybe not, if you could also assess whether the program
executed "successfully."

I don't really see it as practical, but it might be fun. Might uncover
optimizations that no one has seen before.

~~~
jamesjporter
It's not a genetic algorithm, but FFTW's planning system actually does perform
experiments to optimize for a particular machine's hardware.

~~~
stcredzero
Genetic algorithms are just a particular kind of stochastic hill-climbing
graph-search. (EDIT: It's actually a general graph, not a tree.)

------
sillysaurus3
Writing SSE assembly (not intrinsics) by hand can sometimes give you over a
10x speedup in an inner loop of a computationally expensive problem, like CPU
bone animation of vertices.

And while it's convenient to use intrinsics to generate SSE (intrinsics look
like ordinary C functions and save you from having to write actual assembly),
it's impossible for them to _always_ be as efficient as a clever programmer.
I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics is
an NP-complete problem. So a clever programmer will always be able to boost
performance more than a compiler.

That said, the two biggest arguments against assembly are:

(a) performance matters much less in the modern day,

(b) you kill your portability by writing inline assembly. If you expect your
code to run on Windows (and gamedevs expect their code to run on Windows,
though this may be changing) you'll need to maintain two parallel, #ifdef'd
hand written bits of assembly that do exactly the same thing: one for GCC, and
one for MSVC. It's a Royal Pain.

Those are two damning arguments, and it's why SSE assembly crafting has become
something of a lost art.

EDIT: I originally led with "It's disappointing that the author makes no
mention of SSE assembly," but this was confusing due to the distinction
between assembly vs intrinsics. They did briefly mention SSE intrinsics, but
not that hand-written SSE assembly is sometimes superior for raw performance.
Also, getting a strikethrough modifier on HN for edits would be wonderful.

~~~
zwegner
> I'm pretty sure that generating perfectly efficient SSE from SSE intrinsics
> is an NP-complete problem. So a clever programmer will always be able to
> boost performance more than a compiler.

Being NP-complete doesn't really have anything to do with whether humans can
solve a given problem better than computers or not. It just means that it
(probably) takes exponential time to solve.

I believe there is no distinction whatsoever, in principle, between computers
and humans (our brains are just big, complicated calculators). As such, it's
only a matter of time before computers beat humans in compilation at every
point, just like the situation in chess. People claimed for years that humans
will always have some edge in chess, because they "understand" the game, or
have "ingenuity", or something like that. Nobody claims that anymore, since
computers absolutely massacre humans. And similarly to chess, computers have
been getting better and better at a very fast rate, while humans have
stagnated more-or-less.

As an aside, computer chess got quite boring for me, due to the homogeneity of
current programs, but it did help to show me the limitations in current
programming languages and compilers. So I'd like to do my part to put humans
in their place in another field. My still very-much-not-even-close-to-finished
contribution, that I haven't found much time to work on in a while, the
Mutagen programming language:
[https://github.com/zwegner/mutagen](https://github.com/zwegner/mutagen)

~~~
joosters
Often, the compiler cannot produce better code than humans because of the
failings/features of the language being used. Even c/c++ has many gotchas, for
example the assembly code might have to constantly dereference a pointer
because the source code (in this case) does not guarantee that the data won't
have been changed by others. Meanwhile, a programmer who hand-crafts the same
assembly language might know that this won't happen, so they can keep the
value in a register.

In theory, some of these shortcomings could be solved by a well-written
aggressive whole-program optimiser that could deduce data usage & aliasing,
but in practice they don't/can't, few people use WPO and it fails when you use
library code.

In short, compilers probably don't have enough info to beat a skilled assembly
programmer.

~~~
zwegner
Absolutely--I see this as a failure of current programming languages and
compiler technology, though, and not a fundamental barrier. We should aim to
take away all the impediments to the compiler's job, like the aliasing problem
you point out. Then, we can just sit back and let the blinding speed and
precision of computers take over, trying all the different possible
compilations of a given program.

This is the main goal of the Mutagen language. See the link above for a bunch
of hand-waving about compiler technology that might exist some day :)

~~~
sharpneli
It's quite doable in C nowadays. Just use the restrict keyword which pretty
much just gives the compiler a permission to assume that nothing aliases the
pointer in question.

It's not that hard to write loops in C that vectorize well. And you get the
benefit that those same loops also vectorize for ARM neon.

~~~
joosters
While the restrict keyword helps several cases, it's not a catch-all. Even
worse, it is very easy to get wrong, leading to near-undetectable errors in
your code at some future date.

Worse, it's almost impossible to retrofit into large existing code. Even
turning on strict aliasing in the compiler might cause hideous problems. The
compiler can't always warn you about these.

The 'holy grail' is still for a compiler that can infer this stuff from the
code. With existing C code, that requires at least WPO and some very deep
reasoning about pointer usage. The other approach is a different programming
language that makes data/variable access control more explicit.

~~~
sharpneli
I agree that retrofitting existing codebase is a pain.

Turning on strict aliasing is a mixed bag as programs that break when it's on
are explicitly breaking the C spec. But one must make amends to support badly
written legacy code. The official way to handle the aliasing across types is
to use unions.

In my opinion the default behaviour in C should have been what restrict does.
And a keyword to allow alias and yet another to allow aliasing across types,
because unions are bit tedious to use. As an example float *aliased foo. And
warnings when you cast normal pointer into aliased one or cast across types if
the pointer is not aliased across types.

------
benjamincburns
Don't follow this editorialized title's advice if you'd actually like to
become an expert [1].

The actual article is quite good and makes a very valid point. Don't jump to
machine code whenever things get a bit slow, chances are you're introducing
more problems than you're fixing, and it's very difficult to outdo the
compiler. Generally you should start by looking for macro-optimizations
instead. But to add my own advice, it's still an _incredibly_ worthwhile thing
to learn, even if you don't expect to ever run into the types of scenarios
which will benefit from optimization at this level.

Why? Programming languages do everything they can to abstract away the
machine, and abstractions do leak [1]. Even mostly air-tight abstractions over
machine code [2]. Because of this, learning the entire "stack" makes debugging
higher-level code much easier. When you can understand what's happening with
every level right down to the metal, what were once ridiculous off-the-wall
problems become recognizable and much easier to reason about.

So _do_ use machine code if you'd like to _become_ an expert. Just use it on
your own time, and don't use it under duress.

1: Presently reads "Don't use assembly unless you're an expert."

2:
[http://www.joelonsoftware.com/articles/LeakyAbstractions.htm...](http://www.joelonsoftware.com/articles/LeakyAbstractions.html)

3: [http://stackoverflow.com/questions/11227809/why-is-
processin...](http://stackoverflow.com/questions/11227809/why-is-processing-a-
sorted-array-faster-than-an-unsorted-array)

~~~
elwell
> The title of this post was obviously meant to be an attention-grabber

It's becoming very common to have a catchy title and then a disclaimer for it
in the article.

~~~
benjamincburns
A trend which is both tempting and annoying. I have a feeling that this is
causing some people to develop an analogue to "advertisement blindness" where
people tend to ignore titles which fit into certain models. I think the filter
gets tripped for me any time people use overly emotional modifier words, the
word "this," or the phrase "what happens next." This title didn't fit any of
those, however.

------
csl
_and then runs the function 1000 more times, measuring each run independently
and reporting the average runtime at the end._

When it comes to measuring the performance of code like this, averaging run
times is not the way to do it.

To remove the noise caused by context switching, just run the code many times
and report the single fastest run you get. This should be the value closest to
running the code on an OS without preemption (i.e. you want to measure how
fast the code runs on the bare metal without interruption).

Even Facebook's Folly library [0] changed their benchmarking code from using
statistics to just providing the fastest run. As the comments say:

    
    
        // Current state of the art: get the minimum. After some
        // experimentation, it seems taking the minimum is the best.
        
        return *min_element(begin, end);
    

This is explained in the docs [1]:

 _Benchmark timings are not a regular random variable that fluctuates around
an average. Instead, the real time we 're looking for is one to which there's
a variety of additive noise (i.e. there is no noise that could actually
shorten the benchmark time below its real value). In theory, taking an
infinite amount of samples and keeping the minimum is the actual time that
needs measuring._

[0]:
[https://github.com/facebook/folly/blob/master/folly/Benchmar...](https://github.com/facebook/folly/blob/master/folly/Benchmark.cpp#L142)

[1]:
[https://github.com/facebook/folly/blob/master/folly/docs/Ben...](https://github.com/facebook/folly/blob/master/folly/docs/Benchmark.md#a-look-
under-the-hood)

------
overgard
Sort of a tangent, but you know what I'd love to see? A compiler that could
emit comments with the code it generates. They wouldn't have to be brilliant,
just whenever it decides to optimize some bit it emits a generic "why". It'd
be even cooler if it was a website that hooked up to something like GCC, and
you could feed it some C code and get a hypothetical "what would the compiler
do here" sort of deal.

I realize that's kind of crazy, but it'd be an awesome teaching and debugging
tool. I don't look at disassembly a lot, but when I do it's not always
particularly obvious why the compiler generated a thing in a certain way.

~~~
CUViper
It's not exactly what you're asking for, but gcc with -fdump-rtl, -fdump-tree,
and -fopt-info can show you quite a lot of what's going on.

------
forgottenpaswrd
Don't use assembly, period.

I can make things go more than 10 times faster in assembly. My main job is as
manager/entrepreneur but I could read-write assembly as a result of my
experience and I help-guide other people easily.

In the real world 10 times faster is nothing. You should spend the time
understanding the problem in a mathematical way, and VERY IMPORTANT,
documenting your work using images, text, voice and video.

This way you could make things go 100, 1000, 10000 times faster as most
algorithms could be indexed, ordered in some way as to make it extremely fast,
like doing log() operations instead of n squared or cubic or to the elevated
to four or five(when you manage several dimensions like 3D with time or video
analysis or medical tomography).

More important than that, 10 years from now it will continue working in new
devices or OSs and will be something that supports the company instead of
being a debt burden because the original developer is not here now(or you
don't have the slightest idea of what you did so far away in the past and did
not document).

The main problem is that people is not self aware that they forget things. And
your brilliant idea that makes everything go 3 times faster is nuts if it
makes everything way harder to understand, or if it could be forgotten even by
you.

~~~
jwr
That's not good advice. Whether you need to use assembly depends on the
particular situation at hand.

Here's a practical example: as a result of redesigning the algorithm to use
fixed-point and implementing it in assembly, I got it to run 600x faster than
the initial C version. Big O complexity was the same, the difference was in
the constant factor. But the constant factor matters! In my case, it meant
that you could get your computation done in half a day instead of a year.

Yes, it took me 3 weeks to get the algorithm implemented, instead of a single
day, but even so — it was definitely worth it. And in many cases even a 3-fold
improvement in speed is important, if you have long-running calculations.

~~~
graphene
That bit about fixed point is exremely interesting.. I found your blog post
about the project you're (I think) referring to
([http://jan.rychter.com/enblog/2009/12/4/x86-assembly-
encount...](http://jan.rychter.com/enblog/2009/12/4/x86-assembly-
encounter.html)), but it doesn't mention the fixed point part.

Not knowing too much about processor architecture, I don't understand how
fixed point can be much faster, since floating point ops are implemented in
hardware.. I presume you used integer operations on your fixed point values,
but could you explain a bit why it ends up being much faster than floating
point?

~~~
jwr
It all depends on how precise your fixed point values need to be. If you can
squeeze them into 8 bits (I could), you can use SSE 128-bit registers to
operate on 16 values at a time. It gets even better with AVX, although that
wasn't available to me at the time.

So the speedup is not just from going to fixed point, but from managing to use
the vector instructions.

------
dror
Just another form of Knuth's "Premature optimization is the root of all evil
(or at least most of it) in programming."

Or my personal variant Most optimizations will end up in things being slower.

~~~
maximilianburke
You forgot the whole quote. "Programmers waste enormous amounts of time
thinking about, or worrying about, the speed of noncritical parts of their
programs, and these attempts at efficiency actually have a strong negative
impact when debugging and maintenance are considered. We should forget about
small efficiencies, say about 97% of the time: premature optimization is the
root of all evil. Yet we should not pass up our opportunities in that critical
3%."

Knuth is saying that in most cases optimizing prematurely is not worth it,
which is right, but just as critical is the part saying that there are parts
where pays off to consider optimization early. Writing your logging function
in assembly likely fits into the 97%. Writing your N-queens solver, if you're
aiming for all-out speed, likely fits into the 3%.

Of course not thinking about optimization at all will lead you down a path
where your software is unoptimizable, or at least very difficult to optimize.
Ask anyone who's had to optimize a video game that manages all objects in a
scene graph, where reorganizing a cache hostile tangled web of pointers to
base classes is a herculean feat.

------
jedicoffee
Don't try to learn things, unless you want small minded people to get upset.

------
otikik
If no one did it, there would be no experts.

------
srean
I upvoted this post (not enough IMO) when it was in the "new" page and glad
that it made it to the front page, many good posts dont.

A persistent peeve of mine is this vapid commentary that would invariably show
up in programming language related discussions

    
    
       "what's the point of this language, if one needs speed, it will be written in C". 
    

Some would replace "C" with "assembly" in that sentence. The point is that if
a piece of code is doing something non-trivial, it is extremely hard to have
confidence in the correctness of handwritten code that is written entirely at
a low level. Smart compilers apply drastic, and sometimes unintuitive
transformation to optimize code. To do the same manually and using low levels
of abstractions would require a toure de force to pull off correctly. This
simply is not going to happen often. Its precisely for speed that we need a
high-level, yet optimization friendly, language so that we can delegate the
job of optimizing the code to the compiler. Compilers are much better at
applying large correctness preserving transformations than we humans are. The
"write it in assembly" works in the small, not in the large.

~~~
jwr
Whenever I write anything non-trivial in assembly, I always go through a
series of progressively lower-level implementations in C. The last ones
emulate SSE instructions with for loops and correspond very closely to the
assembly code. It isn't possible for me to write a serious chunk of assembly
'just like that'.

~~~
srean
I did not bookmark it, but there was this fun blog post about writing a number
crunching HW assignment in Scheme (much against contemporary wisdom). But It
ended up other implementations in C. The trick was to write it in CPS and keep
hand transforming it iteratively till it was efficient assembly in a different
syntax. I would still argue that that you are really writing in a high level
language/abstraction and the rest is mostly tedious transliteration with
special attention to keep the FPUs busy, efficient memory loading and register
allocation.

~~~
pflanze
Perhaps you're referring to this:
[http://www.cs.ucla.edu/~palsberg/course/purdue/cs565/F96/sob...](http://www.cs.ucla.edu/~palsberg/course/purdue/cs565/F96/sobel.ps)

~~~
srean
Yepp! thats the one

------
randunel
How can one become an expert without using assembly on a daily basis?

------
Havoc
And in order to become an expert...you'll need to use assembly whilst not
quite there yet. [Hopefully not for mission critical code though]

------
wedesoft
No big deal. Just become an expert then. It is not that hard. Just have a look
at Ian Piumarta's work [1,2]. Future systems will probably facilitate
programming on every level from the bare metal to the highest abstraction.
I.e. no strict boundaries between the implementation of the application and
the implementation of the compiler/interpreter.

[1] [http://piumarta.com/software/maru/](http://piumarta.com/software/maru/)

[2]
[http://www.youtube.com/watch?v=cn7kTPbW6QQ](http://www.youtube.com/watch?v=cn7kTPbW6QQ)

EDIT: There are other reasons to write machine code than performance.

------
chacham15
I feel like people are assuming that the only reason you write assembly is for
performance. I try to stay as far away from it as possible, but sometimes
there is no other way (barring finding a library that itself has written
assembly). Some examples: finding a backtrace, getting the CPUID, or calling a
function of arbitrary number of arguments (for some reflection like
capabilities: imagine a function like "void foo(void (* func)(), ArrayList *
args, char * arg_types)" and getting foo to call func with args passed as
arguments, not an array). If you can, you should try and find a library to do
those tasks, but the libraries themselves have no choice but to write assembly
as you just cannot do those things in plain C.

~~~
benjamincburns
Absent of other details, I think in most cases this is an argument to go
higher-level, not lower-level.

------
noname123
It's not the application but a man's need to be an animal again. Frankly,
writing Ember.js apps via CoffeeScript which has roughly 11 layers of
abstraction feels like a metrosexual man eating a processed food with organic
marketing on a IKEA furniture (CoffeeScript -> Compiled Javascript -> Ember.js
-> jQuery -> JavaScript -> Chrome V8 Engine -> WebKit Layout Engine ->
GTK+/Cocoa/Chrome Windows View -> Linux/OS X/Win32 display API -> C/C++ ->
Assembly).

There was a time when a man can eat raw meat (machine code) but he can start
again today by eating bloody beef.

~~~
pavlov
Unless the man grows, feeds and slaughters the cow himself, that "bloody beef"
is just as removed from animalistic experience as the organic food.

It's ridiculous to pretend that eating hormone-accelerated mass-processed meat
is closer to nature. Sure, it's oozing blood, but that blood is tainted by an
enormous industrial profit machine.

~~~
com2kid
My boss knows assembly.

My boss also hunts animals with a bow and arrow.

Does he get a pass?

~~~
benjamincburns
Only if he made the bow and arrows himself from sticks, fibre, and shale he
found in the woods, _and_ if he also purified/doped/etched his own silicon
wafers.

~~~
dllthomas
That Seattle kid who fought an octopus on its turf, bare handed, probably
counts though.

------
projectileboy
This has been true for a long time, even when the x86 architecture was much
simpler. I worked on an SDK for an 80386-based platform back in '97, and it
was unusual for me to do better than the Watcom C compiler.

------
nemasu
I don't really understand how the author is managing to contradict the
previous HN discussion.

[https://news.ycombinator.com/item?id=7301913](https://news.ycombinator.com/item?id=7301913)

~~~
chris_overseas
Well the previous discussion was centered around the same assembly
implementation as this one. The only difference is, in the previous discussion
the fastest C/C++ implementation was 1.1x slower than the assembly. In this
current discussion, the author managed to write a C++ implementation that runs
faster than the assembly version.

~~~
nemasu
Okay, but as far as I can tell he changed the algorithm (recursive c++,
pop/push to global array, etc) ... doesn't that kind of negate the argument
unless you change it in the assembly version as well?

~~~
chris_overseas
True enough. That reinforces a couple of other points that tend to become
apparent from these sorts of comparisons and benchmarks:

1) High level languages are easier to refactor and maintain than lower level
languages (changing the algorithm in the assembly's a big job).

2) For performance, the algorithm used is almost always more important than
the language you choose to implement it in.

Both of those points back up the author's assertion, it's a shame he didn't
specifically discuss them.

I used to hand-code assembly (or generate it, eg "compiled bitmaps") back in
the 8086 days. For many situations there were easy gains to be had that
couldn't be achieved so easily in high level languages. These days I wouldn't
dream of attempting it other than perhaps for vector code in an inner loop.
I'd far rather spend the time profiling and optimising the algorithms and data
structures because that's where the big gains are going to be. To go ahead and
implement something in assembly when the algorithm clearly isn't yet optimal
is perhaps fun but also kinda insane.

------
ASneakyFox
Percentages are for suckers. If you're not optimizing your code with actual
use cases then you're just wasting your time. Id say 90 percent of code any
one would ever write would not provide any benefit if it "ran faster".

No one will ever know or care that a 5ms operation only takes 2ms because of
how "savvy" you are. No one knows or cares what language you used. Its just a
program.

If you can produce a working program in a high level language but chose to use
a low level one.... whyy?

~~~
LnxPrgr3
Depends on the context.

If your operation is relaying a 20ms audio packet from one side of a call to
another, 50µs to 20µs is the difference between handling 200 concurrent calls
and handling 500 (per CPU, minus non-linear scaling with increasing CPUs and
other overhead). Unlike the cost of hardware, which every customer would have
to buy separately, the development cost can be spread out over many customers.

If $50,000 of developer effort can save 50 customers $1000 on hardware, you've
already broken even in a sense.

------
paulmd
This is nothing new; letting an optimizing compiler do the heavy lifting has
been the general-case recommendation for a long time now.

But there are still circumstances where it's necessary to write assembly,
particularly in things like real-time programming. Doesn't matter how good a
job the compiler did on the overall program, if you can afford 16 cycles on an
inner loop and the compiler spits out 24 then you've gotta hand-tune it.

~~~
dllthomas
I wish there was more ability to specify the paths I really care about being
low latency (even at the expense of throughput) and the like. As it is, it's
rare that I drop below C, but more common that my C winds up restructured for
what it happens to do to the output code.

------
nicholassmith
Even if you are an expert, consider whether the next person to maintain your
code will be an expert. Assume they aren't and that they know where you live.

------
aidenn0
Something the author doesn't mention is how intel tunes their architecture to
be faster on code output by popular compilers. This means that it is likely
that N generations of x86 from now, the C code will run even faster relative
to the hand-written code.

------
j45
I would hope that asking people to stay away from assembly doesn't inhibit
creating new experts in assembly through putting in the effort.

------
stcredzero
And keep in mind, that on CISC processors, the average C statement works out
to ~2.7 assembly language operations.

------
baldfat
I MISS THE Z80!!!!

Old folks only would even understand

------
andyidsinga
time to go dust off nasm and maybe an avr assembler.

assembly isn't just for speed ..its for fun too.

------
frozenport
Don't you need to be an expert to use ASM?

~~~
Aloha
No. C and Assembly are not that far apart, using ASM requires knowing
Assembly.

------
icantthinkofone
In other news ... don't use C#, PHP or any other language for the same reason.

