
Trivial Artificial Neural Network in Assembly Language - fogus
http://syprog.blogspot.com/2012/03/trivial-artificial-neural-network-in.html
======
dave_sullivan
That's really cool, a good breakdown on what's actually going on with an ann.
But I don't think using assembly will help with speed, wouldn't you still be
better off using matrix multiplication on a gpu, written with something like
theano in python? Then again, maybe you just used assembly to explain things
at a very low level, rather than for any speed boost. Either way, very cool
article.

~~~
pbsd
In terms of performance, it's actively worse than a compiler's output. Using
the x87 FPU makes no sense in this day and age, considering this is Intel 64
assembly.

GPUs (and plain CPU SIMD) would help very much yes. Any serious NN
implementation will be using it.

~~~
pslam
It's not only worse, it's poorly scheduled. There are many times hand-coding
assembly will do far better than a compiler (often an order of magnitude).
This isn't one.

If you write straight-line asm code where every output is immediately consumed
in the next 2-3 instructions, you'll find it runs exactly the same speed as
the compiler's output. Probably worse, because the compiler would at least
attempt this. This is where a human can do better than the compiler, but this
isn't an example of it.

~~~
pbsd
Well, I give scheduling a free pass because modern processors will reorder and
rename the whole thing anyway, provided the instructions are not too badly
chosen.

Beating the compiler usually has more to do with having more information. Say,
"this register will never have integers greater than X", or "the carry flag
will be set to Y here", or whatever. Even compiled intrinsic SIMD beats most
hand-coded kernels these days, unless you're doing something very exotic
(e.g., mixing integer and floating-point logic to save cycles).

Hand-written assembly is something that will probably go away nearly
completely (can't really beat optimal schedulers, superoptimizers), and I say
this as someone who's spent many years optimizing assembly.

~~~
DarkShikari
_Even compiled intrinsic SIMD beats most hand-coded kernels these days_

Intrinsics basically are hand-coded assembly: you're specifying every single
instruction manually, just in an obfuscated syntax, and without a proper
assembler preprocessor.

At that point, it's way easier to just write it straight by hand.

Also, there are still plenty of things compilers are terrible at with
intrinsics. This includes things like:

1) Register allocation. Compilers, even "good" ones, will throw things on the
stack willy-nilly. The biggest problem here is that the compiler hides
register utilization from the programmer, preventing him or her from _seeing_
how many registers are used. If you're aware of register usage, you can tweak
the algorithm to avoid the problem.

2) Compilers are truly terrible at making up calling conventions for calls
within assembly (e.g. to a subfunction). Humans can make whatever calling
convention they like, which can give great performance improvements in
functions like FFTs (see: libav split radix fft) that are inherently
recursive. In the case of the aforementioned FFT, Loren was able to make a
handwritten-asm Altivec version around ~40% faster than intrinsics, IIRC.

3) Compilers are not very good at organizing code; in many cases, it is useful
to split up code and reuse blocks to save on code cache (see: x264 trellis
asm). This avoids inlining it and creating a massive mess -- which, by the
way, trashes the compiler's register allocator too. When using the compiler to
do this, you run into the problem of 2); it can't allocate registers
efficiently between code blocks, or come up with a decent calling convention,
so it's inefficient.

4) There are some things that no compiler in existence is currently able to
do, like computed jumps without tables to a templated set of functions (see:
x264 cacheline SAD asm).

~~~
pbsd
I could swear I had read a similar rant of yours somewhere before, but could
not find it.

Anyway, I do not disagree with you. What I was trying to convey was not so
much the quality of current intrinsic code, but the progress that has been
made in the last ~5 years in this respect. I remember writing some intrinsic
code that became a series of useless stores and loads all over the place. When
that is your baseline, current tech is impressive, modulo the issues you've
quite correctly pointed out.

------
kruhft
I don't think I had more fun and felt more control than when I was
professionally programming in assembly language. The more I use high level
languages the more I just want to step back and learn x86-64 assembler and put
scoping and default allocation and language lawyers behind me. Of course
there's nothing stopping me except myself really...

------
xcallmejudasx
How long did it take you to put this together and how much of that time was
coding vs researching?

