
Introduction to AVX2 optimizations in x264 - DarkShikari
http://www.scribd.com/doc/137419114/Introduction-to-AVX2-optimizations-in-x264
======
jasin
A couple of comments/elaborations on the "core differences" mentioned in the
article:

The first difference mentioned is that whereas the first SSE2 implementations
were often implemented using 64-bit ALUs internally, yielding roughly the same
performance as doing two equivalent MMX ops manually, this isn't the case with
AVX2. However, it may be worth noting, that it largely _is_ the case with the
current AVX ("AVX1", i.e. pre-Haswell) implementations.

The second cited difference is that there's a 128-bit "boundary" in many of
the operations. This is effectively what can throw down the drain the hopes of
getting 2x gains over SSE2 just by naïvely migrating into AVX2. For instance,
you cannot do shuffles to/from arbitrary components anymore, but have to
consider the 128-bit lane boundaries instead.

The third issue, i.e. data layouts of internal formats and the assumptions of
various algorithms are probably the most significant factors that determine
how large a benefit you are going to get. Typically the internal data layouts
(i.e. is my pixel block size 2x2, 4x4, 16x8 or something else?) are married
with the ISA. Thus, when migrating from one instruction set to another, these
typically may need to be reconsidered if speed is paramount. Interestingly
enough, this means that when the ISA changes, you most likely want to do some
higher-level algorithmic optimizations as well.

------
lmm
Anyone have a non-scribd copy?

~~~
DarkShikari
There's one attached to my newsletter that goes along with the latest changes;
see
[http://mailman.videolan.org/pipermail/x264-devel/2013-April/...](http://mailman.videolan.org/pipermail/x264-devel/2013-April/010044.html).

------
Osiris
Are there binary builds available with AVX2 support compiled in for testing?
I'm curious if FMA(3/4) support available in AMD processors would increase
performance. A quick Google search shows that there are some patches available
for FMA support.

~~~
DarkShikari
I only pushed the code a few minutes ago, but binaries should probably be up
at <http://x264.nl/> relatively soonish (it's not my site though, so I
wouldn't know exactly).

If you want to test without a physical Haswell, the Intel Software Development
Emulator should work okay, albeit somewhat slowly. I'd post overall numbers
for real Haswells, but Intel has apparently said we can't do that yet.

Regarding FMA, FMA3/4 are floating point only. Since x264 has just one
floating point assembly function, only two FMA3/FMA4 instructions get used in
all of x264 (not counting duplicates from different-architecture versions of
the function). An FMA4 version has been included for a while; the new AVX2
version does include FMA3, but of course that won't run on AMD CPUs (yet).

XOP had some integer FMA instructions, but I generally didn't find them that
useful (there's a few places I found they could be slotted in, though).

~~~
jamesaguilar
I've heard that there are c libraries for things like SSE2. I assume the same
is true of AVX2. If this is so, why do you write so much of x264 in assembly?
Do you find that there are significant gains versus c-code that uses SIMD
libraries? Have I been misled that C is nearly as fast as assembly 99% of the
time?

Note: I'm not trying to question your engineering chops, just trying to
correct my own misconceptions.

~~~
DarkShikari
"C libraries for things like SSE2"? Do you mean math libraries that have SIMD
implementations of various functions that are callable from C? This here is
effectively _writing those libraries_ ; they don't exist until we write the
code.

~~~
jamesaguilar
I'm talking about something like this:
<http://sseplus.sourceforge.net/fntable.html>

I'm not an SIMD expert, but it seems like this implements similar primitives
to those that are available to assembly (and not C). My question is basically
whether the algorithms you're talking about could be implemented with these
primitives. Although I guess no such library yet exists for AVX2.

~~~
DarkShikari
Intrinsics aren't really C; they work in a C-like syntax, but you're still
doing the exact same thing as assembly: you still have to write out every
instruction you want to use, so you're not really saving any effort compared
to just skipping the middleman.

In return, you are stuck with an extremely ugly syntax and a much less
functional preprocessor, with the added bonus of a compiler that mangles your
code.

~~~
jedbrown
With intrinsics, you don't have to think about register naming. You still
might count registers to avoid spills (and check the assembly to make sure),
but there is less of a mental context switch than writing straight assembly.

~~~
DarkShikari
I almost never spend more than a few seconds considering register
allocation/naming when writing assembly (part of this is because x264's
abstraction layer lets macros swap their arguments, so you don't have to track
"what happens to be in xmm0 right now" mentally). In some rare cases it can
get tricky when you start pushing up against the register cap, but that's
_exactly the case where the compiler tends to do terribly_ , and you'd want to
do it yourself.

The pain of not having a proper macro assembler in C intrinsics is orders of
magnitude worse than having to do my own register allocation in yasm, so for
now, yasm is the lesser of two evils.

~~~
nitrogen
Is there any hope of a compiler ever coming close to the level of optimization
you can get from hand-coded assembly language? The numbers in your table
routinely exceeded 10x gains over straight C. What's the compiler doing that's
taking so long? Is it not able to vectorize at all?

~~~
DarkShikari
x264 actually turns off vectorization in the configure script, because it's
caused crashes and bugs in the past on various platforms. But even if you
enable it, it usually almost never triggers. Even the Intel compiler's
autovectorization only triggers in a few functions, despite its reputation,
and typically does an pretty mediocre job.

The problem has many parts:

1\. Autovectorization in general is just extremely difficult and even trivial
code segments often get compiled very badly. It feels like the compiler is
trying to fit the code into a few autovectorization special cases -- for
example, a 16x16->16 multiply gets compiled into a 16x16->32 multiply, and
then it laboriously extracts the 16 bits, probably because nobody wrote code
to explicitly handle the former variant. A good autovectorizer would have to
have a vast array of these sort of things to "know what to do" in a particular
case, I'd imagine.

A lot of autovectorization resources seem to be tuned towards floating point
math (which typically doesn't need the same sort of tricks), which probably
exacerbates the problem in x264's case.

2\. The compiler doesn't know enough. It can't guarantee alignment, it doesn't
know about the possible ranges of the input values or the relationship among
them, it doesn't know the things the programmer knows.

3\. SIMD algorithms are often wildly different from the original C. Much of
the process of writing assembly is figuring out how to restructure, reorder,
and transform algorithms to be more suitable for SIMD. The compiler can't
really realistically do this; its job is to translate your C operations into
machine operations, not rewrite your algorithm.

Part of this problem is that C is just not a great vector math language, but
part of it is also that the optimal algorithm structure will depend on the
capabilities of your SIMD instructions and their performance. For example,
when the _pmaddubsw_ instruction is available, it's faster to do x264's SATD
using a horizontal transform first, but if not, it's faster to do it with a
vertical transform first. The Atom CPU has pmaddubsw, but only has a 64-bit
multiplier, making it too slow to utilize the horizontal version (so it gets
the vertical version instead).

You can definitely finagle code into getting hit by the autovectorizer,
especially with Intel's compiler, but it takes a lot of futzing to make it
happy, and even when it is happy, it can be many times slower than proper
assembly. Of course, it's not useless -- it can get you some relatively free
performance improvements without writing actual SIMD code. But it's not a
replacement.

------
zobzu
Nice gains. Thanks for the writeup and explanations!

