

X264's abstraction layer for cross-platform x86 assembly: now free to use - DarkShikari
http://x264dev.multimedia.cx/?p=191

======
jacquesm
That's really neat Jason, what with all the 'assembly is dead' stuff that gets
bandied around it's nice to see people keeping the art alive and collaborating
at this level.

~~~
DarkShikari
Assembly has never been dead in the realm of multimedia processing, or for
that matter, anywhere that you have to do an enormous amount of SIMD-able
number crunching.

Find a video (en|de)coder without SIMD and you've found an unusable piece of
software.

There's a reason that Intel focuses so much on SIMD instructions and
arithmetic units with each new chip release: if not for SIMD or dedicated
hardware, it would be physically impossible to decode 1080p video on almost
any modern machine.

~~~
jrockway
_Find a video (en|de)coder without SIMD and you've found an unusable piece of
software._

Also worth noting that these video decoders tend to crash hard on invalid
input. Tradeoffs.

~~~
DarkShikari
Not a tradeoff; that's just bad input checking. Most of ffmpeg's decoders have
heavy checks in them--crashes are usually bugs (missing checks). Yes, the
checks have a small speed cost, but it's at most 1-2%, probably less.

------
jwr
This is good stuff. It will make writing assembly on Intel at least slightly
less painful.

I am still amazed we don't have an x86 assembler that is anywhere near what
DSP people use. I mean, we still have to track register usage manually and do
manual instruction reordering to improve performance! The x86 world could
learn a lot from the DSP world here (Texas Instruments tools are a good
example).

~~~
jrockway
_I am still amazed we don't have an x86 assembler that is anywhere near what
DSP people use._

I believe these are called "compilers".

~~~
jrockway
Seriously, you're downmodding this? Write some C sometime, and watch the
compiler output perfectly-optimized assembly for your architecture. You write
a high-level solution to the problem, the compiler makes it work efficiently.

If it doesn't, it's a compiler bug, and should be fixed at that level.

~~~
ramchip
I program C signal processing code on x86. Sure, it's usually sufficient for
my needs, but "perfectly-optimized assembly"? Not for those tight loops where
you really want it. It's decent and if you hold the compiler's hand will get
vectorized somewhat, but the code is still heavy.

