

The hidden cycle-eating demon: L1 cache misses - DarkShikari
http://x264dev.multimedia.cx/?p=149

======
akkartik
My dissertation was partly about prefetching to L1.
<http://akkartik.name/akkartik-phd07.pdf>

It required changes to processor microarchitecture, though.

------
yan
I have posted this before, but if you care at all about memory and how it
applies to writing efficient code, you owe it to yourself to check out this
fantastic paper: <http://people.redhat.com/drepper/cpumemory.pdf>

------
chipsy
When working in a higher-level language, it's easy to forget that a lot of the
performance disadvantage comes from using data types that are distanced from
the hardware, require unboxing, RTTI, etc. Not only do those conveniences eat
up CPU, they also eat up memory, which compounds the problem by forcing the
data to live outside of cache, regardless of what optimizations the language
implementation has.

Hence, I really appreciate it when a language offers some way to drop down to
byte-level constructs and build space-efficient implementations where they're
necessary. Improvements in memory compactness can often save you the effort of
dropping all the way down to C.

~~~
barrkel
Indeed. I had a recent situation where I was building a trie for prefix
search, and found that object overhead / pointer cost was so expensive (a
whole 8 bytes per instance in 32 bit - really adds up and worse in x64) I
dropped down to using parallel integer arrays to store all data, and even
then, I was able to squeeze out more space by going to a custom bit-packed
array class.

------
QE2
While I found this an interesting read, I believe someone who works with *264
encoding with any regularity can easily justify using the relatively cheap
hardware encoders available to consumers. I'm not sure if x264 is fully
compatible with such a device, though. This is a lot more feasible than
waiting for Intel and AMD to significantly change their architecture for what
sounds like an edge case.

~~~
DarkShikari
Hardware encoders are not feasible for ordinary consumers at all; if you want
an encoder that produces reasonable compression, you have to go up to the
$10k-$50k range (and even there, most of the ones on the market are really not
very good!). Low-end hardware encoders are both often slow (usually
outperformed by x264 on a cheap quad-core CPU) and extremely bad at
compression.

Plus, a CPU can be used for things other than encoding video, while a hardware
device is of course useless for anything else, so it's easier to justify
spending money on a fast CPU than on a task-specific piece of hardware.

Also, it isn't really an edge case; it will occur in _any_ application which
has a working set that is unavoidably larger than the L1 cache. Video encoders
are just one of many cases where this occurs.

 _(Note: added that last point to the blog post after I posted this.)_

~~~
QE2
Thanks for the follow-up. After re-reading the post, I see that this is not
/just/ an x264 problem.

Thank you also for informing me that hardware encoders aren't very good. I had
actually considered purchasing one of the ~$100 ones, but now I'll steer
clear.

I must be doing something wrong, though, because I can't seem to get much
better than 2x real-time on my Q6600 w/ 4GB memory when using x264.

~~~
DarkShikari
The meaning of 2x real-time, of course, depends entirely on the resolution.

x264 now has encoding presets you can use to trade off speed for compression:
they go from "ultrafast" to "placebo" (full list in the help). Grab the latest
from x264.nl; we've also had a lot of speed improvements lately ;).

Do note that at high speeds, it's easy to get bottlenecked: the most common
case is the decoding of the input, which is often only single-threaded. If
your input is uncompressed, then you can get bottlenecked by reading it off
the disk. And if you're running filters on the input (e.g. resizing it), that
can also serve as a bottleneck.

