
Doubling the speed of jpegtran with SIMD - jgrahamc
https://blog.cloudflare.com/doubling-the-speed-of-jpegtran/
======
pierrec
I sort of expect a top-notch compiler to perform this automatically
(especially if it allows profiling). The MSVC++ compiler has pretty impressive
auto-vectorization, as I've verified several times by testing the performance
of my DSP code with SIMD enabled and disabled. Just make sure your code
facilitates auto-vectorization, and it should be more readable and future-
proof while achieving the same result, assuming the compiler is smart enough.

(Also, since this is CloudFlare, <insert rant and dream about SIMD happening
in LuaJIT>... Thanks CloudFlare!)

~~~
legulere
The problem is that often the compiler can't automatically vectorize because
some preconditions are not met. The knowledge about those preconditions is
mostly limited to compiler people and so far, by what I know, there's no
static analysis tool that tells you why the compiler couldn't vectorize a
loop.

~~~
leecb
Clang can tell you why a loop wasn't vectorized with a couple of compiler
flags:

-Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorized

[http://llvm.org/docs/Vectorizers.html#diagnostics](http://llvm.org/docs/Vectorizers.html#diagnostics)

------
guardian5x
I wonder if they've considered using libjpeg-turbo instead of libjpeg9a, which
comes with a lot of SIMD optimization and is regarded to be 2x-5x as fast as
IJG's implementation.[1]

[1] [http://www.libjpeg-turbo.org/About/Performance](http://www.libjpeg-
turbo.org/About/Performance)

~~~
listic
My thought as well.

As far as I understand, these intrinsics map to SSE instructions (they are
128-bit). One could use the later AVX (256-bit, found in server chips since
2011). Probably they decided to start with the lowest common denomintor of
SSE, because they don't have AVX in _all_ of their servers? The term _SIMD_ is
generic.

~~~
unwind
That is explicitly mentioned in the article:

 _Note: we could also choose to use 256-bit wide YMM registers here, but that
would only work on the newest CPUs, while gaining little performance._

I'm not saying they're right and you're wrong, but you make it sound as if
they never even considered AVX.

~~~
liotier
> Note: we could also choose to use 256-bit wide YMM registers here, but that
> would only work on the newest CPUs, while gaining little performance.

Should that be an #ifdef ?

~~~
repsilat
Maybe if the machine running the code is the machine that compiles it, or if
you have a good way to send processor-specific binaries to the right places.

More useful, I think, would be a runtime check far enough up the stack (i.e.,
away from tight loops) that it doesn't affect performance. Probably doesn't
even need to be that high, it'll branch-predict correctly every time but the
first iteration, and the binary bloat probably won't hurt your icache because
the number of instructions you ever actually execute stays more or less the
same.

------
dahart
Very nice! An independent but very useful way to speed up image batches, as
log as you're working in batches and not one at a time:

    
    
      parallel jpegtran ::: *.jpg
    

Combine it with these SIMD improvements, and your batch will be done _before_
you hit enter.

~~~
sloppycee
[http://www.gnu.org/software/parallel/](http://www.gnu.org/software/parallel/)

------
karim79
While this is a great project from both an academic and performance
perspective, the resulting images are substantially larger than ones produced
using Mozjpeg. From the article:

 _We at CloudFlare make sure that our servers run at top notch performance, so
our customers ' websites do as well!._

Correct me if I'm wrong, but it does seem to me that this is of greater value
to CloudFlare in terms of electricity savings than to CloudFlare's customers.
Less computing time for CloudFlare at the expense of larger output files for
CloudFlare's customers to serve, files which will potentially be served again
and again and count against the customer's transfer quota. It might not seem
like much, but it adds up when you consider how many times a single image in
the wild can be downloaded. Consider the following example:

    
    
      # Original size is 21796912 bytes
      # jpegtran with SIMD
      jpegtran -copy none -optimize -progressive bigjpeg.jpg > bigjpeg_jpegtran.jpg
      # bigjpeg_jpegtran.jpg resulting size is 19771224 bytes
    
      # mozjpeg 3.1
      mozjpeg -copy none -optimize -progressive bigjpeg.jpg > bigjpeg_mozjpeg.jpg
      # bigjpeg_mozjpeg.jpg size is 19328832 bytes
    

That's a difference of 442KB! I'm not emphasizing the computing time here. I
do not dispute that this is orders of magnitude faster than Mozjpeg, but isn't
it worth doing the extra work to get a smaller file? That file could be served
billions of times. My argument is simply that the extra computing effort in
the beginning leads to much grander savings in the long run.

In my (quick) testing Jpegtran's _squishing_ performance is consistently less
than that of Mozjpeg. So what, actually, is _right_ thing to do in the end?

~~~
0x07c0
If they get larger file sizes then that probably means they are doing
something wrong. Optimized a mpg encoder with SSE once. And larger file size
was usually do to some lose of floating point precision. Going thorough the
functions and comparing output from serial to vector function should give the
bug. Fixing this stuff also can give some more speed up!

------
chatman
They should've included graphs to their article, e.g. to compare speedups at
each compression level using histograms. Even tabulation of their results may
have looked cool.

