
NEON is the new black: fast JPEG optimization on ARM servers - jgrahamc
https://blog.cloudflare.com/neon-is-the-new-black/
======
0x0
This made me think of the story about SnappyCam, whose author handcrafted a
JPEG encoder in Arm NEON assembly to enable iPhones to capture still images at
20fps on the iPhone5 (before burst mode was a thing in iOS), and was
subsequently acqui-hired by Apple:

[https://www.theguardian.com/technology/2014/jan/06/apple-
buy...](https://www.theguardian.com/technology/2014/jan/06/apple-buys-iphone-
camera-app-snappycam-and-gets-australian-algorithms-whiz)

[https://newatlas.com/snappycam-iphone-
app/28558/](https://newatlas.com/snappycam-iphone-app/28558/)

------
cameldrv
This is a really good article. Couple of comments:

* Intel is dead man walking. They've been coasting on superior process technology for years, and now they're being attacked on two sides in the datacenter: From ARM many-core chips like this, that have 50% of the performance per core, but 4x as many cores for about the same price and TDP. On the other side, you've got NVIDIA. The Volta has something like 9X the memory bandwidth of typical Intel server CPUs. This is starting to get used in databases, and of course they already own deep learning.

* libjpegturbo is a great choice if you need fast CPU JPEG encoding/decoding. It's a very near drop-in replacement for libjpeg, and it runs about 2x faster, mostly due to AVX/NEON optimizations.

~~~
walrus01
Disagree that Intel is a dead man walking. If they see a serious threat in the
number of 1U size ARM servers actually going into production at scale, which
is not happening yet, they have the ability to make 50% price cuts on
16/24/32-core xeons to make them competitive again. With no new r&d money.

You cannot yet buy small to medium sized quantities of ARM motherboard from
any of the top ten taiwanese motherboard manufacturers. Intel and AMD have
advantages of massive economy of scale.

If you look at the global server market for individual bare metal hypervisor
platforms priced from $900 to $11000 each, x86-64 is literally like 99% of
motherboards shipped. The other 1% is probably spread between arm, power and
other even more esoteric platforms.

I would dearly like to be proven wrong. If anyone is aware of a $400 arm board
I can buy right now from MSI, asus, gigabyte, supermicro, tyan, quanta, or
others please post a link here.

~~~
kogepathic
_> I would dearly like to be proven wrong. If anyone is aware of a $400 arm
board I can buy right now_

Not from the manufacturers you mentioned, but it is possible to buy an ARM
board in a standard form factor that can be used in high performance
applications. [1]

[1] [http://macchiatobin.net/](http://macchiatobin.net/)

~~~
kikoreis
Good luck getting a standard Linux kernel to run on those Armadas though.

~~~
walrus01
I looked briefly and found no mention of being able to run a Debian or centos
derived distribution on them. Chicken or egg problem.

------
vladdanilov
There's a huge problem with these improvements. They rarely make it to the
original project.

Luckily, the SIMD acceleration for progressive Huffman encoding was a rare
exception [1]. So, alone with other improvements, it made this fork [2] rather
pointless. Adding the NEON optimizations would have been easier with a proper
rebase, and that would require less work for both to keep things up to date.

[1] [https://github.com/libjpeg-turbo/libjpeg-
turbo/pull/46](https://github.com/libjpeg-turbo/libjpeg-turbo/pull/46)

[2]
[https://github.com/cloudflare/jpegtran](https://github.com/cloudflare/jpegtran)

~~~
jgrahamc
We push everything we can upstream. Sometimes maintainers don't want our
changes. We expect to push a _lot_ of ARM changes to a lot of different
software in the coming months.

~~~
tbarkley
Any chance the other optimizations will have a blog post similar to this?

~~~
jgrahamc
Very likely.

~~~
tbarkley
Thank you!

------
bigtones
I'm really impressed with the Cloudflare Engineering team, their collective
knowledge, and their ability to roll up their sleeves and optimize the hell
out of the products and services they run. From looking in, it seems almost
everything they run has been source code customized to their requirements,
from OpenSSL, to JPEG compression, to their HTTP/2 implementation.

~~~
jgrahamc
We switched away from OpenSSL to BoringSSL (in part to get rid of our own
modifications): [https://blog.cloudflare.com/make-ssl-boring-
again/](https://blog.cloudflare.com/make-ssl-boring-again/)

------
NKCSS
This is the kind of in-depth stuff that I love to read; the CloudFlare blog
has been a source of good articles for a few years now, might be time to just
add that to the list of things to check on a regular basis instead of waiting
for some of them to pop up on HN :)

------
geokon
It's always kinda humbling to see how much performance is still left on the
table in some base/core libraries everyone uses. It'd be nice to have seen
what the compiler output is in comparison - and explore if there is a way to
reorganize the code to "coerce" the compiler to output something similar.

The few times I've had to do similar things - I usually work by tweaking the
code to give me the desired compiler output and I really try to avoid inline
assembly. This usually helps me learn some limitations of C/C++ and what the
compiler can deduce

~~~
praulv
godbolt?

edit: curious as to how you perform this optimisation.

~~~
geokon
It's been a few years since I've had to do it - but back then I just used the
Visual Studio disassembly. You can get something similar (but more messy) with
objdump.

Godbolt seems like a good solution now-a-days, but you still need to run
benchmarks

------
ec109685
I wonder what types of controls they have over the data they cache. While
caches are by nature "public", given suitable naming, unless you have a link,
they are effectively private.

> To understand the impact on overall performance I ran jpegtran over a set of
> 34,159 actual images from one of our caches.

~~~
jgrahamc
That's a great question. Answer: a lot of control.

In order to run the cache experiment Vlad had to file an access control ticket
which I had to approve. He was given access to cached images from two
dogfooding/guinea pig machines and was able to run his experiment on those
machines. They are not machines through which we run 'normal' production
traffic.

------
alfanick
Can we get asm/bytecode instead of C intrinsic? It's easier to reason about,
without spawning actual compiler.

~~~
thecompilr
I ended up writing asm for the important part:

[https://github.com/cloudflare/jpegtran/blob/vlad/arm/jchuff_...](https://github.com/cloudflare/jpegtran/blob/vlad/arm/jchuff_util_armv8.S)

------
gok
> NEON is the ARMv8 version of SIMD

NEON was actually introduced in ARMv7.

~~~
booblik
How does that make the statement incorrect?

~~~
brigade
Technically, ARMv8 has no such thing as "NEON"; it was renamed Advanced SIMD.
Only in ARMv7 is it officially called NEON.

But everyone sensible ignores a lot of the stupid names ARM uses in ARMv8
(AArch64/AArch32, A64/A32/T32, etc.)

~~~
thecompilr
Yes and no.
[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CJHECGIH.html)

> The ARM Advanced SIMD architecture, its associated implementations, and
> supporting software, are commonly referred to as NEON technology. There are
> NEON instruction sets for both AArch32 (equivalent to the ARMv7 NEON
> instructions) and for AArch64. Both can be used to significantly accelerate
> repetitive operations on large data sets. This can be useful in applications
> such as media codecs.

The NEON architecture for AArch64 uses 32 × 128-bit register, twice as many as
for ARMv7. These are the same registers used by the floating-point
instructions. All compiled code and subroutines conforms to the EABI, which
specifies which registers can be corrupted and which registers must be
preserved within a particular subroutine. The compiler is free to use any
NEON/VFP registers for floating-point values or NEON data at any point in the
code.

------
herf
In case anyone else was confused by a 200KB file taking so long, the original
file is 9933x7016 (7.1MB) and actually _does_ take that long.

I'm not going to link it here because the site already seems slow, but it is
easy to find if you want to do your own benchmark.

~~~
t0rakka
Found the file. 230 ms decode and 96 ms encode on i9 7900x. :)

------
t0rakka
Timings with the original test image (
[https://www.eso.org/public/archives/print_posters/large/prin...](https://www.eso.org/public/archives/print_posters/large/print_poster_0025.jpg)
)

9933 x 7016 (7.4 MB)

    
    
        load turbojpeg: 309 ms 
        load mango:     225 ms
        save turbojpeg: 388 ms
        save mango:     96 ms
    

Started with 800 ms save time earlier today but got motivated to finally
optimize the encoder; thanks! :)

------
tuananh
Cloudflare, Uber, Netflix engineering teams are all amazing.

------
wolfspider
Very interested in seeing this for libPNG for platforms other than Linux. The
ARM intrinsics are always the sticking point when doing cross-platform stuff.
Very good write-up I'll be following the project now.

------
baybal2
How transcoding on this centriq compares to FPGA transcoders cost-wise?

Can you put your testset images on github?

~~~
sannee
There are commercially available FPGA JPEG transcoders?

High end FPGAs cost an arm and a leg, so I find it hard to believe they would
be cost effective, though Intel seems to claim otherwise.

~~~
baybal2
Cyclone V board with onboard network, some ram, and flash costs $7k a piece. 2
years ago, we benchmarked an off-the-shelf jpeg transcoder IP to be ~30-40
times faster than libjpeg on top of the line 8 core xeon.

~~~
sannee
Oooh, that's pretty awesome then. I was pretty sceptical after reading about
somewhat "meh" results for deep learning applications a few weeks ago.

~~~
baybal2
> I was pretty sceptical after reading about somewhat "meh" results for deep
> learning applications a few weeks ago.

To begin with, nobody even bothered yet to make an optimised HDL design for
this purposes.

This is unlike for all kinds of media transcoders which are the bread and
butter of the SoC industry. An image transcoding asic you find in cellphone
SoC these days can easily max the I/O ceiling, which means they can transcode
faster than the SoC's memory interface can work.

------
JensRex
Using `-copy none` with jpegtran can be a bad idea in some cases, because it
wipes out the color profile data. Be absolutely sure you don't need it, before
you go blowing that data away everywhere.

