Hacker News new | past | comments | ask | show | jobs | submit login
NEON is the new black: fast JPEG optimization on ARM servers (cloudflare.com)
252 points by jgrahamc on April 13, 2018 | hide | past | favorite | 73 comments

This made me think of the story about SnappyCam, whose author handcrafted a JPEG encoder in Arm NEON assembly to enable iPhones to capture still images at 20fps on the iPhone5 (before burst mode was a thing in iOS), and was subsequently acqui-hired by Apple:



This is a really good article. Couple of comments:

* Intel is dead man walking. They've been coasting on superior process technology for years, and now they're being attacked on two sides in the datacenter: From ARM many-core chips like this, that have 50% of the performance per core, but 4x as many cores for about the same price and TDP. On the other side, you've got NVIDIA. The Volta has something like 9X the memory bandwidth of typical Intel server CPUs. This is starting to get used in databases, and of course they already own deep learning.

* libjpegturbo is a great choice if you need fast CPU JPEG encoding/decoding. It's a very near drop-in replacement for libjpeg, and it runs about 2x faster, mostly due to AVX/NEON optimizations.

Disagree that Intel is a dead man walking. If they see a serious threat in the number of 1U size ARM servers actually going into production at scale, which is not happening yet, they have the ability to make 50% price cuts on 16/24/32-core xeons to make them competitive again. With no new r&d money.

You cannot yet buy small to medium sized quantities of ARM motherboard from any of the top ten taiwanese motherboard manufacturers. Intel and AMD have advantages of massive economy of scale.

If you look at the global server market for individual bare metal hypervisor platforms priced from $900 to $11000 each, x86-64 is literally like 99% of motherboards shipped. The other 1% is probably spread between arm, power and other even more esoteric platforms.

I would dearly like to be proven wrong. If anyone is aware of a $400 arm board I can buy right now from MSI, asus, gigabyte, supermicro, tyan, quanta, or others please post a link here.

I'd guess that Intel is "coasting" on purpose to maintain their ability to consistently deliver improvements. The company doesn't benefit from deploying their best tech as fast as possible if there is nothing to compete with it. When there is competition, it is easy for them to move forward.

I would like it if there were ARM servers in the marketplace too, but as yet it doesn't seem they are a real threat. "dead man walking" is a bit premature when the competition barely exists.

yeah, I mean to make an analogy, it's like declaring Toyota a "dead man walking" because Tesla has awesome tech. Before researching the number of new-build cars/trucks/vans shipped and delivered to customers each year...

Toyota, cars, 9.1m

Tesla, all of 2017: Something like 101,000

It’s also like looking at the iPhone in 2007 and saying Nokia can do x/y/z to keep the upstart out.

Fact is, mis-placed corporate focus and the sense of superiority kicks these companies off top spot.

Intel, as it stands this minute IS next.

They can’t price like ARM, they don’t have the scale that ARM devices have - those two things are what Intel usually relies on itself to dominate competitors (eg like AMD).

Add to this that Intel really is focussed on growing its business outside of chips and continually dropping the ball on the chip side (how late is CoffeeLake? Can anyone understand their marketing kefudgery with the generational mixture of chips?) and it’s an absolute recipe for corporate implosion.

>they don’t have the scale that ARM devices have

It is the other way around when you are talking about Server market. The Scale of ARM devices in IoT or even Samrtphone has near zero advantage on Server. ( The instruction set, x86-64 / ARMv8 matters very little )

There is a solid ecosystem for building arm instruction set mobile phones to run Android.

There are lots of little tiny IoT type things that run the arm instruction set and do it really well, but they are not in the same class as serious server motherboards.

ARM is just now barely scratching the surface of the number of pci-express 3.0 "lanes" you can get with one EPYC or threadripper socket.

There is nearly zero current ecosystem of serious server motherboards to run Linux either bare metal or with the kvm or xen hypervisors. Things with a lot of ddr4 ram, m.2 nvme SSD PCI Express slots, pci-express 3.0 x16 slots, 10 and 100 Gbps NICs (SFP+ and qsfp28 form factor).

Every arm server board I have ever seen to date is a special one off low quantity thing produced for a special purpose.

As it stands right now a manufacturer like supermicro can fit four single socket motherboards in a 2RU chassis, and populate them with something like the $750 EPYC 16-core which will run circles around anything arm based. Or a cloud scale operator can put a whole shitload of 1U size individual chassis into use with standard atx size motherboards and populate them with $275 each, ryzen 7 2700 CPUs.

> which will run circles around anything arm based.

Do you have a single number to back that up?

I'm not sure how you can believe that. Right now the biggest problem is total lack of availability of ARM servers. Perhaps Qualcomm's Amberwing is competitive but the only source of information about it is from Cloudflare in SIMD workloads. Infrastructure like networking is an obvious target for ARM servers because it can be accellerated via SIMD but is the performenace good enough for regular application development like webapps or databases? We don't know and they won't tell us. The only reason to believe or hope that they are superior is because you have a grudge against the established players.

I don't even claim that ARM is better than Intel on SIMD workloads. Quite the opposite. Intel has AVX2 and AVX512 that rip ARM apart on most highly parallel workloads. This very specific snippet benefits a bit more from ARM SIMD due to the presence of specific instructions lacking in SSE. But AVX512 has so much more instructions, with a much wider application.

That is assuming they will do price cut.

I think Intel is in some trouble when, Cloudflare decide to and finish improving all the Open Source library used by majority of Internet companies with NEON optimized version, and Qualcomm ship its 2nd generation of Centriq which promised improvement in IPC, I think we are looking at 2020.

But Intel already has a New uArch they are working on specifically for Server, along with 7nm which is roughly TSMC / Samsung's 5nm. i.e Intel will still be at least 1 year ahead in leading edge node on Server.

And Let's not forget Intel's FPGA, if they can some how making it to work seamlessly in Servers.

by 2020 AMD will be shipping its third generation of EPYC...

> I would dearly like to be proven wrong. If anyone is aware of a $400 arm board I can buy right now

Not from the manufacturers you mentioned, but it is possible to buy an ARM board in a standard form factor that can be used in high performance applications. [1]

[1] http://macchiatobin.net/

Marvell are an awful awful company to do anything with if you have any expectation of driver source, kernel upstreaming or them even reading your emails unless your company has an n-million contract with them.

Good luck getting a standard Linux kernel to run on those Armadas though.

I looked briefly and found no mention of being able to run a Debian or centos derived distribution on them. Chicken or egg problem.

I wonder if they fixed the issue present on the (very similar) Espressobin board where using a PCIe card will crash the system.

Wow, 269USD is pretty good. 10GigE ports and upto 16G memory... tempting!

Looks pretty nice for Linux based network device development, but I'm going to bet the CPU performance is in a tier well below a Coffee Lake Core i3 8100. Not really in the same category as single socket, narrow (two systems in 17.5" 1U chassis width), 8, 16, 24 or 32-core Xeon or EPYC.

It seems like they have a few firewalls to burn through before a dramatic xeon price cut. They could just start churning out xeon-d parts that are already fairly affordable and designed to basically compete with most of these server arm parts.

It's hard to overstate how good Intel is at what they do. It sucks that they haven't been radically increasing performance like they used to, but they had been doing it for 40 years. It also sucks that they do seem to slow play things a little and enjoy the margins. Do they still own an ARM license? Could they make their own ARM parts again if they wanted?

I don't think Intel is going to die anytime soon, but I do hope for a stronger competition between ARM based server, Intel and AMD.

> * Intel is dead man walking.

Intel is so dominant that they can sell their server processors for many thousands of dollars and still command the vast majority of the server market. I don't think any other CPUs have the number of SIMD units they have either.

Intel has also already made a lower power more core architecture with the Xeon Phi. It even has high bandwidth memory separate from main memory. The Xeon Phi is niche, but so are many core ARM servers. The point being that Intel is far from unprepared in this area and companies making many core ARM servers are going to have to do a lot to prompt a switch.

Isn't Xeon phi dead? There haven't been any updates since 2013. My one professor who worked on them said they have many issues from being passively cooled to requiring the Intel compiler to get halfway decent performance.


The previous version was a 14nm CPU that was haswell instruction compatible. The 10nm version has been canceled.

That 2013 date is a bit misleading. Although the most recent generation was announced that year, it did not ship until 2016, and the Knights Mill derivative launched just a few months ago.

Knights Hill was canceled. I hadn't heard anything about the entire line being killed, although it would not be totally surprising if that were to happen.

Intel's market segmentation is getting even more aggressive, which gives them future flexibility to address pricing in a very fine-grained manner.

Notice that they make you pay up for 2 AVX-512 modules per core: Xeon Gold 6xxx has them, and 5xxx only has 1 per core. Huge price jump to get those extra SIMD units.

There aren't nearly enough different parts for ARM to credibly threaten Intel at this moment. Also, with Intel's profit margins, they can substantially reduce prices to remain competitive in all but the most price-sensitive segments.

What the future holds, however, is different. We see we are (finally!) getting over the binary compatibility problem and we are now able to use more or less the same software recompiled and tuned for different architectures (POWER, ARM, Intel, AMD) that cover a vast space in the price/performance/watt space, much larger than any single vendor could before.

I can't wait to see people becoming creative again with hardware.

It was easier to be more creative with hardware in the past, because of the full stack experience.

Now that OSes and bytecode as portable binaires have become commodity it is very hard to try to sell something where users can just leave at any moment.

> It was easier to be more creative with hardware in the past

In the early 70s, perhaps. Then being compatible with DOS and then Windows limited most computers to x86 processors. One of the niches that was not limited gives us some hints of what can be now: Unix workstations that employed Motorola and RISC processors and all sorts of creative hardware to get a performance edge over what Intel could offer. They didn't need to reinvent the wheel and licensed large amounts of code to make the core of their OSs. They sometimes needed to port the C compiler to their architecture, but that was not the norm.

Developing for them was often making small changes to headers and recompiling the same source.

And then they died, because that C code could run anywhere regardless of the hardware improvements they were offering to the world.

This is what I meant with having nothing that could prevent developers to go away from the platform, just the performance edge wasn't good enough against commodity hardware.

When commodity hardware got "good enough" no workstation manufacturer could any longer ask for the large premiums they did when a PC couldn't touch their performance. A 386 was not good enough. A 486 was almost there and a Pentium was quite enough I didn't feel constrained when running Solaris 2.5 for development with a 20" monitor and reasonable graphics. Now I'm comfortable with x86's running different flavors of Unix and use both macOS and Linux almost interchangeably.

Yet, if someone could offer radically better price/performance than a x86 PC and still run the large software base that runs on x86 Linux machines, it'd stand a much better chance than it'd in the 90's and 2000's.

Our jpegtran is still faster. We use libjpegturbo for other manipulations though.

Faster cores will still be supreme until parallel computing solutions ramp up enough to fully utilize all the cores - there are still way too many problems and just software in general being built with the constraint of being single threaded.

Intel is going to be fine. There’s no real alternative to Intel CPUs in cloud for most workloads, which is why they’re able to charge so much for their top SKUs, and none anywhere on the horizon. And if challengers do appear (like AMD on the desktop) they could just throw in more cores, bump frequencies, lower prices slightly and coast on their superior tech for another decade. Don’t get me wrong, I do think the current situation is pretty bad, and I do want alternatives to get more traction, but Intel would need to screw up really badly and repeatedly to be in any trouble.

There's a huge problem with these improvements. They rarely make it to the original project.

Luckily, the SIMD acceleration for progressive Huffman encoding was a rare exception [1]. So, alone with other improvements, it made this fork [2] rather pointless. Adding the NEON optimizations would have been easier with a proper rebase, and that would require less work for both to keep things up to date.

[1] https://github.com/libjpeg-turbo/libjpeg-turbo/pull/46

[2] https://github.com/cloudflare/jpegtran

We push everything we can upstream. Sometimes maintainers don't want our changes. We expect to push a lot of ARM changes to a lot of different software in the coming months.

>We expect to push a lot of ARM changes to a lot of different software in the coming months.

Thank you very much!

Any chance the other optimizations will have a blog post similar to this?

Very likely.

Thank you!

I'm really impressed with the Cloudflare Engineering team, their collective knowledge, and their ability to roll up their sleeves and optimize the hell out of the products and services they run. From looking in, it seems almost everything they run has been source code customized to their requirements, from OpenSSL, to JPEG compression, to their HTTP/2 implementation.

We switched away from OpenSSL to BoringSSL (in part to get rid of our own modifications): https://blog.cloudflare.com/make-ssl-boring-again/

Very few other companies have a need to do these low level type projects.

Cloudflare actually delivers enough images per day for a few thousand cpu cycles saved per image to translate to a lot of $$$.

This is the kind of in-depth stuff that I love to read; the CloudFlare blog has been a source of good articles for a few years now, might be time to just add that to the list of things to check on a regular basis instead of waiting for some of them to pop up on HN :)

It's always kinda humbling to see how much performance is still left on the table in some base/core libraries everyone uses. It'd be nice to have seen what the compiler output is in comparison - and explore if there is a way to reorganize the code to "coerce" the compiler to output something similar.

The few times I've had to do similar things - I usually work by tweaking the code to give me the desired compiler output and I really try to avoid inline assembly. This usually helps me learn some limitations of C/C++ and what the compiler can deduce


edit: curious as to how you perform this optimisation.

It's been a few years since I've had to do it - but back then I just used the Visual Studio disassembly. You can get something similar (but more messy) with objdump.

Godbolt seems like a good solution now-a-days, but you still need to run benchmarks

I wonder what types of controls they have over the data they cache. While caches are by nature "public", given suitable naming, unless you have a link, they are effectively private.

> To understand the impact on overall performance I ran jpegtran over a set of 34,159 actual images from one of our caches.

That's a great question. Answer: a lot of control.

In order to run the cache experiment Vlad had to file an access control ticket which I had to approve. He was given access to cached images from two dogfooding/guinea pig machines and was able to run his experiment on those machines. They are not machines through which we run 'normal' production traffic.

Can we get asm/bytecode instead of C intrinsic? It's easier to reason about, without spawning actual compiler.

I ended up writing asm for the important part:


> NEON is the ARMv8 version of SIMD

NEON was actually introduced in ARMv7.

How does that make the statement incorrect?

Technically, ARMv8 has no such thing as "NEON"; it was renamed Advanced SIMD. Only in ARMv7 is it officially called NEON.

But everyone sensible ignores a lot of the stupid names ARM uses in ARMv8 (AArch64/AArch32, A64/A32/T32, etc.)

Yes and no. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

> The ARM Advanced SIMD architecture, its associated implementations, and supporting software, are commonly referred to as NEON technology. There are NEON instruction sets for both AArch32 (equivalent to the ARMv7 NEON instructions) and for AArch64. Both can be used to significantly accelerate repetitive operations on large data sets. This can be useful in applications such as media codecs.

The NEON architecture for AArch64 uses 32 × 128-bit register, twice as many as for ARMv7. These are the same registers used by the floating-point instructions. All compiled code and subroutines conforms to the EABI, which specifies which registers can be corrupted and which registers must be preserved within a particular subroutine. The compiler is free to use any NEON/VFP registers for floating-point values or NEON data at any point in the code.

I would assume because 7 comes before 8.

So NEON is not the ARMv8 version of SIMD, because it is ARMv7 version of SIMD? Those are not mutually exclusive.

In case anyone else was confused by a 200KB file taking so long, the original file is 9933x7016 (7.1MB) and actually does take that long.

I'm not going to link it here because the site already seems slow, but it is easy to find if you want to do your own benchmark.

Found the file. 230 ms decode and 96 ms encode on i9 7900x. :)

Timings with the original test image ( https://www.eso.org/public/archives/print_posters/large/prin... )

9933 x 7016 (7.4 MB)

    load turbojpeg: 309 ms 
    load mango:     225 ms
    save turbojpeg: 388 ms
    save mango:     96 ms
Started with 800 ms save time earlier today but got motivated to finally optimize the encoder; thanks! :)

Cloudflare, Uber, Netflix engineering teams are all amazing.

Very interested in seeing this for libPNG for platforms other than Linux. The ARM intrinsics are always the sticking point when doing cross-platform stuff. Very good write-up I'll be following the project now.

How transcoding on this centriq compares to FPGA transcoders cost-wise?

Can you put your testset images on github?

If you're just looking for large numbers of test images in general, https://testimages.org/ has millions of them available for use, licensed CC BY-NC-SA 4.0.

There are commercially available FPGA JPEG transcoders?

High end FPGAs cost an arm and a leg, so I find it hard to believe they would be cost effective, though Intel seems to claim otherwise.

Cyclone V board with onboard network, some ram, and flash costs $7k a piece. 2 years ago, we benchmarked an off-the-shelf jpeg transcoder IP to be ~30-40 times faster than libjpeg on top of the line 8 core xeon.

Oooh, that's pretty awesome then. I was pretty sceptical after reading about somewhat "meh" results for deep learning applications a few weeks ago.

> I was pretty sceptical after reading about somewhat "meh" results for deep learning applications a few weeks ago.

To begin with, nobody even bothered yet to make an optimised HDL design for this purposes.

This is unlike for all kinds of media transcoders which are the bread and butter of the SoC industry. An image transcoding asic you find in cellphone SoC these days can easily max the I/O ceiling, which means they can transcode faster than the SoC's memory interface can work.

How about comparing to the best library? A few alternatives are mentioned in this discussion.

To compare that, I need to run benchmarks on their lib. Our original measurements were in Mb/s, and not the time per testset.

Adding to that, our fpga boards are also programmed to do very high quality Lanzo scaling with gamma correction and a sharpening algo by default. Compression settings are on the quality side

jpegtran does not transcode. It performs lossless optimization of the Huffman coefficients.

Using `-copy none` with jpegtran can be a bad idea in some cases, because it wipes out the color profile data. Be absolutely sure you don't need it, before you go blowing that data away everywhere.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact