* Intel is dead man walking. They've been coasting on superior process technology for years, and now they're being attacked on two sides in the datacenter: From ARM many-core chips like this, that have 50% of the performance per core, but 4x as many cores for about the same price and TDP. On the other side, you've got NVIDIA. The Volta has something like 9X the memory bandwidth of typical Intel server CPUs. This is starting to get used in databases, and of course they already own deep learning.
* libjpegturbo is a great choice if you need fast CPU JPEG encoding/decoding. It's a very near drop-in replacement for libjpeg, and it runs about 2x faster, mostly due to AVX/NEON optimizations.
You cannot yet buy small to medium sized quantities of ARM motherboard from any of the top ten taiwanese motherboard manufacturers. Intel and AMD have advantages of massive economy of scale.
If you look at the global server market for individual bare metal hypervisor platforms priced from $900 to $11000 each, x86-64 is literally like 99% of motherboards shipped. The other 1% is probably spread between arm, power and other even more esoteric platforms.
I would dearly like to be proven wrong. If anyone is aware of a $400 arm board I can buy right now from MSI, asus, gigabyte, supermicro, tyan, quanta, or others please post a link here.
I would like it if there were ARM servers in the marketplace too, but as yet it doesn't seem they are a real threat. "dead man walking" is a bit premature when the competition barely exists.
Toyota, cars, 9.1m
Tesla, all of 2017: Something like 101,000
Fact is, mis-placed corporate focus and the sense of superiority kicks these companies off top spot.
Intel, as it stands this minute IS next.
They can’t price like ARM, they don’t have the scale that ARM devices have - those two things are what Intel usually relies on itself to dominate competitors (eg like AMD).
Add to this that Intel really is focussed on growing its business outside of chips and continually dropping the ball on the chip side (how late is CoffeeLake? Can anyone understand their marketing kefudgery with the generational mixture of chips?) and it’s an absolute recipe for corporate implosion.
It is the other way around when you are talking about Server market. The Scale of ARM devices in IoT or even Samrtphone has near zero advantage on Server. ( The instruction set, x86-64 / ARMv8 matters very little )
There are lots of little tiny IoT type things that run the arm instruction set and do it really well, but they are not in the same class as serious server motherboards.
ARM is just now barely scratching the surface of the number of pci-express 3.0 "lanes" you can get with one EPYC or threadripper socket.
There is nearly zero current ecosystem of serious server motherboards to run Linux either bare metal or with the kvm or xen hypervisors. Things with a lot of ddr4 ram, m.2 nvme SSD PCI Express slots, pci-express 3.0 x16 slots, 10 and 100 Gbps NICs (SFP+ and qsfp28 form factor).
Every arm server board I have ever seen to date is a special one off low quantity thing produced for a special purpose.
As it stands right now a manufacturer like supermicro can fit four single socket motherboards in a 2RU chassis, and populate them with something like the $750 EPYC 16-core which will run circles around anything arm based. Or a cloud scale operator can put a whole shitload of 1U size individual chassis into use with standard atx size motherboards and populate them with $275 each, ryzen 7 2700 CPUs.
Do you have a single number to back that up?
I think Intel is in some trouble when, Cloudflare decide to and finish improving all the Open Source library used by majority of Internet companies with NEON optimized version, and Qualcomm ship its 2nd generation of Centriq which promised improvement in IPC, I think we are looking at 2020.
But Intel already has a New uArch they are working on specifically for Server, along with 7nm which is roughly TSMC / Samsung's 5nm. i.e Intel will still be at least 1 year ahead in leading edge node on Server.
And Let's not forget Intel's FPGA, if they can some how making it to work seamlessly in Servers.
Not from the manufacturers you mentioned, but it is possible to buy an ARM board in a standard form factor that can be used in high performance applications. 
It's hard to overstate how good Intel is at what they do. It sucks that they haven't been radically increasing performance like they used to, but they had been doing it for 40 years. It also sucks that they do seem to slow play things a little and enjoy the margins. Do they still own an ARM license? Could they make their own ARM parts again if they wanted?
Intel is so dominant that they can sell their server processors for many thousands of dollars and still command the vast majority of the server market. I don't think any other CPUs have the number of SIMD units they have either.
Intel has also already made a lower power more core architecture with the Xeon Phi. It even has high bandwidth memory separate from main memory. The Xeon Phi is niche, but so are many core ARM servers. The point being that Intel is far from unprepared in this area and companies making many core ARM servers are going to have to do a lot to prompt a switch.
The previous version was a 14nm CPU that was haswell instruction compatible. The 10nm version has been canceled.
Knights Hill was canceled. I hadn't heard anything about the entire line being killed, although it would not be totally surprising if that were to happen.
Notice that they make you pay up for 2 AVX-512 modules per core: Xeon Gold 6xxx has them, and 5xxx only has 1 per core. Huge price jump to get those extra SIMD units.
What the future holds, however, is different. We see we are (finally!) getting over the binary compatibility problem and we are now able to use more or less the same software recompiled and tuned for different architectures (POWER, ARM, Intel, AMD) that cover a vast space in the price/performance/watt space, much larger than any single vendor could before.
I can't wait to see people becoming creative again with hardware.
Now that OSes and bytecode as portable binaires have become commodity it is very hard to try to sell something where users can just leave at any moment.
In the early 70s, perhaps. Then being compatible with DOS and then Windows limited most computers to x86 processors. One of the niches that was not limited gives us some hints of what can be now: Unix workstations that employed Motorola and RISC processors and all sorts of creative hardware to get a performance edge over what Intel could offer. They didn't need to reinvent the wheel and licensed large amounts of code to make the core of their OSs. They sometimes needed to port the C compiler to their architecture, but that was not the norm.
Developing for them was often making small changes to headers and recompiling the same source.
This is what I meant with having nothing that could prevent developers to go away from the platform, just the performance edge wasn't good enough against commodity hardware.
Yet, if someone could offer radically better price/performance than a x86 PC and still run the large software base that runs on x86 Linux machines, it'd stand a much better chance than it'd in the 90's and 2000's.
Luckily, the SIMD acceleration for progressive Huffman encoding was a rare exception . So, alone with other improvements, it made this fork  rather pointless. Adding the NEON optimizations would have been easier with a proper rebase, and that would require less work for both to keep things up to date.
Thank you very much!
Cloudflare actually delivers enough images per day for a few thousand cpu cycles saved per image to translate to a lot of $$$.
The few times I've had to do similar things - I usually work by tweaking the code to give me the desired compiler output and I really try to avoid inline assembly. This usually helps me learn some limitations of C/C++ and what the compiler can deduce
edit: curious as to how you perform this optimisation.
Godbolt seems like a good solution now-a-days, but you still need to run benchmarks
> To understand the impact on overall performance I ran jpegtran over a set of 34,159 actual images from one of our caches.
In order to run the cache experiment Vlad had to file an access control ticket which I had to approve. He was given access to cached images from two dogfooding/guinea pig machines and was able to run his experiment on those machines. They are not machines through which we run 'normal' production traffic.
NEON was actually introduced in ARMv7.
But everyone sensible ignores a lot of the stupid names ARM uses in ARMv8 (AArch64/AArch32, A64/A32/T32, etc.)
> The ARM Advanced SIMD architecture, its associated implementations, and supporting software, are commonly referred to as NEON technology. There are NEON instruction sets for both AArch32 (equivalent to the ARMv7 NEON instructions) and for AArch64. Both can be used to significantly accelerate repetitive operations on large data sets. This can be useful in applications such as media codecs.
The NEON architecture for AArch64 uses 32 × 128-bit register, twice as many as for ARMv7. These are the same registers used by the floating-point instructions. All compiled code and subroutines conforms to the EABI, which specifies which registers can be corrupted and which registers must be preserved within a particular subroutine. The compiler is free to use any NEON/VFP registers for floating-point values or NEON data at any point in the code.
I'm not going to link it here because the site already seems slow, but it is easy to find if you want to do your own benchmark.
9933 x 7016 (7.4 MB)
load turbojpeg: 309 ms
load mango: 225 ms
save turbojpeg: 388 ms
save mango: 96 ms
Can you put your testset images on github?
High end FPGAs cost an arm and a leg, so I find it hard to believe they would be cost effective, though Intel seems to claim otherwise.
To begin with, nobody even bothered yet to make an optimised HDL design for this purposes.
This is unlike for all kinds of media transcoders which are the bread and butter of the SoC industry. An image transcoding asic you find in cellphone SoC these days can easily max the I/O ceiling, which means they can transcode faster than the SoC's memory interface can work.
Adding to that, our fpga boards are also programmed to do very high quality Lanzo scaling with gamma correction and a sharpening algo by default. Compression settings are on the quality side