
Comparative Benchmark of Arm, AMD, and Intel for Cloud-Native Workloads - blixtra
https://kinvolk.io/blog/2019/11/comparative-benchmark-of-ampere-emag-amd-epyc-and-intel-xeon-for-cloud-native-workloads/
======
spamizbad
Should be mentioned the AMD CPUs featured are the previous generation (7401)
CPUs.

~~~
floatboth
The current eMAG (Skylark), though it is current, is not very new either. It's
a 16 nm design from last year. They wanted to launch Quicksilver in 2019, but
there's like one month left..

~~~
ac29
Sounds like it will be sampling this year:
[https://www.datacenterknowledge.com/hardware/ampere-gears-
la...](https://www.datacenterknowledge.com/hardware/ampere-gears-
launch-7nm-80-core-arm-chip-cloud-data-centers)

------
alain94040
_In multi-thread benchmarks of raw memory I /O we found a clear performance
leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs
by a factor of 6 or higher_

That doesn't sound right. Neither AMD nor Intel get more than a handful of
GB/s in basic memory I/O? Any idea what could be wrong?

~~~
wmf
It doesn't sound remotely right to me either. It could be NUMA since the Intel
and AMD systems are NUMA but the eMAG is not. The code for this benchmark
appears to be
[https://github.com/akopytov/sysbench/blob/master/src/tests/m...](https://github.com/akopytov/sysbench/blob/master/src/tests/memory/sb_memory.c)
which... is not an interesting way to benchmark a large server IMO. Running a
single process with a lot of threads and a lot of RAM on a NUMA server is
going to perform poorly (unless you do a lot of tuning which I don't recommend
either). "Microservices" might run a lot faster.

~~~
lrem
Are you sure about that? I got the impression that should work well under
Linux, unless you create a lot of contention.

------
sanxiyn
If you are interested in development workload (ARM porting) instead of "cloud-
native" workload, I did one here:
[https://github.com/sanxiyn/blog/blob/master/posts/2019-11-12...](https://github.com/sanxiyn/blog/blob/master/posts/2019-11-12.md)

In addition to Packet, both AWS and Scaleway were also benchmarked.

------
andy_ppp
Wouldn't most things need Hyperthreading off to be secure on Intel or is it
fine if you have your own hardware?

~~~
sigio
That's only fine if you know all code running in all parts (containers) on the
same hardware node. Code running on one container can influence data/code from
other containers. (When some third-party has a form of code execution)

~~~
loeg
Privilege-aware scheduling could colocate only same-container (or same-user,
or same-process) threads on HT pairs.

------
NicoJuicy
And for the Worldwide LHC Computing Grid, the Power8 came out on top.

Dutch article:

[https://tweakers.net/reviews/7426/datavloedgolf-lhc-op-
komst...](https://tweakers.net/reviews/7426/datavloedgolf-lhc-op-komst-nikhef-
bereidt-zich-voor-met-rappe-opslag.html)

------
userbinator
_In the memcopy benchmark, which is designed to stress both memory I /O as
well as caches, Intel’s XEON shows the highest raw performance_

I am not surprised by that, given that x86 has a single instruction that will
copy arbitrary number of bytes in cacheline-sized chunks --- something that
ARM does not have.

~~~
wmf
It seems like the bottleneck should be the memory hierarchy, not executing
instructions. /RISC4EVER

------
fhcoso
I'm very impatient to look RISC-V coming to look performance/security. Don't
forget to disable a lot of features about Intel if you want a full secure
environment like SMT/Hyper-Threading

About ARM, Cloudflare uses them : [https://blog.cloudflare.com/arm-takes-
wing/](https://blog.cloudflare.com/arm-takes-wing/)

------
baybal2
Very impressive perf on ARMs side given it competes against decades of x86
specific optimisation in the code.

Intel for example for long intentionally made float performance close to
integer of same size, so there was no perf difference in scripting languages
that use float internally for all computations.

ARM sucks at web benchmarks because ARM never put any accent on fp perf. Many
ARM cores simply don't have fp units at all. The most popular JS vm V8 does a
lot of useless float>integer and back conversions under the hood, and that
doesn't help either. They are almost free on x86, but degrade js perf on
smartphones by double digits.

Second, vector math and vector float math have close to no use in web loads,
but a lot of devs still try to put SSE instructions everywhere simply because
SSE is many times faster than simple math and many binary manipulations on
x86.

ARM on other hand is relatively good with making a lot of ops on byte and
double data, because it was historically never aimed at number crunching with
extra wide vector instructions.

For the same reason ARMs UCS-2 and UTF-16 parsing performance is that bad. All
kinds of parsers exploit fast register renaming on x86 to run tzcnt with very
good perf, but they have to revert to relatively slow SIMD bitmasks on ARM.
You can feel that a lot when you work with VMs/interpreters that use UCS-2 as
their internal unicode implementation.

Hardware peripherals were always x86 optimised too. Yes, almost every device
you can hook onto PCIE has been extensively optimised to work well with x86
style DMA, and some higher level APIs like I/O virtualisation, DMA offload
engines, and assumptions about typical controller, memory, and cache latency.

Yes, even endianness conversion is there to make x86 jump ahead. Almost all
"enterprise hardware" intentionally uses little endian in its protocols, to
avoid endianness conversion on x86. Of course at the cost of doing it on big
endian machines, that include ARM.

P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer
big endian for an opposite reason.

~~~
ComputerGuru
Modern ARM is bi-endian but rarely run in big endian mode.

