Hacker News new | past | comments | ask | show | jobs | submit login
Comparative Benchmark of Arm, AMD, and Intel for Cloud-Native Workloads (kinvolk.io)
65 points by blixtra 12 days ago | hide | past | web | favorite | 22 comments

Should be mentioned the AMD CPUs featured are the previous generation (7401) CPUs.

So are the Xeons. But AMD's latest generation was a bigger change than Intel's latest update.

The current eMAG (Skylark), though it is current, is not very new either. It's a 16 nm design from last year. They wanted to launch Quicksilver in 2019, but there's like one month left..

Sounds like it will be sampling this year: https://www.datacenterknowledge.com/hardware/ampere-gears-la...

In multi-thread benchmarks of raw memory I/O we found a clear performance leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs by a factor of 6 or higher

That doesn't sound right. Neither AMD nor Intel get more than a handful of GB/s in basic memory I/O? Any idea what could be wrong?

Upfront they say:

"It should be noted that Kinvolk has ongoing cooperation with both Ampere Computing and Packet, and used all infrastructure used in our benchmarking free of charge. Ampere Computing furthermore sponsored the development of the control plane automation used to issue benchmark runs, and to collect resulting data points, and to produce charts."

I'm not saying anything was intentionally done, but optimizations were likely done on the Ampere side.

It doesn't sound remotely right to me either. It could be NUMA since the Intel and AMD systems are NUMA but the eMAG is not. The code for this benchmark appears to be https://github.com/akopytov/sysbench/blob/master/src/tests/m... which... is not an interesting way to benchmark a large server IMO. Running a single process with a lot of threads and a lot of RAM on a NUMA server is going to perform poorly (unless you do a lot of tuning which I don't recommend either). "Microservices" might run a lot faster.

Are you sure about that? I got the impression that should work well under Linux, unless you create a lot of contention.

I ran sysbench memory on dual-channel six-core MacBook and it scores 18311.19 MiB/sec, higher than either of those x64 behemoths. Something seems off.

If you are interested in development workload (ARM porting) instead of "cloud-native" workload, I did one here: https://github.com/sanxiyn/blog/blob/master/posts/2019-11-12...

In addition to Packet, both AWS and Scaleway were also benchmarked.

Wouldn't most things need Hyperthreading off to be secure on Intel or is it fine if you have your own hardware?

That's only fine if you know all code running in all parts (containers) on the same hardware node. Code running on one container can influence data/code from other containers. (When some third-party has a form of code execution)

Privilege-aware scheduling could colocate only same-container (or same-user, or same-process) threads on HT pairs.

Their tests disabled hyperthreading on Intel due to the security concerns and also on AMD on speculation that security concerns might arise in the future (if I read everything correctly).

And for the Worldwide LHC Computing Grid, the Power8 came out on top.

Dutch article:


In the memcopy benchmark, which is designed to stress both memory I/O as well as caches, Intel’s XEON shows the highest raw performance

I am not surprised by that, given that x86 has a single instruction that will copy arbitrary number of bytes in cacheline-sized chunks --- something that ARM does not have.

It seems like the bottleneck should be the memory hierarchy, not executing instructions. /RISC4EVER

I'm very impatient to look RISC-V coming to look performance/security. Don't forget to disable a lot of features about Intel if you want a full secure environment like SMT/Hyper-Threading

About ARM, Cloudflare uses them : https://blog.cloudflare.com/arm-takes-wing/

Very impressive perf on ARMs side given it competes against decades of x86 specific optimisation in the code.

Intel for example for long intentionally made float performance close to integer of same size, so there was no perf difference in scripting languages that use float internally for all computations.

ARM sucks at web benchmarks because ARM never put any accent on fp perf. Many ARM cores simply don't have fp units at all. The most popular JS vm V8 does a lot of useless float>integer and back conversions under the hood, and that doesn't help either. They are almost free on x86, but degrade js perf on smartphones by double digits.

Second, vector math and vector float math have close to no use in web loads, but a lot of devs still try to put SSE instructions everywhere simply because SSE is many times faster than simple math and many binary manipulations on x86.

ARM on other hand is relatively good with making a lot of ops on byte and double data, because it was historically never aimed at number crunching with extra wide vector instructions.

For the same reason ARMs UCS-2 and UTF-16 parsing performance is that bad. All kinds of parsers exploit fast register renaming on x86 to run tzcnt with very good perf, but they have to revert to relatively slow SIMD bitmasks on ARM. You can feel that a lot when you work with VMs/interpreters that use UCS-2 as their internal unicode implementation.

Hardware peripherals were always x86 optimised too. Yes, almost every device you can hook onto PCIE has been extensively optimised to work well with x86 style DMA, and some higher level APIs like I/O virtualisation, DMA offload engines, and assumptions about typical controller, memory, and cache latency.

Yes, even endianness conversion is there to make x86 jump ahead. Almost all "enterprise hardware" intentionally uses little endian in its protocols, to avoid endianness conversion on x86. Of course at the cost of doing it on big endian machines, that include ARM.

P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer big endian for an opposite reason.

Modern ARM is bi-endian but rarely run in big endian mode.

ARM is doing OK with the current generation of HPC systems, and the post-K system, whose name I forget, should be rather impressive at floating point. SIMD width is not all that matters, after all. (Obviously this is v8 and up, which requires floating point.)

ARMs are usually used little ended

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact