That doesn't sound right. Neither AMD nor Intel get more than a handful of GB/s in basic memory I/O? Any idea what could be wrong?
"It should be noted that Kinvolk has ongoing cooperation with both Ampere Computing and Packet, and used all infrastructure used in our benchmarking free of charge. Ampere Computing furthermore sponsored the development of the control plane automation used to issue benchmark runs, and to collect resulting data points, and to produce charts."
I'm not saying anything was intentionally done, but optimizations were likely done on the Ampere side.
In addition to Packet, both AWS and Scaleway were also benchmarked.
I am not surprised by that, given that x86 has a single instruction that will copy arbitrary number of bytes in cacheline-sized chunks --- something that ARM does not have.
About ARM, Cloudflare uses them : https://blog.cloudflare.com/arm-takes-wing/
Intel for example for long intentionally made float performance close to integer of same size, so there was no perf difference in scripting languages that use float internally for all computations.
ARM sucks at web benchmarks because ARM never put any accent on fp perf. Many ARM cores simply don't have fp units at all. The most popular JS vm V8 does a lot of useless float>integer and back conversions under the hood, and that doesn't help either. They are almost free on x86, but degrade js perf on smartphones by double digits.
Second, vector math and vector float math have close to no use in web loads, but a lot of devs still try to put SSE instructions everywhere simply because SSE is many times faster than simple math and many binary manipulations on x86.
ARM on other hand is relatively good with making a lot of ops on byte and double data, because it was historically never aimed at number crunching with extra wide vector instructions.
For the same reason ARMs UCS-2 and UTF-16 parsing performance is that bad. All kinds of parsers exploit fast register renaming on x86 to run tzcnt with very good perf, but they have to revert to relatively slow SIMD bitmasks on ARM. You can feel that a lot when you work with VMs/interpreters that use UCS-2 as their internal unicode implementation.
Hardware peripherals were always x86 optimised too. Yes, almost every device you can hook onto PCIE has been extensively optimised to work well with x86 style DMA, and some higher level APIs like I/O virtualisation, DMA offload engines, and assumptions about typical controller, memory, and cache latency.
Yes, even endianness conversion is there to make x86 jump ahead. Almost all "enterprise hardware" intentionally uses little endian in its protocols, to avoid endianness conversion on x86. Of course at the cost of doing it on big endian machines, that include ARM.
P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer big endian for an opposite reason.