AMD is not just competitive, it is better than Intel. Thus Google should adopt i...

sseveran · on Aug 8, 2019

AWS has already rolled out EPYC instances.

https://aws.amazon.com/ec2/amd/

danarmak · on Aug 8, 2019

Those are the previous generation, Zen 1 EPYC CPUs, whichn were rolled out on AWS back in November.

bluedino · on Aug 8, 2019

Don't forget power consumption. Electricity costs are probably just as big as a factor as Cperformance when it comes to the number of computer Google has.

IBM quoted them as willing to switch to POWER if they could save 10% in energy costs

jrockway · on Aug 8, 2019

AWS has AMD machines. Amazon claims they perform worse than the Intel machines they use, and they are priced lower as a result.

philliphaydon · on Aug 9, 2019

T3 are intel, newer than the T2, and perform worse than the T2. T3a are AMD and perform on par/slightly better than T3 for less cost. (From my own testing. Not a claim I can backup this is just my observation)

corint · on Aug 8, 2019

I thought they were cheaper due to the better energy efficiency. Less electricity means less cooling required, double whammy!

bradstewart · on Aug 9, 2019

Where did you hear that? We're running a handful of m5a instances with fantastic performance. I figured they were priced lower because they're cheaper to purchase and operate.

jimmaswell · on Aug 9, 2019

Depends on the use case. Intel is still top for many games and some apps like Excel/Photoshop.

stcredzero · on Aug 8, 2019

"roll"

cjhanks · on Aug 8, 2019

Very few developers are prepared to write code that can efficiently use 256 threads / machine. At that level, cache coherency becomes a real and non-trivial problem.

In most cases, I suspect developers will see improved wall-clock times with substantively worse FLOPS/watt. Good for developers, bad for data-centers.

mrb · on Aug 8, 2019

«Very few developers are prepared to write code that can efficiently use 256 threads / machine»

This junk justification has no longer been relevant for years. Most developers don't care because (1) they rely on core applications that are already multi-threaded (web servers, SQL engines, transcoding, etc), or (2) in today's age of containers, VMs, etc, it doesn't matter to them. Now we scale by adding more containers and VMs per physical machine. Bottom line, data centers always need more cores/threads per machine.

cjhanks · on Aug 8, 2019

Correct, if you partition a 256 core machine into 32 virtual 8-core machines partitioned by their NUMA architecture - you are relatively unaffected by core count (minus the consequence of some scheduling algorithms not tuned for N > 8).

Unsure what the percentage of VM's that use no time sharing or oversubscription is though.

Nursie · on Aug 8, 2019

Most devs I know are creating async workloads which don't require cache coherency, as they use parallelism to parallel process separate requests and workloads. I can see things being pretty linear in that sort of space.

cjhanks · on Aug 8, 2019

They are not linear unless all requests take an identical amount of time OR the system is not oversubscribed (common in many workloads) - and even then, the current linux CFS scheduler has a complexity of `O(log N)`.

When you have variable length requests, you will find cores will not always be balanced, it is simply a statistical reality. And in those cases, the kernel will have to migrate your process to a different core, and if you have 256 cores, that core might be really far away.

gmueckl · on Aug 8, 2019

Except that they are typically not. The Zen architectures are NUMA and controlling where memory is allocated is key to decent threaded performance. You may even have to do seemingly counterintuitive things like duplicating central data structures across nodes and other tricks from the distributed systems playbook.

coder543 · on Aug 8, 2019

Epyc 2's memory layout is not like Epyc 1. Epyc 2 is very simple.

jdsully · on Aug 8, 2019

Yup everything is equally slow now. Kinda sad, but the original NUMA design was treated as a glass half empty situation instead of AMD letting people maximize performance. This change lets them avoid the bad press and everyone is happier despite the final design being slower than it could have been.

gmueckl · on Aug 8, 2019

Epyc 2 has different memory latencies within and across NUMA nodes according to the infirmation I have. So it is not equally slow for all memory. Can you point me to a source that says otherwise?

Edit: my source is this German article: https://www.heise.de/newsticker/meldung/AMD-Server-CPUs-Epyc...

jdsully · on Aug 8, 2019

See the architecture diagram here: https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/2

Everything goes through the central crossbar on the I/O die, where Zen1 had memory attached directly to each CPU chiplet which would relay as necessary. On Zen1 if you accessed direct attached memory you wouldn't pay the latency penalty from relaying the data. In Zen2 all data is relayed via the I/O die with the associated delay that entails.

gmueckl · on Aug 8, 2019

I did some more digging. It seems like the Linux NUMA topology shown in the anandtech article is a deliberate lie. There are different latencies between cores and memory comtrollers on the same socket, but these are deemed to be insignificant enough to not expose them in the reported NUMA topology.

jdsully · on Aug 8, 2019

That is true with Intel chips as well. In the HFT space people actively work with Intel to determine which cores they should pin tasks to.

The speed of light is constant, and some cores will always be a little closer to various resources.

shaklee3 · on Aug 9, 2019

That was true before Skylake, but is no longer true since they moved away from the multi ring architecture.

jdsully · on Aug 9, 2019

Even with the mesh the number of hops is variable based on which core is requesting and the physical geometry of the chip. The cores right beside the IMC will have the lowest latency. See this diagram: https://en.wikichip.org/wiki/intel/mesh_interconnect_archite...

The main improvement is the max number of hops is log(n) instead N/2.

wmf · on Aug 8, 2019

Epyc 1 was NUMA within the socket while Epyc 2 is officially UMA within the socket (although not really). Unfortunately Epyc memory latency is much higher than Intel so it's fair to call it uniformly slow.

Jweb_Guru · on Aug 10, 2019

Yeah, I actually was not so happy with the benchmarks because the memory access latency is not all that good... for most of the workloads that I care about, I don't know that the Epyc will be faster than a Xeon.

fulafel · on Aug 8, 2019

I suspect cache coherency doesn't mean what you think it means. It's a hardware feature.

But yes, writing correct and performant highly parallel code is difficult & error prone, often prohibitively so.