
HPC Systems Special Offer: Two A64FX Nodes in a 2U for $40k - ZeljkoS
https://www.anandtech.com/show/15885/hpc-systems-special-offer-two-a64fx-nodes-in-a-2u-for-40k
======
FeepingCreature
An A64FX CPU has 48 cores. 1 CPU/node means 96 cores on 2 nodes for $40k and
~6TFLOPS (3 per). Meanwhile the 128 thread Threadripper 3990X costs $3.6k, and
benchs put it at 1.5TFLOPS in Linpack, which is less but not 10x less, and I
don't know what the benchmark basis for the A64FX value is, so I suspect it's
closer, especially since it says theoretical peak performance.

... Am I missing something?

~~~
dragontamer
A64FX has HBM2 RAM. This means it will be relatively low (32GBs RAM), but
extremely high performance RAM. Literally the highest-bandwidth RAM in the
market, directly wired onto the chip itself over an interposer (PCB is too
slow! Direct interposer connections only mm away from the cores).

Your implication is correct however. x86 is more in line with "typical"
consumers, and even businesses, who need this kind of compute.

Dell's C6525 quad-node dual-socket EPYC is a good example of what a typical
compute-oriented build: [https://www.servethehome.com/dell-emc-
poweredge-c6525-review...](https://www.servethehome.com/dell-emc-
poweredge-c6525-review-2u4n-amd-epyc-kilo-thread-server/)

\-------------

A64FX is an HBM2 box. Its a normal CPU (not a GPU), with access to that
stupid-high bandwidth RAM. There's probably a few use cases where the high-
bandwidth becomes a major advantage.

A64FX compares against the NVidia V100 on a memory-bandwidth and memory-
capacity basis (and is approaching GPU-level FLOPs thanks to SVE 512-bit SIMD
units). Except its running the ARM instruction set. As others have pointed
out, this thing is like Xeon Phi 2.0, except with the notable niche that its
the #1 supercomptuer in the world right now.

~~~
brandmeyer
SVE is considerably richer than AVX-512, IMO. Its got better unaligned
load/store support (especially on A64FX) and richer instructions for
generating and manipulating masks.

For example, SVE has mask partitioning and speculative vector load
instructions to accelerate data-dependent loop termination. You can do a
vector-length-agnostic strncpy on SVE without too much effort.

------
guillaumei
[https://www.csm.ornl.gov/srt/conferences/Scala/2019/keynote_...](https://www.csm.ornl.gov/srt/conferences/Scala/2019/keynote_2.pdf)
for an overall presentation of the A64FX and the "supercomputer" Fugaku.

------
ksec
2U is only available in Japan. They only ship whole Rack internationally.

May be someone could give some hints as to why?

~~~
q3k
Probably they don't want to deal with the overhead of international sales,
shipping, logistics and support for just a $40k contract.

------
Uptrenda
What is up with the price? 40k for only 2 x 48 cores and a mediocre 2 ghz
clock rate. Modern processors are way more efficient than older processors but
I simply can't imagine the efficiency adds up to (simplistically) $416 per
core...

Very interesting memory bus though... 1 TB / s? That is cool, but I would
still much rather get a crap load more cores at a reasonable price then be
able to send around data that efficiently. Granted, I am definitely not the
target audience for this.

~~~
nottorp
If you need that kind of memory bandwidth you'll probably know already, and
maybe this will even look cheap.

HBM not only has lotsa bandwidth(tm) but also much better latency?

~~~
formerly_proven
Latency in conventional DDRx memory is limited by the DRAM array itself, which
doesn't change with the interface (DDR, GDDR, HBM, ...). Essentially, with
regular DRAM, you cannot do better than the contemporary low-latency DDRx
memory. You might chose to increase latency to increase throughput, though.

~~~
microcolonel
> _Essentially, with regular DRAM, you cannot do better than the contemporary
> low-latency DDRx memory. You might chose to increase latency to increase
> throughput, though._

My understanding is that, given how memory systems work right now, typically
it's the opposite: increasing throughput decreases latency.

------
sjreese
This is a business workhouse; think web hosting, HPC scientific programming OR
any Bitcoin mining related business. You have a system that pays for itself
and RHEL means it will run any Linux application. Adding AI ( What is
processing ) it is a real money maker in the USA. 2U means at home - I'll bet
the US FTC is already placing import restrictions on it as we speak, (with
AT&T wavier) See Also: SKYDRIVE [https://nerdist.com/article/japanese-flying-
cars-nerdist-new...](https://nerdist.com/article/japanese-flying-cars-nerdist-
news/)

