HPC Systems Special Offer: Two A64FX Nodes in a 2U for $40k

FeepingCreature · on Aug 26, 2020

An A64FX CPU has 48 cores. 1 CPU/node means 96 cores on 2 nodes for $40k and ~6TFLOPS (3 per). Meanwhile the 128 thread Threadripper 3990X costs $3.6k, and benchs put it at 1.5TFLOPS in Linpack, which is less but not 10x less, and I don't know what the benchmark basis for the A64FX value is, so I suspect it's closer, especially since it says theoretical peak performance.

... Am I missing something?

dragontamer · on Aug 26, 2020

A64FX has HBM2 RAM. This means it will be relatively low (32GBs RAM), but extremely high performance RAM. Literally the highest-bandwidth RAM in the market, directly wired onto the chip itself over an interposer (PCB is too slow! Direct interposer connections only mm away from the cores).

Your implication is correct however. x86 is more in line with "typical" consumers, and even businesses, who need this kind of compute.

Dell's C6525 quad-node dual-socket EPYC is a good example of what a typical compute-oriented build: https://www.servethehome.com/dell-emc-poweredge-c6525-review...

-------------

A64FX is an HBM2 box. Its a normal CPU (not a GPU), with access to that stupid-high bandwidth RAM. There's probably a few use cases where the high-bandwidth becomes a major advantage.

A64FX compares against the NVidia V100 on a memory-bandwidth and memory-capacity basis (and is approaching GPU-level FLOPs thanks to SVE 512-bit SIMD units). Except its running the ARM instruction set. As others have pointed out, this thing is like Xeon Phi 2.0, except with the notable niche that its the #1 supercomptuer in the world right now.

brandmeyer · on Aug 26, 2020

SVE is considerably richer than AVX-512, IMO. Its got better unaligned load/store support (especially on A64FX) and richer instructions for generating and manipulating masks.

For example, SVE has mask partitioning and speculative vector load instructions to accelerate data-dependent loop termination. You can do a vector-length-agnostic strncpy on SVE without too much effort.

sonium · on Aug 26, 2020

Quantum chemistry software benefits from 'stupid-high bandwith RAM'. It says Gaussian supports this hardware. I remember when running VASP (another popular quantum chemistry software) we were not even using all the cores on a node, because the limited performance of the RAM interface.

The basic operation they perform is fast fourier transform on large (number of electrons of the system ^ 2) matrices.

formerly_proven · on Aug 27, 2020

HBM2 is totally tame in terms of signalling rate. I think it's like 2 GT/s, while GDDR6X is at 20 GT/s (both per pin). PCBs aren't too slow for HBM2, a silicon interposer is just the only practical way to route a memory interface using something like 7000 pins directly between wafers. If HBM2 chips were using regular packaging and pitches suitable for PCB use, each stack would have to be a huge package with thousands of balls.

beached_whale · on Aug 26, 2020

In concurrent/parallel scenarios the memory bandwidth is often the bottleneck. So 1TB/s of memory bandwidth over 96 threads is about 10.5GB/s each. Not great, but the 3990x has what, 100GB/s over 64 threads for about 1.5GB/s each?

q3k · on Aug 26, 2020

You're comparing the price of a consumer CPU to an entire rackable unit. There's a whole bunch of more stuff that goes into making a good server than just stuffing a good CPU into a beige box.

fock · on Aug 26, 2020

then look at those 2U, 4x2 socket supermicros and think again. 512 epyc cores and 4TB of memory + Infiniband sell for USD 100k.

but you're probably right that noone buys 2 of these at that price. Probably only japanese companies proud of their country or whatever.

brian_herman__ · on Aug 26, 2020

Yeah, these computers can be spec with 32gb of HBM also.

fock · on Aug 26, 2020

no they can't, but unless Fujitsu or you produce some benchmark, to show me that 150GB of HBM with 250 cores beats the aforementioned supermicro I doubt that this is a cost-effective solution for general HPC (we have a KNL system here at the cluster, it isn't either – or I am to dumb to compile the software I use there...). Like KNL it's probably king for some specialized tasks, which is totally fine and is probably also reflected in actual price if you buy some racks of them.

My personal opinion: this is some randomly fixed "we are selling these"-price to show they are not IBMs-"ask for a 6 figure-quote" price. So they might very well be very interesting if you have a budget in the upper six figures (because, the 40k won't be the final offer for sure...); everyone else (like the lower half of the 6-figures) knows: nice, but not my price.

sliken · on Aug 27, 2020

The #1 cluster on the top 500 list would like to have a chat.

Actually it looks like it's just the opposite of what you describe. Fujitsu build a CPU that's very good at running a wide variety of applications. Unlike x86-64's that need accelerators for good FP or memory bandwidth per node, the A64FX does not need CUDA applications. Plain old fortran is just fine.

Keep in mind the $40k includes support for porting applications, it doesn't mean that if you want 1000 of them that you'll have to pay $20k each.

fock · on Aug 27, 2020

yes, because running on the whole number #1 cluster is just recompiling your non-trivial software. no, it's not (just take a look at QuantumEspresso, which is widely used, not too bad, but does not scale tooo god with non-Fortran and QE experts testing). And as I said, this is a fine price, I just doubt that it's worth anyones attention who is not building an upper 6-figures to some millions cluster. And unless you actually port your application with Fujitsu (which incurs also personell cost) to take advantage of the arch, you will just fare better if you buy a bog-standard EPYC or even Skylake-Refresh, which "just works the same way it did before"©

sliken · on Aug 27, 2020

Not so sure, many of the impressive A64FX numbers were published with no source code changes. Even large complex codes like WRF (a weather simulator) "just works" and was faster than a dual socket Xeon 8168 (each Xeon 8168 is $5680 each list).

For another F90 benchmark (Himeno) they got 4 times faster than a dual Xeon 8168 AND 1.1x faster than a Tesla V100 running the CUDA version of the same benchmark. IMO that's pretty amazing, no source code changes and you get BETTER than GPU performance without having to rewrite your code.

So sure a standard Epyc or Skylake refresh will run your code and "just work", the a64fx opens up the possibility of significant performance upgrades with no source code changes. If your code isn't currently rewritten for CUDA getting 10x the bandwidth and 4x the CPU performance would be really attractive.

tyingq · on Aug 26, 2020

I don't think anyone buys a 2 node HPC system for production. Note that you can fit 6 more nodes into this 2U system. 288 cores in 2U with infiniband, etc.

This is likely the development setup for someone that also buys a fully populated rack for production. So it's probably not meant to be a great $$/core story at this small scale.

Though, your point that the AMD EPYC is probably pressuring these ARM HPC setups is fair.

sdgjhfdlkgjhdf · on Aug 26, 2020

I think that a useful way of seeing this is to compare car engines:

From wikipedia, a Ford Flex Limited can have a V6 turbocharged 3.6L engine (max torque 350 lb-ft), while a Formula One has a V6 turbocharged with only 1.6L (max torque 214 lb-ft).

The Formula One engine has lower specs, and cost orders of magnitude more.

I think that the answer is that this comparison doesn't make much sense, when the use cases and the target are so different.

Edit: it is funny to see all the replies to my comment mirroring the replies to the parent comment: there is a lot more to the comparison than just cherry-picking a few spec point.

formerly_proven · on Aug 26, 2020

The Formula One engine has about 3-4x the crankshaft power of that Ford V6. The F1 engine has higher specs all around except weight and torque; the latter is irrelevant since gears exist.

mhh__ · on Aug 26, 2020

Keep in mind that the F1 engine can pump out 1000HP almost instantly and is more than 50% thermally efficient (Approximately double a road car)

cwilkes · on Aug 26, 2020

Also the F1 engine is only designed to last 1300 miles. Not sure if it is U.N. usable after that.

https://f1chronicle.com/how-long-do-f1-engines-last/

outsomnia · on Aug 26, 2020

They are just aarch64 cores like in your phone, not a "formula 1 engine". I think it's a really tough sell.

For around the same money you can buy 512 cores of modern AMD x86_64 (4 x 128 core, ~$5000 ea) AND 4 x rtx8000 gpu, 18,000 CUDA compute cores (also ~$5000 ea)

dman · on Aug 26, 2020

They are the only cores you can buy right now with the SVE extensions - so not exactly equivalent too the phone cores.

numpad0 · on Aug 26, 2020

It’s actually Fujitsu’s former SPARC core modified to take AArch64 instructions, or at least that’s the story

colechristensen · on Aug 26, 2020

Support and software, the case, the motherboards, power supplies, memory, the fact that the threadripper uses twice as much power, reliability testing: in a real supercomputer you’ll be buying thousands of them and dealing with failures gets expensive for everyone... and on and on

reitzensteinm · on Aug 26, 2020

It has 1 TB/s memory bandwidth via HBM2, and 512 bit vector units. A better comparison would probably be to Xeon Phi (not that it's lighting the world on fire exactly).

FeepingCreature · on Aug 26, 2020

Ah fair, I think the 3990x only gets 100GB/s at quad DDR4 3200, if I'm interpreting the Google results right. That could be a killer. And there's no AVX512 support either (though it does have -256).

Would be good to have some actual benchmarks to compare.

reitzensteinm · on Aug 26, 2020

I'd also not focus too much on single node value for money; these machines come with Infiniband and are likely intended to be evaluation units for people that are interested in building a large cluster.

Fugaku cost $1 billion, with 159k nodes. That's only ~$6k per node, and that's if you assume the entire budget is allocated to the compute nodes, which it wouldn't nearly be. Wikipedia lists it as total program cost; so in addition to the facility, cooling and other equipment such as networking, it may well include years worth of operating expenses, staff, etc.

I believe the best way to view the price tag is: low enough that it's no barrier to anyone that's interested in building even a moderate cluster, but high enough to keep away anyone that's just toying around.

That way if somebody buys one of these machines and calls you up to ask a question, you're not wasting your time to treat it as a step in a longer sales process (meaning you let them talk to engineers, respond in depth etc).

fluffything · on Aug 26, 2020

> ... Am I missing something?

Memory BW? Power consumption ? What's the BW and the power consumption of 2 A64FX CPUs, and what's the BW and power consumption of 4x Threadripper 3990X that also deliver 6 TFLOP/s? You might need to add the power for cooling to the comparison as well if you plan to stock a full cluster with these.

Also, if your main metric is TFLOPs, you might want to add a single A100 to the comparison.... it probably crushes the Threadripper and the A64FX from the POV of TFLOPs, memory BW, and power consumption, but maybe not price.

eridan2 · on Aug 26, 2020

Fugaku supercomputer had a total cost of 1B$, and 160k cpus, so the cost was less than 6500 per chip, or 3600$ per 1.5 TFlops

sliken · on Aug 27, 2020

Yes, $6250 per node, but that cost includes storage, compute nodes, racks, IB switches, GigE switches, consoles, PDUs, etc. At least 1/3rd of that $1B went to storage/network/infrastructure and likely much more.

Also note that the #1 on the top 500 list is using the a64fx and scales with 80% efficiency. The #8 on the list is a pure intel (no accelerators) and is 14 times smaller, and only scales with 60% efficiency. That's a pretty impressive feat since good scaling gets harder as the cluster increases in size.

moonbug · on Aug 26, 2020

if you have to ask it's not for you. this is for people who need a PoC or reference for a larger HPC system. the pricing is largely irrelevant.

as to the system itself, most of the value in this processor is its memory bandwidth, derived from HBM integration. no other cpu system comes close.

sliken · on Aug 26, 2020

Well the A64FX "deal" is for those interested in building HPC clusters out of them. As such people people are interested in useful work / total cost to own.

Your 3990X system has dramatically less memory bandwidth with 4 64 bit memory channels good for a peak bandwidth of around 100GB/sec, which is 10% of the A64FX.

Your $3,500 price likely doesn't include an infiniband card, or a port on an IB leaf switch, or the port on spine IB switch, and you are getting quite a bit less memory bandwidth and floating point than the A64FX system. As a result you end up buying more threadripper nodes, paying for more power, more rack space, more cooling, more IB spine switches, more IB cables, more IB leaf switches, and potentially even a larger building ... just to hit the same performance.

Generally to get close to the memory bandwidth or flops with an x86-64 system you end up using an accelerator, like say the Nvidia A100. Problem is that the nvidia GPUs can only run CUDA aware programs, which rules out a significant chunk of HPC workloads.

The attraction of the A64FX is that it's efficient, has impressive flops per watt (which means cheaper cooling, racks, buildings), impressive memory bandwidth, and will run any python, perl, fortran, C, C++, go, java, code you throw at it.

For large clusters this isn't just a nice to have, it's a huge driver of real world performance / price. So much so that the top 4 supercomputers in the world are not x86-64. Even #5 uses a Intel Xeon to run the OS and do I/O and uses an accelerator (Matrix-2000) to do the heavy lifting. #6 and #7 are similar, but using a Nvidia cards. #8 is the first pure x86-64 cluster and is pretty small in comparison, 18 times smaller than #1.

Even #8 (Frontera) being 18 times smaller, makes it MUCH easier to scale codes that run across the entire cluster. Despite that huge advantage, Frontera scaled lipack to the entire cluster with 60% efficiency. The #1 Fugaku using the A64FX scaled at 80% efficiency, that's a pretty large real world advantage.

Keep in mind that Linpack is a pretty limited benchmark, heavily optimized, and not particularly memory or network intensive. Many real world codes will more heavily exercise the system and would likely show a larger performance differential than linpack.

Imagine you have 20,000 watts a rack and 100 racks. Sure you could use cheap nodes, but your total performance would be less... and that's before you throw in problems like scaling to 100 racks at 60% efficiency instead of 80%. When asking for your budget it will be that much harder if you only perform well on CUDA codes. Not that there's not room for cheaper node clusters out there, but non-x86-64 systems do seem to have a significant advantage on larger clusters.

Also general HPC devel/test systems often come with substantial support to enable users to tune their codes to a new platform. I wouldn't assume that if you want 100 racks of them that you'd pay $20k each, in fact the hardware you'd likely use isn't even the same hardware. The A64FX nodes used in clusters like Fugaku have higher density nodes that depend on water cooling, have more cores, and have a much better interconnect.

Uptrenda · on Aug 26, 2020

Enterprise shills back in full force

guillaumei · on Aug 26, 2020

https://www.csm.ornl.gov/srt/conferences/Scala/2019/keynote_... for an overall presentation of the A64FX and the "supercomputer" Fugaku.

ksec · on Aug 26, 2020

2U is only available in Japan. They only ship whole Rack internationally.

May be someone could give some hints as to why?

q3k · on Aug 26, 2020

Probably they don't want to deal with the overhead of international sales, shipping, logistics and support for just a $40k contract.

Uptrenda · on Aug 26, 2020

What is up with the price? 40k for only 2 x 48 cores and a mediocre 2 ghz clock rate. Modern processors are way more efficient than older processors but I simply can't imagine the efficiency adds up to (simplistically) $416 per core...

Very interesting memory bus though... 1 TB / s? That is cool, but I would still much rather get a crap load more cores at a reasonable price then be able to send around data that efficiently. Granted, I am definitely not the target audience for this.

mbreese · on Aug 26, 2020

See, and I had the opposite opinion. I work with HPC all the time (computational biology) and have built (small) clusters. When I saw the price I thought — yep, that make sense for a novel (and core heavy) HPC node. It’s on the high end, but not an outrageous price. Normally with something so new, you never actually see a price (if you have to ask...).

Especially when you consider that this is a new HPC architecture, that’s not an unreasonable price. As others have mentioned, this is a SKU you’re buying to evaluate a significantly larger purchase.

There isn’t exactly a lot of pricing pressure on the processors. Top end HPC always has the biggest margins because this is a market where absolute speed tends to trump all other factors, including price. A better price comparison might be an IBM Power HPC node — another fairly novel architecture built for raw speed and HPC.

Plus, it is entirely possible that the cost difference of the node (compared to x86) is completely covered by the power savings. HPC is very power hungry. So a significant savings there could make up for an increased node cost.

colechristensen · on Aug 26, 2020

This is a scaled down version of a single unit that’s designed to go into supercomputers by the thousand, where the cost of the whole system is hundreds of millions of dollars, consuming megawatts of power.

You would probably buy one of these to facilitate testing and development of codes to run on a big one.

No doubt the memory vs cpu balance is correct for what these computers are doing.

nottorp · on Aug 26, 2020

If you need that kind of memory bandwidth you'll probably know already, and maybe this will even look cheap.

HBM not only has lotsa bandwidth(tm) but also much better latency?

formerly_proven · on Aug 26, 2020

Latency in conventional DDRx memory is limited by the DRAM array itself, which doesn't change with the interface (DDR, GDDR, HBM, ...). Essentially, with regular DRAM, you cannot do better than the contemporary low-latency DDRx memory. You might chose to increase latency to increase throughput, though.

microcolonel · on Aug 26, 2020

> Essentially, with regular DRAM, you cannot do better than the contemporary low-latency DDRx memory. You might chose to increase latency to increase throughput, though.

My understanding is that, given how memory systems work right now, typically it's the opposite: increasing throughput decreases latency.

moonbug · on Aug 26, 2020

it's almost as if the price of the system reflects more than core count.

in any case, anyone buying one of these is doing so as a poc for a larger procurement.

they're not for tourists.

marktangotango · on Aug 26, 2020

What’s the SHA256 hash rate and power consumption? Only partially joking btw!

Uptrenda · on Aug 26, 2020

Nobody mines Bitcoin on CPUs since about 2012~. And there are far more reasons to want access to reasonably priced computation than just crypto-asset mining (like research - or what the Internet was originally about.) But I suppose dunking on an entire industry is just what the cool kids do now to seem in the know?

Doesn't matter that none of these critics ever address a single use-case beyond what little info they can gleam from memes. It's particularly obnoxious. But whatever. People who use this tech are all idiots, right, and couldn't possibly know how to add up a balance sheet. Lol, it use lotssa powerr. BRRRRRrrrrrrrr. soo stupid of them to waste that much power for absolutely no reason.

Only partially joking btw!

sjreese · on Aug 26, 2020

This is a business workhouse; think web hosting, HPC scientific programming OR any Bitcoin mining related business. You have a system that pays for itself and RHEL means it will run any Linux application. Adding AI ( What is processing ) it is a real money maker in the USA. 2U means at home - I'll bet the US FTC is already placing import restrictions on it as we speak, (with AT&T wavier) See Also: SKYDRIVE https://nerdist.com/article/japanese-flying-cars-nerdist-new...