
Chinese Chipmaker Unveils 64-Core ARM Processor - jcbeard
https://www.top500.org/news/chinese-chipmaker-unveils-speedy-64-core-arm-processor/
======
nl
_That subsequently prompted some speculation that the chip would replace the
now-banned Xeon Phi processors in the 100-petaflop upgrade to the Tianhe-2
supercomputer, which was supposed to be revealed in June at ISC 2016. Given
that the latter never happened, it’s likely those FT-2000 /64 chips were not
deployed, or if they were, did not meet expectations._

Hmm..

 _Process：Manufacturing with 28nm process_

And there you go.

Not sure if people are aware, but China doesn't have any sub-28nm (or 22nm?)
fabs. TSMC has some of course, but the Taiwanese government is very, very
strong on keeping those plants in Taiwan (although they are fine with TSMC etc
building less advanced fabs in China).

Until they get a process shrink on this, I think.. well, I'd like to see some
independent benchmarks.

Edit: The Samsung Exynos 5433 (4 cores on a 20nm process[1]) maxes out at 3.78
GFLOPS[2] on selected benchmarks. Call it 1 GLOP/core - I find it unlikely
that this thing is going to get 500 GFLOPS with 64 cores, even with a higher
power consumption.

[1] [http://www.anandtech.com/show/8718/the-samsung-galaxy-
note-4...](http://www.anandtech.com/show/8718/the-samsung-galaxy-
note-4-exynos-review)

[2] [http://www.anandtech.com/show/8718/the-samsung-galaxy-
note-4...](http://www.anandtech.com/show/8718/the-samsung-galaxy-
note-4-exynos-review/4)

~~~
mrb
Here is how this Chinese processor claims 512 (single precision) GFLOPS: each
core seems to have 128-bit NEON vector instructions, so it can execute 4
32-bit operations per cycle. The core runs at 2 GHz, so that's 8 GFLOPS per
core. Times 64 cores = 512 GFLOPS. So yeah their claims check out.

These benchmark numbers you found about the Exynos 5433 are so low probably
because "Geekbench 3" is not very good at reaching the maximum theoretical
performance, as it runs a bunch of real-world computing tasks. In general you
need to hand-code assembly loops of fused-multiply-add instructions to reach
the max theoretical perf and clearly this is not what Geekbench does, look at
the list of FP workloads it runs:
[http://support.primatelabs.com/kb/geekbench/geekbench-3-benc...](http://support.primatelabs.com/kb/geekbench/geekbench-3-benchmarks#floating-
point-workloads)

~~~
CyberDildonics
Not only that but for most workloads you would need lots of memory bandwidth
to supply the cores.

------
z2
It seems this was announced exactly one year ago, not today!

Edit: Though last year's announcement was just the planned specs, perhaps the
unveiling here is the prototype itself and not the concept.

[Chinese links]
[http://www.ltaaa.com/bbs/forum.php?mod=viewthread&tid=364899](http://www.ltaaa.com/bbs/forum.php?mod=viewthread&tid=364899)
[http://bbs.kafan.cn/thread-1849400-1-1.html](http://bbs.kafan.cn/thread-1849400-1-1.html)

------
AstroJetson

       "FCBGA package with 2892 pins"
    

Yikes! Thats pretty dense on the other side, would love to see a picture.

It's a 100 watts, so at 3.3 volts its drawing 30+ amps, so I'm guessing that a
good chunk of those pins are power and ground. Is that a good theory?

~~~
aembleton
Here's a picture of it:
[http://www.phytium.com.cn/Public/Home/images/ban2000.png](http://www.phytium.com.cn/Public/Home/images/ban2000.png)

~~~
aluhut
Since the page is already on the way down here is a rehost:
[http://i.imgur.com/beAY5De.png](http://i.imgur.com/beAY5De.png)

~~~
AstroJetson
Thanks to the two of you, it's even more amazing than I though.

------
nxjfxgkkt
At 500 gigaflops, I would buy these in a heartbeat if they were offered for
sale to the public. While that's only half the performance of a xeon phi, this
is a traditional CPU which makes programming it much more straightforward. Not
to mention that the cost would be significantly less.

~~~
mrb
512 single precision GFLOPS is not half, but 1/10th the performance of a Xeon
Phi (Xeon Phi 7210 is 5325 single precision GFLOPS).

However perhaps this processor beats the Xeon Phi 7210 in perf/price as the
latter is horrendously priced at $2438 (list price) as well as perf/watt.

~~~
snaky
Do you think this 64-core beast would be priced at less than $243?

~~~
mrb
Probably not, but _maybe_. This chinese processor is hardly a beast. It is
comparable (±20%) in raw GFLOPS to a ~$200 Intel skylake processor: 4-core
3.3GHz i7-6700K is rated 422 single precision GFLOPS.

------
cordite
I wonder how erlang would work on this, especially in a multi-coprocessor
system.

~~~
weatherlight
I was just thinking the same thing! I really hope as more cores are available
on the same machine we start to see languages like Erlang/Elixir really take
off.

~~~
mafribe
A crucial question for fast execution of the BEAM-machine on multiple cores is
how expensive message copying between cores is. With that many cores, there's
probably no shared memory between all cores. How do the remaining cores
communicate? Can the BEAM machine's optimisations take into account
communication cost?

~~~
yellowapple
I don't know very much about BEAM's internals, but one of the key reasons why
Erlang (and therefore Elixir) is able to spin up so many processes is because
the processes don't have shared state, or at least they don't have _writable_
shared state; the underlying data structures are all immutable, and this is
enforced on a VM level (which is why Elixir's "mutability" is only in terms of
which data a given variable holds rather than the data itself).

Given that, my guess would be that
Erlang/Elixir/LFE/$INSERT_OTHER_BEAM_BASED_LANGUAGE_HERE would do pretty darn
well with a high number of cores. I hope one of these days I'll be able to
afford such a machine (whether with this particular ARM processor or something
else with a ridiculous number of cores/threads, like a modern POWER or SPARC
CPU) or have access to one so that I can experience for myself exactly _how_
darn well :)

It's also worth noting that Erlang has had a lot of design around clusters of
independent nodes, which means even more extreme problems when it comes to
data copying between nodes. I reckon intercore message copying is
significantly more performant than internode message copying.

~~~
mafribe
Do you know what kinds of optimisations Erlang/BEAM does for clusters of
independent nodes? Does the user have to tell Erlang/BEAM which thread lives
on what node, or does the scheduler handle this automatically?

~~~
yellowapple
Internode process spawning happens explicitly, last I checked. It's pretty
easy once both nodes are communicating, but not automatic.

In contrast, BEAM's SMP support _is_ automatic AFAICT; the scheduler will
happily distribute processes across as many cores as it can access (1 BEAM
thread per hardware thread by default, so a quad-core CPU with one thread per
core and a dual-core CPU with two threads per core will both be loaded with
four BEAM threads unless BEAM is configured to do something else).

------
jcbeard
Nice slide presentation here: [http://insidehpc.com/2016/08/phytium-china-
unveils-64-core-a...](http://insidehpc.com/2016/08/phytium-china-
unveils-64-core-arm-hpc-processor/)

------
gourou
I guess that's what happens when you forbid Intel from selling to China

[http://wccftech.com/us-government-bans-intel-nvidia-amd-
chip...](http://wccftech.com/us-government-bans-intel-nvidia-amd-chips-china/)

~~~
gscott
China has loads of money to throw around... they get what they want.

[http://www.extremetech.com/computing/227059-amd-announces-
ne...](http://www.extremetech.com/computing/227059-amd-announces-
new-293-million-joint-venture-to-build-servers-for-the-chinese-market)

------
yazr
This appears to a global-cache-coherence chip.

Slides 12-16 [http://insidehpc.com/2016/08/phytium-china-
unveils-64-core-a...](http://insidehpc.com/2016/08/phytium-china-
unveils-64-core-arm-hpc-processor/)

Each 8-core "panel" has its own L2 and DCU The L3 is globally shared, with the
usual cache coherence protocol. 30ns latency for L3 hit.

I am not an expert on cache performance, but this certainly seems like an up-
to-date design.

Can anyone compare this to Xeon and Phi ?!

------
geezerjay
Sounds cool. Does anyone have any benchmarks to get an idea of what this
processor can actually do?

A price would also be nice.

~~~
petra
512Gflops. But probably it isn't supposed to lead(which requires 14nm), but to
use manufacturing fabs fully controlled by china.

~~~
geezerjay
The 512Gflops is only an uncorroborated blurb posted in a press release. Thus,
its highly doubtful that the chip's real world performance comes anywhere near
that value.

It would be very interesting to see real-world benchmarks compiled
independently.

------
api
I fear for Intel, AMD, and the US semiconductor industry. This is not
competitive with Xeons _yet_ , but it will be.

On the other hand this is good for the future of computing in general. The
worst case scenario would be for things to stagnate with no competition,
offering no incentive for anyone to push beyond traditional Moore's law type
scaling.

