
96-core ARM supercomputer using the NanoPi-Fire3 - betolink
https://climbers.net/sbc/nanopi-fire3-arm-supercomputer/
======
dragontamer
Doesn't seem practical. It might be useful as a learning-framework for MPI /
Supercomputer programming... but it wouldn't be a tool that I'd use
personally.

A practical baseline for anyone interested in ARM-compute, would be the
Thunder X CPU (Cloud rental: [https://www.packet.com/cloud/servers/c1-large-
arm/](https://www.packet.com/cloud/servers/c1-large-arm/)). 48-cores per
socket, 2x for 96-core servers.

As another commenter said: the primary use of this NanoPi is the ability to
emulate a "real" super-computer and really use MPI and such. MPI is a
different architecture than a massive node (like a 96-core Thunder X ARM), and
you need to practice a bit with it to become proficient.

~~~
marmaduke
> emulate a "real" super-computer and really use MPI and suc

Wouldn’t containers be a easier way to do that?

~~~
dragontamer
No. Containers on a single node are too fast.

Super-computers have high-latency communications through thick pipes. True,
Super Computers have 40Gbit or 100Gbit connections between nodes, but it can
take multiple microseconds to send messages around.

A bunch of containers all sitting on the same box would be able to handle
communications within dozens of nanoseconds. So its a bad "emulator" for
SuperComputers.

Coordinating all your nodes to compute a problem, while achieving high
utilization, is tricky. Its not like programming a normal computer where
threads share RAM and can communicate in nanoseconds.

~~~
skissane
> No. Containers on a single node are too fast.

You can add artificial delay to your local container network to better
simulate a production environment.

For example using [https://github.com/alexei-
led/pumba](https://github.com/alexei-led/pumba) and "pumba netem delay" you
can add networking delay between Docker containers, and "pumba netem rate" can
limit the bandwidth between them as well.

("pumba" is just using the underlying Linux networking traffic control
technologies, such as the "tc" command from "iproute2", so you don't have to
use "pumba", you can set this up manually, but a tool like "pumba" makes it a
lot easier.)

~~~
dragontamer
I've used netem to emulate millisecond delays before. But I'm not sure if it
has the granularity to emulate micro-second level delays.

Basically, netem is designed to provide milliseconds of delay, emulating a
worldwide network. Supercomputers are thousands of times faster than that. I'd
have to play with netem before I was certain that it could handle a sub ~10uS
delay that supercomputers have node-to-node.

Considering that Linux task-switching is on the order of ~10mS or so, I have
severe doubts that uS level delays will work with netem.

The NanoPi-Fire3 uses normal Gigabit Ethernet, which probably has latencies in
the ~50uS region. Which is slower than a real supercomputer, but
"proportionally" should be representative of supercomputers (since the tiny
embedded ARM chips are around 50x slower than a real supercomputer node
anyway).

A bunch of Rasp Pis on Gigabit Ethernet seems like a better overall "cheap
supercomputer" architecture, for students of supercomputing. Better than
containers or software-emulated network delays

------
nine_k
60 GFlops on 96 cores is not that large.

OTOH if you want to see how your massively parallel algorithm behaves on a
96-node cluster / network, such a box is just $500, and is portable and can
work offline.

~~~
patrioticaction
The comparisons by GFlops was more or less a lark. Especially the ones
comparing energy efficiency with a supercomputer from the 90s. This 96 core
rig produces 1 GFlop per Watt, compare that to an i9-9900k (250GFlop), z390
chipset and 1 stick of DDR4 (95W + 7W + 2.5W = 104.5W) which does ~2.3 GFlop
per Watt.*

* this is back of napkin, real world results will vary

~~~
DuskStar
I think the i9-9900k has a real-world sustained power draw of around 170W,
actually: [https://www.anandtech.com/show/13400/intel-9th-gen-
core-i9-9...](https://www.anandtech.com/show/13400/intel-9th-gen-
core-i9-9900k-i7-9700k-i5-9600k-review/21)

Add in all the ancillary hardware (motherboard, memory, hard drive, PSU
losses) and that efficiency number is going to take a nosedive.

~~~
magila
That page is embarrassingly wrong about how power management works in Intel
CPUs. By default Intel CPUs will not allow their rolling average power
consumption over a period of ~1 minute to exceed the specified TDP (95 W in
this case). Once the limit is reached the CPU reduces its frequency to bring
power consumption down. Intel optimizes their CPUs to achieve a good balance
between efficiency and performance when operating at the TDP.

What you see in Anandtech's review is the result of motherboard firmware
effectively disabling the power limit by setting it to a very high value. This
is a common practice among enthusiast motherboards in order to boost scores in
reviews. Unfortunately it also results in drastically lower power efficiency
and lots of clueless people, including many tech writers, complaining about
unrealistic TDP numbers.

~~~
DuskStar
> By default Intel CPUs will not allow their rolling average power consumption
> over a period of ~1 minute to exceed the specified TDP (95 W in this case).

From the page in question: "In this case, for the new 9th Generation Core
processors, Intel has set the PL2 value to 210W. This is essentially the power
required to hit the peak turbo on all cores, such as 4.7 GHz on the eight-core
Core i9-9900K. So users can completely forget the 95W TDP when it comes to
cooling. If a user wants those peak frequencies, it’s time to invest in
something capable and serious."

95W is the required power to sustain the base clocks.

Also, calling AnandTech clueless... Are there any better hardware review
sites? I would consider them a tier 1 site, with HardOCP and not a whole lot
else...

~~~
magila
Like I said, the Anandtech article has a lot of inaccurate information in it.
Unfortunately the quality of tech journalism has taken a dive the last few
years as most of the good writers have been hired away by the very tech
companies they used to cover.

See this article on Gamers Nexus for a much better summary of the power
consumption situation for Intel CPUs

[https://www.gamersnexus.net/guides/3389-intel-tdp-
investigat...](https://www.gamersnexus.net/guides/3389-intel-tdp-
investigation-9900k-violating-turbo-duration-z390)

~~~
DuskStar
AnandTech actually has a new article on Intel TDP limits out today:
[https://www.anandtech.com/show/13544/why-intel-processors-
dr...](https://www.anandtech.com/show/13544/why-intel-processors-draw-more-
power-than-expected-tdp-turbo)

I would find it hilarious if this conversation somehow prompted it.

Anyways, AnandTech's position seems to be:

We test at stock, out-of-the-box motherboard settings, except for memory
profiles. We do this for three reasons -

1\. This is the experience almost all users will have.

2\. This is what the benchmarks published by Intel reflect.

3\. This is what damn near every other review site has done forever, and to do
otherwise would make results less useful.

So that's why their power draw number was 170W and not 95W for the i9-9900k -
motherboard vendors take Intel's recommended settings and laugh. But so does
Intel for benchmarks.

------
sannee
Can these NanoPis boot over PXE? I was pleasantly surprised a few weeks ago by
the fact that the Raspberry Pi can do network boot without an SD card.

------
ElBarto
What's fascinating in that article is to see that a Raspberry Pi 3 has about
10% of the floating-point processing power of a Cray C90...

Cue the many forum questions: "I'm planning to use a Raspberry Pi to control a
<simple-ish device>. Will it be powerful enough?"

~~~
adrianN
The real question is "Will it be powerful enough even though I use a Desktop
operating system and a software stack designed for programmer comfort rather
than efficiency to control <simple-ish device>".

------
qwerty456127
> The NanoPi Fire3 is a high performance ARM Board developed by FriendlyElec
> for Hobbyists, Makers and Hackers for IOT projects. It features Samsung's
> Cortex-A53 Octa Core S5P6818@1.4GHz SoC and 1GB 32bit DDR3 RAM

Who needs such a powerful CPU with so little RAM? The reason I have still not
bought any Pi is all of them have 2 or less GiBs of RAM and I don't feel
interested in buying anything with less than 4.

~~~
giancarlostoro
You'd be wanting to look into ARM-64 boards like the ROCKPro64:

[https://www.pine64.org/?page_id=61454](https://www.pine64.org/?page_id=61454)

There's others that are pricier (> $100) with x86 arch the UDOO boards if you
really want a SBC with much more RAM too.

------
sheepybloke
I've been trying to do something similar with 4 Orange Pi Zero Plus boards
(this blog was one of my main inspirations). While I know it's not practical,
it's fun to design the case and the stand, how everything needs to connect,
and route it all together. I hope to in the end host a distributed personal
website on it and a MQTT server on it for any IoT tinkering I'd want to do!

------
floatboth
Significantly cheaper than the 24-core (also A53) SynQuacer Developerbox. But
of course you're getting a cluster instead of one machine…

------
megous
Nice! Distcc based compilation might be something to try on this. :) One thing
I noticed is that heatsink fins are oriented in a wrong direction. Air should
be going through the fins, not to the side of them. But I guess any air
movement is enough to cool this.

~~~
otherlife35
Here is a simple study on distcc, pump and using of the make -j# option on low
end hardware. It seems that the network could be a bottleneck. The compilation
time probably would decrease to 1/4\. But I think the use of -j# is the best
advice.

[https://forums.gentoo.org/viewtopic-t-1056580-start-0.html](https://forums.gentoo.org/viewtopic-t-1056580-start-0.html)

------
mschaef
The only supercomputer they compare it to is 27 years old, and it uses Gigabit
Ethernet as its interconnect. I think they have a much looser definition of
'Supercomputer' than most people.

~~~
geezerjay
It's a few SoC crammed in a shoebox. Of course the comparison was never meant
to be taken seriously.

------
fluxty
I wonder what topology this has--it definitely seems reminiscent of older
supercomputers like the famous Thinking Machines CM-5, which used a hypercube.

~~~
aepiepaey
Probably nothing interesting.

There are two 8-port ethernet switches.

With 12 nodes, this leaves 4 unused port (2 in each switch).

From the pictures you can see that the box itself has two jacks, both of which
are likely connected to one switch each.

The switches don't seem to support link aggregation, so likely to look like
this:

    
    
        switch1
        ├── external
        ├── nano-pi1
        ├── nano-pi2
        ├── nano-pi3
        ├── nano-pi4
        ├── nano-pi5
        └── nano-pi6
        switch2
        ├── external
        ├── nano-pi7
        ├── nano-pi8
        ├── nano-pi9
        ├── nano-pi10
        ├── nano-pi11
        └── nano-pi12
    

and if you connect both the switches to the same external switch, you'd get
something like:

    
    
         switch1
        ┌┼── switch2
        │├── nano-pi1
        │├── nano-pi2
        │├── nano-pi3
        │├── nano-pi4
        │├── nano-pi5
        │└── nano-pi6
        │switch2
        └┼── switch1
         ├── nano-pi7
         ├── nano-pi8
         ├── nano-pi9
         ├── nano-pi10
         ├── nano-pi11
         └── nano-pi12

------
albertgoeswoof
This is cool! But why test on this instead of using a virtual environment
locally?

~~~
zamadatix
Unless you have 96 physical cores testing it in a virtual environment doesn't
tell you the same thing.

~~~
magila
A bunch of Pis networked together is sufficiently far removed from a real HPC
cluster that you could probably create a more realistic simulator without too
much trouble.

