A practical baseline for anyone interested in ARM-compute, would be the Thunder X CPU (Cloud rental: https://www.packet.com/cloud/servers/c1-large-arm/). 48-cores per socket, 2x for 96-core servers.
As another commenter said: the primary use of this NanoPi is the ability to emulate a "real" super-computer and really use MPI and such. MPI is a different architecture than a massive node (like a 96-core Thunder X ARM), and you need to practice a bit with it to become proficient.
I read somewhere that some real supercomputer systems programmers actually use toy clusters of Raspberry Pi's to test their scheduling software. It helps speed up their development cycle because they can do initial testing on their desktops.
Edit: I think this is what I was thinking of: https://www.youtube.com/watch?v=78H-4KqVvrg
Wouldn’t containers be a easier way to do that?
Super-computers have high-latency communications through thick pipes. True, Super Computers have 40Gbit or 100Gbit connections between nodes, but it can take multiple microseconds to send messages around.
A bunch of containers all sitting on the same box would be able to handle communications within dozens of nanoseconds. So its a bad "emulator" for SuperComputers.
Coordinating all your nodes to compute a problem, while achieving high utilization, is tricky. Its not like programming a normal computer where threads share RAM and can communicate in nanoseconds.
You can add artificial delay to your local container network to better simulate a production environment.
For example using https://github.com/alexei-led/pumba and "pumba netem delay" you can add networking delay between Docker containers, and "pumba netem rate" can limit the bandwidth between them as well.
("pumba" is just using the underlying Linux networking traffic control technologies, such as the "tc" command from "iproute2", so you don't have to use "pumba", you can set this up manually, but a tool like "pumba" makes it a lot easier.)
Basically, netem is designed to provide milliseconds of delay, emulating a worldwide network. Supercomputers are thousands of times faster than that. I'd have to play with netem before I was certain that it could handle a sub ~10uS delay that supercomputers have node-to-node.
Considering that Linux task-switching is on the order of ~10mS or so, I have severe doubts that uS level delays will work with netem.
The NanoPi-Fire3 uses normal Gigabit Ethernet, which probably has latencies in the ~50uS region. Which is slower than a real supercomputer, but "proportionally" should be representative of supercomputers (since the tiny embedded ARM chips are around 50x slower than a real supercomputer node anyway).
A bunch of Rasp Pis on Gigabit Ethernet seems like a better overall "cheap supercomputer" architecture, for students of supercomputing. Better than containers or software-emulated network delays
But ping on local host reports .3 ms latency..?
In any case it’s still an easy way to get started, and arguably when exact latency starts counting for something, you have to be tuning your code on the system it’s going to run on. An RPi cluster could skew that in all sorts of ways, eg the TCP stack being disproportionately slower etc
It is an interesting question what people who don't have access to a supercomputer, but would like to learn and optimize for HPC-style distributed memory programming should use.
I've found AWS to be pretty nice, except there are no RDMA drivers for the elastic NIC and the BW is a bit low. (25Gbit vs. 100Gbit). For MPI bulk synchronous programs, it's probably a pretty close model, though.
Relative to accessing the hardware resources on a host, it very much is. Just as accessing the RAM is slow relative to accessing the L2 cache on a CPU
OTOH if you want to see how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.
* this is back of napkin, real world results will vary
Add in all the ancillary hardware (motherboard, memory, hard drive, PSU losses) and that efficiency number is going to take a nosedive.
What you see in Anandtech's review is the result of motherboard firmware effectively disabling the power limit by setting it to a very high value. This is a common practice among enthusiast motherboards in order to boost scores in reviews. Unfortunately it also results in drastically lower power efficiency and lots of clueless people, including many tech writers, complaining about unrealistic TDP numbers.
From the page in question: "In this case, for the new 9th Generation Core processors, Intel has set the PL2 value to 210W. This is essentially the power required to hit the peak turbo on all cores, such as 4.7 GHz on the eight-core Core i9-9900K. So users can completely forget the 95W TDP when it comes to cooling. If a user wants those peak frequencies, it’s time to invest in something capable and serious."
95W is the required power to sustain the base clocks.
Also, calling AnandTech clueless... Are there any better hardware review sites? I would consider them a tier 1 site, with HardOCP and not a whole lot else...
Anandtech's quality has dropped since Anand Lal Shimpi left for Apple. Its still decent, but they're missing that Anand chip-level wizardry that they used to have. I still consider them a good website, just down a few notches.
The new sites with quality are Youtube-based. Its just where the eyeballs and money are right now.
GamerNexus is probably the best up-and-coming sites (they have a traditional webpage / blog, but also post a Youtube video regularly). And Buildzoid is one of the best if you want to discuss VRM-management on motherboards. These focus more on "builder" issues than chip-level engineering like Anand used to write about.
TechReport is my favorite overall reviewer.
See this article on Gamers Nexus for a much better summary of the power consumption situation for Intel CPUs
I would find it hilarious if this conversation somehow prompted it.
Anyways, AnandTech's position seems to be:
We test at stock, out-of-the-box motherboard settings, except for memory profiles. We do this for three reasons -
1. This is the experience almost all users will have.
2. This is what the benchmarks published by Intel reflect.
3. This is what damn near every other review site has done forever, and to do otherwise would make results less useful.
So that's why their power draw number was 170W and not 95W for the i9-9900k - motherboard vendors take Intel's recommended settings and laugh. But so does Intel for benchmarks.
> how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.
Cue the many forum questions: "I'm planning to use a Raspberry Pi to control a <simple-ish device>. Will it be powerful enough?"
Who needs such a powerful CPU with so little RAM? The reason I have still not bought any Pi is all of them have 2 or less GiBs of RAM and I don't feel interested in buying anything with less than 4.
There's others that are pricier (> $100) with x86 arch the UDOO boards if you really want a SBC with much more RAM too.
What do you need that much RAM for? What do you plan to run in this machine?
There are two 8-port ethernet switches.
With 12 nodes, this leaves 4 unused port (2 in each switch).
From the pictures you can see that the box itself has two jacks, both of which are likely connected to one switch each.
The switches don't seem to support link aggregation, so likely to look like this: