I'm a strong believer that the future of supercomputing (well, if anyone adopts it, HPC can be notoriously un-innovative, borderline corrupt even in some countries...) lies in manycore processors like GPUs or their MIMD counterparts like the Rex Neo and the Adapteva Epiphany.
I have very few kind words for the Xeon Phi. It's relatively poorly executed, is not competitive with GPUs, etc.
I'm not really sure why people think ARM can do much better though. If you're going to go for a new ISA because supercomputing can recompile often, then you might as well
start with a clean slate ISA.
Back of the envelope.. a GTX 1080 has 2560 cores, TDP 300W, 1.6 GHz, so ~12 compute thread GHz/W. Dual 48-core ARM boards, 4x32 SIMD, TDP 100W, 2.5 Ghz, so ~10 compute thread Ghz/W. (this is assuming IPC and price are similar)
So, for SIMD enabled workloads, there's potential to be in the same ballpark. For non SIMD or mixed workflows, GPU isn't even an option. For HPC in general (not deep learning specifically), arm may be a better long run bet that GPU.
The reason I don't see ARM being very competitive is that ARM processors have no discernible advantage compared to a clean slate ISA design. What exactly does ARM offer that the Rex Neo doesn't?
By contrast, the Neo is far more efficient, doesn't waste energy on OoO, has no caches, making it far more efficient (especially in terms of data movement, which is a big deal) and has the advantage of having a simpler decoder.
That said, single core performance is far lower than x86. With high core counts it makes sense for HPC work.
I'd really like to see a good OpenCL implementation for arm.
While Phi are easy to port, they aren't close to GPUs for vectorization-friendly problems. This is likely why the Aurora supercomputer has been delayed/canceled.
IIRC Andy Glew had some slide deck where he advocated that. Edit: https://drive.google.com/file/d/0B5qTWL2s3LcQNGE3NWI4NzQtNTB...
AVX-512 has many of the primitives required, and ISPC ports the GPU programming model to CPU SIMD.
For buying a machine, it depends on what you want. Desktops are coming out now, for example: https://softiron.com/products/overdrive-1000/
Server/HPC, there are quite a few coming online. The Cavium presenter @goingarm listed (https://www.packet.net) on their slides, might be decent for getting a cloud instance to try out. There are many others. A good place to start finding them is: http://arm-hpc.gitlab.io.
A nice thing about ARM is that you get lots of different micro-arch under a single ISA. Once the ball gets rolling there will be many processor choices to choose from, which I think is a great thing (however, in full disclosure I'm one of the authors, and I work for ARM Research for our HPC program..although all statements above are my own opinion and not those of ARM).
The only competitive opportunity that Intel leaves open is by artificially limiting its CPUs to jack up its margins and it seems that AMD Epyc has been designed from the ground up to exploit this opportunity.
Between Intel and AMD I don't see any niche left for an ARM solution to get off the ground.
"2013 was not the year of ARM. It was the year of ARM hype. But for reasons having nothing to do with ARM itself. Basically ARM was the anti-Intel. Everyone was looking for an anti-Intel, in order to maintain pressure on Intel. I can’t tell you how many times I’d heard “Intel is done” or “Intel is in trouble” in meetings with customers, or partners. This wasn’t true, though there was a great deal of wishing it were true on the part of some. We hedged a bit by trying to work with Calxeda. Unfortunately, as we rapidly discovered, the claims about ARM as a viable replacement for Intel were simply not true (specifically talking about Calxeda’s chips) at this time. Calxeda were positioned in customers and partners minds as being “just like Intel’s maybe a little slower and much less power”. Neither of these statements were true. The chips were badly underpowered, and when you aggregated enough of them to do interesting work, they ran … HOT … ."
What we found was a strong misperception in end users minds, what I will say ... willfully ... placed there by interested parties, that 1 ARM core == 1 Intel core. This is most definitely untrue (then, and as far as I can tell, now).
I know that people really want a viable alternative to Intel, and I understand why. But, honestly, ARM isn't it, and doesn't look like it will ever be it.
There are many reasons for this, but it boils down to actual real world application performance, and toolchain issues. The former is the main reason why I am betting against ARM now. The latter is made even more difficult by a fractured ARM market. Which ARM ABI should you write for, and do you have good optimizing compilers for?
So I am very skeptical at this moment. I've seen the song and dance before. Then I did the testing. And it was embarrassingly bad. 200 ARM cores  were not as fast as 16 Intel cores. All while the 200 ARM cores consumed the same if not greater power, had a smaller memory address space, and were slower.
I'd be happy to be proven wrong on this. Though I wouldn't bet on this outcome.
1. Historical tight coupling of components (necessary for SoCs in embedded products such as Smartphones) means there is no standard to build an ARM PC. This means vendor lock-in, difficulties upgrading, and even more proprietary blobs than you get with PCs.
2. No desire by vendors to change #1. PCs being customizable and piece meal upgradable is something that vendors would like to go away. Samsung and Apple would love it if ARM desktops took over, they make their own ARM chips, they would sell you a computer with 16GB of soldered on memory, 256GB of storage built in, and if you wanted to change that you'd have to buy a new machine. Want a new graphics card? New machine. Imagine all the nightmares of laptops, except that every computer is like that. For corporate, professional, and enthusiast users, this is very bad.
A lot of the scenarios you talk about are actually possible today, indeed demos of those scenarios have been floating around for ages. The CPU architecture is not the hard part, both Google and Microsoft's app stores have packages that run on both x86/x64 and ARM just fine. Real problems are things like real time syncing and collaboration, real time migration of entire app state from one device to another, and creating a good user experience around this.
Making a cool demo where one app transitions from a smartphone to a PC screen is not that hard, and you can see many tech demos of that uploaded to YouTube. Making that work across all apps, in general, and making an app UI that works just as well on Mobile as on Desktop and maybe also on Tablet, that gets harder.
Full disclosure: I work for Microsoft.