What kind of bullshit is this. Alex Krizhevsky used CUDA to run neural nets much faster than you could on CPU, blew up ImageNet and NVIDIA found out they were sitting on a goldmine powered by a revolution they really had nothing to do with.
NVIDIA really had nothing to do with any of this taking off, except happening to provide the hardware.
NVIDIA got lucky by having the right kind of hardware, and created CUDA not due to some insight into the AI world, but based on the needs of scientific computing. Just look at their early papers on CUDA (E.g., Luebke's 2008 paper "CUDA: SCALABLE PARALLEL PROGRAMMING FOR HIGH-PERFORMANCE SCIENTIFIC COMPUTING") - AI was nowhere to be seen. They were focused on traditional high performance computing problems.
They really haven't made much forward looking investment in this area, they just got lucky since they had built CUDA beforehand for other reasons, and that was what people standardized on.
AMD's chips would work just fine for DL if everything wasn't already written in CUDA, needing them to invest in a cross-platform CUDA alternative in the form of ROCm/HIP.
In fact, I prefer OpenCL to CUDA. However, that it is not only a matter of "a few people"'s effort is that AMD had a few people on BLAS libraries since forever, and their stuff is almost unusable for DL, let alone a match for CUDA. Add to that that Nvidia provided cuDNN that everyone uses. That it is not a matter of just a few people can be seen from the fact that AMD still does not provide an alternative, even a toy one, that works. Everythinig in AMD/OpenCL world relies on a few 3-rd party open-source efforts. Some of those work OK, some are cool, some are great, some are garbage, but there is no ecosystem anywhere near Nvidia's.
They have HIP (and they have OpenCL) but these are only general purpose compilers. They have nothing when it comes to the libraries.
In comparison, NVIDIA had all this effort done for them by the library developers.
I'll just say that there is a reason the projects you linked have only a handful of stars on GitHub, despite having significant headcount and supposed megacorp backing.
People have been burned by AMD so many times that rarely anyone cares any more. I do, but I'm not very optimistic...
I think everyone in the deep learning space is interested in AMD succeeding so that we're not all locked into NVIDIA or Google's TPUs, but everyone wants someone else to take the leap first and pay that cost.
I'm certainly not going to buy any AMD GPUs before these versions have been upstreamed and a few people have kicked the tires, the lost productivity just isn't worth it.
Now that Deep Learning is big it's a push toward Half-Precision actually.
Most of the "incredible power" of a Tesla V100 is 100% unusable by a Physicist for example
Also I love your MCMC sampler
It could be done on FPGAs and stuff but that wasn't easily accessible to the masses.
As GP says, this is some interesting revisionist history.
NVidia figured this out at the same time as everyone else.
edit: on newest hardware, without recompilation
How GPUs and AI are rooted in DSP and how the lack of generalized DSP chips (perhaps built from FPGAs) have set back progress in concurrent programming immeasurably, especially for the embarrassingly parallel problems whose solutions are making headlines today.
The point you're making about untapped technical potential is interesting, but I'm left wanting to know more about what you think the issue/solution is.
It would be so much better if we had generalized multicore computing. Something on the order of 256 or more 1 GHz MIPS, ARM, or even early PowerPC processors. We need to be able to run arbitrary code with flow control and be able to experiment with our own models for data locality, caching, etc.
Something like that running Erlang/Go/Octave/MATLAB or one of the many pure functional languages would open up amazing opportunities if we were no longer compute-bound. I first started caring about this in the late 90s with FPGAs but that possible future was supplanted by the hardware accelerated graphics future we're in now. Yes it's pretty good, but a pale shadow of what's possible.
Edit: here are a few examples of what I'm talking about:
The modern GPU is a “generalized multicore” machine anyways. Nvidia GPUs have, say, 60 SMs with many warps per SM concurrently resident, giving you hundreds of hardware threads (warps) available, each with 32-wide SIMD units. You have multiple memories you can play with for data locality (the multi-MB sized register file and shared memory space, the smaller caches, global memory). It’s a hard enough problem to use all of that well for all but simple problems.
If you want to change deeper behavior, there's always Verilog.
They do exist, it's just a considerable challenge to program them effectively and they're more expensive than GPUs. Whereas with GPUs the R&D is spread across the huge games market (and more recently cryptocurrency).
- https://news.ycombinator.com/item?id=13741505 shows an experimental new CPU architecture which appears to be single-core but demonstrates interesting power savings and speedups by eschewing standard on-chip caching and building a smarter compiler. TL;DR closed-source compiler, but I think the people making the design are trying to make the architecture design/registers/overview/etc as open as practical/viable with free documentation etc. I'm not sure if there's a simulator for it. I'm very sure they're already taking on commercial customers, and getting access to it is likely to be not overly nontrivial. (I realize you're not likely to need one just yet, but "can't access at all" makes for something much less interesting to file away.)
- Forth, Inc. makes the GA144, a 144-core microcontroller that natively runs Forth (and will NOT comfortably run transpiled <any other language here>). At one point a production company was making evaluation boards. https://www.youtube.com/watch?v=NK1zlz67MjU is a (1hr long, somewhat slow-paced) presentation that provides a good overview of the GA144. I half wonder whether I'm looking at a chip designed by people who don't understand SMP when I watch this, but I also think that it's possible that the brutal simplicity that Forth (and Chuck Moore) applies to everything means that I'm really staring at the raw complexity of SMP here. I've read that Forth is one of those languages that changes the way you think about programming, so maybe the GA144 could be a training processor - crack it, and you'll understand SMP a tiiiny bit better on more contemporary/traditional processor architectures.