"Made for inference" just means "too slow for training" if you are pessimistic or "optimized for power efficiency" if you are optimistic.
Otherwise training and inference are basically the same
Training and inference are only similar at a high level, not in actual application.
(ETA: In case it's not obvious, I'm agreeing with david-gpu's comment, and adding more reasons that training currently differs from inference.)
By having an API thats not horrible to use, that advantage is gone. The utility libraries will be more of a challenge to undermine, but since it targets CUDA natively there is no disadvantage to users of nvidia's hardware, but there is no advantage to others, yet (see GLAS for what is possible with relative ease). Using D as the kernel language will also bring significant advantages over C/C++, static reflection, sane templates and compile time code generation to name a few.
You can find it at https://github.com/libmir/dcompute.
If you have any question, please ask!
Please read this before moving on: https://twitter.com/jrprice89/status/667466444355993600
Also, NVIDIA's CUDA compilers are built on clang which does have OpenCL frontend, so all they would need to do is to put some resources into making that frontend work with their current nvcc toolchain.
Many request and want this, but instead they are trying hard to hold back OpenCL just because providing OpenCL 2.0 support (and extensions for their GPUs features) may help adoption of OpenCL which in turn may end up helping other folks and companies too.
Cuda seems to be clearly winning over OpenCL in the real world so other vendors should just adopt it. AMD already has a CUDA compiler IIRC.
A similar mistake is about to happen, but luckily on the software side where losses can be cut quick and mistakes can be reversed easier -- though many will suffer when they have to reimplement their precious library from ground up because they did (or could) not take into account the fact that CUDA is as proprietary as it gets.
AMD has no CUDA compiler BTW. And CUDA is not a programming language FYI. ;)
Aside: I have no position on whether is CUDA's Fortran and C++ dialects constitute their own languages, nor did I refer to CUDA as a programming language.
Sadly, that's a very problematic, borderline BS definition.
"A system that allows third parties to make products that plug into or interoperate with it. For example, the PC is an open system."
Intel allows some third parties to interoperate with their system (ref Intel vs NVIDIA etc.) and they pick and choose to their liking, kill some and promote others exactly because they control the open-ness of their systems.
HIP is still not a CUDA compiler.
> nor did I refer to CUDA as a programming language.
You did refer to "CUDA compiler". My comment was admittedly a nitpick as well as a serious point too. CUDA can be seen as a C++ language + extensions -- something you can compile --, but it's also more than that (stuff that you can't compile), e.g.: API, programming tools, etc. all strongly adapted for NVIDIA hardware.
In supercomputing this is the problem with using high performance linpack for benchmarks, which typically exceeds actual scientific codes by an order of magnitude in terms of floating point operations per second.
Hopefully Tensorflow XLA or other optimization frameworks could solve this problem in a more general way in the medium term:
On the other hand, I believe Google is working on a CUDA compiler  so we may actually see meaningful improvement in the sense that it may become possible to run CUDA on other GPUs. (Edit: And Google actually has an incentive to achieve performance parity, so it might really happen.)
Hi, I'm one of the developers of the open-source CUDA compiler.
It's not actually a separate compiler, despite what that paper says. It's just plain, vanilla, open-source clang. Download or build the latest version of clang, give it a CUDA file, and away you go. That's all there is to it.
In terms of compiling CUDA on other GPUs, that's not something I've worked on, but judging from the commits going by to clang and LLVM, other people are quite interested in making this work.
But it still targets NVIDIA GPUs and uses NVIDIA libraries so not that universal yet.
This is an untrue, yet often repeated statement. For example Hashcat migrated their CUDA code to OpenCL some time ago, with zero performance hits. What is true is that Nvidia's OpenCL stack is less mature than CUDA. But you can write OpenCL code that performs just as well as CUDA.
Nowadays our CUDA compiler is just clang
Incorrect. Our kernels (GROMACS molecular simulation package) are 2-3x slower implemented in OpenCL vs CUDA.
> On the other hand, I believe Google is working on a CUDA compiler
They were. It's upstream clang by now.
There was a series of attempts by Lee Smolin and others to come up with a theory of quantum gravity by assuming that the universe, at the bottom, is essentially simple and discrete (not in the fixed-grid sense, but in the sense of a discrete web of relations). That model also exhibits a remarkable similarity between the structure of the universe, and the structure of the neural networks that understand it.
The future of fundamental science is sure to be fascinating.
There's no relationship between a hierarchy of probabilistic estimations and a hierarchical decomposition of the cosmos. The cosmos forms an apparent hierarchy because of the rules that govern matter and the initial expansion of the universe. That a small number of parameters might be listed in describing both is neither here nor there. A small number of parameters describe the vectors in a font file. It doesn't follow that a typeface then has any relationship with my brain or the universe.
The article reads, to me, like this: neural networks are this cool hierarchy thing, the cosmos is this cool hierarchy thing, and both of these things have low Kolmogorov complexity, isn't it amazing that our brains are like this and can understand the universe, wow.
That's one way of describing quantum theory; generally "contextual" or "non-commuting" are used instead of "hierarchical".
If the universality of such a common framework doesn't seem profound to you, at least realise it isn't something generally appreciated and barely even hinted at just a few decades ago.
Like in one of the Stanisław Lem's stories about Ijon Tichy people call intelligent anthropomorphic robots washing machines.
From the dictionary:
robot, origin: from Czech, from robota ‘forced labor.’ The term was coined in K. Čapek's play R.U.R. ‘Rossum's Universal Robots’ (1920).
Edit: traditional vector machines like the nec sx still hold the programmability crown because you get a usable single system image, right?
The whole point of Voodoo 1 was making it as simple and cheap as possible by removing all the advanced features and calculating geometry/lighting on the CPU.
Later SGI Geometry Engines used custom, very specialized DSP-like processors, but the microcode for those were written by SGI, and not end-user programmable.
There were probably research systems before it, but AFAIK the Geforce 3 was the first (highly limited) programmable geometry processor that was generally commercially available.
AI Accelerators have been a thing for decades - DSPs were used as neural network accelerators in the early 90s - and Cell processors were a thing by 2001.
GPUs just became vastly more accessible to general purpose program in the last decade. People were doing it back in the 90s but it was seriously hard.
We finally hit a tipping point where it's just kinda hard.
Assuming there's a big future to training hardware and inferencing. Many of those "new paradigms" / "silver bullet technologies" have come and gone in the last decades.
I'm biased, since I'm part of one, but there's little to no modification of the software stack necessary, so it's a credible threat to nvidia.
Today was a great day to be!
That's at the reticle limit of TSMC, a truly absurd chip.
However, they have been at the reticle limit since they were in 28nm. GM200 (980 Ti and Titan X) was 601 mm^2 at TSMC... the maximum possible at the time.
Feels like they're feeling AMD breathing down their necks with their VEGA architecture, which should be very interesting.
AMD have also stepped up their game with ROCm which might take a chunk out of CUDA.
Can't imagine we will be seeing any Volta GeForce cards released till next year.
Ironically, most of them actually use AMD's IP (the "Adreno" GPU, which is an anagram of "Radeon") that they sold off to Qualcomm in 2009. Which was yet another terrible call made by AMD management in that timeframe.
(although who knows if Adreno would have blown up in the same way if it had AMD mismanaging it)
Even more ironically, Adreno also used tile-based rendering that NVIDIA ended up adopting in the Maxwell architecture and AMD is adopting in the Vega architecture. It's a nice way to boost your power efficiency, which is critical to battery life in mobile devices.
Turns out since we're past Dennard scaling, packing more transistors on a chip now makes it hotter. So if you want it to go faster, you need to cut the power down in other ways. And thus, desktop GPUs are starting to look an awful lot like mobile GPUs...
(which is yet another reason why AMD's general-purpose compute-oriented GPU architectures are losing so badly in the desktop graphics market. RX 580 pulls twice the power of a GTX 1060 for the same performance...)
Many other aftermarket 580s are similar. For a sense of perspective here, that's roughly the same amount of power as some aftermarket 290Xs used. Or roughly 60 watts more than a GTX 1080. And that's GPU-only, not a total system load.
Polaris 10 is a reasonably efficient chip when you don't push it too hard. AMD - and their AIB partners - are pushing it way, way too hard in a desperate attempt to eke out a 2% win over the 1060. It isn't worth a 50% increase in TDP to get an extra 8% performance.
(and unlike the RX 480 - there is no reference RX 580 design, it's a whole bunch of these crazy juiced-up cards)
Not good, but also not twice the power consumption...
I don't understand why AMD didn't use faster memory in the 580 like Nvidia did with the 1060 refresh. The 580 needs faster memory more than higher core clocks.
Lets hope AMDs return to tile based rendering (used in Adreno) plus the other improvements help them get better at power consumption just like Nvidia with Maxwell. But I don't expect much from Vega after AMDs GPUs of the last 3 years. Navi looks more promising, as it is probably the first GPU to be fully designed under Raja Koduri.
GeForce Now for SHIELD - Different model, more like "netflix for games"
> Summit is a supercomputer being developed by IBM for use at Oak Ridge National Laboratory. The system will be powered by IBM's POWER9 CPUs and Nvidia Volta GPUs.
Summit is supposed to be finished in 2017, though. I'm quite surprised this is possible since the Volta architecture has only just now been announced.
Supercomputers have very long planning and development cycles. So do GPUs and CPUs. The contract specified chips that didn't yet exist (Volta and POWER9) as much more than codenames on a roadmap.
The issue though is no memory sharing with the GTX/Titan line. If that were the case, I probably just would have sprung for two 1080Tis out the gate.
Definitely loving the eight 1080Tis they just fit in here though: http://www.velocitymicro.com/promagix-g480-high-performance-...
The math doesn't add up.
Under "New SM" in "Key Features" section
As far as any overlap software-wise is concerned, while it isn't super clear what Tesla Motors is doing for their self-driving systems, based on what I've seen it seems like they are using only "basic" lane-detection and identification along with some other algorithmic vision-based systems. I'm not saying that's everything they are doing, just what I have seen released publicly on their vehicle platform.
NVidia, on the other hand, has been experimenting with using neural networks (deep learning CNNs specifically) to drive vehicles using only camera information:
This is actually a fun CNN to implement - I (and many others) implemented variations of it in the first term on Udacity's Self-Driving Car Engineer Nanodegree. We weren't told to do it this way, but I chose to do so after reviewing the various literature, plus it seemed like a challenge (and it was for me). Udacity supplied a simulator:
...and we wrote code in Python (Tensorflow and Keras) to train and drive the virtual car. For my part, I had set up my home workstation with CUDA so that Tensorflow would utilize my GPU (a lowly GTX 750 TI SC - though it seems like it might have a similar GPU capability as NVidia's Drive-PX system, based on what I've researched - a Mini-ITX mobo, a PCI-E slot riser, and a GTX 750 would make a decent low-end deep-learning platform for self-driving vehicle experiments, and cost a fraction of what the Drive-PX sells for).