The MI8 card's HBM has a great power and performance advantage (512 GB/s peak bandwidth) even if it's on 28 nm. NVIDIA has nothing that has even remotely comparable bandwidth in this price/perf/TDP regime.
None of the NVIDIA GP10 Teslas have GDDR5X -- not to surprising given that it was rushed to marked, riddled with issues, and barely faster than GDDR5. Hence, the P4 has only 192 Gb/s peak BW; while the P40 does have 346 GB/s peak, it is far higher TDP, different form factor and not intended for cramming in into custom servers.
[I don't work in the field, but] To the best of my knowledge inference is often memory bound (AFAIK GEMV-intensive so low flops/byte), so the Fiji card should be pretty good at inference. In such use-cases GP102 can't compete in bandwidth. So the MI8 with 1.5 the Flop rate, 2.5x bandwidth and likely ~2x higher TDP (possibly configurable like the P4) offers an interesting architectural balance which might very well be quite appealing for certain memory-bound use-cases -- unless of course the same cases are also need large memory.
Update: should have looked closer at the benchmarks in the announcement; in particular the MIOpen benchmarks  MI8 clearly beating even TitanX-Pascal which has higher BW than the P40 indicates that this card will be pretty good for latency-sensitive inference as long as stuff fits in 4 GB.
Not sure if AMD is going to go all HBM on all it's high performance GPUs in 2017, or only offer one or two models with it.
EDIT: MI6 is Polaris based, MI8 is Fiji with HBM, MI25 is Vega.
Almost. My guess is R9 Nano  given the same 8.2 Tflops (SP) Flop rate .
> Not sure if AMD is going to go all HBM on all it's high performance GPUs in 2017, or only offer one or two models with it.
It would make perfect sense to have some GDDR5X-based medium-range GPUs. HBM2 will be expensive, too expensive for the top of the medium range (and the same applies for NVIDIA). GDDR5X has plenty of room for improvement over GDDR5 and by next year they should have it figured out better.
Having said that there are some early techniques out there for DMA from NVMe drives to GPU ram directly. Like this: http://kaigai.hatenablog.com/entry/2016/09/08/003556
To be realistic you need to compare AMD hardware running MIOpen to NV hardware running a framework backed by cuDNN.
All in all, given how elegant HIP is and that HCC seems to make GPUs more approachable than CUDA (and less silly than OpenACC), there is a great potential for AMD to gain some traction. My greatest concern is the quality and robustness of their software stack, their overly optimistic view (at least from the outside), and their relationship with the rest of the OSS world, especially given the conflicts that they seems to be running into with upstream contrib .
AMDs HIP port is using the "fallback" open-source CUDA kernels, which are nowhere near as fast as the hand-optimized cuDNN code.
What exactly do you that mean?
Massive amount of boilerplate, heavy APIs or crappy software stack are more dangerous IMO than having to drop in replacements for NVIDIA-specific optimizations of GPU-to-NIC or GPU-to-GPU communication.
AMD is in a catch 22, support CUDA and be effectively in a constant catch up position, not support it and continue to be ignored by the market at large simply because the momentum NVIDIA has managed to achieve with CUDA over the years.
One example: AMD can support CUDA while focusing on price/performance at the low end (initiating an Innovators Dilemma for NVIDIA), and if successful then solidify AMD's position, for example by starting a standards process for a successor to CUDA.
NVIDIA, of course, has various countering moves available such as IP moats, initiating competing standards, and so on.
Rarely is a credible player entirely boxed in. And even then, there are moves available to make the best of things like making some acquisitions and then spinning off the division into an independent company, or selling it off to another player (in this case probably Intel, but maybe ARM could be suckered into it) and so on.
I thought OpenCL was the standard in this space.
XHTML 2.0 comes to mind.
Tactically, creating a backward-compatible successor to CUDA (a superset, essentially) would have the advantage (for AMD, that is) that NVIDIA can't decide to un-implement the existing CUDA support in their products in order to to spike AMD's efforts.
Then again, there is a lot of IP sloshing around in this space, so the specific tactic used by AMD may have to be something sneakier and subtler.
Anyway, this is all just one of the hypothetical ways AMD could try to unseat NVIDIA, there are others.
Just look at the Maxwell-based Tesla cards that came out of the blue (as if they were an afterthought); I bet Facebook, Baidu, Google, etc. told NVIDIA that Kepler was shit for their use-cases and they did not want to wait for Pascal.
Or look at the bizarre Pascal product line where the GP100 does support half precision, but the others don't, not even the P4 or P40. Strange, isn't it?
Things are changing so quickly that it isn't hard at all to find footing as long as you have something useful to offer. At the same time, I agree, a robust software stack is an advantage, but for the likes of Google or Baidu even that is not so big of a deal (Google wrote their own CUDA compiler!).
As for the half precision, it's pretty much the same thing NVIDIA been doing since Kepler dumping FP64 and FP16, especially FP16 due to the silicon costs.
NVIDIA came out with the Titan and Titan Black with baller FP16 performance and no one seem to care, the Titan X then dropped it and people bought it like it was cupcakes, I'm pretty sure they have pretty good market research that states most people can live without it and those who can't can pay through the nose.
As for Google and Baidu while they are huge I'm not sure how "important" they are, Google can pretty much design their own hardware at this point, and as you mentioned they don't really use the software ecosystem that much as they can write everything from scratch even the driver if need be.
What CUDA gives is a huge ecosystem for 3rd parties and more importantly a lock on developers since that's "all they know", it's not that different than how MSFT got a lock on the IT industry through sysadmins that only knew Windows and even now with Dev Ops and everyone and their mother running Linux they are still an important player.
If the majority of the commercial software is CUDA based, if most researchers and developers are exposed more to CUDA and are more experienced with it NVIDIA has a lock on the market.
I'm not entirely sure how much big of a client or how good of a client Google will be they can demand pretty steep discounts and they are a heartbeat away from building their own hardware probably anyhow.
NVIDIA doesn't want to be locked to 2-3 huge contracts that pretty much dictate how their hardware and software should look like, that's what put AMD in a bind with the console contracts dictating how their GPU's are going to look for a few generations now.
First off not sure what you are referring to by "Maxwell 1" and "Maxwell 2"; there's GM204, GM206, GM200 all very similar, and GM20B the slight outlier.
Maxwell was an arch tweak on Kepler which, due to everyone but Intel stuck at 28 nm, had to cut down on all but the gaming-essential stuff to deliver what the consumers expected (>1.5x gen-boost). For that reason, and because HPC has traditionally been thought to need DP, IIRC no public roadmap mentioned Maxwell Tesla at all. Instead, the GK210 (K80) was the attempt to be the bridge-gap chip for HPC until Pascal; it was released just a few months before the big GM200 consumer chips came out. That is until fall '15 when the late arriving M40 and M4 were pitched as "Deep learning accelerators" (though GRID/virtualization version were released a little earlier in August). Quite obvious naming, plus no sane HPC shop would buy a crippled (no DP) chip <1 year before the promised miracle-chip, Pascal was planned to be released. Let's not forget that P100 was known to be quite late, K80 was insufficient for the needs of many customers, so another bridge-gap was necessary, and that what the M Teslas were: bridge-gap ML/DL cards that Google/FB/Baidu and the like wanted, and picked up pretty quickly e.g. see .
> As for the half precision, it's pretty much the same thing NVIDIA been doing since Kepler dumping FP64 and FP16, especially FP16 due to the silicon costs.
Not really. In Kepler they experimented with the SP/DP balance a bit (1/3), in Maxwell they were pushed by 28 nm, in Pascal they returned to the DP = 1/2 SP throughput.
Also, FP16 is not completely separate silicone, AFAIK FP16 instructions are dual-issued on the SP hardware.
> NVIDIA came out with the Titan and Titan Black with baller FP16 performance and no one seem to care
Source? AFAIK earlier FP16 was only supported natively by textures and by some conversion instructions. First chip was the GM20X/Tegra X1 with native FP16.
> As for Google and Baidu while they are huge I'm not sure how "important" they are, Google can pretty much design their own hardware at this point,
Some hardware that allows large benefits, but definitely not all hardware. They're happily relying heavily on GPUs, are planning to pick up Power8+NVlink, etc. Designing chips is expensive, especially if there isn't a huge market to pay for it.
> and as you mentioned they don't really use the software ecosystem that much as they can write everything from scratch even the driver if need be.
Note that they "only" wrote the fronted and IR optimizer, the code generator is NVIDIA's NVPTX !
To wrap up because this is getting long, to your last points I'd say the big DL players are very important for NVIDIA because they are the trendsetters leading in many aspects of AI/DNN research with OSS toolkits for GPUs, and they'd be silly to not use GPUs in-house for their own needs which they're happy to talk about at various conferences and trade shows (e.g. both keynotes GTC 2015  ).
GM1XX is Maxwell 1st Gen, which was the 750ti, the Gefore 800M series, Tegra K1 and Tesla M10, GM2XX is Maxwell 2nd Gen.
>Not really. In Kepler they experimented with the SP/DP balance a bit (1/3), in Maxwell they were pushed by 28 nm, in Pascal they returned to the DP = 1/2 SP throughput
Maxwell wasn't 1/3 it was 1/32 (no this isn't a typo) ;) this is the same (or even worse IIRC it's 1/64 now) with Pascal with the exclusion of the GP100, the Pascal Titan and Quadro cards are 1/32 or 1/64.
As I said NVIDIA keeps this for very limited silicon if you buy a desktop/workstation GPU even with Pascal don't expect DP/HP performance.
As for the big players, yes they are trend setters but they are also can easily turn into "blackmailers" if you have a single client which buys half your GPU's they dictate the terms, companies went under because of contracts that take too much of their delivery pipeline.
My bad, forgot about 5.0 devices being called GM1xx. Still, AFAIR there was little practical difference (in particular instruction set) between all but the compute capability 5.3 Tegra.
> Maxwell wasn't 1/3 it was 1/32 (no this isn't a typo) ;) this is the same (or even worse IIRC it's 1/64 now) with Pascal with the exclusion of the GP100, the Pascal Titan and Quadro cards are 1/32 or 1/64.
I did not say Maxwell had 1/3 DP flop rate, I said Kepler had that, but Maxwell was pushed by the 28 nm process (so they had to get rid of all DP ALUs).
> As I said NVIDIA keeps this for very limited silicon if you buy a desktop/workstation GPU even with Pascal don't expect DP/HP performance.
Never disagreed, but they can only do that because their professional compute division has grown big enough that it it worth designing different silicon; that shift happened in fact after Kepler, GK110 was more or less the same on GeForce and Tesla (e.g. 780 Ti and K40). They're also trying to find a good way to do market segmentation and DP ALU die area is an obvious candidate to play with. Still, no HP on GP102/104 is a curious thing, especially as the Tesla P4/P40 that oficially target ML/DL would really need it. My bet is they won't "forget" HP on lower-end Volta; not all consumer chips will support it, but unless they'll design different dies for GV102/104 (or whatever they'll call the non-DP Tesla uarch), which I doubt, some desktop parts will also support it.
As for Kepler well it had 1:3 but only on few chips, most kepler based GPU's had 1:24 DP performance, the 780ti had 1:24 while the Kepler Titan had 1:3, the Titan Black then had 1:24 again and again no one seemed to care...
CUDA / cuDNN are well tuned and integrated with the most popular libraries.
At this point, in order to get people to switch not only would your hardware have to be faster, but your replacement for cuDNN would have to be better as well. Which is by no means impossible, just difficult.
I just hope they do the second part correctly.
For instance I took the Udemy course a few months back, given by a Google scientist, and in at least one instance the code was pinned to CPU because it didn't run on GPU, presumably due to a TensorFlow issue.
If they can get TensorFlow running well on AMD and release images that make it painless, it will be a huge win, but I'm a little skeptical. Big difference between getting a benchmark win and a stable platform.
Also one of the earlier participants of the TF OpenCL support conversations https://github.com/gujunli was working for AMD at the time.
AMD could really invest more in optimising deep learning libraries for their hardware if they want to be relevant
> AMD could really invest more in optimising deep learning libraries for their hardware if they want to be relevant
I am optimistic about AMD's involvement in AI. Just do a search of recent news on here, tons of exciting devleopment from AMD, especially Radeon Instinct. And here is an interesting comment about the comparison between Nvidia and AMD.
This video explains what you get https://www.youtube.com/watch?v=tKBthlKTtvQ
I'm looking forward to the day that Nvidia gets some competition in the GPUs-for-deeplearning market. Further, doing some smaller Deep learning experiments on my MacBook Pro with AMD discrete GPU is another benefit I'm looking forward to ;)
Micro-trends are mostly meaningless with stocks, unless you're trying to do high frequency work. Stock shifts and changes to small extents all the time based on the quirks of all sorts of trading companies, and each of Especially if there hasn't been any significant news about the company, and this isn't significant news. It's interesting, but it isn't really threatening Nvidia's dominance or profits at the moment. AMD needs to make a bigger name for itself in the sector, start picking up some splashy customers before most of the market might react.
Looking at the bigger picture, Nvidia stock is up 160% so far this year, but has been fluctuating a bunch over the last month or so, and it's still well within the scope of those fluctuations.
- autonomous vehicles
- autopilot drone
- personal assistant
- personal robots
i know it's optimistic, but it's not science-fiction.
- child slavery to build new iPhones in Africa and China
- killing and burning wildlife to build new farms in Latin America
- still not having any solution for problem of drinkable water in 2/3 of World
What a time to be alive!
I really think that in a weird way technology can help (not solve) overcome, or better compensate for our human stupidity in some of these issues. Not alone, of course. But right now i don't see a path for me that's more effective in tackling these problems at the same rate as improving technology. (if you want to see it more pragmatic: NGO's really need some better tools, especially for campaigning and training (this may be the most important, most are volunteers). It is often very unprofessionlized and inefficient. It's really holding some of them back.) Maybe there will be a time when i see more potential impact for my personal political ambitions, but right now i don't see how this is going to work better than studying computer science.
Also, it's been worse for humanity ;) We dropped like flies in the end of the 19. century due to diseases.
Other concerns: you want a full size cage, with the SSD drives all the way at the top, out of the way of the video cards (the largest Nvidia OEM 3 fan cards are almost 13" long, so there's going to be some Tetris if you want a lot of drives and video cards in there). Cooling critical, stuff the box with RAM and Xeon's or i7s.
This is a good tip: older Xeon servers, dual socket, can support good PCI bandwidth with 2 CPUs: https://news.ycombinator.com/item?id=12606481
I'm just starting off, so my primary concern is speed of iteration and learning. I want to train on and generate audio phonemes, so this will undoubtably take a lot of practice.
I'm in Latin America, and wonder which kind of niche could be satisfied with this, for more "normal" kind of customers.
I had to ship out a high-end gamer GPU with a dummy HDMI adapter for this purpose recently. But it's obviously not very efficient. It would also be nice to be able to run multiple screens in parallel, not just one per GPU.
I doubt there will ever be a product for my use case, but one can dream...
That said, these are cool. I think they're lower power than the Nvidia equivalent, but I could be mistaken (I just recall the Tesla models being power hungry.. enough to cause a real problem in a datacenter rack).
That said, I would love to be proven wrong. Healthy competition such as this fosters much better results. Also CUDA is not without issues in certain matters.
Even Intel is late to the game (and so far can't even compete with AMD in terms of hardware).
OpenCL? Something comparable to CUDA?
What about utilizing Vulkan?
I am all for choice, but AMD has a lot of catching up to do.