What's particularly interesting here is that the Fiji card they propose is a very different beast than any of the NVIDIA offerings.
The MI8 card's HBM has a great power and performance advantage (512 GB/s peak bandwidth) even if it's on 28 nm. NVIDIA has nothing that has even remotely comparable bandwidth in this price/perf/TDP regime.
None of the NVIDIA GP10[24] Teslas have GDDR5X -- not to surprising given that it was rushed to marked, riddled with issues, and barely faster than GDDR5. Hence, the P4 has only 192 Gb/s peak BW; while the P40 does have 346 GB/s peak, it is far higher TDP, different form factor and not intended for cramming in into custom servers.
[I don't work in the field, but] To the best of my knowledge inference is often memory bound (AFAIK GEMV-intensive so low flops/byte), so the Fiji card should be pretty good at inference. In such use-cases GP102 can't compete in bandwidth. So the MI8 with 1.5 the Flop rate, 2.5x bandwidth and likely ~2x higher TDP (possibly configurable like the P4) offers an interesting architectural balance which might very well be quite appealing for certain memory-bound use-cases -- unless of course the same cases are also need large memory.
Update: should have looked closer at the benchmarks in the announcement; in particular the MIOpen benchmarks [1] MI8 clearly beating even TitanX-Pascal which has higher BW than the P40 indicates that this card will be pretty good for latency-sensitive inference as long as stuff fits in 4 GB.
> Not sure if AMD is going to go all HBM on all it's high performance GPUs in 2017, or only offer one or two models with it.
It would make perfect sense to have some GDDR5X-based medium-range GPUs. HBM2 will be expensive, too expensive for the top of the medium range (and the same applies for NVIDIA). GDDR5X has plenty of room for improvement over GDDR5 and by next year they should have it figured out better.
From the photos and perf numbers it looks like the linup is RX480, R9 Nano, and whatever the Vega10 gets called, minus some of the connectors and passively cooled.
The PCIE NVIDIA P100 has peak memory bandwidth of 730GB/s, at 250 Watts, which is almost exactly the same bw/watt as the MI8 (well, there are two PCIE P100s, I mean the better one).
Those MIOpen benchmarks are a bit dubious, since MIOpen is AMDs own deep learning framework. It's unlikely that code written by AMD is optimal for the Nvidia hardware.
To be realistic you need to compare AMD hardware running MIOpen to NV hardware running a framework backed by cuDNN.
It's clearly indicated on the slide that those are Deepbench [1] GEMM and GEMM-convolution numbers. Data for M40, TITAN Maxwell/Pascal and Intel KNL is actually provided by Baidu in their Github repo.
Does anyone use AMD for deep learning in scientific / industry ? All the libraries for deep learning I have seen require CUDA and NVIDIA is winning by merely being the most popular API. Searching github it looks like they are university assignment projects see https://github.com/search?utf8=%E2%9C%93&q=opencl+deep+learn...
Additionally, here's an example of how Caffe was porting using HIP [1]. To be honest, if the approach really does work, you might see a very quick increase in the number of applications ported.
All in all, given how elegant HIP is and that HCC seems to make GPUs more approachable than CUDA (and less silly than OpenACC), there is a great potential for AMD to gain some traction. My greatest concern is the quality and robustness of their software stack, their overly optimistic view (at least from the outside), and their relationship with the rest of the OSS world, especially given the conflicts that they seems to be running into with upstream contrib [2].
The catch, which AMD are quick to gloss over, is that Caffes performance on Nvidia hardware largely comes from its use of Nvidia's proprietary cuDNN kernels.
AMDs HIP port is using the "fallback" open-source CUDA kernels, which are nowhere near as fast as the hand-optimized cuDNN code.
Things got a bit heated in the LKML thread, but if you read the more recent messages things have cooled down significantly. It's unclear whether they will ultimately be able to achieve all their goals while also getting the code into an upstreamable state, but they do seem committed to working out the issues with upstream.
The tricky, but fun parts. It's certainly not trivial to port those, but I'm not too worried about it as long as there is solid runtime support in ROCm. More fun work for perf engineers like me. ;)
Massive amount of boilerplate, heavy APIs or crappy software stack are more dangerous IMO than having to drop in replacements for NVIDIA-specific optimizations of GPU-to-NIC or GPU-to-GPU communication.
Not only not trivial to port those but they are constantly will have to play catch up and be at the mercy of NVIDIA in regards to spec.
AMD is in a catch 22, support CUDA and be effectively in a constant catch up position, not support it and continue to be ignored by the market at large simply because the momentum NVIDIA has managed to achieve with CUDA over the years.
There are several well known (certainly to AMD) strategies and tactics for dealing with this situation.
One example: AMD can support CUDA while focusing on price/performance at the low end (initiating an Innovators Dilemma for NVIDIA), and if successful then solidify AMD's position, for example by starting a standards process for a successor to CUDA.
NVIDIA, of course, has various countering moves available such as IP moats, initiating competing standards, and so on.
Rarely is a credible player entirely boxed in. And even then, there are moves available to make the best of things like making some acquisitions and then spinning off the division into an independent company, or selling it off to another player (in this case probably Intel, but maybe ARM could be suckered into it) and so on.
Yes, well, sometimes there are both de-facto standards and de-jure standards in circulation.
XHTML 2.0 comes to mind.
Tactically, creating a backward-compatible successor to CUDA (a superset, essentially) would have the advantage (for AMD, that is) that NVIDIA can't decide to un-implement the existing CUDA support in their products in order to to spike AMD's efforts.
Then again, there is a lot of IP sloshing around in this space, so the specific tactic used by AMD may have to be something sneakier and subtler.
Anyway, this is all just one of the hypothetical ways AMD could try to unseat NVIDIA, there are others.
NVIDIA is playing catch-up with themselves and the shifting market too!
Just look at the Maxwell-based Tesla cards that came out of the blue (as if they were an afterthought); I bet Facebook, Baidu, Google, etc. told NVIDIA that Kepler was shit for their use-cases and they did not want to wait for Pascal.
Or look at the bizarre Pascal product line where the GP100 does support half precision, but the others don't, not even the P4 or P40. Strange, isn't it?
Things are changing so quickly that it isn't hard at all to find footing as long as you have something useful to offer. At the same time, I agree, a robust software stack is an advantage, but for the likes of Google or Baidu even that is not so big of a deal (Google wrote their own CUDA compiler!).
Not sure if Maxwell came out of the blue, Maxwell 1 was designed for mobile, embedded and tesla, Maxwell 2 came out most likely because Pascal at large was delayed.
As for the half precision, it's pretty much the same thing NVIDIA been doing since Kepler dumping FP64 and FP16, especially FP16 due to the silicon costs.
NVIDIA came out with the Titan and Titan Black with baller FP16 performance and no one seem to care, the Titan X then dropped it and people bought it like it was cupcakes, I'm pretty sure they have pretty good market research that states most people can live without it and those who can't can pay through the nose.
As for Google and Baidu while they are huge I'm not sure how "important" they are, Google can pretty much design their own hardware at this point, and as you mentioned they don't really use the software ecosystem that much as they can write everything from scratch even the driver if need be.
What CUDA gives is a huge ecosystem for 3rd parties and more importantly a lock on developers since that's "all they know", it's not that different than how MSFT got a lock on the IT industry through sysadmins that only knew Windows and even now with Dev Ops and everyone and their mother running Linux they are still an important player.
If the majority of the commercial software is CUDA based, if most researchers and developers are exposed more to CUDA and are more experienced with it NVIDIA has a lock on the market.
I'm not entirely sure how much big of a client or how good of a client Google will be they can demand pretty steep discounts and they are a heartbeat away from building their own hardware probably anyhow.
NVIDIA doesn't want to be locked to 2-3 huge contracts that pretty much dictate how their hardware and software should look like, that's what put AMD in a bind with the console contracts dictating how their GPU's are going to look for a few generations now.
> Not sure if Maxwell came out of the blue, Maxwell 1 was designed for mobile, embedded and tesla, Maxwell 2 came out most likely because Pascal at large was delayed.
First off not sure what you are referring to by "Maxwell 1" and "Maxwell 2"; there's GM204, GM206, GM200 all very similar, and GM20B the slight outlier.
Maxwell was an arch tweak on Kepler which, due to everyone but Intel stuck at 28 nm, had to cut down on all but the gaming-essential stuff to deliver what the consumers expected (>1.5x gen-boost). For that reason, and because HPC has traditionally been thought to need DP, IIRC no public roadmap mentioned Maxwell Tesla at all. Instead, the GK210 (K80) was the attempt to be the bridge-gap chip for HPC until Pascal; it was released just a few months before the big GM200 consumer chips came out. That is until fall '15 when the late arriving M40 and M4 were pitched as "Deep learning accelerators" (though GRID/virtualization version were released a little earlier in August). Quite obvious naming, plus no sane HPC shop would buy a crippled (no DP) chip <1 year before the promised miracle-chip, Pascal was planned to be released. Let's not forget that P100 was known to be quite late, K80 was insufficient for the needs of many customers, so another bridge-gap was necessary, and that what the M Teslas were: bridge-gap ML/DL cards that Google/FB/Baidu and the like wanted, and picked up pretty quickly e.g. see [1].
> As for the half precision, it's pretty much the same thing NVIDIA been doing since Kepler dumping FP64 and FP16, especially FP16 due to the silicon costs.
Not really. In Kepler they experimented with the SP/DP balance a bit (1/3), in Maxwell they were pushed by 28 nm, in Pascal they returned to the DP = 1/2 SP throughput.
Also, FP16 is not completely separate silicone, AFAIK FP16 instructions are dual-issued on the SP hardware.
> NVIDIA came out with the Titan and Titan Black with baller FP16 performance and no one seem to care
Source? AFAIK earlier FP16 was only supported natively by textures and by some conversion instructions. First chip was the GM20X/Tegra X1 with native FP16.
> As for Google and Baidu while they are huge I'm not sure how "important" they are, Google can pretty much design their own hardware at this point,
Some hardware that allows large benefits, but definitely not all hardware. They're happily relying heavily on GPUs, are planning to pick up Power8+NVlink, etc. Designing chips is expensive, especially if there isn't a huge market to pay for it.
> and as you mentioned they don't really use the software ecosystem that much as they can write everything from scratch even the driver if need be.
Note that they "only" wrote the fronted and IR optimizer, the code generator is NVIDIA's NVPTX [3]!
To wrap up because this is getting long, to your last points I'd say the big DL players are very important for NVIDIA because they are the trendsetters leading in many aspects of AI/DNN research with OSS toolkits for GPUs, and they'd be silly to not use GPUs in-house for their own needs which they're happy to talk about at various conferences and trade shows (e.g. both keynotes GTC 2015 [3] [4]).
>First off not sure what you are referring to by "Maxwell 1" and "Maxwell 2"; there's GM204, GM206, GM200 all very similar, and GM20B the slight outlier.
GM1XX is Maxwell 1st Gen, which was the 750ti, the Gefore 800M series, Tegra K1 and Tesla M10, GM2XX is Maxwell 2nd Gen.
>Not really. In Kepler they experimented with the SP/DP balance a bit (1/3), in Maxwell they were pushed by 28 nm, in Pascal they returned to the DP = 1/2 SP throughput
Maxwell wasn't 1/3 it was 1/32 (no this isn't a typo) ;) this is the same (or even worse IIRC it's 1/64 now) with Pascal with the exclusion of the GP100, the Pascal Titan and Quadro cards are 1/32 or 1/64.
As I said NVIDIA keeps this for very limited silicon if you buy a desktop/workstation GPU even with Pascal don't expect DP/HP performance.
As for the big players, yes they are trend setters but they are also can easily turn into "blackmailers" if you have a single client which buys half your GPU's they dictate the terms, companies went under because of contracts that take too much of their delivery pipeline.
> GM1XX is Maxwell 1st Gen, which was the 750ti, the Gefore 800M series, Tegra K1 and Tesla M10, GM2XX is Maxwell 2nd Gen.
My bad, forgot about 5.0 devices being called GM1xx. Still, AFAIR there was little practical difference (in particular instruction set) between all but the compute capability 5.3 Tegra.
> Maxwell wasn't 1/3 it was 1/32 (no this isn't a typo) ;) this is the same (or even worse IIRC it's 1/64 now) with Pascal with the exclusion of the GP100, the Pascal Titan and Quadro cards are 1/32 or 1/64.
I did not say Maxwell had 1/3 DP flop rate, I said Kepler had that, but Maxwell was pushed by the 28 nm process (so they had to get rid of all DP ALUs).
> As I said NVIDIA keeps this for very limited silicon if you buy a desktop/workstation GPU even with Pascal don't expect DP/HP performance.
Never disagreed, but they can only do that because their professional compute division has grown big enough that it it worth designing different silicon; that shift happened in fact after Kepler, GK110 was more or less the same on GeForce and Tesla (e.g. 780 Ti and K40). They're also trying to find a good way to do market segmentation and DP ALU die area is an obvious candidate to play with. Still, no HP on GP102/104 is a curious thing, especially as the Tesla P4/P40 that oficially target ML/DL would really need it. My bet is they won't "forget" HP on lower-end Volta; not all consumer chips will support it, but unless they'll design different dies for GV102/104 (or whatever they'll call the non-DP Tesla uarch), which I doubt, some desktop parts will also support it.
NVIDIA said they have a new double precision silicon for Volta so in effect "DP" should become the new base unit, which would either give you 1:1 DP/SP or 1:2 DP/SP if they can 2 SP operations on a single DP unit, but no much talk about HP.
The emphasis on DP performance across the board was extreme with NVIDIA this time around, which makes me wonder if they think FP16 is irrelevant for some reason.
As for Kepler well it had 1:3 but only on few chips, most kepler based GPU's had 1:24 DP performance, the 780ti had 1:24 while the Kepler Titan had 1:3, the Titan Black then had 1:24 again and again no one seemed to care...
CUDA / cuDNN are well tuned and integrated with the most popular libraries.
At this point, in order to get people to switch not only would your hardware have to be faster, but your replacement for cuDNN would have to be better as well. Which is by no means impossible, just difficult.
I disagree; most people use caffe / keras / theano / tensorflow / etc hiding the cuDNN details so end users won't care much. Offering more performance / shorter training loops is a big deal. Your typical ML learning loop with an expensive human regularly waiting for experiment results shows clear benefits.
Yes, we use and like the R9 Nano. We wrote our own tools that are portable across GPUs so third party software support (eg cuDNN) is not a factor for us.
TensorFlow is evolving rapidly and it's a challenge to keep up with GPU driver updates, CUDA updates, TensorFlow updates, and not fall into version hell.
For instance I took the Udemy course a few months back, given by a Google scientist, and in at least one instance the code was pinned to CPU because it didn't run on GPU, presumably due to a TensorFlow issue.
If they can get TensorFlow running well on AMD and release images that make it painless, it will be a huge win, but I'm a little skeptical. Big difference between getting a benchmark win and a stable platform.
Interesting! He's definitely got some interesting background there. Normally someone wouldn't be added into a Github organization without an employment relationship there.
> AMD could really invest more in optimising deep learning libraries for their hardware if they want to be relevant
I am optimistic about AMD's involvement in AI. Just do a search of recent news on here, tons of exciting devleopment from AMD, especially Radeon Instinct. And here is an interesting comment about the comparison between Nvidia and AMD[1].
I remember that a few years ago AMD had the only sensible solution for virtualizing GPUs, and you could make a bunch of them work together as a single unit without much trouble. But I didn't have a Radeon card so I never got to try it. Don't know what happened with that, it's true that they really lagged behind NVIDIA.
There seems to be considerable effort being undertaken to allow TensorFlow to work with OpenCL [0]. Also see [1]. This coincides nicely with the introduction of these AMD cards.
I'm looking forward to the day that Nvidia gets some competition in the GPUs-for-deeplearning market. Further, doing some smaller Deep learning experiments on my MacBook Pro with AMD discrete GPU is another benefit I'm looking forward to ;)
Interesting that NVDA is down almost 4% for the day [1] while AMD is up 3% [2]. Is Wall Street realizing that NVidia is not alone in the ML Hardware space?
It's way too early to be making guesses like that.
Micro-trends are mostly meaningless with stocks, unless you're trying to do high frequency work. Stock shifts and changes to small extents all the time based on the quirks of all sorts of trading companies, and each of Especially if there hasn't been any significant news about the company, and this isn't significant news. It's interesting, but it isn't really threatening Nvidia's dominance or profits at the moment. AMD needs to make a bigger name for itself in the sector, start picking up some splashy customers before most of the market might react.
Looking at the bigger picture, Nvidia stock is up 160% so far this year, but has been fluctuating a bunch over the last month or so, and it's still well within the scope of those fluctuations.
well, i was very active for greenpeace/other environmental organisations/the green party youth in germany. I know of these issues and it's getting better, but it's still awfully slow and frustrating. If you want some improvements way fewer are starving[0], food security improved for millions. I know there is no rational reason for children to starve, humanity produces enough food, but frankly we (as in first world) don't care. We really don't care. Most just donate something on the end of the month and think they are no part of the solution. But try explaining to people that they should scale back their consumption. They do stuff as long as there is not the slightest inconvenience.
I really think that in a weird way technology can help (not solve) overcome, or better compensate for our human stupidity in some of these issues. Not alone, of course. But right now i don't see a path for me that's more effective in tackling these problems at the same rate as improving technology. (if you want to see it more pragmatic: NGO's really need some better tools, especially for campaigning and training (this may be the most important, most are volunteers). It is often very unprofessionlized and inefficient. It's really holding some of them back.) Maybe there will be a time when i see more potential impact for my personal political ambitions, but right now i don't see how this is going to work better than studying computer science.
Also, it's been worse for humanity ;) We dropped like flies in the end of the 19. century due to diseases.
Yes, the world is not perfect by any stretch of the imagination. Lots of things to be improved. If you could choose another time to live in, when would it be?
Wow, that's a great question. I can't decide if it would be in 20 years or 40 years ago. Just to be part or witness of creating Internet as we know it or as we will see it! When some of the above problems don't exist any more because of technological advancement and some worsen because of the very same reason, but we no longer see them, because no one is looking at the wild parts of the world.
If you're getting a single video card, or one for calculation and one just to do video, make sure motherboard has 16 PCI-e rel3 lanes into the primary card, people are using z170 and z97. For 4 video cards, you want x99 mobo, newegg does a good job of standardizing different mfrs' specs e.g. http://www.newegg.com/Product/Product.aspx?Item=N82E16813132...
Other concerns: you want a full size cage, with the SSD drives all the way at the top, out of the way of the video cards (the largest Nvidia OEM 3 fan cards are almost 13" long, so there's going to be some Tetris if you want a lot of drives and video cards in there). Cooling critical, stuff the box with RAM and Xeon's or i7s.
I've been curious about this too. AWS does have K80 instances available for $0.90/hour which isn't too bad for playing around and as long as they keep updating their infrastructure, you can play with the newest stuff versus having to upgrade your own all the time.
That doesn't sound that bad. I might investigate AWS until putting an investment into my own dedicated hardware.
I'm just starting off, so my primary concern is speed of iteration and learning. I want to train on and generate audio phonemes, so this will undoubtably take a lot of practice.
If you also play video games, then just get one of the new Nvidia Pascal chips and install it in your home PC. Dual-boot Windows / Linux. Which card to get? Bigger is better but it also depends on your wallet.
This is the setup I recently made and has been working great so far. However, most of it was purchased on black friday - you can probably replace any of these components with comparable parts that are currently on sale.
One thing that would be interesting is if you could use cards like this for rendering multiple instances of X, for the purpose of running things like WebGL browser screenshotters.
I had to ship out a high-end gamer GPU with a dummy HDMI adapter for this purpose recently. But it's obviously not very efficient. It would also be nice to be able to run multiple screens in parallel, not just one per GPU.
I doubt there will ever be a product for my use case, but one can dream...
That said, these are cool. I think they're lower power than the Nvidia equivalent, but I could be mistaken (I just recall the Tesla models being power hungry.. enough to cause a real problem in a datacenter rack).
I really don't think this will make a dent in CUDA's platform. CUDA has a well established ecosystem in deep learning and compatible cards like Quadro coupled with very matured platform makes it miles ahead of platform.
That said, I would love to be proven wrong. Healthy competition such as this fosters much better results. Also CUDA is not without issues in certain matters.
The ML/deep learning technologies are peaking or will peak in the coming years on the hype curve [1], so it is not too late for AMD to join the show, there is plenty of newcomers and growing entities to buy hardware. So even though the CUDA ecosystem is far more established, let's not forget how much room for growth there is under an exponential growth curve!
Even Intel is late to the game (and so far can't even compete with AMD in terms of hardware).
Just speaking from a personal perspective, I took a parallel computing course at my uni this past semester and CUDA was the main platform we worked on (and I'm an undergraduate). Nvidia also has a great Udacity course they offer for free. Unless AMD gets CUDA compatibility working soon, I really don't see how they're going to catch up as far as adoption goes.
I can't even look at the press picture without remembering that is the exact same metal card slot tab that I had on my IBM PC 35 years ago. They should take a picture of the other end or something.
This really doesnt matter for deep learning. There is a large ecosystem built around CUDA. Unless AMD becomes CUDA compatible (they are working on it but not there yet) and I can install Torch/TF and run it on my AMD GPU, I will stick with NVIDIA.
I am all for choice, but AMD has a lot of catching up to do.
The MI8 card's HBM has a great power and performance advantage (512 GB/s peak bandwidth) even if it's on 28 nm. NVIDIA has nothing that has even remotely comparable bandwidth in this price/perf/TDP regime. None of the NVIDIA GP10[24] Teslas have GDDR5X -- not to surprising given that it was rushed to marked, riddled with issues, and barely faster than GDDR5. Hence, the P4 has only 192 Gb/s peak BW; while the P40 does have 346 GB/s peak, it is far higher TDP, different form factor and not intended for cramming in into custom servers.
[I don't work in the field, but] To the best of my knowledge inference is often memory bound (AFAIK GEMV-intensive so low flops/byte), so the Fiji card should be pretty good at inference. In such use-cases GP102 can't compete in bandwidth. So the MI8 with 1.5 the Flop rate, 2.5x bandwidth and likely ~2x higher TDP (possibly configurable like the P4) offers an interesting architectural balance which might very well be quite appealing for certain memory-bound use-cases -- unless of course the same cases are also need large memory.
Update: should have looked closer at the benchmarks in the announcement; in particular the MIOpen benchmarks [1] MI8 clearly beating even TitanX-Pascal which has higher BW than the P40 indicates that this card will be pretty good for latency-sensitive inference as long as stuff fits in 4 GB.
[1] http://images.anandtech.com/doci/10905/AMD%20Radeon%20Instin...