Hacker News new | comments | ask | show | jobs | submit login
AMD is edging closer to breaking Nvidia's graphic dominance (engadget.com)
143 points by toufiqbarhamov 29 days ago | hide | past | web | favorite | 122 comments



Wanna know why AMD drivers are so broken?

I talked once to an AMD project managery guy. He told me how they have several teams in different countries. And for each feature/bug different managers have to "bid" with weekly resource estimates. Bid too high and you dont get enough work for your peoples. Bid too low and they end up with unpaid overtime. Rinse and repeat.

And these are kernel developers...

EDIT: to be clear, I am rooting for competition to Intel and NVIDIA. I just dont think this kind of culture can work on the software side.


Seriously? That doesn't work in an entity who share the same revenue. Everyone lose.

They should rather hire the economist as their boss.


Though it is anecdotal information and relevant to AMD drivers from several years ago, but when AMD driver crashed (about once a year) it did so without affecting the system and restarted itself correctly, I only saw error message and windows going briefly to "basic" theme and back to aero. When Nvidia driver crashed (again, about once a year) it takes down entire system, it actually freezed Windows so completely that even ctrl-alt-del didn't work, and I had to power cycle the PC.


That's totally insane.


I'm not sure how to respond to this other than WTF?


With a $5000> card (MI50) AMD is forced to sell at a loss to be able to meet the same performance levels NVIDIA had for the same amount of money ($700) 2 years ago.

If this is getting closer I think they need a better tape measurer.

With 16GB of aquabolt HBM2 on a 7nm node there is no way AMD is losing less than $150 per card without even accounting for opportunity loss.

In fact this card is the reason why the head of the RTG was fired when suggesting to sell it at $749 at a loss.

The 2080 for all intents and purposes is a mid range die.

The only thing AMD is competing on atm is price and for them it’s a lose lose situation until Navi comes out and that is only if Navi could actually be competitive above the $300 price bracket as it’s a Polaris successor its not clear if AMD will have anything on the level of the 2080 not to mention the 2080ti based on Navi.

AMD had to drop the prices of VEGA to around $300 due to the RTX 2060 they are likely also losing money on that front, they are losing money on Radeon 7 and hopefully they will finally make money with Navi.


AMD doesn't have to win in discrete at all to make a killing. They only have to dominate the lower end, which they already are positioned to do with current APUs - these chips can manage 30hz/720p playability in most new games, which is easily ignored if your requirement is competitive gaming minimums or 4k, but in terms of price/performance, it's near unbeatable.

The Radeon VII is ultimately just a repurposed workstation card given flagship treatment. It's not where AMD's business really is, ever since they went down the route of semi-custom and small dies. While pushing RTX could renew Nvidia's advantages, the only game console they're on these days is the Switch - no raytracing to be seen there. Developers accommodate the Geforce cards for PC releases, but AAA going console-first precludes designing most content around RTX. They've made a lot of moves to try to repurpose high end graphics for other markets, but it looks like it's going to be very rough for Nvidia in the next few years if they can't find a "blue ocean" that needs both speed and programmability.


>Developers accommodate the Geforce cards for PC releases, but AAA going console-first precludes designing most content around RTX.

On the flip side, weak console performance is making the PC become the default platform for AAA games.

>but it looks like it's going to be very rough for Nvidia in the next few years if they can't find a "blue ocean" that needs both speed and programmability.

I don't see how. According to the steam hardware survey the top 10 GPUs used were all nvidia.


"PCs to Become the Smallest Gaming Platform in 2018"

https://www.statista.com/chart/13789/worldwide-video-game-re...


AMD's APUs are good but aren't really that popular with gamers, Intel's new iGPU also looks impressive as hell, as they are already competitive with Vega 8 with their current lineup this new one seems to be a powerhouse for integrated graphics.

It seems that AMD will have competition on the APU side very soon, and I really hope Intel would pull it off and actually be competitive in the discrete market as well in 2020 or 2021 and that their GPU adventure isn't going to get a backroom abortion.


The problem is most people who game aren't "gamers". By which I mean they aren't necessarily interested in the hardware or anything else except "will this game run?".


I'd bet there is a very large market for pre-built "Fortnite battlestations" in the $500 to $700 range, a little above current-gen consoles. If that's your budget, then AMD is the obvious choice.


Is there data for that? Anecdotally, I was thinking about buying a gaming PC recently. Every single thread you find on reddit about PC gaming starts with 'don't buy pre built.' It is engrained in the gaming community at this point that pre-built wastes money to the tune of hundreds of dollars, and PCs aren't all that difficult to put together.

As it stands $500-700 gets you 1080p gaming thats better looking than current gen consoles; this is the entry level and its already several notches above APU.


But if you're reading reddit threads on this I think you're several levels removed from the average person who plays games. Sure "gamers" will read up on this and be part of the gaming community but PC games are now mass entertainment and not the niche it used to be.

Few people build their own PC. Many more play games.


For me, the "don't buy pre built" mantra essentially means "research on what you want and buy accordingly".

I have heard that there are some prebuilts that are worth it. But you won't be able to know which ones are good until you had experience building your own.


And even then, this is not 2009 anymore. Sure, you still need a new card to play everything on max settings with new games, but you can get easily get by with moderate settings. I mean I haven't played Fortnite, but Destiny 2 is the same age (although the specs might be lower because I don't know how much the engine changed since Destiny 1) and it runs perfectly on ultra at 2560x1440 on my nearly 3 year old midrange AMD card with a 6 year old CPU.

Also what's up with kids these days not setting textures to potato mode to eke out the last FPS, like we did with Quake? :P


Most people that play current titles and invest in gaming might not care about the hardware but they aren't running on AMD APUs either they likely buy a best buy PC with a 1050 or a 1060 in it.


Or a games console plugged into a nice TV screen and sound system.


> AMD's APUs are good but aren't really that popular with gamers

It would help if you could actually buy them. The external graphics cards were great price-to-performance, but were basically impossible to find for a while (since they were the best at crypto/mining). So most of the people who wanted one, already had to settle for something else.

And all the OEMS are largely locked into Intel roadmaps, so even though AMD integrated GPUs are awesome, it's very difficult to find a laptop that ships with it. A laptop that could be an ultrabook all day, but still game at console-level graphics (1080p 30fps on low settings) would be a huge hit, and the product already exists. Most companies just aren't selling any of it yet (The Dell XPS 9575 being the only widely-available Windows unit I'm aware of).

So, to buy an AMD-powered laptop, you basically have to buy weird low-end hard-to-find laptops from HP or whatever -- even though the market that would best be served by these devices is the mid-range to high-end (ThinkPads, XPS, Spectre, etc).


It would help if they didn't broke their APU drivers on Linux for example.

APU with Radeon HD 6290, good DirectX 11 support, rebooted driver on GNU/Linux with loss of features from fglrx.


These consumer chips have 60 active CUs compared to 64 for the MI50 so this is probably just the "failed" MI50 chips that are being sold, not something they manufacture to that spec. If they didn't sell it it would have been a much higher loss (if there is one) I think.


The MI50 has 60CU's and 16GB of memory, the MI60 has 64CU's and 32GB of memory.

This is a MI50 with soldered display outputs since the MI cards lack them as they have no rasterization support in the driver.


Ah I see, I somehow only thought of the MI60 as the MI50 and no "lower" range version of it.

My bad! :)


I'm sorry, but this just sounds like bullshit to me.

"With a $5000> card (MI50) AMD is forced to sell at a loss to be able to meet the same performance levels NVIDIA had for the same amount of money ($700) 2 years ago."

Source? What 2 year old nvidia card is beating the instinct mi50 at floating point math? Does it even support fp64? Nothing about that sounds even a bit true.

"The 2080 for all intents and purposes is a mid range die."

The second fastest gpu in the world is 'mid-range'. Yeah right. Not even close unless you only consult with millionaires for your part info.

"AMD had to drop the prices of VEGA to around $300 due to the RTX 2060"

Vega 64 competes with the 2070, it blows the 2060 out of the water. Also the price has gone down? Yeah, that's what happens when tech gets older. Prices drop. They always have and always will.


Do you have a bill-of-materials you're using as an estimate, to say that $700 is selling at a loss?


> With 16GB of aquabolt HBM2 on a 7nm node there is no way AMD is losing less than $150 per card without even accounting for opportunity loss.

Just curious, but how did you come up with that estimate?


it is ridiculous to call a $750 GPU "midrange".


2005: it is ridiculous to call a $350 GPU "midrange".


A current midrange GPU is around $150-$250 according to the recommendations on https://logicalincrements.com (not affiliated, just a fan).


And that statement still holds.


It’s still a mid range die - TU104, the high end die is the TU102, it’s all relative.

We’re about 1 month away from NVIDIA’s next quarterly report it will be really interesting to see how much revenue they got so far from Turing especially considering the 2060 was released only 1 month prior to the date their revenue will be reported on.


The 2080 is 545mm^2. Intel's "HCC" 18-core Skylake Xeon is 485mm^2. There's nothing "mid-range" about a die at 545mm^2.

Yeah, there are higher-end XCC Intel chips or higher end NVidia chips. But by any practical measurement, things pushing 500mm^2 or bigger is downright massive.

Radeon VII is ~350mm^2, but on a very expensive 7nm process. Hard to tell what "high end" or "mid range" will be at 7nm right now, since there aren't too many chips being made to compare against.


Again it’s all relative, NVIDIA always used big dies to win and they are making money hand over fist.

It doesn’t change the fact that the TU104 is a mid range die as far as the Turing lineup goes doesn’t matter if it’s 500mm^2 or 5M^2.

Radeon 7 is a fairly large die for its processes on a very expensive node using the fastest memory around which barely beats the 1080ti and brings next to nothing new to the table over Vega, heck i doesn’t bring that many new things over Fiji but at least they finally have dot product support.

The simple reality is that for gaming the 2080 is still a better pick if only because of the RTX features.

For compute the 2080 might edge it depending on the workload and especially for DL/ML due to the craptastic state of AMDs ecosystem even today.

For HPC as in FP64 then the Rad 7 is worse perf/$ than the Titan V you can buy a Titan V 24GB for £2400 today and get 7 tflops of FP64.

The only case where the Radeon 7 would be better than the 2080 is likely in a small subset of GPU accelerated software that properly supports AMD GPUs e.g. blender.


> The simple reality is that for gaming the 2080 is still a better pick if only because of the RTX features.

I think there's a strong argument to future-proofing 4k gaming experiences. I think the 1080 Ti is a better pick, but the 1080 Ti is sold out in the US Market. The 2080's 8GB of VRAM is usable today, but I'm not entirely sure how long it will be before 4k games blow through that.

1080 Ti had 11GB of VRAM, Radeon 7 is a bit overkill at 16. But 8GB is solidly a mid-range VRAM size.

> The only case where the Radeon 7 would be better than the 2080 is likely in a small subset of GPU accelerated software that properly supports AMD GPUs e.g. blender.

FP32 performance is also used in Adobe Premier (video editors). But yo man, 3d renders (like Blender) can take hours under normal circumstances. Its a solid market to try to accelerate.


Adobe Premiere doesn't care about the GPU much, even with pretty heavy effects and 8K you wouldn't notice the difference between a 1060 and a 2080ti, but one thing you usually don't want to do is to pair it with an AMD GPU due Adobe's craptastic support.

https://www.pugetsystems.com/labs/articles/Premiere-Pro-CC-2...

And 8GB is more than enough, heck if AMD memory management worked as in dev's would've used it then 4GB is enough for 4K even today, there are plenty of GDC talks about it and how some games on average only accessed 3GB of memory for a given level the biggest ones were from Doom and Witcher 3 iirc.

This is how the Xbox One X can do 4K even if it's at 30fps with only 12GB of RAM shared between the GPU and CPU and how even the PS4 pro can do 4K with 8GB and in all cases when they can't it's not a memory limitation but rather a fill rate / horsepower issue.

I'll take NVIDIA's stellar memory management and compression and tiled rendering over AMD's 1TB of memory bandwidth and 16GB of memory for 4K gaming, and 4K gaming on PCs is well meh, I'm actually regretting getting a 4K monitor and a 1080ti SLI setup something which i will remediate once the 1440P FLAD HDR Gsync monitors come out to the market.


"Heavy" GPU-accelerated 2d effects, like Optical Flow Time Remapping, are used quite often in my experience. Its the easiest way to get the "cool slo-mo" that made movies like 300 popular. Good Youtubers are using the effect a lot in their videos now.

You're right about Adobe being bad with OpenCL, but its just an example. But my overall point is that the 2d video editing community is mostly about FP32. A high end FP32 card caters towards 3d renderers, 2d video editors, and others in the professional marketplace.

Davinci Resolve users seem to get a bigger benefit out of Vega cards than 10xx series NVidia cards. So it depends on your video editor.

----------

Aside from scientific compute, I'm having issues figuring out where FP64 support is an issue. All of the applications I'm aware of are FP32. There's a clear difference in the rendering of heavy effects (Optical Flow based Resampling) or 3d editors, so a good GPU really makes a difference in those cases.

> And 8GB is more than enough, heck if AMD memory management worked as in dev's would've used it then 4GB is enough for 4K even today, there are plenty of GDC talks about it and how some games on average only accessed 3GB of memory for a given level the biggest ones were from Doom and Witcher 3 iirc.

XBox and PS4 Pro are bad examples to use. Those systems have far weaker GPUs, and the textures aren't truly 4k.

If you enter a game with 4k texture packs (or 8k texture packs, like Star Citizen), you can expect your memory usage to spike significantly. 8GB is still sufficient, but gamers are expecting better and better textures.

4k + 10-bit HDR textures take up a lot of room.


>XBox and PS4 Pro are bad examples to use. Those systems have far weaker GPUs, and the textures aren't truly 4k.

Textures are the same as v.high/high on PC the only thing that some PC games add is an uncompressed texture pack which are simply stupid to begin with.

Xbox One X is a great example of a system that can push 4K or near 4K gaming because it does so pretty darn well.

The memory capacity is really not an issue with 8GB, in fact the only cases where the 2080 really pulls a head of the 1080ti is in higher resolutions especially 4K.

>if you enter a game with 4k texture packs (or 8k texture packs, like Star Citizen), you can expect your memory usage to spike significantly. 8GB is still sufficient, but gamers are expecting better and better textures.

Most textures in a game are well beyond 4 or even 8K regardless of the resolution, heck a simple texture atlas is huge.

Star Citizen is a crap game to use as an example as both the textures are overall simply horrid and it make PUBG looks like a well optimized title.

Overall the 2080 gives you RTX which is RT + Tensor Cores which will become very important as WinML/DirectML games begin to ship since NVIDIA both assisted Microsoft in defining the spec and it's essentially tailored for their Tensor Cores, you get concurrent Int/FP execution per ALU which will give NVIDIA a huge boost as integer shaders (e.g. compute shaders with heavy bool logic) will become more common and you get a more mature ecosystem, better encode/decode and various other small QoL improvements.

For gaming specifically the Radeon 7 doesn't have anything to offer other than to those that care more about the brand than what they get for their investment.

I really miss the days of the HD4000 series or the days when the R300, and sadly I don't see AMD getting out of this cycle any time soon, Fiji was late and underwhelming, Vega was late and underwhelming, and now Radeon 7 is late and underwhelming.

If Navi is another mid range GPU will see 15%-20% improevemnt over Polaris if we're lucky, once we get closer to Feb 7th I'm pretty sure we'll see that in 1080p there is likely no difference between VEGA64 and Radeon 7 and the only place we see the difference is in 4K due to slightly higher fill rate and more bandwidth since AMD still can't get delta compression to work correctly.


>And 8GB is more than enough, heck if AMD memory management worked as in dev's would've used it then 4GB is enough for 4K even today, there are plenty of GDC talks about it and how some games on average only accessed 3GB of memory for a given level the biggest ones were from Doom and Witcher 3 iirc.

It's "enough" because it has to be. Game developers need to constantly pay attention to these budgets because people who play their games have hardware with limited performance. If everyone had more of a resource then none of them would say no to that particularly now that GPU compute is becoming more popular in games.


If the Witcher 3 team found they actually accessed just over 3GB of memory within a given stream window then yes while its true that it’s all about managing resources this isn’t an issue that requires compromise just yet.

Doom had a similar talk, resource management is a key factor in game development.

GPU compute is a completely different issue.


>I think there's a strong argument to future-proofing 4k gaming experiences.

4k isn't that impressive for gaming with a monitor. Most people can barely tell the difference between it and 1440p, but there's a huge performance difference. For VR it does matter though.


Given the rumors of a $300 2080-equivalent card, I wouldn't be surprised if the Navi architecture is more similar to Zen 2 where an array of cheap 80-100mm dies is placed around a larger 12nm bus die on the same package.

AMD has clearly placed a lot of R&D research into that approach and it would allow Ryzen 3000 etc to share a package with a Navi GPU die.

It would also explain the mysteriously absent second 7nm die in the Ryzen announcement this past CES.


I mean, its a rumor, so I'm pretty much pretending that Navi will be either slower than that, or cost more than $300. Rumor is that AMD had issues figuring out how to architecture chiplets into GPUs.

CPUs have a long-history of NUMA / multiple sockets. GPUs history of "Crossfire" and other such technologies is far weaker and has far worse support. It is less clear how software is supposed to represent asymmetric access to memory.

How do you build a GPU with a single view of memory, when each die is likely connected to its own set of RAM? Especially with latency being as big of an issue as it is already.

I'm sure AMD is working on the problem, but I don't personally expect it to be solved so soon. Yeah, AMD Infinity Fabric is a nifty 50GB/s chiplet bus. But HBM2 is 1TBps, so Infinity Fabric is simply too slow to share data between hypothetical GPU chiplets.


Not sure why it would have to look anything like Crossfire.

The EPYC Zen 2 cores have increased L3 and then communicate to RAM via the monolithic I/O core. It's very different than traditional NUMA in terms of locality but it's also likely that latencies have been increased.

A GPU with the same design would look monolithic as far as software is concerned since modern GPUs are already subdivided into hundreds of compute cores as it is.


> The EPYC Zen 2 cores have increased L3 and then communicate to RAM via the monolithic I/O core.

As stated earlier: the Infinity Fabric links on EPYC (not necessarily EPYC 2... but on original EPYC 1) were like 50GB/s. Assuming EPYC2 has a similar fabric, we're still only looking at technology that can push 50GB/s to any core.

GPUs like the Vega64 have 500GB/s bandwidth, and Radeon 7 has 1000GB/s bandwidth to main memory. Not cache, but literally its main memory bandwidth. Its completely, and unfathomably huge, compared to CPU architecture.

Building a 1000GB/s crossbar between RAM and GPU compute units doesn't seem like an easy problem to me. 50GB/s Infinity fabric was done before (socket-to-socket communication), AMD just shrank the tech down to fit on chiplets.


I'm also not sure why you're so focused on InifityFabric, it's a strawman. If Vega already has a 1000GB/sec bandwidth, what's stopping AMD from using that same architecture to make an MMU on a smaller die to feed other small dies?

InfinityFabric really has nothing to do with this, especially when locality has such a strong impact on possible bus-widths.


How do you expect chiplets to connect to each other?

HBM is the topology btw. Each 1024-bit connection (1024 pins!!) gives 250GB/s between the chip and RAM. 4-connections to RAM to one chip leads to 1000GB/s.

With a chiplet design, you'd need 2 or more chips connected to RAM. The immediate design of 1-chip == 2 memory controllers + 1-chip == other two memory controllers (500GB/s per chip) now halves the bandwidth.

So now you have to somehow, allow the two chips communicate at 500GB/s, bidirectional btw. (500GB/s from chip0 to chip1, and 500GB/s from chip1 to chip0). Even then, this bidirectional link will have horrible latency, since it will be Chip0 -> Chip1 -> HBM2 stack.

So it just raises questions how anything would work at all. It just doesn't seem like an easy problem to solve to me. But hey, maybe I'm wrong and AMD got it figured out. All I'm saying is: no one has demonstrated a chip-to-chip bus that keeps up with HBM2 speeds.


Your implication that in the current Vega architecture a compute core can saturate the 1000GB/sec 4x1024bit RAM without some kind of segmentation is preposterous. We know there are 4 channels to switch between.

So the idea that bandwidth "halves" with any intermediate step is clearly wrong since the bandwidth available to any compute core is already a fraction of the total available.



They already confirmed it's monolithic several times.


> It’s still a mid range die - TU104, the high end die is the TU102, it’s all relative.

yes and no. in a product marketed toward consumers, the price tiers have to be at least somewhat informed by what people are willing/able to pay. a boxster isn't an entry level car just because it's the cheapest one porsche makes.


It’s an entry level Porsche, just like an entry level Quadro that may cost twice what a gaming card costs while being a 106 or even 107/8 die form NVIDIA is still a mainstream or low end card die.


Depending on how much stock you put into the Steam Hardware Survey, it's worth noting that Nvidia still controls 74% of the market relative to AMD's 15%: https://store.steampowered.com/hwsurvey/videocard/

I also feel like graphics cards are one of those weird things that command a lot of brand loyalty, so it's probably going to take more than near-performance-parity to move the needle.


I think the marketshare contributes more to the actual inferiority of AMD cards at the midrange compared to the hardware specs.

Because AMD is a minority and game developers know it, driver-related bugs that would be priority 1 to fix if it affected nVidia users are bottom priority for AMD users. The AMD midrange cards are wonderful in terms of the hardware for certain displays and competitive in every way from a hardware perspective. Especially if you are looking for 60fps+ @ 1080p you can get awesome deals by going with AMD.

Just be prepared to occasionally have issues like BSODing whenever a game renders a lot of white textures like snow scenes for six months at a time unless you downgrade the drivers to a specific version, which will then be fixed in a driver update and broken again if you are an AMD user. Or to occasionally have driver crashes that developers acknowledge and don't fix for months. And to fuss with shader cache settings and things like that because those settings are busted for AMD for some specific game you want to play.

It's not even necessarily that the drivers suck. They don't suck. They're fine. It's that the developers don't have a big incentive to fix bugs that only impact a smaller segment of the market. While it's not like there aren't a lot of nVidia only bugs that crop up, because of its market share, those bugs are almost always a higher priority to fix and you really notice it if you have used both GPU manufacturers at the same time over a period of years or have used both at various times.

Most of the nVidia tech gimmicks usually suck and are uninteresting (HairWorks, physx, realtime ray tracing) and those that don't suck are usually matched quickly (Gsync). But the marketshare alone is sort of a perversely positive feature because it just means you are on the same drivers as the majority of the market so bugs that impact you are a higher priority for developers to fix.


>Most of the nVidia tech gimmicks usually suck and are uninteresting (HairWorks, physx, realtime ray tracing)

This isn't even a gimmick though. This is the future, because our current way of doing shadows and lights will keep increasing in complexity, but will never be good enough. Whether nvidia's solution will work is a different matter, but this most certainly is not a gimmick.


> competitive in every way from a hardware perspective

Power usage.


There's nothing gimmicky about real-time raytracing.


Now, in the next 3 months, this year, it is a 100% gimmick that is all running on hybrid technology instead of full ray tracing. I own an RTX and I can tell you that it is a gimmick that no one turns on except to see what it looks like once and you can count the number of games that support it on less than one hand.

Over the next 3 years and the next generation of cards, absolutely, great technology.


Jesus, maybe appreciate the fact that just a few years ago real-time raytracing was nothing but a pipedream. Now we have cards on the market that can do a limited amount of real-time raytracing. That's huge. Next year they'll get faster. The year after that they'll get faster. More games will support it. Developers will buy in. This is a generational change that will take 10+ years. That doesn't make this any less significant. It's not a gimmick if it's going to take time for things to develop. That's how things work.


In 2019 it is a technically interesting gimmick that tanks your FPS. This shouldn't be that controversial, because it is the nearly universal opinion of reviewers and of people who have the card. What is interesting from a technical perspective can easily be a gimmick from the perspective of a consumer. I bought an RTX for the regular performance and maybe DLSS (which although not widespread is not currently a gimmick). By the time actual ray tracing and not hybrid ray tracing gains wider adoption, the next generation of RTX cards will already be out.

I can even link you to a video from a Youtuber sponsored by nVidia that essentially says that it is a nice-looking gimmick that will be nice for cinematic single player games that isn't that useful for fast paced multiplayer games in which most gamers will pick performance over image quality most of the time. DLSS because it is a performance feature is in a different category. Like Hairworks RTX is a tech demo type feature that most users will turn off to get dozens of FPS more.


It's either this or we don't get raytracing at all. Do you not understand how things develop? They have to start implementing it somewhere and develop the tools and engine to support it, but the hardware doesn't exist for them to do raytracing globally. They have to gain experience and develop practices for implementing it. This is a good first step that will lead to more. How are you not getting this?


I will change my card Without looking back for any brand that can run the same games for 100$ Less...

The case is im not that picky when it comes for graphics. Ive seen too many games with wonderfull graphics and just plain boring.


Absolutely for gaming, Nvidia has a very loyal fanbase (myself included since the 4600GTX days) but for data centers? Different ball game. Both companies are now pushing into that territory pretty hard.


At least Linux gaming shows negative trend for Nvidia (though it's still a big majority), and for the obvious reason - AMD provide open source drivers which have become very competitive lately.

See: https://www.gamingonlinux.com/index.php?module=statistics&vi...


I'd say CUDA has an even bigger lead over OpenCL than Nvidia has over AMD in the gaming market.


Not CUDA itself, but lock-in of various libraries that are CUDA-only. No one stops you from making a library that works with OpenCL or Vulkan for example for compute purposes.


No one stops you but it's a huge time investment. It's not only CUDA... it's the massive amount of optimized libraries NVIDIA provides. For example, CuBLAS (linear algebra libraries), CuDNN (libraries for deep learning). These things really need massive engineering teams to pull off and are very tied to the hardware.


So there should be combined effort to make such libraries that are cross hardware. If many people need it, why don't they pool resources and create open libraries like that?


I don’t know about brand loyalty but I’m not letting nvidia with that stinking binary blob near any of my systems any day regardless of technology. The new open source amdgpu drivers are fully open source and pack a serious punch.


Maybe for the competitive consumer graphics market, but I don't see them disrupting Nvidia's hold in the HPC world. With Nvidia's NVLink, you can achieve high bandwidth data transfer between graphics cards without having to pass through the PCIe to CPU. If you have large computations that require syncing data among multiple GPU's, NVLink is your most performant option by a large margin. That's not to mention how much further ahead CUDA is from Sycl, OpenCL, ROCm, etc. I certainly welcome the competition (and hopefully an open standard). In my experience, CUDA is ahead on developer tooling, performance and productivity features (Thrust, Unified Memory, C++17 support, etc).


AMD's NVLink is infinity fabric (MI60, MI50) isnt it? with 200gb card to card transfer rate... Did you take that into account? I agree the toolset is light years ahead of what I seen from AMD, but I'm not much into that.


Ah, I hadn't seen that announcement. It looks like this was announced recently (November 2018) and it's still 100Gb/s slower throughput than the latest NVLink. Either way, it could certainly be interesting if they compete on price, but I cannot find any information yet on actually using it to sync with OpenCL / ROCm.


Radeon 7 (Vega 20) is a major overkill for gaming though. It looks more like a 3D rendering targeted card. Upcoming Navi cards supposedly should be more gaming oriented and more affordable.


As somebody who uses my cards for rendering in my freelance work and hasn't played a AAA game since my teenage years, I'm definitely hoping to see AMD kick some ass and competitively drive down some prices. Especially since they actually seem to give a shit about linux drivers. Those saints!


Sure. Open source Linux drivers is exactly why I'm using AMD and not Nvidia.


Not overkill in 4K gaming or high refresh rate gaming.


I think 4K gaming is a red herring really. At least at present. Here is a nice video on the topic: https://www.youtube.com/watch?v=ehvz3iN8pp4

Nice optimum today are monitors with something like 2560x1440 and up to 144Hz adaptive sync. And you don't need Vega 20 with crazy expensive 16GB VRAM for that. Hopefully Navi will fit that use case well and will have less availability issues than Vega 10 (56/64) cards.


There's more to gaming than latest AAA hits and stupid ultra settings. I have a 4K monitor, I play games with an RX 480, it works pretty well.


If you have a limited use case - sure. 4K is nice let's say for adventure games that don't require rapid animation. But for more general usefulness, 4K is an overkill that doesn't have affordable GPUs to back it. I'd take higher refresh rate over resolution (to a point) when there is a choice.


My very affordable RX 480 does 60-100 fps in GTA V at 4K (unless MSAA is turned on which, don't).


Depends on the game. I bet it won't pull that off with the Witcher 3 for example.

I've being using the same RX 480, and it produced ~40-50 fps in TW3 at 1920x1200 on max settings (no hairworks) in Wine+dxvk on Linux. Something like Vega 56 already hits 60-80 fps in the same setup. 4K would be way too heavy for anything.

Also, as the above video suggests, better antialiasing obscures resolution issues. Sure, if you don't use it, you need more resolution to compensate.


Yeah, Witcher 3 is pretty heavy, I haven't really played it much yet, but it was okay with tweaked settings and 3200x1800.

> max settings

I don't remember presets in TW3, but I bet there's MSAA and various heavy shader effects enabled.


It is pretty demanding indeed. But visuals are improved noticeably too, so it's worth it to have a better card for better framerate in it on max settings.

Unlike TW2, it's not using any super crazy double pass anti-ailasing which they called "ubersampling", so it's reasonable on max, even if demanding. But having that and 4K is already too much for any single GPU :)


> high refresh rate gaming

> up to 144Hz adaptive sync

you are agreeing with him...

In my personal experience, I need every watt of performance out of 2080ti to approach achieve a consistent 180-240hz.


> you are agreeing with him.

Not really. For 2560x1440 resolution you don't need so much VRAM (that's what makes Radeon 7 so expensive). So it is an overkill and not very practical for gaming. It's probably a really good 3D rendering card though.

> I need every watt of performance out of 2080ti to approach achieve a consistent 180-240hz

I think 144Hz is more than enough. More would be an overkill as well. And Vega 64 and upcoming Navi should handle 60-144 fps well at 2560x1440.


I am happy that 60-144 fps is acceptable for you in the games that you play.

However, your experience is far from an objective characterization of competitive gaming. 60 FPS spikes are called "cancer" in such circles.


Spikes issue is not about > 144 fps really, it's about frame times uniformity. You can have no spikes at 60 fps as well, as long as your frame times are not wildly jumping around. And you can easily have them at high framerates, if something isn't right.


>Not really. For 2560x1440 resolution you don't need so much VRAM (that's what makes Radeon 7 so expensive). So it is an overkill and not very practical for gaming. It's probably a really good 3D rendering card though.

It's overkill, because most GPUs don't have anywhere near that amount of memory and performance, but if they did then games would use up all of it.


Theoretically, anything could happen. But we are talking about existing options, and also how practical they are.


This video is a bit silly - the differenences in image resolution on a 65" TV between 1440p and 4K is still very noticable, especially since PC games mostly don't support adaptive resolution tech from consoles. As such, I still had to go with a high-end nVidia to get a non-crappy FPS and good looking UIs in PC games on a TV.

4K gaming might be a red herring on small desk monitors, but more and more of us are gaming on large living room TVs with PCs plugged in next to consoles.


This video is a bit comical, but it makes a valid point. Consoles and TVs are even worse examples in this context, since they don't even aim at 60 fps, let alone at anything higher than that.

If you are OK with low framerate, then sure, 4K can be appealing. But higher framerate improvements are generally quite noticeable, so console makers try to downplay them.


I'm no graphics programmer. I'm mostly interested in the compute (OpenCL, CUDA) side of things. I don't own an NVidia GPU, so my experience mostly relates to AMD.

I'm certainly interested in AMD's Linux "ROCm" push. I really think the programming model there is relatively easy to understand, but there are major flaws in the documentation and implementation.

For example, OpenCL 1.2 on ROCm 2.0 isn't stable enough to run Blender Cycles. Yes, you can render the default cube, but very slowly. On a real scene, Blender Cycles on OpenCL ROCm can take 500+ seconds to compile, and the actual execution seems to hang (infinite loop and/or memory segfault, depending on the scene) on anything close to a typical geometry.

Note that Blender's OpenCL code is explicitly written for AMD's older OpenCL (AMDGPU-Pro OpenCL implementation). Blender has a separate CUDA branch for NVidia cards. So OpenCL ROCm is at very least performance-incompatible with OpenCL AMDGPU-Pro. The Blender OpenCL code probably has to be rewritten to work (ie: not infinite loop), and maybe even become efficient on OpenCL ROCm again.

--------

AMD's hardware is fine (not as power-efficient as NVidia, but performance is fine, in theory). But the drivers / software stack is clearly immature. Even as ROCm has hit a 2.0 release, these sorts of issues still exist.

AMDGPU-PRO with OpenCL1.2 is workable, but feels old and cranky. (OpenCL 1.2 was specified in 2011, and is missing key features. Its atomics model is incompatible with C/C++11, its missing SVM and kernel-side enqueue... etc. etc.)

AMDGPU-PRO OpenCL2.0 is theoretically supported, but is still unstable in my experience. ROCm OpenCL (both 1.2 and 2.0) is still under development, but doesn't seem to be ready for prime-time yet. (At least, with Blender 2.79 or 2.80 Cycles is any indication).

AMD HCC seems usable, but there aren't many programs using it. AMD HIP is an interesting idea but I haven't used it.

I know NVidia has driver issues / software issues. But CUDA Code written 5 years ago will still have similar performance / implementation if run on today's cards, on today's software stack. I'm not sure if the same is true for AMD's OpenCL code (between AMDGPU-PRO OpenCL1.2 and ROCm 1.2).

----------

Long story short: the only mature AMD OpenCL compute platform seems to be OpenCL 1.2 on AMDGPU-PRO. Fortunately, it also seems like AMDGPU-PRO will work for the foreseeable future, but AMD really needs to clarify its platform to attract developers. (Ex: prioritize testing of ROCm OpenCL to ensure performance-compatibility with existing OpenCL 1.2 code written for AMDGPU-PRO)


I can only agree. As a somewhat that dabbles in machine learning (including deepnets), AMD is basically a no-go as there is no practical support for deepnets. While Nvidia has stuff like cuDNN, AMD can't get even the basic computational stack for DNNs working.


Is OpenCL implementation open source in the PRO package?


No. AMDGPU-PRO is closed-source.

Which is partially why ROCm exists. Its the OpenSource implementation of AMD's driver stack. AMD seems to have indicated that the Open Source ROCm drivers are the way of the future. Its a sentiment I can certainly get behind (and AMD has even pushed the ROCm driver stack to the Linux Kernel proper).

But ROCm isn't ready quite yet. So I think in practice, people will still be relying upon the older AMDGPU-Pro drivers. At least for the next year.


It's so interesting that ever since 1998 ATI (AMD) drivers have always been crap and the Nvidia ones good. That's more than 20 years now.


On Linux in my experience, Nvidia drivers are crap and have always been crap. AMD have gone from much crappier to almost OK.


It's been a while now admittedly for me with Linux, but for me it was like, AMD was always trying to halfheartedly support open source, their proprietary drivers fast but buggy and hard to install. The open source driver slow and stable.

Nvidia made no pretense, you downloaded a thin open source shim, typed "make" and then it worked and was fast. And 99.9% proprietary blob.


"and then it worked"

not my experience


Methinks the title is a bit too strongly worded. At that price point Radeon VII is unlikely to win many customers over. If some of the recent leaks[1] turn out to be true, then we can talk about "edging closer to breaking Nvidia's graphic dominance."

[1] https://youtu.be/PCdsTBsH-rI?t=1247

Edit: added timestamp to video


I'm not sure I follow the reasoning in the article. It says if ray tracing flops, AMD will take the lead, but then goes on to say that AMD believes ray tracing will be important and is working on it.


The author seems to think RTX is unimportant. He is wrong. Ray tracing or perhaps more likely path tracing plus the AI denoising is the next revolution in computer graphics and gaming. Now, up until this point, it has been totally impractical for games, but that isn't the case now. Let me try to explain why it's so important.

Essentially this type of thing is the only good way to improve realism. And it allows rendering to move eventually from what is basically an aggregation of different types of lighting approximations towards a more unified global illumination system. It is going to require engine rewrites and a few more generations of hardware improvements to fully achieve that, but it will enable real-time rendering close to today's cinema quality and also streamline game development by removing the need for so much asset preparation related to lighting.

Anyway I think that AMD absolutely needs to catch up in this area. Even though many people related to gaming or programming may have trouble recognizing the relevance because all of the lighting hacks are firmly baked into the culture and accepted. All those hacks will gradually become obsolete.


Unless Moore's law stops, Ray tracing probably is the future. Regardless, Ray tracing in 2019 is pretty much useless. AMD needs to have Ray tracing ready for future chips, but in the mean time they can take advantage of Nvidia wasting a quarter of their die on a currently useless feature.


Ray tracing and path tracing are different. Path tracing with sophisticated denoising in 2019 is 100% useful today. It is just not quite practical yet to replace the engines in most games with it entirely. But it is absolutely useful. Please do some more research.


It’s not too dissimilar from how AVX instructions are so poorly implemented on AMD CPUs - they may support it but it may not be core to their strategy (no pun intended).


AVX instructions are actually done fine in my experience. 256-bit is emulated, but that's not really a big deal. Intel is faster for sure, but I never had issues running AVX code on AMD CPUs. The performance characteristics are different between Intel and AMD.

In fact, AMD has some tricks up its sleeve. AMD Zen has two AES pipelines (while Intel Skylake only has one), so vectorized AES code is faster on AMD.

There are some performance issues with vpgatherdd instructions, but Intel has those as uop emulated code too. So both Intel and AMD are equally to blame there.

----------

My "issues" with AMD CPUs are relatively tame. AMD's profiling tools are weaker than Intel's. (Ex: AMD has "Instruction Based Sampling" while Intel's "PEBS" (Precision Event Based Sampling) are a bit easier to use. True 256-bit execution would be nice, but its not a major hassle in most cases IMO: AMD's back-end is very thick, so you can still get a lot of ILP out of AVX2 instructions.


> High Precision Event Timers

Isn't HPET is just a basic system timer? (That is definitely present on Zen.) You might be thinking of something else?

> 256-bit execution would be nice, but its not a major hassle in most cases

Coming with Zen 2 anyway :)


> Isn't HPET is just a basic system timer? (That is definitely present on Zen.) You might be thinking of something else?

You're right. I got the names confused. I've edited the post above to use the proper "PEBS" term.

Intel's "Precise Event-Based Sampling" is what I was trying to talk about. Intel's PEBS can precisely tell you where a branch-mispredict happens.

AMD's default event timers are inaccurate: your branch mispredictions will be all over "add" instructions, and other unrelated stuff. This is because a CPU is looking at roughly ~100 instruction windows (between the pipeline, decoder, and retirement... there's a lot of inaccuracy in determining "where did this branch misprediction happen??").

So when trying to track down a branch-misprediction on AMD systems, you have to switch to the harder to use "Instruction based Profiling" mode. Intel has a simpler PEBS switch which is easier to use IMO.


At some point I'm hoping to see AVX completely emulated in microcode and replaced with an embedded GPU core that can write to an L3 or L4 cache.

This slow scaling of 128->256->512 bits in the instruction set is more or less a solved problem in the GPU space with shader compilers and AVX would mostly be redundant with GPUs if it weren't for the memory bandwidth constraint.

ie: When it comes to vector processing, go big or go home.

AVX/SSE are a compromise from back when CPU die space and bandwidth was more precious. Now that we have 8-32 cores on a die with a good bus between them, it seems like duplicating those AVX units 8-32 times is less optimal.


Please no.

AVX's main advantage is that it is roughly 1-cycle away from your main registers, and 4-cycles away from L1 cache. Talking to and from L3 cache is on the order of 30 to 40 cycles, an order of magnitude slower.

If your workload fits inside of 64kB, AVX is incredibly beneficial. If your workload fits within 8MB (L3 cache), you're starting to look at a point where maybe you should pipe that data to the GPU instead.

A GPU call over PCIe is under 5 uS / 5000 nanoseconds, with a bandwidth of ~15GB/s. GPUs are certainly farther away than L3 cache, but if you're pushing L3 cache levels... you're getting close to the GPU anyway.

---------

128-bit SSE code is perfect for representing Complex Numbers (Two double-floats). 128-bit is also great for a x,y,z,w 32-bit vector.

-------

GPUs are the "go big or go home" architecture. AVX's primary benefit is latency.


What you think of as "vector" processing is currently being used by compilers to speed up things you didn't think were vectorizable. This is possible only because these instructions are pretty cheap latency-wise. By introducing huge latency, you'd be ruining performance of autovectorization, which accounts for a lot of the performance gains in the past decade.


After years of staring at HFT disassembly, I'm thoroughly disappointed by autovectorization.

Here's the simplest godbolt example I could think of to illustrate, summing a string of fixed length and non-fixed length: https://godbolt.org/z/O_M5fU

You can see that the most recent GCC fails to use the AVX512 zmm registers even after being configured to do so (afaik) and also fails to use more than 4 registers. Clang does better, using zmm* and all the registers.

But in both cases, the amount of code generated is quite large. If you compile with -Os instead of -O3, no vector instructions are used for some reason.

So when you load this code, no matter what, you're loading a bunch of instruction cachelines, which will destroy most of your latency gain unless the input is very large. And even if your input is large, you'll fault that data anyways.

So what's the point of doing this on the CPU again?


I don't believe Knights Landing supports an 8-bit ZMM add.

Changing the code to 32-bit ints results in the "vpaddd zmm0, zmm0, ZMMWORD PTR [rdx]" that you'd expect from the auto-vectorizer.

> You can see that the most recent GCC fails to use the AVX512 zmm registers even after being configured to do so (afaik) and also fails to use more than 4 registers

In the 32-bit vpaddd code... "vpaddd zmm0, zmm0, ZMMWORD PTR [rdx]" becomes a new zmm register allocation in the register-renamer (due to a cut dependency). I doubt any code would be any faster.

In any case, its not about "the number of registers used", proper analysis is about the depth of dependencies. I'd only be worried about small-register usage if the dependency chain is large (which doesn't seem to be the case).

EDIT: I had some bad analysis. I've erased the bad paragraph.

So it seems pretty good in my eyes.


The clang code loads the 8-bit words into 16-bit vector instructions and parallelizes using a tree-algorithm to minimize dependencies.

So already it's better than "pretty good" and it's still verbose.

But neither of these compilers are able to optimize for code size while using AVX, so the ((code size)/64byte cacheline) * (~100ns loads) will still kill your performance on any data set that's smaller than a few kilobytes.


> The clang code loads the 8-bit words into 16-bit vector instructions and parallelizes using a tree-algorithm to minimize dependencies.

You specified the "int sum", so that means "sum" needs to follow 32-bit overflow semantics. I don't think it is possible to do 8-bit

> But neither of these compilers are able to optimize for code size while using AVX

You can totally do that. You're specifying -O3, which is "speed over size". If you want size, you want -Os, and then use -ftree-vectorize to enable the auto-vectorization.


Staring is not enough, you need to benchmark. You might be surprised.

The results are often counterintuitive, because it's hard for a human to account for how the program _actually_ gets executed, which instructions go in which order, port utilization, data dependencies, CPU bugs, effects of alignment (or, often, lack thereof on Intel CPUs), micro-op caching, etc etc. For all you know GCC deliberately avoided the use of AVX512 registers because AVX512 causes the CPU to throttle the clock.

I've also found that ye olde -O2 flag recommendation which says -O3 is likely to be slower in practice because it produces bulkier machine code, nearly never holds up anymore.

That said, I concede that autovectorization is not perfect. I can almost always beat it if I really want to, sometimes by quite a margin. But what you're proposing is unlikely to help matters, especially for smaller inputs that you're concerned about.


I have benchmarked similar code extensively, this is a toy version of a FIX checksum. It's not "counterintuitive", it's just broken, especially for infrequently executed code where the instruction fetch stalls will kill your latency.

You are welcomed to load up the godbolt example and try and create better assembly using O2 or whatever else you can think of. You can build the code on your machine and benchmark it too.

But at the end of the day, the compiler output sucks. It's clearly optimized for synthetic benchmarks that chug through large datasets using a tiny amount of code that can fit in L1 cache even after the compiler blows up its footprint.

But my point is that these large datasets with tiny processing kernels (like a video codec for example) are much better suited to different kinds of processors.


I (selfishly) hope they succeed. NVIDIA is getting ridiculous with their constrained supply of $1200+ high end GPUs.


as someone that bought AMD stock with the belief that NVIDIA was over priced and it was about time it swung back the other way...me too!


Related, AMD's stock has been heavily manipulated for years: https://seekingalpha.com/article/4153502-institutional-short...


This is very interesting. Are you aware of any more research into this?


It's certainly a very small part of the overall GPU-buying community, but nearly the entire hackintosh community has more or less abandoned / foresworn NVIDIA due to lack of mojave drivers. Even on older macos releases, the nvidia drivers were awful.


Good, NVIDIA should bankrupt as soon as possible, because of their opensource stance.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: